About This Project

The Neural Scribe investigates whether modern neural language models can detect authorship differences in ancient Greek texts. This page documents the methodology, data sources, and how to reproduce the analysis.

Analysis Pipeline

The analysis proceeds in four stages, each designed to minimise bias and ensure reproducibility.

Corpus Preparation

Texts are fetched from the MorphGNT repository (morphological annotations of the SBLGNT). Each text is segmented into overlapping chunks of 150 words with a 75-word stride to maximise spatial resolution while maintaining context coherence. Only the lemma column is extracted, stripping punctuation to align with the classic baseline pipeline.

Classic Stylometry Baseline

A traditional stylometric analysis using PCA on z-scored frequencies of the 100 most frequent words (MFW). This establishes that surface-level word distributions cannot distinguish disputed letters from undisputed Paul — motivating the neural approach. Chunks are treated as documents; PCA coordinates are saved for visualisation.

Neural Embedding Probe

Each chunk is embedded using pranaydeeps/Ancient-Greek-BERT with mean pooling over attention-masked tokens. A Pauline centroid and standard deviation are computed from the 314 undisputed chunks only (one-class approach). Each target chunk's cosine distance from this centroid is then z-score normalised to produce a σ-distance metric.

Statistical Testing

For each disputed text, a one-sample t-test checks whether its mean σ-distance differs significantly from zero. Cohen's d measures practical effect size. A Spearman rank correlation tests whether the distances form a gradient matching scholarly consensus rejection rates. Percentile analyses examine what fraction of chunks exceed P75/P90/P95 thresholds of the Pauline distribution.

Data Sources

📜 MorphGNT / SBLGNT

The primary text source is the MorphGNT project, which provides morphological annotations of the SBL Greek New Testament. Each word is annotated with part of speech, parsing code, text form, word form, normalised form, and lemma.

github.com/morphgnt →

📚 Scholarly Consensus

Rejection rates for Pauline authorship are based on surveys of critical scholarship (Ehrman 2012, Brown 1997, Kümmel 1975). The rates represent the approximate percentage of scholars who consider each letter pseudepigraphal or non-Pauline.

Texts Analysed

GroupTextMorphGNT FileChunks
AnchorRomans66-Ro-morphgnt.txt47
Anchor1 Corinthians67-1Co-morphgnt.txt46
Anchor2 Corinthians68-2Co-morphgnt.txt30
AnchorGalatians69-Ga-morphgnt.txt15
AnchorPhilippians71-Php-morphgnt.txt
Anchor1 Thessalonians73-1Th-morphgnt.txt
AnchorPhilemon78-Phm-morphgnt.txt
TargetColossians72-Col-morphgnt.txt20
Target2 Thessalonians74-2Th-morphgnt.txt10
TargetEphesians70-Eph-morphgnt.txt31
Target1 Timothy75-1Ti-morphgnt.txt20
Target2 Timothy76-2Ti-morphgnt.txt15
TargetTitus77-Tit-morphgnt.txt8
ControlHebrews79-Heb-morphgnt.txt65

Embedding Model

Ancient Greek BERT

We use pranaydeeps/Ancient-Greek-BERT, a BERT-base model pre-trained on a large corpus of Ancient Greek texts including literary, philosophical, and religious writings. The model uses a custom Ancient Greek tokeniser and vocabulary.

  • Architecture: BERT-base (12 layers, 768 hidden, 12 heads)
  • Pre-training corpus: Ancient Greek texts
  • Tokeniser: WordPiece, custom Greek vocabulary
  • Usage: zero-shot (no fine-tuning)

Embedding Extraction

For each 150-word chunk:

  • Tokenise with model tokeniser (max 512 sub-word tokens)
  • Forward pass through the model
  • Extract last hidden state
  • Apply attention-mask-weighted mean pooling
  • Result: one 768-dimensional vector per chunk

Mean pooling with attention masking was selected over CLS-token extraction based on empirical testing showing more stable results for this corpus.

Reproducibility

All code and data are open-source. Follow these steps to reproduce the analysis.

Quick Start

# Clone the repository
git clone https://github.com/Agnieszkachr/neural-scribe-pilot.git
cd neural-scribe-pilot

# Install dependencies
pip install -r requirements.txt

# Step 1: Run classic stylometry baseline
python run_classic_baseline.py

# Step 2: Run neural probe
python run_neural_probe.py

# Step 3: Generate visualizations
python visualize_dissociation.py
python create_gradient_plot.py

Dependencies

PackagePurpose
transformersHuggingFace model loading (Ancient Greek BERT)
torchNeural network inference
numpyNumerical computation
scipyStatistical testing (t-tests, Spearman ρ)
scikit-learnPCA, dimensionality reduction
matplotlibStatic figure generation

Project Structure

neural-scribe-pilot/
├── data_loader.py           # Corpus fetching & segmentation
├── run_classic_baseline.py  # PCA on MFW z-scores
├── run_neural_probe.py      # One-class neural classification
├── visualize_dissociation.py# Method comparison plot
├── create_gradient_plot.py  # Gradient scatter
├── requirements.txt
├── data/                    # MorphGNT files (auto-downloaded)
├── results/                 # JSON outputs & embeddings
│   ├── classic_results.json
│   ├── neural_results.json
│   └── embeddings.npz
└── site/                    # This website
← Back to Results View on GitHub ↗