The Neural Scribe investigates whether modern neural language models can detect authorship differences in ancient Greek texts. This page documents the methodology, data sources, and how to reproduce the analysis.
The analysis proceeds in four stages, each designed to minimise bias and ensure reproducibility.
Texts are fetched from the MorphGNT repository (morphological annotations of the SBLGNT). Each text is segmented into overlapping chunks of 150 words with a 75-word stride to maximise spatial resolution while maintaining context coherence. Only the lemma column is extracted, stripping punctuation to align with the classic baseline pipeline.
A traditional stylometric analysis using PCA on z-scored frequencies of the 100 most frequent words (MFW). This establishes that surface-level word distributions cannot distinguish disputed letters from undisputed Paul — motivating the neural approach. Chunks are treated as documents; PCA coordinates are saved for visualisation.
Each chunk is embedded using pranaydeeps/Ancient-Greek-BERT with mean pooling
over attention-masked tokens. A Pauline centroid and standard deviation are computed from
the 314 undisputed chunks only (one-class approach). Each target chunk's cosine distance
from this centroid is then z-score normalised to produce a σ-distance metric.
For each disputed text, a one-sample t-test checks whether its mean σ-distance differs significantly from zero. Cohen's d measures practical effect size. A Spearman rank correlation tests whether the distances form a gradient matching scholarly consensus rejection rates. Percentile analyses examine what fraction of chunks exceed P75/P90/P95 thresholds of the Pauline distribution.
The primary text source is the MorphGNT project, which provides morphological annotations of the SBL Greek New Testament. Each word is annotated with part of speech, parsing code, text form, word form, normalised form, and lemma.
Rejection rates for Pauline authorship are based on surveys of critical scholarship (Ehrman 2012, Brown 1997, Kümmel 1975). The rates represent the approximate percentage of scholars who consider each letter pseudepigraphal or non-Pauline.
| Group | Text | MorphGNT File | Chunks |
|---|---|---|---|
| Anchor | Romans | 66-Ro-morphgnt.txt | 47 |
| Anchor | 1 Corinthians | 67-1Co-morphgnt.txt | 46 |
| Anchor | 2 Corinthians | 68-2Co-morphgnt.txt | 30 |
| Anchor | Galatians | 69-Ga-morphgnt.txt | 15 |
| Anchor | Philippians | 71-Php-morphgnt.txt | — |
| Anchor | 1 Thessalonians | 73-1Th-morphgnt.txt | — |
| Anchor | Philemon | 78-Phm-morphgnt.txt | — |
| Target | Colossians | 72-Col-morphgnt.txt | 20 |
| Target | 2 Thessalonians | 74-2Th-morphgnt.txt | 10 |
| Target | Ephesians | 70-Eph-morphgnt.txt | 31 |
| Target | 1 Timothy | 75-1Ti-morphgnt.txt | 20 |
| Target | 2 Timothy | 76-2Ti-morphgnt.txt | 15 |
| Target | Titus | 77-Tit-morphgnt.txt | 8 |
| Control | Hebrews | 79-Heb-morphgnt.txt | 65 |
We use pranaydeeps/Ancient-Greek-BERT, a BERT-base model pre-trained on
a large corpus of Ancient Greek texts including literary, philosophical, and religious
writings. The model uses a custom Ancient Greek tokeniser and vocabulary.
For each 150-word chunk:
Mean pooling with attention masking was selected over CLS-token extraction based on empirical testing showing more stable results for this corpus.
All code and data are open-source. Follow these steps to reproduce the analysis.
# Clone the repository
git clone https://github.com/Agnieszkachr/neural-scribe-pilot.git
cd neural-scribe-pilot
# Install dependencies
pip install -r requirements.txt
# Step 1: Run classic stylometry baseline
python run_classic_baseline.py
# Step 2: Run neural probe
python run_neural_probe.py
# Step 3: Generate visualizations
python visualize_dissociation.py
python create_gradient_plot.py
| Package | Purpose |
|---|---|
transformers | HuggingFace model loading (Ancient Greek BERT) |
torch | Neural network inference |
numpy | Numerical computation |
scipy | Statistical testing (t-tests, Spearman ρ) |
scikit-learn | PCA, dimensionality reduction |
matplotlib | Static figure generation |
neural-scribe-pilot/
├── data_loader.py # Corpus fetching & segmentation
├── run_classic_baseline.py # PCA on MFW z-scores
├── run_neural_probe.py # One-class neural classification
├── visualize_dissociation.py# Method comparison plot
├── create_gradient_plot.py # Gradient scatter
├── requirements.txt
├── data/ # MorphGNT files (auto-downloaded)
├── results/ # JSON outputs & embeddings
│ ├── classic_results.json
│ ├── neural_results.json
│ └── embeddings.npz
└── site/ # This website