About — The Neural Scribe

About This Project

The Neural Scribe investigates whether modern neural language models can detect authorship differences in ancient Greek texts. This page documents the methodology, data sources, and how to reproduce the analysis.

Analysis Pipeline

The analysis proceeds in four stages, each designed to minimise bias and ensure reproducibility.

Corpus Preparation

Texts are fetched from the MorphGNT repository (morphological annotations of the SBLGNT). Each text is segmented into overlapping chunks of 150 words with a 75-word stride to maximise spatial resolution while maintaining context coherence. Only the lemma column is extracted, stripping punctuation to align with the classic baseline pipeline.

Classic Stylometry Baseline

A traditional stylometric analysis using PCA on z-scored frequencies of the 100 most frequent words (MFW). This establishes that surface-level word distributions cannot distinguish disputed letters from undisputed Paul — motivating the neural approach. Chunks are treated as documents; PCA coordinates are saved for visualisation.

Neural Embedding Probe

Each chunk is embedded using pranaydeeps/Ancient-Greek-BERT with mean pooling over attention-masked tokens. A Pauline centroid and standard deviation are computed from the 314 undisputed chunks only (one-class approach). Each target chunk's cosine distance from this centroid is then z-score normalised to produce a σ-distance metric.

Statistical Testing

For each disputed text, a one-sample t-test checks whether its mean σ-distance differs significantly from zero. Cohen's d measures practical effect size. A Spearman rank correlation tests whether the distances form a gradient matching scholarly consensus rejection rates. Percentile analyses examine what fraction of chunks exceed P75/P90/P95 thresholds of the Pauline distribution.

Data Sources

📜 MorphGNT / SBLGNT

The primary text source is the MorphGNT project, which provides morphological annotations of the SBL Greek New Testament. Each word is annotated with part of speech, parsing code, text form, word form, normalised form, and lemma.

github.com/morphgnt →

📚 Scholarly Consensus

Rejection rates for Pauline authorship are based on surveys of critical scholarship (Ehrman 2012, Brown 1997, Kümmel 1975). The rates represent the approximate percentage of scholars who consider each letter pseudepigraphal or non-Pauline.

Texts Analysed

Group	Text	MorphGNT File	Chunks
Anchor	Romans	`66-Ro-morphgnt.txt`	47
Anchor	1 Corinthians	`67-1Co-morphgnt.txt`	46
Anchor	2 Corinthians	`68-2Co-morphgnt.txt`	30
Anchor	Galatians	`69-Ga-morphgnt.txt`	15
Anchor	Philippians	`71-Php-morphgnt.txt`	—
Anchor	1 Thessalonians	`73-1Th-morphgnt.txt`	—
Anchor	Philemon	`78-Phm-morphgnt.txt`	—
Target	Colossians	`72-Col-morphgnt.txt`	20
Target	2 Thessalonians	`74-2Th-morphgnt.txt`	10
Target	Ephesians	`70-Eph-morphgnt.txt`	31
Target	1 Timothy	`75-1Ti-morphgnt.txt`	20
Target	2 Timothy	`76-2Ti-morphgnt.txt`	15
Target	Titus	`77-Tit-morphgnt.txt`	8
Control	Hebrews	`79-Heb-morphgnt.txt`	65

Embedding Model

Ancient Greek BERT

We use pranaydeeps/Ancient-Greek-BERT, a BERT-base model pre-trained on a large corpus of Ancient Greek texts including literary, philosophical, and religious writings. The model uses a custom Ancient Greek tokeniser and vocabulary.

Architecture: BERT-base (12 layers, 768 hidden, 12 heads)
Pre-training corpus: Ancient Greek texts
Tokeniser: WordPiece, custom Greek vocabulary
Usage: zero-shot (no fine-tuning)

Embedding Extraction

For each 150-word chunk:

Tokenise with model tokeniser (max 512 sub-word tokens)
Forward pass through the model
Extract last hidden state
Apply attention-mask-weighted mean pooling
Result: one 768-dimensional vector per chunk

Mean pooling with attention masking was selected over CLS-token extraction based on empirical testing showing more stable results for this corpus.

Reproducibility

All code and data are open-source. Follow these steps to reproduce the analysis.

Quick Start

# Clone the repository
git clone https://github.com/Agnieszkachr/neural-scribe-pilot.git
cd neural-scribe-pilot

# Install dependencies
pip install -r requirements.txt

# Step 1: Run classic stylometry baseline
python run_classic_baseline.py

# Step 2: Run neural probe
python run_neural_probe.py

# Step 3: Generate visualizations
python visualize_dissociation.py
python create_gradient_plot.py

Dependencies

Package	Purpose
`transformers`	HuggingFace model loading (Ancient Greek BERT)
`torch`	Neural network inference
`numpy`	Numerical computation
`scipy`	Statistical testing (t-tests, Spearman ρ)
`scikit-learn`	PCA, dimensionality reduction
`matplotlib`	Static figure generation

Project Structure

neural-scribe-pilot/
├── data_loader.py           # Corpus fetching & segmentation
├── run_classic_baseline.py  # PCA on MFW z-scores
├── run_neural_probe.py      # One-class neural classification
├── visualize_dissociation.py# Method comparison plot
├── create_gradient_plot.py  # Gradient scatter
├── requirements.txt
├── data/                    # MorphGNT files (auto-downloaded)
├── results/                 # JSON outputs & embeddings
│   ├── classic_results.json
│   ├── neural_results.json
│   └── embeddings.npz
└── site/                    # This website

← Back to Results View on GitHub ↗