The Neural Scribe — Authorship Gradient Detection in the Pauline Corpus

Overview

Can neural language models detect what two centuries of scholarship have debated? We use zero-shot embeddings from Ancient Greek BERT to measure semantic distance between undisputed Pauline texts and letters of contested authorship.

314

Pauline anchor chunks

169

Disputed text chunks

0.704

Spearman ρ gradient

p < .0001

Hebrews divergence

Research Question

Scholars have long debated which New Testament letters attributed to Paul were genuinely written by him. Traditional stylometry (word-frequency analysis) struggles because the disputed letters are stylistically very close to authentic Paul. Our approach asks: do modern neural embeddings — which capture deep semantic and syntactic patterns — reveal an authorship gradient that correlates with scholarly consensus?

Text	Scholarly Status	Rejection Rate
Undisputed Paul	Authentic	0%
Colossians	Contested	~40%
2 Thessalonians	Contested	~50%
Ephesians	Contested	~60%
1 Timothy	Likely pseudepigraphal	~80%
2 Timothy	Likely pseudepigraphal	~80%
Titus	Likely pseudepigraphal	~80%
Hebrews	Non-Pauline (consensus)	100%

Methodology

A one-class classification approach that avoids circular reasoning by building the baseline exclusively from undisputed texts.

🧠 One-Class Classification

The baseline centroid and variance are computed using only undisputed Pauline texts (Romans, 1–2 Corinthians, Galatians, Philippians, 1 Thessalonians, Philemon). Disputed texts are then measured against this baseline without influencing it — avoiding the circular reasoning that plagues traditional multi-class approaches.

📐 Distance Metric

Cosine distance with mean pooling, z-score normalised by intra-Pauline variance:

σ = 0.0 → Pauline centroid
Positive → divergence from Paul
Negative → closer to centroid than typical Paul

Cosine distance is the literature standard for sentence embeddings (Reimers & Gurevych 2019), measuring directional similarity independent of magnitude.

📊 Statistical Framework

One-sample t-test: target mean ≠ 0
Cohen's d: effect size for practical significance
Spearman ρ: gradient correlation with scholarly consensus
Percentile analysis: chunk-level distribution outside P75/P90/P95

⚙️ Technical Setup

Model: pranaydeeps/Ancient-Greek-BERT
Pooling: mean with attention mask
Window: 150 words, 75-word stride
Significance: α = 0.05 (two-tailed)
Data: SBLGNT / MorphGNT morphological annotations

Authorship Gradient

Semantic distance from the Pauline baseline plotted against scholarly rejection rate. The positive trend (ρ = 0.704) shows that texts more frequently rejected by scholars tend to diverge further in neural embedding space.

Spearman ρ = 0.704, p = 0.077 (n = 7 texts)

Reading the Chart

Each point represents a disputed text. The x-axis shows the percentage of scholars who reject Pauline authorship; the y-axis shows how far the text's embedding centroid sits from the undisputed Pauline centroid (in standard deviations). The dashed trend line captures the positive correlation. Hebrews and 1 Timothy — coloured in rose — stand out as the clearest outliers, while 2 Thessalonians and Ephesians (green) sit close to or below the Pauline baseline.

Statistical Results

Distance from Paul (in σ units) with 95% confidence intervals and significance testing. Only Hebrews and 1 Timothy reach statistical significance.

Error bars show 95% confidence intervals · Baseline at σ = 0

Text	N chunks	Distance (σ)	95% CI	P-Value	Cohen's d	Verdict
Paul (baseline)	314	0.00	—	—	—	Baseline
Colossians	20	0.24	[−0.20, 0.69]	0.261	0.25	Indistinguishable
2 Thessalonians	10	−0.22	[−0.70, 0.26]	0.320	−0.26	Indistinguishable
Ephesians	31	−0.03	[−0.39, 0.33]	0.868	−0.03	Indistinguishable
1 Timothy	20	0.79	[0.39, 1.20]	0.0007	0.84	Significant ✦
2 Timothy	15	0.11	[−0.61, 0.83]	0.748	0.09	Indistinguishable
Titus	8	1.11	[−0.10, 2.32]	0.067	0.89	Marginal
Hebrews	65	1.21	[0.75, 1.67]	<0.0001	0.81	Significant ✦

Chunk Distribution Analysis

What percentage of each text's chunks fall outside Pauline percentile thresholds? Texts with true authorial divergence should show many chunks in the tails.

Dashed lines show expected Pauline baselines (25%, 10%, 5%)

Interpreting the Bars

By definition, 25% of Pauline chunks exceed P75, 10% exceed P90, and 5% exceed P95. Values substantially above these baselines indicate systematic divergence. Hebrews shows the most extreme pattern: 58.5% of its chunks exceed P75 and 38.5% exceed P95 — nearly eight times the expected rate. 1 Timothy follows with 65% above P75. In contrast, 2 Thessalonians shows fewer extreme chunks than expected, with 0% above P90 — consistent with its negative mean distance.

Neural Embedding Space

PCA projection of BERT embedding chunks reveals clear geometric separation between Pauline and non-Pauline texts in the latent space.

PCA of Ancient Greek BERT embeddings · 150-word chunks, mean pooled

Three Distinct Regions

The scatter plot shows embedding chunks projected via PCA. Pauline texts (indigo dots) form a tight cluster in the positive PC1 region. Hebrews (rose triangles) separates cleanly into negative PC1, confirming neural divergence. Colossians (amber diamonds) occupies a distinct position on PC2, intermediate between Paul and Hebrews — consistent with its contested status. This geometric structure is invisible to classic word-frequency methods.

Classic Stylometry Baseline

PCA on most-frequent-word z-scores (100 MFW) shows high lexical similarity across all texts — classic methods cannot distinguish disputed letters from Paul.

PCA on 100 MFW z-scores · PC1 explains 4.98%, PC2 explains 3.72% of variance

Why Classic Stylometry Fails Here

Classic stylometry relies on surface-level word frequencies. All Pauline and pseudo-Pauline texts cluster tightly in PCA space (blue box) because the disputed authors successfully mimicked Paul's vocabulary. Only Hebrews shows slight separation on PC2. Colossians (amber) is the most distant on PC1, yet the total variance explained is extremely low (4.98% + 3.72%), meaning even these small differences may be noise. This motivates the neural approach, which captures deeper semantic and syntactic patterns invisible to word-counting methods.

Method Dissociation

The neural probe detects divergence where classic stylometry does not — a dissociation that validates the complementary value of deep embeddings.

Left axis: neural distance (σ) · Right axis: classic lexical similarity (0–1)

Two Methods, Two Stories

Classic stylometry (amber bars) shows uniformly high lexical overlap with Paul across all texts, ranging from 0.82 (Hebrews) to 0.97 (2 Thessalonians). The neural distance measure (indigo bars) reveals a starkly different picture: Hebrews (1.21σ), Titus (1.11σ), and 1 Timothy (0.79σ) stand well above the baseline, while 2 Thessalonians (−0.22σ) and Ephesians (−0.03σ) remain close or below. This dissociation demonstrates that neural embeddings capture information about authorship that surface-level word frequencies miss entirely.

Key Findings

What the neural probe reveals about the Pauline corpus.

✅ Gradient Detected

Semantic distance correlates with scholarly rejection rates (ρ = 0.704). Texts more frequently rejected as non-Pauline show greater neural divergence, producing a monotonic trend that approaches conventional significance (p = 0.077) with only 7 data points.

✅ Hebrews Validated

Hebrews — universally regarded as non-Pauline — shows the strongest divergence (1.21σ, p < 0.0001, d = 0.81), with 38.5% of chunks outside the 95th percentile. This validates the method's ability to detect genuine authorial differences.

🔍 Pastoral Split

1 Timothy shows clear divergence (d = 0.84, p = 0.0007), while 2 Timothy remains indistinguishable (d = 0.09, p = 0.748). This may reflect different compositional histories or varying degrees of authentic material embedded in each letter — a finding consistent with some recent scholarship.

📌 Contested Letters Close

2 Thessalonians (−0.22σ) and Ephesians (−0.03σ) show negative distances, clustering closer to Paul than average. This is consistent with either authentic authorship or high-quality stylistic mimicry that captures deep semantic patterns beyond mere word-frequency imitation.

Limitations & Future Directions

Overlapping chunks: 50% stride overlap inflates effective sample size; future work will use dependence-aware inference (block bootstrap).
Zero-shot only: Fine-tuned domain-specific models may reveal additional signal currently masked by generic embeddings.
Punctuation removed: Greek punctuation was stripped to match the classical baseline; the neural pipeline should retain it for richer syntactic encoding.
Gradient approaching significance (p = 0.077) with only 7 data points; expanding the corpus to additional NT texts will increase statistical power.
Single model: Cross-model convergence analyses (multiple embedding architectures) will strengthen confidence in the detected gradient.

Methodology Details → View on GitHub ↗