Machine learning models for identifying citation structures in classical texts and resolving bibliographic references to canonical URNs.
Current README largely contains notes for my own use.
Project Status: Early Development
- ✅ Data pipeline implemented (extraction task)
- ✅ Model initialization and embedding handling
- ✅ Fine-tuning for extraction task
- ✅ XML file processing using fine-tuned model (in progress)
- ⏳ URN resolution implementation (planned)
Requirements:
- Python 3.13+
- uv package manager (recommended) or pip
Setup:
# Clone repository
git clone https://github.com/andrmayo/perseus-citation-model.git
cd perseus-citation-model
# Install with uv (recommended)
uv sync
# Or install with pip
pip install -e ".[dev]"This project provides two complementary ML tasks for working with citations in TEI-encoded XML documents from the Perseus Digital Library:
- Tag Extraction: Identify and extract citation tags (
<cit>,<quote>,<bibl>) from plain text - URN Resolution: Map bibliographic references to Canonical Text Services (CTS) URNs
Both tasks share data pipelines and preprocessing infrastructure but use different model architectures appropriate to each problem.
Input: Plain text extracted from TEI XML documents Output: Token-level tags identifying citation boundaries
Target tags:
<cit>- Citation container<quote>- Quoted text<bibl>- Bibliographic reference
Note <cit> tags surround a quote-bibl pair, so can simply be inserted
logically once <bibl> and <quote> elements have been identified
Challenges:
- Variable citation formats
- Mixed languages (Greek, Latin, English)
- Context-dependent identification
Input: Bibliographic reference text (e.g., "Hdt. 8.82") Output: CTS URN (e.g., "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82")
Examples:
- "Hom. Il. 7.268" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
- "Thuc. 3.38" → "urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:3.38"
- "Plat. Rep. 332D" → "urn:cts:greekLit:tlg0059.tlg030.perseus-grc2:332d"
Challenges:
- Abbreviated author names (Hdt., Hom., Thuc.)
- Work title variations and abbreviations
- Range notation (e.g., "7.268-272", "sqq.")
- Unresolvable references to modern scholarship (e.g., "ARV2, 987")
- Missing URNs for ~12% of citations
Training data is in JSONL format with two files:
resolved.jsonl (~216K examples) - Citations with URNs:
{
"bibl": "Hdt. 8.82",
"quote": "",
"xml_context": "...full XML context with tags...",
"filename": "xml_files/viaf17286815.viaf001.xml",
"urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
"ref": "hdt. 8.82",
"n_attrib": "Hdt. 8.82",
"doc_cit_urn": ":citations-28.3"
}unresolved.jsonl (~30K examples) - Citations without URNs:
{
"bibl": "FR, pl. 167,2",
"quote": "",
"xml_context": "...full XML context with tags...",
"filename": "xml_files/viaf114145308.viaf001.xml",
"urn": "",
"ref": "fr pl. 167,2",
"n_attrib": "",
"doc_cit_urn": ":citations-24.1"
}Key fields:
bibl: Bibliographic reference textquote: Quoted text (often empty)xml_context: XML snippet with tags for tag extraction trainingurn: CTS URN (empty for unresolved citations)ref: Normalized reference text
Fine-tune a pre-trained transformer model (DeBERTa) for sequence labeling using BIO tagging.
Input Text → Tokenizer → Transformer Encoder → Linear Layer → Softmax → BIO Tags
Each token is labeled with one of:
O- Outside any citation tagB-CIT- Beginning of<cit>tagI-CIT- Inside<cit>tagB-QUOTE- Beginning of<quote>tagI-QUOTE- Inside<quote>tagB-BIBL- Beginning of<bibl>tagI-BIBL- Inside<bibl>tag
Example:
Text: Hom. Il. 7.268 - 272 : "Ajax hurled a rock"
Tags: B-BIBL I-BIBL I-BIBL I-BIBL I-BIBL I-BIBL O B-QUOTE I-QUOTE I-QUOTE I-QUOTE
Currently, the project uses microsoft/deberta-v3-base - for the following
reasons:
- Superior contextual understanding for nested structures
- Better multilingual handling (Greek, Latin, English)
- State-of-the-art performance on token classification
- 1-3% F1 improvement over RoBERTa on similar tasks
One alternative that might be worth trying is roberta-base - Good
alternative if:
- Need faster inference (~10-15% faster than DeBERTa)
- Memory constraints
- Strong baseline performance
To help the model learn sequences rather than simply individual token labels, I've added a CRF decoding layer. A (in this case secondary) benefit is to enforce valid label predictions (which really just means that I- labels follow B- label).
Input Text → Tokenizer → Transformer Encoder → Linear Layer → CRF Layer → BIO Tags
The training pipeline automatically splits data by filename to prevent data leakage:
from perscit_model.extraction.train import split_data
train_path, val_path, test_path = split_data(
input_file="cit_data/resolved.jsonl",
output_dir="model_data/extraction",
train_ratio=0.8, # Default from config
val_ratio=0.1,
test_ratio=0.1,
seed=42 # Default from config
)Implementation details:
- Splits by filename (not individual examples) to prevent data leakage
- Tries multiple shuffles to get close to target ratios
- Saves split configuration to prevent accidental re-splitting
- Reuses existing splits if configuration matches
- Training done via curriculum learning:
- Phase 1 on just passages with citations
- Phase 2 on all dictionary XML files, especially LSJ, with data augmentation described below
- Phase 3 on commentary XML files, and with some
<bibl>and<quote>tags randomly retained and their contents labelled as "O" so model learns to ignore existing tagged citations
NB: The LSJ is about half of the raw amount of text in the phase 2 training
corpus, and has more than half of the citation tokens. The LSJ also has certain
distorting features, above all the consistent use of <author> and <title>
tags in citations. Hence, to prevent over-reliance on these tags, they get
stripped at a rate of 50%. It would be risky to strip them at a higher rate than
this unless done during preprocessing, given that it is likely to produce a
strong correlation between text length and the presence of citations. It should
also probably only be used alongside random token dropping (which makes this
correlation noisier, but doesn't eliminate it in expectation).
The extraction model was trained on 512-token windows centered around citations. To process full XML documents that exceed this context window, we need a sliding window approach with special handling for:
- Context window boundaries (avoiding edge effects)
- Split entities (citations spanning multiple windows)
- Existing citations (preserving already-identified citations)
- Reconstructing
<cit>tags (wrapping bibl-quote pairs)
Problem: Citations near window edges may lack sufficient context, reducing prediction quality.
Solution: Process with overlapping windows, but only trust predictions from the center region.
Algorithm:
-
Window parameters:
- Window size: 512 tokens (same as training)
- Stride: 256 tokens (50% overlap)
- Reliable region: Center 256 tokens of each window
- Exception: First and last windows trust predictions to document edges
-
Coverage guarantee: Every position in the document appears in the center of at least one window
-
Handles split entities: With 50% overlap, any entity < 256 tokens appears fully within some window's center
Example:
Document: [============================]
Window 1: [512 tokens]
[ reliable ]
Window 2: [512 tokens]
[ reliable ]
Window 3: [512 tokens]
[ reliable ]
Goal: Supplement (not replace) existing citation tags in XML files.
Approach: Post-hoc filtering after inference.
Algorithm:
-
Extract existing citations:
- Parse XML and identify all
<bibl>,<quote>,<cit>tags - Store as character-level spans:
[(start_pos, end_pos, tag_type), ...]
- Parse XML and identify all
-
Strip citation tags for inference:
- Remove
<bibl>,<quote>,<cit>tags (keep other XML tags) - Run inference on stripped text
- Remove
-
Merge predictions:
- Filter out predicted entities that overlap with existing citations
- Keep existing citations unchanged
- Insert new predicted citations
Why this works:
- Model sees natural context around existing citations
- Simple conflict resolution (existing citations take precedence)
- Can later extend to more sophisticated merge strategies
Problem: Overlapping windows produce multiple predictions for the same tokens.
Solution: Merge predictions at the character level, only using reliable regions.
Algorithm:
-
Initialize: Create character-level label array:
char_labels = ['O'] * len(text) -
For each window:
- Get token-level predictions from model
- Convert to character-level using tokenizer offset mapping
- Determine reliable character range (e.g., chars corresponding to tokens 128-384)
- Update
char_labelsonly in the reliable region
-
Extract entities: Convert final
char_labelsto entity spans using BIO logic
Advantages:
- Handles split entities automatically (continuous character regions)
- Clean merging of overlapping predictions
- No special logic needed for window boundaries
Recall: The model doesn't predict <cit> tags (no training examples). These
must be inserted logically.
Pattern matching algorithm:
-
Identify adjacent pairs:
<bibl>immediately followed by<quote>(same parent, adjacent siblings)<quote>immediately followed by<bibl>(same parent, adjacent siblings)
-
Wrapping conditions:
- Must be direct siblings in XML tree
- Only whitespace allowed between elements
- Not already inside a
<cit>tag - Don't wrap if non-whitespace text appears between them
-
Insert
<cit>wrapper:- Wrap the pair:
<cit><bibl>...</bibl><quote>...</quote></cit> - Preserve whitespace between elements
- Wrap the pair:
Example:
Before: See <bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote> for details.
After: See <cit><bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote></cit> for details.High-level pipeline:
class XMLCitationProcessor:
"""Process full XML documents with sliding window inference."""
def __init__(self, model_path, window_size=512, stride=256):
self.model = InferenceModel(model_path)
self.window_size = window_size
self.stride = stride
def process_file(self, xml_path, preserve_existing=True):
"""
Process XML file and insert citation tags.
Steps:
1. Parse XML and extract existing citations
2. Strip citation tags from text
3. Create sliding windows
4. Run inference on each window
5. Merge predictions at character level (reliable regions only)
6. Filter predictions that conflict with existing citations
7. Wrap bibl-quote pairs in <cit> tags
8. Insert all tags into final XML
9. Validate and return result
"""
passKey parameters:
window_size: Token length of each window (default: 512)stride: Token overlap between windows (default: 256)preserve_existing: Keep existing citation tags (default: True)max_bibl_quote_distance: Max chars between bibl/quote for wrapping (default: 100)
Transformer only:
- Learning rate: 2e-5 to 5e-5
- Batch size: 16-32
- Epochs: 3-5
- Warmup steps: 500
- Weight decay: 0.01
Minimum:
- GPU: 8GB VRAM (e.g., RTX 2070, T4)
- RAM: 16GB
- Storage: 10GB for model + data
Recommended:
- GPU: 16GB+ VRAM (e.g., V100, A100, RTX 3090)
- RAM: 32GB
- Storage: 50GB
URN resolution maps bibliographic reference strings to Canonical Text Services (CTS) URNs. This is a different ML problem from tag extraction - it's a structured prediction or sequence-to-sequence task rather than token classification.
Strategy: Use perseus-citation-processor for 87.6% of cases, hierarchical DNN classifiers for the remaining 12.4%.
Advantages:
- Guaranteed valid URNs - classification over known catalog, no hallucination
- Interpretable - see which stage (author/work) succeeded or failed
- Efficient - small vocabularies (~500 authors, ~50 works per author)
- High precision - rule-based handles well-formatted citations (87.6%)
- Debuggable - can inspect author confidence vs work confidence separately
Architecture:
Input: "Hdt. 8.82"
↓
[perseus-citation-processor] → 87.6% resolved directly
↓ (if unresolved or low confidence)
[Author Classifier (DNN)] → tlg0016 [conf: 0.95]
↓
[Work Classifier (DNN)] → tlg0016.tlg001 [conf: 0.92]
↓
[Passage Parser (rules)] → 8.82
↓
[Edition Selector (rules)] → perseus-grc2
↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"
Components:
- Rule-based baseline: perseus-citation-processor (Go binary)
- Author classifier: DeBERTa classification over ~500 Greek/Latin authors
- Work classifier: DeBERTa classification over works (conditioned on author)
- Passage parser: Rule-based extraction of passage references
- Edition selector: Rule-based (Greek quote → grc, Latin quote → lat)
- Confidence scorer: DeBERTa binary classifier for quality control
Models: T5, BART, ByT5, or encoder-decoder transformers
Why NOT recommended:
- URNs are structured catalog lookups, not creative text generation
- Can hallucinate invalid author/work combinations
- Less interpretable - black box generation
- Wasteful - learns URN syntax instead of just citation→URN mappings
- No guaranteed validity - requires complex constrained decoding
When to consider:
- If you need to handle completely new authors not in any catalog (unlikely for classical texts)
- If you want to experiment with end-to-end learning
- As a baseline to compare against hierarchical classification
Architecture:
from transformers import T5ForConditionalGeneration, AutoTokenizer
# ByT5 for character-level handling of Greek/Latin
model = T5ForConditionalGeneration.from_pretrained("google/byt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
# Training format: "resolve citation: Hdt. 8.82" → "urn:cts:greekLit:..."
input_text = "resolve citation: Hdt. 8.82"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
# Major issue: May generate syntactically valid but semantically wrong URNs
# e.g., "urn:cts:greekLit:tlg9999.tlg999.perseus-grc2:8.82" (invalid author)Verdict: Use hierarchical classification (Approach 1) instead.
Strategy: Retrieve similar citations from database, use ML to rank/select.
Advantages:
- Leverages existing resolved citations
- Good for rare author/work combinations
- Can explain predictions via retrieved examples
Architecture:
Input: "Thuc. 3.38"
↓
Embedding model → Dense vector representation
↓
Vector DB → Retrieve top-K similar resolved citations
↓
Ranking model → Score candidates
↓
Output: Highest-scoring URN
Recommended Approach: Hierarchical Classification (NOT seq2seq)
This implementation uses:
- Rule-based baseline (perseus-citation-processor) for 87.6% of citations
- Hierarchical DNN classifiers for the remaining 12.4%
- Author classifier (DeBERTa)
- Work classifier (DeBERTa, conditioned on author)
- Confidence scorer (DeBERTa) to validate rule-based outputs
Why hierarchical classification over seq2seq?
- Guaranteed valid URNs (no hallucination)
- Interpretable decisions (author vs work failures)
- Efficient (small vocabularies)
- Better uncertainty handling (top-k at each stage)
This project uses the perseus-citation-processor as the rule-based foundation:
Performance on 246K citations:
- ✅ 87.6% resolution rate (216K resolved)
- ❌ 12.4% unresolved (30K citations)
- ⚡ Fast: 28 seconds for 125 XML files
- 📚 Comprehensive author/work mappings (Greek, Latin, Scholia)
The DNN components handle the remaining 12.4% plus add confidence scoring.
Input Citation
↓
[perseus-citation-processor] ← Rule-based (87.6% coverage)
↓
┌───┴────┐
↓ ↓
Resolved Unresolved
↓ ↓
[Confidence [DNN Resolution
Scorer] Model]
↓ ↓
High conf? URN + conf
↓ ↓
Yes→Output Merge→Output
No↘ ↗
[DNN Model]
Purpose: Resolve the 30K citations that rule-based system couldn't handle.
Approach: Hierarchical Classification (NOT seq2seq generation)
Why Classification over Seq2Seq:
- URNs are structured catalog lookups, not creative text
- Guaranteed valid outputs - cannot hallucinate invalid author/work combinations
- Interpretable - see exactly which stage succeeded/failed
- Efficient - small vocabulary (~500 authors × ~50 works = 25K combinations)
- Better uncertainty handling - top-k predictions at each stage
Architecture: Multi-stage classification pipeline
"Hdt. 8.82"
↓
[Author Classifier] → tlg0016 (Herodotus) [confidence: 0.95]
↓
[Work Classifier] → tlg0016.tlg001 (Histories) [confidence: 0.92]
↓
[Edition Selector] → perseus-grc2 (rule-based: Greek text → grc)
↓
[Passage Parser] → 8.82 (rule-based extraction)
↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"
Implementation:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class AuthorClassifier:
"""Stage 1: Classify citation to author URN"""
def __init__(self):
# Load perseus-citation-processor author catalog
self.author_catalog = self.load_author_catalog() # ~500 Greek/Latin authors
self.author_to_id = {urn: i for i, urn in enumerate(self.author_catalog)}
self.id_to_author = {i: urn for urn, i in self.author_to_id.items()}
# DeBERTa classifier over author vocabulary
self.model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base",
num_labels=len(self.author_catalog)
)
self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
def predict(self, citation_text, context=""):
"""Predict top-k authors with confidence scores"""
# Input format: "[citation] [SEP] [context]"
text = f"{citation_text} [SEP] {context}"
inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# Get predictions
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
top_k = torch.topk(probs[0], k=3)
# Return [(author_urn, confidence), ...]
return [(self.id_to_author[idx.item()], prob.item())
for idx, prob in zip(top_k.indices, top_k.values)]
class WorkClassifier:
"""Stage 2: Classify to work URN (conditioned on author)"""
def __init__(self):
# Group works by author
self.works_by_author = self.load_work_catalog() # {tlg0016: [tlg001, tlg002, ...]}
self.max_works = max(len(works) for works in self.works_by_author.values())
# Classifier conditioned on author
self.model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base",
num_labels=self.max_works
)
self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
def predict(self, citation_text, author_urn, context=""):
"""Predict work for given author"""
# Get candidate works for this author
candidates = self.works_by_author.get(author_urn, [])
if not candidates:
return []
# Input format: "[author] [SEP] [citation] [SEP] [context]"
text = f"{author_urn} [SEP] {citation_text} [SEP] {context}"
inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = self.model(**inputs)
# Filter to only valid works for this author
valid_logits = outputs.logits[0, :len(candidates)]
probs = torch.softmax(valid_logits, dim=-1)
# Return [(work_urn, confidence), ...] sorted by confidence
results = [(candidates[i], probs[i].item()) for i in range(len(candidates))]
return sorted(results, key=lambda x: x[1], reverse=True)
class HierarchicalURNResolver:
"""Complete hierarchical URN resolution system"""
def __init__(self):
self.author_classifier = AuthorClassifier()
self.work_classifier = WorkClassifier()
self.passage_parser = PassageParser() # Rule-based
self.edition_selector = EditionSelector() # Rule-based
def resolve(self, citation_text, context="", quote=""):
"""Resolve citation to URN with confidence score"""
# Stage 1: Classify author
author_candidates = self.author_classifier.predict(citation_text, context)
# Stage 2: For each author candidate, classify work
urn_candidates = []
for author_urn, author_conf in author_candidates[:3]: # Top-3 authors
work_candidates = self.work_classifier.predict(
citation_text, author_urn, context
)
for work_urn, work_conf in work_candidates[:3]: # Top-3 works
# Stage 3: Parse passage (rule-based)
passage = self.passage_parser.extract(citation_text)
# Stage 4: Select edition (rule-based)
edition = self.edition_selector.select(author_urn, quote)
# Stage 5: Assemble URN
namespace = self.get_namespace(author_urn) # greekLit or latinLit
full_urn = f"urn:cts:{namespace}:{work_urn}.{edition}:{passage}"
# Combined confidence (product of stages)
confidence = author_conf * work_conf
urn_candidates.append((full_urn, confidence, {
'author_conf': author_conf,
'work_conf': work_conf,
'author': author_urn,
'work': work_urn
}))
# Return best candidate
if urn_candidates:
best = max(urn_candidates, key=lambda x: x[1])
return best[0], best[1], best[2]
else:
return None, 0.0, {}Training Data Format:
# Training examples from resolved.jsonl (216K examples)
# Split into author and work classification tasks
# Stage 1: Author Classification
author_examples = [
{
"citation": "Hdt. 8.82",
"context": "",
"label": "tlg0016" # Herodotus
},
{
"citation": "Soph. OT 151",
"context": "τᾶς πολυχρύσου Πυθῶνος",
"label": "tlg0011" # Sophocles
},
{
"citation": "Plat. Rep. 332D",
"context": "",
"label": "tlg0059" # Plato
}
]
# Stage 2: Work Classification (conditioned on author)
work_examples = [
{
"citation": "Hdt. 8.82",
"author": "tlg0016",
"context": "",
"label": "tlg0016.tlg001" # Histories
},
{
"citation": "Soph. OT 151",
"author": "tlg0011",
"context": "τᾶς πολυχρύσου Πυθῶνος",
"label": "tlg0011.tlg004" # Oedipus Tyrannus
},
{
"citation": "Hom. Il. 7.268",
"author": "tlg0012",
"context": "Ajax hurled a rock",
"label": "tlg0012.tlg001" # Iliad (not Odyssey)
}
]Features Used:
- Citation text (required): "Hdt. 8.82"
- Context (when available): Surrounding text or quote
- Author URN (for work classification): Condition on stage 1 output
- Ground truth URN: Parse author/work from
urnfield
Purpose: Identify incorrect resolutions from perseus-citation-processor.
Problem: From perseus-citation-processor README: "just because the processor resolves a citation doesn't mean that it resolves it correctly"
Model: DeBERTa-based Binary Classifier
from transformers import AutoModel, AutoTokenizer
import torch.nn as nn
class URNConfidenceScorer(nn.Module):
def __init__(self, model_name="microsoft/deberta-v3-base"):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.dropout = nn.Dropout(0.1)
hidden_size = self.encoder.config.hidden_size
feature_size = 10 # Engineered features
self.classifier = nn.Sequential(
nn.Linear(hidden_size + feature_size, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, citation_text, urn, context=""):
# Encode: "[citation] <SEP> [URN] <SEP> [context]"
text = f"{citation_text} [SEP] {urn} [SEP] {context}"
inputs = self.tokenizer(text, return_tensors="pt", truncation=True)
outputs = self.encoder(**inputs)
pooled = outputs.last_hidden_state[:, 0, :] # CLS token
# Compute engineered features
features = self.compute_features(citation_text, urn, context)
combined = torch.cat([pooled, features], dim=1)
confidence = self.classifier(combined)
return confidence
def compute_features(self, citation, urn, context):
"""Engineered features for confidence scoring"""
return torch.tensor([
self.citation_urn_match_score(citation, urn),
self.has_ambiguous_abbrev(citation),
self.context_language_match(context, urn),
self.passage_validity(urn),
self.author_frequency(urn),
self.work_frequency(urn),
self.quote_presence(context),
self.citation_length(citation),
self.urn_complexity(urn),
self.catalog_match_count(citation)
])Training Data:
# Positive examples (confident resolutions)
positive_examples = [
{
"citation": "Hdt. 8.82",
"urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
"context": "Greek text context",
"label": 1.0 # High confidence
}
]
# Ambiguous cases (medium confidence)
ambiguous_examples = [
{
"citation": "Arist. Met.", # Metaphysics or Meteorology?
"urn": "urn:cts:greekLit:tlg0086.tlg025.perseus-grc2:",
"context": "",
"label": 0.6 # Medium confidence
}
]
# Incorrect resolutions (low confidence)
negative_examples = [
{
"citation": "Hom. 7.268", # Iliad or Odyssey?
"urn": "urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:7.268", # Odyssey
"context": "Ajax hurled a rock", # Ajax is in Iliad!
"label": 0.1 # Low confidence - context mismatch
}
]Purpose: Combine rule-based and DNN intelligently.
class HybridURNResolver:
def __init__(self):
self.rule_based = PerseusProcessorWrapper() # Calls Go binary
self.dnn_resolver = HierarchicalURNResolver() # Hierarchical classifier
self.confidence_scorer = URNConfidenceScorer()
# Thresholds (tune on validation set)
self.high_confidence_threshold = 0.85
self.low_confidence_threshold = 0.50
def resolve(self, citation_text, context="", quote=""):
# Step 1: Try rule-based
rule_urn, rule_status = self.rule_based.resolve(citation_text)
# Step 2: Score rule-based result
if rule_urn:
confidence = self.confidence_scorer(citation_text, rule_urn, context)
if confidence > self.high_confidence_threshold:
return rule_urn, confidence, "rule-based"
# Medium confidence - get DNN opinion
elif confidence > self.low_confidence_threshold:
dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
citation_text, context, quote
)
# Compare: rule-based vs DNN
if dnn_conf > confidence:
return dnn_urn, dnn_conf, "dnn-override", dnn_details
else:
return rule_urn, confidence, "rule-based-verified"
# Low confidence - prefer DNN
else:
dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
citation_text, context, quote
)
return dnn_urn, dnn_conf, "dnn-preferred", dnn_details
# Step 3: Rule-based failed, use DNN only
else:
dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
citation_text, context, quote
)
return dnn_urn, dnn_conf, "dnn-only", dnn_detailsPhase 1: Train Author Classifier
from datasets import Dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
import json
# Step 1: Extract author labels from URNs
def parse_urn_to_author(urn):
"""Extract author URN from full CTS URN"""
# urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016
try:
parts = urn.split(":")
author_work = parts[3] # tlg0016.tlg001.perseus-grc2
author = author_work.split(".")[0] # tlg0016
return author
except:
return None
# Step 2: Build author vocabulary from perseus-citation-processor data
author_catalog = set()
with open("cit_data/resolved.jsonl") as f:
for line in f:
item = json.loads(line)
author = parse_urn_to_author(item['urn'])
if author:
author_catalog.add(author)
author_to_id = {author: i for i, author in enumerate(sorted(author_catalog))}
id_to_author = {i: author for author, i in author_to_id.items()}
# Step 3: Prepare training data
train_data = []
with open("cit_data/resolved.jsonl") as f:
for line in f:
item = json.loads(line)
author = parse_urn_to_author(item['urn'])
if author and author in author_to_id:
train_data.append({
"text": f"{item['bibl']} [SEP] {item.get('quote', '')}",
"label": author_to_id[author]
})
# Train/val split
dataset = Dataset.from_list(train_data)
dataset = dataset.train_test_split(test_size=0.1)
# Step 4: Train DeBERTa classifier
model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base",
num_labels=len(author_catalog)
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./outputs/resolution/models/author-classifier",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['test'],
)
trainer.train()Phase 2: Train Work Classifier
# Step 1: Parse work labels from URNs
def parse_urn_to_work(urn):
"""Extract work URN from full CTS URN"""
# urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016.tlg001
try:
parts = urn.split(":")
author_work = parts[3] # tlg0016.tlg001.perseus-grc2
work = ".".join(author_work.split(".")[:2]) # tlg0016.tlg001
return work
except:
return None
# Step 2: Build work vocabulary grouped by author
works_by_author = {}
with open("cit_data/resolved.jsonl") as f:
for line in f:
item = json.loads(line)
author = parse_urn_to_author(item['urn'])
work = parse_urn_to_work(item['urn'])
if author and work:
if author not in works_by_author:
works_by_author[author] = set()
works_by_author[author].add(work)
# Step 3: Prepare training data (conditioned on author)
train_data = []
with open("cit_data/resolved.jsonl") as f:
for line in f:
item = json.loads(line)
author = parse_urn_to_author(item['urn'])
work = parse_urn_to_work(item['urn'])
if author and work:
# Input includes author URN to condition on
train_data.append({
"text": f"{author} [SEP] {item['bibl']} [SEP] {item.get('quote', '')}",
"author": author,
"label": work
})
# Create label mapping (per-author work indices)
# ... (similar training setup)Phase 3: Train Confidence Scorer
- Curate labeled examples with confidence scores
- Use cross-validation on resolved.jsonl
- Add manually labeled ambiguous cases (~1K examples)
- Generate synthetic negative examples (incorrect author/work pairs)
Phase 4: Evaluate Hierarchical Classifier
- Test on held-out validation set (10% of resolved.jsonl)
- Measure component accuracy:
- Author classification accuracy
- Work classification accuracy (given correct author)
- End-to-end URN exact match
- Test on 30K unresolved.jsonl examples
- Manual evaluation on random sample for accuracy
Conservative estimates:
| Component | Coverage Improvement |
|---|---|
| Rule-based baseline | 87.6% |
| DNN on unresolved (30-50% success) | +4-6% |
| Confidence filtering (catch incorrect rules) | +2-3% |
| Total hybrid coverage | ~94-97% |
Note Even resolving 1/3 of unresolved cases is significant improvement.
For URN Resolution:
- Exact match accuracy: Percentage of perfectly resolved URNs
- Component accuracy: Separate metrics for author, work, passage
- Coverage: Percentage of citations with confident predictions
- Precision@K: Accuracy when allowing top-K predictions
Unlike tag extraction, URN resolution should be split by unique citation patterns rather than documents to test generalization:
- Test on unseen author abbreviations
- Test on unseen work titles
- Test on seen authors but unseen works
Completed:
- ✅ Data pipeline (JSONL → BIO format with special tokens)
- ✅ DeBERTa token classification model
- ✅ Training pipeline with data splitting by filename
- ✅ Evaluation metrics (seqeval for BIO tagging)
- ✅ Inference model for predictions on plain text
- ✅ Test suite (98 tests covering data loading, model, training)
Remaining:
- ⏳ Train on full dataset and tune hyperparameters
- ⏳ Error analysis on test set predictions
- ⏳ DeBERTa+CRF implementation (if baseline insufficient)
- ⏳ Production deployment and inference optimization
Phase 1: Rule-based Baseline (Week 1-2)
- Integration: Wrap perseus-citation-processor as Python callable
- Baseline metrics: Establish 87.6% resolution rate on 216K citations
- Error analysis: Analyze 30K unresolved.jsonl patterns
- Catalog extraction: Load author/work URNs from perseus-citation-processor data
Phase 2: Hierarchical Classifiers (Week 3-5)
- Catalog extraction: Extract author/work vocabularies from resolved.jsonl URNs
- Train author classifier: DeBERTa classification over ~500 authors
- Input: citation + context
- Output: author URN (e.g., tlg0016)
- Train work classifier: DeBERTa classification conditioned on author
- Input: author + citation + context
- Output: work URN (e.g., tlg0016.tlg001)
- Implement passage parser: Rule-based extraction of passage references
- Evaluate components: Measure author accuracy, work accuracy, end-to-end
- Test on unresolved: Evaluate on 30K unresolved.jsonl citations
Phase 3: Confidence Scorer (Week 5-6)
- Training data curation: Label confident vs ambiguous resolutions
- Train DeBERTa classifier: (citation, URN, context) → confidence score
- Threshold tuning: Calibrate confidence thresholds on validation set
- Cross-validation: Test on ambiguous cases from resolved.jsonl
Phase 4: Hybrid System (Week 7-8)
- Orchestrator implementation: Combine rule-based + DNN components
- End-to-end evaluation: Measure coverage improvement (target: 94-97%)
- Component analysis: Track rule-based vs DNN vs hybrid performance
- Error analysis: Identify remaining failure modes
perseus-citation-model/
│
├── cit_data/ # Raw training data (not in repo)
│ ├── resolved.jsonl # 216K citations with URNs
│ └── unresolved.jsonl # 30K citations without URNs
│
├── model_data/ # Partitioned data
│ ├── extraction # Partitions for extraction task
│ └── resolution # Partitions for resolution task
├── outputs/ # Fine-tuned model weights from training
├── src/ # Source code
│ └── perscit_model/ # Main package
│ ├── __init__.py
│ ├── shared/ # Shared utilities across tasks
│ │ ├── __init__.py
│ │ ├── data_loader.py # Base JSONL loader, tokenization
│ │ └── training_utils.py # Training configuration utilities
│ ├── extraction/ # Task 1: Tag Extraction (BIO tagging)
│ │ ├── __init__.py
│ │ ├── data_loader.py # XML → special tokens → BIO labels
│ │ ├── model.py # DeBERTa token classification model
│ │ ├── train.py # Training pipeline and data splitting
│ │ ├── evaluate.py # Evaluation on test set
│ │ └── inference.py # Inference model for predictions
│ └── resolution/ # Task 2: URN Resolution
│ ├── __init__.py
│ └── data_loader.py # Citation data loading for resolution
│
├── configs/ # Configuration files
│ └── extraction/
│ └── baseline.yaml # Hyperparameters (model, max_length)
│
├── tests/ # Test suite (98 tests)
│ ├── conftest.py # Shared fixtures (mock tokenizer)
│ ├── fixtures/ # Test data
│ │ └── sample_extraction.jsonl # 5 real citation examples
│ ├── unit/ # Fast unit tests (88 tests, ~3s)
│ │ ├── test_extraction_dataset.py # BIO label generation tests
│ │ ├── test_extraction_loader.py # Data loader tests
│ │ ├── test_extraction_pipeline.py # End-to-end pipeline tests
│ │ ├── test_resolution_loader.py # Resolution data tests
│ │ └── test_shared_data_loader.py # Shared utility tests
│ └── integration/ # Slow integration tests (10 tests, ~8s)
│ └── test_extraction_model.py # Real model loading/training tests
│
├── pyproject.toml # Project config, dependencies, test settings
├── .gitignore
├── .python-version # Python 3.13
└── README.md
Key Implementation Details:
Instead of word-level BIO tagging with complex alignment, we use special tokens:
- XML → Special Tokens: Replace
<bibl>,<quote>,<cit>tags with[BIBL_START],[BIBL_END], etc. - Add to Vocabulary: Special tokens added to DeBERTa tokenizer (won't be split)
- Tokenize: DeBERTa tokenizes text with special tokens intact
- Generate BIO Labels: State machine generates labels based on special token positions
- Strip Special Tokens: Remove special tokens from input while keeping labels aligned
Example:
XML: <bibl>Hdt. 8.82</bibl> some context
↓
Special: [BIBL_START] Hdt. 8.82 [BIBL_END] some context
↓
Tokens: [CLS] [BIBL_START] Hdt . 8 . 82 [BIBL_END] some context [SEP]
↓
Labels: -100 -100 B- I- I- I- I- -100 O O -100
↓ (strip special tokens)
Final Tokens: [CLS] Hdt . 8 . 82 some context [SEP]
Final Labels: -100 B- I- I- I- I- O O -100
Key point: The model sees only [CLS] Hdt . 8 . 82 some context [SEP]
during training, NOT the special tokens. It must learn to predict citation
boundaries from the context alone.
Advantages over word-level alignment:
- No complex subword↔word alignment logic
- Special tokens guaranteed not to split
- Simpler, more reliable label generation
- Handles malformed XML gracefully (BeautifulSoup repair)
Embedding Resizing:
- Base DeBERTa vocab: 128,000 tokens
- +6 special tokens = 128,006 tokens
- New embeddings initialized to mean of existing embeddings (training stability)
# Fast unit tests only (default)
pytest # 88 tests in ~3s
# Integration tests (downloads real models)
pytest tests/integration # 10 tests in ~8s
# All tests
pytest tests # 98 tests in ~9sThe two tasks can be combined into a complete citation processing pipeline:
Raw Text → Tag Extraction → URN Resolution → Structured Citations
Example workflow:
- Input: "Homer mentions this in Il. 7.268-272: 'Ajax hurled a rock'"
- Tag Extraction: Identify
<bibl>Il. 7.268-272</bibl>and<quote>Ajax hurled a rock</quote> - URN Resolution: Map "Il. 7.268-272" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
- Output: Structured citation with linked canonical reference
This enables:
- Automated citation extraction from plain text
- Linking to canonical text passages
- Cross-referencing across documents
- Building citation networks in classical scholarship
General:
- TEI Guidelines: https://tei-c.org/release/doc/tei-p5-doc/en/html/
- CTS URN Specification: http://cite-architecture.github.io/cts_spec/
- Perseus Digital Library: https://www.perseus.tufts.edu/
Tag Extraction:
- HuggingFace Token Classification: https://huggingface.co/docs/transformers/tasks/token_classification
- pytorch-crf: https://github.com/kmkurn/pytorch-crf
- seqeval metrics: https://github.com/chakki-works/seqeval
- DeBERTa paper: https://arxiv.org/abs/2006.03654
URN Resolution:
- DeBERTa paper: https://arxiv.org/abs/2006.03654
- Hierarchical classification: https://arxiv.org/abs/1904.02817
- CTS URN Specification: http://cite-architecture.github.io/cts_spec/
- CTS API documentation: http://cite-architecture.github.io/cts/
- perseus-citation-processor: https://github.com/andrewbird2/perseus-citation-processor
- Fuzzy string matching: https://github.com/seatgeek/fuzzywuzzy