Perseus Citation Model

Machine learning models for identifying citation structures in classical texts and resolving bibliographic references to canonical URNs.

Current README largely contains notes for my own use.

Project Status: Early Development

✅ Data pipeline implemented (extraction task)
✅ Model initialization and embedding handling
✅ Fine-tuning for extraction task
✅ XML file processing using fine-tuned model (in progress)
⏳ URN resolution implementation (planned)

Installation

Requirements:

Python 3.13+
uv package manager (recommended) or pip

Setup:

# Clone repository
git clone https://github.com/andrmayo/perseus-citation-model.git
cd perseus-citation-model

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e ".[dev]"

Overview

This project provides two complementary ML tasks for working with citations in TEI-encoded XML documents from the Perseus Digital Library:

Tag Extraction: Identify and extract citation tags (<cit>, <quote>, <bibl>) from plain text
URN Resolution: Map bibliographic references to Canonical Text Services (CTS) URNs

Both tasks share data pipelines and preprocessing infrastructure but use different model architectures appropriate to each problem.

Task Definitions

Tag Extraction

Input: Plain text extracted from TEI XML documents Output: Token-level tags identifying citation boundaries

Target tags:

<cit> - Citation container
<quote> - Quoted text
<bibl> - Bibliographic reference

Note <cit> tags surround a quote-bibl pair, so can simply be inserted logically once <bibl> and <quote> elements have been identified

Challenges:

Variable citation formats
Mixed languages (Greek, Latin, English)
Context-dependent identification

URN Resolution

Input: Bibliographic reference text (e.g., "Hdt. 8.82") Output: CTS URN (e.g., "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82")

Examples:

"Hom. Il. 7.268" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
"Thuc. 3.38" → "urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:3.38"
"Plat. Rep. 332D" → "urn:cts:greekLit:tlg0059.tlg030.perseus-grc2:332d"

Challenges:

Abbreviated author names (Hdt., Hom., Thuc.)
Work title variations and abbreviations
Range notation (e.g., "7.268-272", "sqq.")
Unresolvable references to modern scholarship (e.g., "ARV2, 987")
Missing URNs for ~12% of citations

Data Format

Training data is in JSONL format with two files:

resolved.jsonl (~216K examples) - Citations with URNs:

{
  "bibl": "Hdt. 8.82",
  "quote": "",
  "xml_context": "...full XML context with tags...",
  "filename": "xml_files/viaf17286815.viaf001.xml",
  "urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
  "ref": "hdt. 8.82",
  "n_attrib": "Hdt. 8.82",
  "doc_cit_urn": ":citations-28.3"
}

unresolved.jsonl (~30K examples) - Citations without URNs:

{
  "bibl": "FR, pl. 167,2",
  "quote": "",
  "xml_context": "...full XML context with tags...",
  "filename": "xml_files/viaf114145308.viaf001.xml",
  "urn": "",
  "ref": "fr pl. 167,2",
  "n_attrib": "",
  "doc_cit_urn": ":citations-24.1"
}

Key fields:

bibl: Bibliographic reference text
quote: Quoted text (often empty)
xml_context: XML snippet with tags for tag extraction training
urn: CTS URN (empty for unresolved citations)
ref: Normalized reference text

Task 1: Tag Extraction

Fine-tune a pre-trained transformer model (DeBERTa) for sequence labeling using BIO tagging.

Architecture

Input Text → Tokenizer → Transformer Encoder → Linear Layer → Softmax → BIO Tags

BIO Tagging Scheme

Each token is labeled with one of:

O - Outside any citation tag
B-CIT - Beginning of <cit> tag
I-CIT - Inside <cit> tag
B-QUOTE - Beginning of <quote> tag
I-QUOTE - Inside <quote> tag
B-BIBL - Beginning of <bibl> tag
I-BIBL - Inside <bibl> tag

Example:

Text:     Hom.  Il.  7.268  -  272  :  "Ajax  hurled  a  rock"
Tags:     B-BIBL I-BIBL I-BIBL I-BIBL I-BIBL I-BIBL O B-QUOTE I-QUOTE I-QUOTE I-QUOTE

Model Selection

Currently, the project uses microsoft/deberta-v3-base - for the following reasons:

Superior contextual understanding for nested structures
Better multilingual handling (Greek, Latin, English)
State-of-the-art performance on token classification
1-3% F1 improvement over RoBERTa on similar tasks

One alternative that might be worth trying is roberta-base - Good alternative if:

Need faster inference (~10-15% faster than DeBERTa)
Memory constraints
Strong baseline performance

CRF Decoding Layer

To help the model learn sequences rather than simply individual token labels, I've added a CRF decoding layer. A (in this case secondary) benefit is to enforce valid label predictions (which really just means that I- labels follow B- label).

Architecture with CRF

Input Text → Tokenizer → Transformer Encoder → Linear Layer → CRF Layer → BIO Tags

Dataset Splitting

The training pipeline automatically splits data by filename to prevent data leakage:

from perscit_model.extraction.train import split_data

train_path, val_path, test_path = split_data(
    input_file="cit_data/resolved.jsonl",
    output_dir="model_data/extraction",
    train_ratio=0.8,  # Default from config
    val_ratio=0.1,
    test_ratio=0.1,
    seed=42  # Default from config
)

Implementation details:

Splits by filename (not individual examples) to prevent data leakage
Tries multiple shuffles to get close to target ratios
Saves split configuration to prevent accidental re-splitting
Reuses existing splits if configuration matches

Training Details

Training done via curriculum learning:
- Phase 1 on just passages with citations
- Phase 2 on all dictionary XML files, especially LSJ, with data augmentation described below
- Phase 3 on commentary XML files, and with some <bibl> and <quote> tags randomly retained and their contents labelled as "O" so model learns to ignore existing tagged citations

NB: The LSJ is about half of the raw amount of text in the phase 2 training corpus, and has more than half of the citation tokens. The LSJ also has certain distorting features, above all the consistent use of <author> and <title> tags in citations. Hence, to prevent over-reliance on these tags, they get stripped at a rate of 50%. It would be risky to strip them at a higher rate than this unless done during preprocessing, given that it is likely to produce a strong correlation between text length and the presence of citations. It should also probably only be used alongside random token dropping (which makes this correlation noisier, but doesn't eliminate it in expectation).

Processing Full XML Documents

The extraction model was trained on 512-token windows centered around citations. To process full XML documents that exceed this context window, we need a sliding window approach with special handling for:

Context window boundaries (avoiding edge effects)
Split entities (citations spanning multiple windows)
Existing citations (preserving already-identified citations)
Reconstructing <cit> tags (wrapping bibl-quote pairs)

Sliding Window Strategy: "Reliable Center" Method

Problem: Citations near window edges may lack sufficient context, reducing prediction quality.

Solution: Process with overlapping windows, but only trust predictions from the center region.

Algorithm:

Window parameters:
- Window size: 512 tokens (same as training)
- Stride: 256 tokens (50% overlap)
- Reliable region: Center 256 tokens of each window
- Exception: First and last windows trust predictions to document edges
Coverage guarantee: Every position in the document appears in the center of at least one window
Handles split entities: With 50% overlap, any entity < 256 tokens appears fully within some window's center

Example:

Document: [============================]
Window 1:  [512 tokens]
           [  reliable ]
Window 2:         [512 tokens]
                  [  reliable ]
Window 3:                [512 tokens]
                         [  reliable ]

Handling Existing Citations

Goal: Supplement (not replace) existing citation tags in XML files.

Approach: Post-hoc filtering after inference.

Algorithm:

Extract existing citations:
- Parse XML and identify all <bibl>, <quote>, <cit> tags
- Store as character-level spans: [(start_pos, end_pos, tag_type), ...]
Strip citation tags for inference:
- Remove <bibl>, <quote>, <cit> tags (keep other XML tags)
- Run inference on stripped text
Merge predictions:
- Filter out predicted entities that overlap with existing citations
- Keep existing citations unchanged
- Insert new predicted citations

Why this works:

Model sees natural context around existing citations
Simple conflict resolution (existing citations take precedence)
Can later extend to more sophisticated merge strategies

Character-Level Label Merging

Problem: Overlapping windows produce multiple predictions for the same tokens.

Solution: Merge predictions at the character level, only using reliable regions.

Algorithm:

Initialize: Create character-level label array: char_labels = ['O'] * len(text)
For each window:
- Get token-level predictions from model
- Convert to character-level using tokenizer offset mapping
- Determine reliable character range (e.g., chars corresponding to tokens 128-384)
- Update char_labels only in the reliable region
Extract entities: Convert final char_labels to entity spans using BIO logic

Advantages:

Handles split entities automatically (continuous character regions)
Clean merging of overlapping predictions
No special logic needed for window boundaries

Wrapping bibl-quote Pairs in `<cit>` Tags

Recall: The model doesn't predict <cit> tags (no training examples). These must be inserted logically.

Pattern matching algorithm:

Identify adjacent pairs:
- <bibl> immediately followed by <quote> (same parent, adjacent siblings)
- <quote> immediately followed by <bibl> (same parent, adjacent siblings)
Wrapping conditions:
- Must be direct siblings in XML tree
- Only whitespace allowed between elements
- Not already inside a <cit> tag
- Don't wrap if non-whitespace text appears between them
Insert <cit> wrapper:
- Wrap the pair: <cit><bibl>...</bibl><quote>...</quote></cit>
- Preserve whitespace between elements

Example:

Before: See <bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote> for details.
After:  See <cit><bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote></cit> for details.

Implementation Overview

High-level pipeline:

class XMLCitationProcessor:
    """Process full XML documents with sliding window inference."""

    def __init__(self, model_path, window_size=512, stride=256):
        self.model = InferenceModel(model_path)
        self.window_size = window_size
        self.stride = stride

    def process_file(self, xml_path, preserve_existing=True):
        """
        Process XML file and insert citation tags.

        Steps:
        1. Parse XML and extract existing citations
        2. Strip citation tags from text
        3. Create sliding windows
        4. Run inference on each window
        5. Merge predictions at character level (reliable regions only)
        6. Filter predictions that conflict with existing citations
        7. Wrap bibl-quote pairs in <cit> tags
        8. Insert all tags into final XML
        9. Validate and return result
        """
        pass

Key parameters:

window_size: Token length of each window (default: 512)
stride: Token overlap between windows (default: 256)
preserve_existing: Keep existing citation tags (default: True)
max_bibl_quote_distance: Max chars between bibl/quote for wrapping (default: 100)

Training Recommendations

Hyperparameters

Transformer only:

Learning rate: 2e-5 to 5e-5
Batch size: 16-32
Epochs: 3-5
Warmup steps: 500
Weight decay: 0.01

Hardware Requirements

Minimum:

GPU: 8GB VRAM (e.g., RTX 2070, T4)
RAM: 16GB
Storage: 10GB for model + data

Recommended:

GPU: 16GB+ VRAM (e.g., V100, A100, RTX 3090)
RAM: 32GB
Storage: 50GB

Task 2: URN Resolution

Overview

URN resolution maps bibliographic reference strings to Canonical Text Services (CTS) URNs. This is a different ML problem from tag extraction - it's a structured prediction or sequence-to-sequence task rather than token classification.

Recommended Approaches

Approach 1: Rule-based + Hierarchical Classification (Recommended)

Strategy: Use perseus-citation-processor for 87.6% of cases, hierarchical DNN classifiers for the remaining 12.4%.

Advantages:

Guaranteed valid URNs - classification over known catalog, no hallucination
Interpretable - see which stage (author/work) succeeded or failed
Efficient - small vocabularies (~500 authors, ~50 works per author)
High precision - rule-based handles well-formatted citations (87.6%)
Debuggable - can inspect author confidence vs work confidence separately

Architecture:

Input: "Hdt. 8.82"
  ↓
[perseus-citation-processor] → 87.6% resolved directly
  ↓ (if unresolved or low confidence)
[Author Classifier (DNN)] → tlg0016 [conf: 0.95]
  ↓
[Work Classifier (DNN)] → tlg0016.tlg001 [conf: 0.92]
  ↓
[Passage Parser (rules)] → 8.82
  ↓
[Edition Selector (rules)] → perseus-grc2
  ↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"

Components:

Rule-based baseline: perseus-citation-processor (Go binary)
Author classifier: DeBERTa classification over ~500 Greek/Latin authors
Work classifier: DeBERTa classification over works (conditioned on author)
Passage parser: Rule-based extraction of passage references
Edition selector: Rule-based (Greek quote → grc, Latin quote → lat)
Confidence scorer: DeBERTa binary classifier for quality control

Approach 2: Sequence-to-Sequence Model (NOT Recommended)

Models: T5, BART, ByT5, or encoder-decoder transformers

Why NOT recommended:

URNs are structured catalog lookups, not creative text generation
Can hallucinate invalid author/work combinations
Less interpretable - black box generation
Wasteful - learns URN syntax instead of just citation→URN mappings
No guaranteed validity - requires complex constrained decoding

When to consider:

If you need to handle completely new authors not in any catalog (unlikely for classical texts)
If you want to experiment with end-to-end learning
As a baseline to compare against hierarchical classification

Architecture:

from transformers import T5ForConditionalGeneration, AutoTokenizer

# ByT5 for character-level handling of Greek/Latin
model = T5ForConditionalGeneration.from_pretrained("google/byt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")

# Training format: "resolve citation: Hdt. 8.82" → "urn:cts:greekLit:..."
input_text = "resolve citation: Hdt. 8.82"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

# Major issue: May generate syntactically valid but semantically wrong URNs
# e.g., "urn:cts:greekLit:tlg9999.tlg999.perseus-grc2:8.82" (invalid author)

Verdict: Use hierarchical classification (Approach 1) instead.

Approach 3: Retrieval-Augmented Generation

Strategy: Retrieve similar citations from database, use ML to rank/select.

Advantages:

Leverages existing resolved citations
Good for rare author/work combinations
Can explain predictions via retrieved examples

Architecture:

Input: "Thuc. 3.38"
  ↓
Embedding model → Dense vector representation
  ↓
Vector DB → Retrieve top-K similar resolved citations
  ↓
Ranking model → Score candidates
  ↓
Output: Highest-scoring URN

Implementation: Hybrid Rule-based + DNN System

Recommended Approach: Hierarchical Classification (NOT seq2seq)

This implementation uses:

Rule-based baseline (perseus-citation-processor) for 87.6% of citations
Hierarchical DNN classifiers for the remaining 12.4%
- Author classifier (DeBERTa)
- Work classifier (DeBERTa, conditioned on author)
Confidence scorer (DeBERTa) to validate rule-based outputs

Why hierarchical classification over seq2seq?

Guaranteed valid URNs (no hallucination)
Interpretable decisions (author vs work failures)
Efficient (small vocabularies)
Better uncertainty handling (top-k at each stage)

Foundation: perseus-citation-processor

This project uses the perseus-citation-processor as the rule-based foundation:

Performance on 246K citations:

✅ 87.6% resolution rate (216K resolved)
❌ 12.4% unresolved (30K citations)
⚡ Fast: 28 seconds for 125 XML files
📚 Comprehensive author/work mappings (Greek, Latin, Scholia)

The DNN components handle the remaining 12.4% plus add confidence scoring.

Architecture Overview

Input Citation
       ↓
[perseus-citation-processor] ← Rule-based (87.6% coverage)
       ↓
   ┌───┴────┐
   ↓        ↓
Resolved  Unresolved
   ↓        ↓
[Confidence  [DNN Resolution
 Scorer]      Model]
   ↓        ↓
High conf?  URN + conf
   ↓        ↓
Yes→Output  Merge→Output
   No↘    ↗
    [DNN Model]

DNN Component 1: Unresolved Citation Resolver

Purpose: Resolve the 30K citations that rule-based system couldn't handle.

Approach: Hierarchical Classification (NOT seq2seq generation)

Why Classification over Seq2Seq:

URNs are structured catalog lookups, not creative text
Guaranteed valid outputs - cannot hallucinate invalid author/work combinations
Interpretable - see exactly which stage succeeded/failed
Efficient - small vocabulary (~500 authors × ~50 works = 25K combinations)
Better uncertainty handling - top-k predictions at each stage

Architecture: Multi-stage classification pipeline

"Hdt. 8.82"
    ↓
[Author Classifier] → tlg0016 (Herodotus) [confidence: 0.95]
    ↓
[Work Classifier] → tlg0016.tlg001 (Histories) [confidence: 0.92]
    ↓
[Edition Selector] → perseus-grc2 (rule-based: Greek text → grc)
    ↓
[Passage Parser] → 8.82 (rule-based extraction)
    ↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"

Implementation:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class AuthorClassifier:
    """Stage 1: Classify citation to author URN"""

    def __init__(self):
        # Load perseus-citation-processor author catalog
        self.author_catalog = self.load_author_catalog()  # ~500 Greek/Latin authors
        self.author_to_id = {urn: i for i, urn in enumerate(self.author_catalog)}
        self.id_to_author = {i: urn for urn, i in self.author_to_id.items()}

        # DeBERTa classifier over author vocabulary
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "microsoft/deberta-v3-base",
            num_labels=len(self.author_catalog)
        )
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

    def predict(self, citation_text, context=""):
        """Predict top-k authors with confidence scores"""
        # Input format: "[citation] [SEP] [context]"
        text = f"{citation_text} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

        # Get predictions
        outputs = self.model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        top_k = torch.topk(probs[0], k=3)

        # Return [(author_urn, confidence), ...]
        return [(self.id_to_author[idx.item()], prob.item())
                for idx, prob in zip(top_k.indices, top_k.values)]


class WorkClassifier:
    """Stage 2: Classify to work URN (conditioned on author)"""

    def __init__(self):
        # Group works by author
        self.works_by_author = self.load_work_catalog()  # {tlg0016: [tlg001, tlg002, ...]}
        self.max_works = max(len(works) for works in self.works_by_author.values())

        # Classifier conditioned on author
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "microsoft/deberta-v3-base",
            num_labels=self.max_works
        )
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

    def predict(self, citation_text, author_urn, context=""):
        """Predict work for given author"""
        # Get candidate works for this author
        candidates = self.works_by_author.get(author_urn, [])
        if not candidates:
            return []

        # Input format: "[author] [SEP] [citation] [SEP] [context]"
        text = f"{author_urn} [SEP] {citation_text} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

        outputs = self.model(**inputs)

        # Filter to only valid works for this author
        valid_logits = outputs.logits[0, :len(candidates)]
        probs = torch.softmax(valid_logits, dim=-1)

        # Return [(work_urn, confidence), ...] sorted by confidence
        results = [(candidates[i], probs[i].item()) for i in range(len(candidates))]
        return sorted(results, key=lambda x: x[1], reverse=True)


class HierarchicalURNResolver:
    """Complete hierarchical URN resolution system"""

    def __init__(self):
        self.author_classifier = AuthorClassifier()
        self.work_classifier = WorkClassifier()
        self.passage_parser = PassageParser()  # Rule-based
        self.edition_selector = EditionSelector()  # Rule-based

    def resolve(self, citation_text, context="", quote=""):
        """Resolve citation to URN with confidence score"""
        # Stage 1: Classify author
        author_candidates = self.author_classifier.predict(citation_text, context)

        # Stage 2: For each author candidate, classify work
        urn_candidates = []
        for author_urn, author_conf in author_candidates[:3]:  # Top-3 authors
            work_candidates = self.work_classifier.predict(
                citation_text, author_urn, context
            )

            for work_urn, work_conf in work_candidates[:3]:  # Top-3 works
                # Stage 3: Parse passage (rule-based)
                passage = self.passage_parser.extract(citation_text)

                # Stage 4: Select edition (rule-based)
                edition = self.edition_selector.select(author_urn, quote)

                # Stage 5: Assemble URN
                namespace = self.get_namespace(author_urn)  # greekLit or latinLit
                full_urn = f"urn:cts:{namespace}:{work_urn}.{edition}:{passage}"

                # Combined confidence (product of stages)
                confidence = author_conf * work_conf

                urn_candidates.append((full_urn, confidence, {
                    'author_conf': author_conf,
                    'work_conf': work_conf,
                    'author': author_urn,
                    'work': work_urn
                }))

        # Return best candidate
        if urn_candidates:
            best = max(urn_candidates, key=lambda x: x[1])
            return best[0], best[1], best[2]
        else:
            return None, 0.0, {}

Training Data Format:

# Training examples from resolved.jsonl (216K examples)
# Split into author and work classification tasks

# Stage 1: Author Classification
author_examples = [
    {
        "citation": "Hdt. 8.82",
        "context": "",
        "label": "tlg0016"  # Herodotus
    },
    {
        "citation": "Soph. OT 151",
        "context": "τᾶς πολυχρύσου Πυθῶνος",
        "label": "tlg0011"  # Sophocles
    },
    {
        "citation": "Plat. Rep. 332D",
        "context": "",
        "label": "tlg0059"  # Plato
    }
]

# Stage 2: Work Classification (conditioned on author)
work_examples = [
    {
        "citation": "Hdt. 8.82",
        "author": "tlg0016",
        "context": "",
        "label": "tlg0016.tlg001"  # Histories
    },
    {
        "citation": "Soph. OT 151",
        "author": "tlg0011",
        "context": "τᾶς πολυχρύσου Πυθῶνος",
        "label": "tlg0011.tlg004"  # Oedipus Tyrannus
    },
    {
        "citation": "Hom. Il. 7.268",
        "author": "tlg0012",
        "context": "Ajax hurled a rock",
        "label": "tlg0012.tlg001"  # Iliad (not Odyssey)
    }
]

Features Used:

Citation text (required): "Hdt. 8.82"
Context (when available): Surrounding text or quote
Author URN (for work classification): Condition on stage 1 output
Ground truth URN: Parse author/work from urn field

DNN Component 2: Confidence Scorer

Purpose: Identify incorrect resolutions from perseus-citation-processor.

Problem: From perseus-citation-processor README: "just because the processor resolves a citation doesn't mean that it resolves it correctly"

Model: DeBERTa-based Binary Classifier

from transformers import AutoModel, AutoTokenizer
import torch.nn as nn

class URNConfidenceScorer(nn.Module):
    def __init__(self, model_name="microsoft/deberta-v3-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.1)

        hidden_size = self.encoder.config.hidden_size
        feature_size = 10  # Engineered features

        self.classifier = nn.Sequential(
            nn.Linear(hidden_size + feature_size, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, citation_text, urn, context=""):
        # Encode: "[citation] <SEP> [URN] <SEP> [context]"
        text = f"{citation_text} [SEP] {urn} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True)

        outputs = self.encoder(**inputs)
        pooled = outputs.last_hidden_state[:, 0, :]  # CLS token

        # Compute engineered features
        features = self.compute_features(citation_text, urn, context)

        combined = torch.cat([pooled, features], dim=1)
        confidence = self.classifier(combined)

        return confidence

    def compute_features(self, citation, urn, context):
        """Engineered features for confidence scoring"""
        return torch.tensor([
            self.citation_urn_match_score(citation, urn),
            self.has_ambiguous_abbrev(citation),
            self.context_language_match(context, urn),
            self.passage_validity(urn),
            self.author_frequency(urn),
            self.work_frequency(urn),
            self.quote_presence(context),
            self.citation_length(citation),
            self.urn_complexity(urn),
            self.catalog_match_count(citation)
        ])

Training Data:

# Positive examples (confident resolutions)
positive_examples = [
    {
        "citation": "Hdt. 8.82",
        "urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
        "context": "Greek text context",
        "label": 1.0  # High confidence
    }
]

# Ambiguous cases (medium confidence)
ambiguous_examples = [
    {
        "citation": "Arist. Met.",  # Metaphysics or Meteorology?
        "urn": "urn:cts:greekLit:tlg0086.tlg025.perseus-grc2:",
        "context": "",
        "label": 0.6  # Medium confidence
    }
]

# Incorrect resolutions (low confidence)
negative_examples = [
    {
        "citation": "Hom. 7.268",  # Iliad or Odyssey?
        "urn": "urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:7.268",  # Odyssey
        "context": "Ajax hurled a rock",  # Ajax is in Iliad!
        "label": 0.1  # Low confidence - context mismatch
    }
]

DNN Component 3: Hybrid Orchestrator

Purpose: Combine rule-based and DNN intelligently.

class HybridURNResolver:
    def __init__(self):
        self.rule_based = PerseusProcessorWrapper()  # Calls Go binary
        self.dnn_resolver = HierarchicalURNResolver()  # Hierarchical classifier
        self.confidence_scorer = URNConfidenceScorer()

        # Thresholds (tune on validation set)
        self.high_confidence_threshold = 0.85
        self.low_confidence_threshold = 0.50

    def resolve(self, citation_text, context="", quote=""):
        # Step 1: Try rule-based
        rule_urn, rule_status = self.rule_based.resolve(citation_text)

        # Step 2: Score rule-based result
        if rule_urn:
            confidence = self.confidence_scorer(citation_text, rule_urn, context)

            if confidence > self.high_confidence_threshold:
                return rule_urn, confidence, "rule-based"

            # Medium confidence - get DNN opinion
            elif confidence > self.low_confidence_threshold:
                dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                    citation_text, context, quote
                )

                # Compare: rule-based vs DNN
                if dnn_conf > confidence:
                    return dnn_urn, dnn_conf, "dnn-override", dnn_details
                else:
                    return rule_urn, confidence, "rule-based-verified"

            # Low confidence - prefer DNN
            else:
                dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                    citation_text, context, quote
                )
                return dnn_urn, dnn_conf, "dnn-preferred", dnn_details

        # Step 3: Rule-based failed, use DNN only
        else:
            dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                citation_text, context, quote
            )
            return dnn_urn, dnn_conf, "dnn-only", dnn_details

Training Strategy

Phase 1: Train Author Classifier

from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
import json

# Step 1: Extract author labels from URNs
def parse_urn_to_author(urn):
    """Extract author URN from full CTS URN"""
    # urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016
    try:
        parts = urn.split(":")
        author_work = parts[3]  # tlg0016.tlg001.perseus-grc2
        author = author_work.split(".")[0]  # tlg0016
        return author
    except:
        return None

# Step 2: Build author vocabulary from perseus-citation-processor data
author_catalog = set()
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        if author:
            author_catalog.add(author)

author_to_id = {author: i for i, author in enumerate(sorted(author_catalog))}
id_to_author = {i: author for author, i in author_to_id.items()}

# Step 3: Prepare training data
train_data = []
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        if author and author in author_to_id:
            train_data.append({
                "text": f"{item['bibl']} [SEP] {item.get('quote', '')}",
                "label": author_to_id[author]
            })

# Train/val split
dataset = Dataset.from_list(train_data)
dataset = dataset.train_test_split(test_size=0.1)

# Step 4: Train DeBERTa classifier
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=len(author_catalog)
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./outputs/resolution/models/author-classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
)

trainer.train()

Phase 2: Train Work Classifier

# Step 1: Parse work labels from URNs
def parse_urn_to_work(urn):
    """Extract work URN from full CTS URN"""
    # urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016.tlg001
    try:
        parts = urn.split(":")
        author_work = parts[3]  # tlg0016.tlg001.perseus-grc2
        work = ".".join(author_work.split(".")[:2])  # tlg0016.tlg001
        return work
    except:
        return None

# Step 2: Build work vocabulary grouped by author
works_by_author = {}
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        work = parse_urn_to_work(item['urn'])
        if author and work:
            if author not in works_by_author:
                works_by_author[author] = set()
            works_by_author[author].add(work)

# Step 3: Prepare training data (conditioned on author)
train_data = []
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        work = parse_urn_to_work(item['urn'])

        if author and work:
            # Input includes author URN to condition on
            train_data.append({
                "text": f"{author} [SEP] {item['bibl']} [SEP] {item.get('quote', '')}",
                "author": author,
                "label": work
            })

# Create label mapping (per-author work indices)
# ... (similar training setup)

Phase 3: Train Confidence Scorer

Curate labeled examples with confidence scores
Use cross-validation on resolved.jsonl
Add manually labeled ambiguous cases (~1K examples)
Generate synthetic negative examples (incorrect author/work pairs)

Phase 4: Evaluate Hierarchical Classifier

Test on held-out validation set (10% of resolved.jsonl)
Measure component accuracy:
- Author classification accuracy
- Work classification accuracy (given correct author)
- End-to-end URN exact match
Test on 30K unresolved.jsonl examples
Manual evaluation on random sample for accuracy

Expected Performance

Conservative estimates:

Component	Coverage Improvement
Rule-based baseline	87.6%
DNN on unresolved (30-50% success)	+4-6%
Confidence filtering (catch incorrect rules)	+2-3%
Total hybrid coverage	~94-97%

Note Even resolving 1/3 of unresolved cases is significant improvement.

Evaluation Metrics

For URN Resolution:

Exact match accuracy: Percentage of perfectly resolved URNs
Component accuracy: Separate metrics for author, work, passage
Coverage: Percentage of citations with confident predictions
Precision@K: Accuracy when allowing top-K predictions

Data Split Considerations

Unlike tag extraction, URN resolution should be split by unique citation patterns rather than documents to test generalization:

Test on unseen author abbreviations
Test on unseen work titles
Test on seen authors but unseen works

Next Steps

Tag Extraction

Completed:

✅ Data pipeline (JSONL → BIO format with special tokens)
✅ DeBERTa token classification model
✅ Training pipeline with data splitting by filename
✅ Evaluation metrics (seqeval for BIO tagging)
✅ Inference model for predictions on plain text
✅ Test suite (98 tests covering data loading, model, training)

Remaining:

⏳ Train on full dataset and tune hyperparameters
⏳ Error analysis on test set predictions
⏳ DeBERTa+CRF implementation (if baseline insufficient)
⏳ Production deployment and inference optimization

URN Resolution

Phase 1: Rule-based Baseline (Week 1-2)

Integration: Wrap perseus-citation-processor as Python callable
Baseline metrics: Establish 87.6% resolution rate on 216K citations
Error analysis: Analyze 30K unresolved.jsonl patterns
Catalog extraction: Load author/work URNs from perseus-citation-processor data

Phase 2: Hierarchical Classifiers (Week 3-5)

Catalog extraction: Extract author/work vocabularies from resolved.jsonl URNs
Train author classifier: DeBERTa classification over ~500 authors
- Input: citation + context
- Output: author URN (e.g., tlg0016)
Train work classifier: DeBERTa classification conditioned on author
- Input: author + citation + context
- Output: work URN (e.g., tlg0016.tlg001)
Implement passage parser: Rule-based extraction of passage references
Evaluate components: Measure author accuracy, work accuracy, end-to-end
Test on unresolved: Evaluate on 30K unresolved.jsonl citations

Phase 3: Confidence Scorer (Week 5-6)

Training data curation: Label confident vs ambiguous resolutions
Train DeBERTa classifier: (citation, URN, context) → confidence score
Threshold tuning: Calibrate confidence thresholds on validation set
Cross-validation: Test on ambiguous cases from resolved.jsonl

Phase 4: Hybrid System (Week 7-8)

Orchestrator implementation: Combine rule-based + DNN components
End-to-end evaluation: Measure coverage improvement (target: 94-97%)
Component analysis: Track rule-based vs DNN vs hybrid performance
Error analysis: Identify remaining failure modes

Project Structure

perseus-citation-model/
│
├── cit_data/                          # Raw training data (not in repo)
│   ├── resolved.jsonl                 # 216K citations with URNs
│   └── unresolved.jsonl               # 30K citations without URNs
│
├── model_data/                        # Partitioned data
│   ├── extraction                     # Partitions for extraction task
│   └── resolution                     # Partitions for resolution task

├── outputs/                           # Fine-tuned model weights from training

├── src/                               # Source code
│   └── perscit_model/                 # Main package
│       ├── __init__.py
│       ├── shared/                    # Shared utilities across tasks
│       │   ├── __init__.py
│       │   ├── data_loader.py         # Base JSONL loader, tokenization
│       │   └── training_utils.py      # Training configuration utilities
│       ├── extraction/                # Task 1: Tag Extraction (BIO tagging)
│       │   ├── __init__.py
│       │   ├── data_loader.py         # XML → special tokens → BIO labels
│       │   ├── model.py               # DeBERTa token classification model
│       │   ├── train.py               # Training pipeline and data splitting
│       │   ├── evaluate.py            # Evaluation on test set
│       │   └── inference.py           # Inference model for predictions
│       └── resolution/                # Task 2: URN Resolution
│           ├── __init__.py
│           └── data_loader.py         # Citation data loading for resolution
│
├── configs/                           # Configuration files
│   └── extraction/
│       └── baseline.yaml              # Hyperparameters (model, max_length)
│
├── tests/                             # Test suite (98 tests)
│   ├── conftest.py                    # Shared fixtures (mock tokenizer)
│   ├── fixtures/                      # Test data
│   │   └── sample_extraction.jsonl   # 5 real citation examples
│   ├── unit/                          # Fast unit tests (88 tests, ~3s)
│   │   ├── test_extraction_dataset.py    # BIO label generation tests
│   │   ├── test_extraction_loader.py     # Data loader tests
│   │   ├── test_extraction_pipeline.py   # End-to-end pipeline tests
│   │   ├── test_resolution_loader.py     # Resolution data tests
│   │   └── test_shared_data_loader.py    # Shared utility tests
│   └── integration/                   # Slow integration tests (10 tests, ~8s)
│       └── test_extraction_model.py      # Real model loading/training tests
│
├── pyproject.toml                     # Project config, dependencies, test settings
├── .gitignore
├── .python-version                    # Python 3.13
└── README.md

Key Implementation Details:

Extraction Data Pipeline (Special Tokens Approach)

Instead of word-level BIO tagging with complex alignment, we use special tokens:

XML → Special Tokens: Replace <bibl>, <quote>, <cit> tags with [BIBL_START], [BIBL_END], etc.
Add to Vocabulary: Special tokens added to DeBERTa tokenizer (won't be split)
Tokenize: DeBERTa tokenizes text with special tokens intact
Generate BIO Labels: State machine generates labels based on special token positions
Strip Special Tokens: Remove special tokens from input while keeping labels aligned

Example:

XML:      <bibl>Hdt. 8.82</bibl> some context
↓
Special:  [BIBL_START] Hdt. 8.82 [BIBL_END] some context
↓
Tokens:   [CLS] [BIBL_START] Hdt . 8 . 82 [BIBL_END] some context [SEP]
↓
Labels:   -100  -100          B-  I- I- I- I- -100        O    O       -100
↓ (strip special tokens)
Final Tokens: [CLS] Hdt . 8 . 82 some context [SEP]
Final Labels: -100  B-  I- I- I- I- O    O       -100

Key point: The model sees only [CLS] Hdt . 8 . 82 some context [SEP] during training, NOT the special tokens. It must learn to predict citation boundaries from the context alone.

Advantages over word-level alignment:

No complex subword↔word alignment logic
Special tokens guaranteed not to split
Simpler, more reliable label generation
Handles malformed XML gracefully (BeautifulSoup repair)

Model Initialization

Embedding Resizing:

Base DeBERTa vocab: 128,000 tokens
+6 special tokens = 128,006 tokens
New embeddings initialized to mean of existing embeddings (training stability)

Testing

# Fast unit tests only (default)
pytest                    # 88 tests in ~3s

# Integration tests (downloads real models)
pytest tests/integration  # 10 tests in ~8s

# All tests
pytest tests              # 98 tests in ~9s

End-to-End Pipeline

The two tasks can be combined into a complete citation processing pipeline:

Raw Text → Tag Extraction → URN Resolution → Structured Citations

Example workflow:

Input: "Homer mentions this in Il. 7.268-272: 'Ajax hurled a rock'"
Tag Extraction: Identify <bibl>Il. 7.268-272</bibl> and <quote>Ajax hurled a rock</quote>
URN Resolution: Map "Il. 7.268-272" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
Output: Structured citation with linked canonical reference

This enables:

Automated citation extraction from plain text
Linking to canonical text passages
Cross-referencing across documents
Building citation networks in classical scholarship

Useful Links

General:

TEI Guidelines: https://tei-c.org/release/doc/tei-p5-doc/en/html/
CTS URN Specification: http://cite-architecture.github.io/cts_spec/
Perseus Digital Library: https://www.perseus.tufts.edu/

Tag Extraction:

HuggingFace Token Classification: https://huggingface.co/docs/transformers/tasks/token_classification
pytorch-crf: https://github.com/kmkurn/pytorch-crf
seqeval metrics: https://github.com/chakki-works/seqeval
DeBERTa paper: https://arxiv.org/abs/2006.03654

URN Resolution:

DeBERTa paper: https://arxiv.org/abs/2006.03654
Hierarchical classification: https://arxiv.org/abs/1904.02817
CTS URN Specification: http://cite-architecture.github.io/cts_spec/
CTS API documentation: http://cite-architecture.github.io/cts/
perseus-citation-processor: https://github.com/andrewbird2/perseus-citation-processor
Fuzzy string matching: https://github.com/seatgeek/fuzzywuzzy

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
cit_data		cit_data
configs/extraction		configs/extraction
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
tarballs		tarballs
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Perseus Citation Model

Installation

Overview

Task Definitions

Tag Extraction

URN Resolution

Data Format

Task 1: Tag Extraction

Architecture

BIO Tagging Scheme

Model Selection

CRF Decoding Layer

Architecture with CRF

Dataset Splitting

Training Details

Processing Full XML Documents

Sliding Window Strategy: "Reliable Center" Method

Handling Existing Citations

Character-Level Label Merging

Wrapping bibl-quote Pairs in <cit> Tags

Implementation Overview

Training Recommendations

Hyperparameters

Hardware Requirements

Task 2: URN Resolution

Overview

Recommended Approaches

Approach 1: Rule-based + Hierarchical Classification (Recommended)

Approach 2: Sequence-to-Sequence Model (NOT Recommended)

Approach 3: Retrieval-Augmented Generation

Implementation: Hybrid Rule-based + DNN System

Foundation: perseus-citation-processor

Architecture Overview

DNN Component 1: Unresolved Citation Resolver

DNN Component 2: Confidence Scorer

DNN Component 3: Hybrid Orchestrator

Training Strategy

Expected Performance

Evaluation Metrics

Data Split Considerations

Next Steps

Tag Extraction

URN Resolution

Project Structure

Extraction Data Pipeline (Special Tokens Approach)

Model Initialization

Testing

End-to-End Pipeline

Useful Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Wrapping bibl-quote Pairs in `<cit>` Tags

Packages