Skip to content

andrmayo/perseus-citation-model

Repository files navigation

Perseus Citation Model

Machine learning models for identifying citation structures in classical texts and resolving bibliographic references to canonical URNs.

Current README largely contains notes for my own use.

Project Status: Early Development

  • ✅ Data pipeline implemented (extraction task)
  • ✅ Model initialization and embedding handling
  • ✅ Fine-tuning for extraction task
  • ✅ XML file processing using fine-tuned model (in progress)
  • ⏳ URN resolution implementation (planned)

Installation

Requirements:

  • Python 3.13+
  • uv package manager (recommended) or pip

Setup:

# Clone repository
git clone https://github.com/andrmayo/perseus-citation-model.git
cd perseus-citation-model

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e ".[dev]"

Overview

This project provides two complementary ML tasks for working with citations in TEI-encoded XML documents from the Perseus Digital Library:

  1. Tag Extraction: Identify and extract citation tags (<cit>, <quote>, <bibl>) from plain text
  2. URN Resolution: Map bibliographic references to Canonical Text Services (CTS) URNs

Both tasks share data pipelines and preprocessing infrastructure but use different model architectures appropriate to each problem.

Task Definitions

Tag Extraction

Input: Plain text extracted from TEI XML documents Output: Token-level tags identifying citation boundaries

Target tags:

  • <cit> - Citation container
  • <quote> - Quoted text
  • <bibl> - Bibliographic reference

Note <cit> tags surround a quote-bibl pair, so can simply be inserted logically once <bibl> and <quote> elements have been identified

Challenges:

  • Variable citation formats
  • Mixed languages (Greek, Latin, English)
  • Context-dependent identification

URN Resolution

Input: Bibliographic reference text (e.g., "Hdt. 8.82") Output: CTS URN (e.g., "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82")

Examples:

  • "Hom. Il. 7.268" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
  • "Thuc. 3.38" → "urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:3.38"
  • "Plat. Rep. 332D" → "urn:cts:greekLit:tlg0059.tlg030.perseus-grc2:332d"

Challenges:

  • Abbreviated author names (Hdt., Hom., Thuc.)
  • Work title variations and abbreviations
  • Range notation (e.g., "7.268-272", "sqq.")
  • Unresolvable references to modern scholarship (e.g., "ARV2, 987")
  • Missing URNs for ~12% of citations

Data Format

Training data is in JSONL format with two files:

resolved.jsonl (~216K examples) - Citations with URNs:

{
  "bibl": "Hdt. 8.82",
  "quote": "",
  "xml_context": "...full XML context with tags...",
  "filename": "xml_files/viaf17286815.viaf001.xml",
  "urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
  "ref": "hdt. 8.82",
  "n_attrib": "Hdt. 8.82",
  "doc_cit_urn": ":citations-28.3"
}

unresolved.jsonl (~30K examples) - Citations without URNs:

{
  "bibl": "FR, pl. 167,2",
  "quote": "",
  "xml_context": "...full XML context with tags...",
  "filename": "xml_files/viaf114145308.viaf001.xml",
  "urn": "",
  "ref": "fr pl. 167,2",
  "n_attrib": "",
  "doc_cit_urn": ":citations-24.1"
}

Key fields:

  • bibl: Bibliographic reference text
  • quote: Quoted text (often empty)
  • xml_context: XML snippet with tags for tag extraction training
  • urn: CTS URN (empty for unresolved citations)
  • ref: Normalized reference text

Task 1: Tag Extraction

Fine-tune a pre-trained transformer model (DeBERTa) for sequence labeling using BIO tagging.

Architecture

Input Text → Tokenizer → Transformer Encoder → Linear Layer → Softmax → BIO Tags

BIO Tagging Scheme

Each token is labeled with one of:

  • O - Outside any citation tag
  • B-CIT - Beginning of <cit> tag
  • I-CIT - Inside <cit> tag
  • B-QUOTE - Beginning of <quote> tag
  • I-QUOTE - Inside <quote> tag
  • B-BIBL - Beginning of <bibl> tag
  • I-BIBL - Inside <bibl> tag

Example:

Text:     Hom.  Il.  7.268  -  272  :  "Ajax  hurled  a  rock"
Tags:     B-BIBL I-BIBL I-BIBL I-BIBL I-BIBL I-BIBL O B-QUOTE I-QUOTE I-QUOTE I-QUOTE

Model Selection

Currently, the project uses microsoft/deberta-v3-base - for the following reasons:

  • Superior contextual understanding for nested structures
  • Better multilingual handling (Greek, Latin, English)
  • State-of-the-art performance on token classification
  • 1-3% F1 improvement over RoBERTa on similar tasks

One alternative that might be worth trying is roberta-base - Good alternative if:

  • Need faster inference (~10-15% faster than DeBERTa)
  • Memory constraints
  • Strong baseline performance

CRF Decoding Layer

To help the model learn sequences rather than simply individual token labels, I've added a CRF decoding layer. A (in this case secondary) benefit is to enforce valid label predictions (which really just means that I- labels follow B- label).

Architecture with CRF

Input Text → Tokenizer → Transformer Encoder → Linear Layer → CRF Layer → BIO Tags

Dataset Splitting

The training pipeline automatically splits data by filename to prevent data leakage:

from perscit_model.extraction.train import split_data

train_path, val_path, test_path = split_data(
    input_file="cit_data/resolved.jsonl",
    output_dir="model_data/extraction",
    train_ratio=0.8,  # Default from config
    val_ratio=0.1,
    test_ratio=0.1,
    seed=42  # Default from config
)

Implementation details:

  • Splits by filename (not individual examples) to prevent data leakage
  • Tries multiple shuffles to get close to target ratios
  • Saves split configuration to prevent accidental re-splitting
  • Reuses existing splits if configuration matches

Training Details

  • Training done via curriculum learning:
    • Phase 1 on just passages with citations
    • Phase 2 on all dictionary XML files, especially LSJ, with data augmentation described below
    • Phase 3 on commentary XML files, and with some <bibl> and <quote> tags randomly retained and their contents labelled as "O" so model learns to ignore existing tagged citations

NB: The LSJ is about half of the raw amount of text in the phase 2 training corpus, and has more than half of the citation tokens. The LSJ also has certain distorting features, above all the consistent use of <author> and <title> tags in citations. Hence, to prevent over-reliance on these tags, they get stripped at a rate of 50%. It would be risky to strip them at a higher rate than this unless done during preprocessing, given that it is likely to produce a strong correlation between text length and the presence of citations. It should also probably only be used alongside random token dropping (which makes this correlation noisier, but doesn't eliminate it in expectation).

Processing Full XML Documents

The extraction model was trained on 512-token windows centered around citations. To process full XML documents that exceed this context window, we need a sliding window approach with special handling for:

  1. Context window boundaries (avoiding edge effects)
  2. Split entities (citations spanning multiple windows)
  3. Existing citations (preserving already-identified citations)
  4. Reconstructing <cit> tags (wrapping bibl-quote pairs)

Sliding Window Strategy: "Reliable Center" Method

Problem: Citations near window edges may lack sufficient context, reducing prediction quality.

Solution: Process with overlapping windows, but only trust predictions from the center region.

Algorithm:

  1. Window parameters:

    • Window size: 512 tokens (same as training)
    • Stride: 256 tokens (50% overlap)
    • Reliable region: Center 256 tokens of each window
    • Exception: First and last windows trust predictions to document edges
  2. Coverage guarantee: Every position in the document appears in the center of at least one window

  3. Handles split entities: With 50% overlap, any entity < 256 tokens appears fully within some window's center

Example:

Document: [============================]
Window 1:  [512 tokens]
           [  reliable ]
Window 2:         [512 tokens]
                  [  reliable ]
Window 3:                [512 tokens]
                         [  reliable ]

Handling Existing Citations

Goal: Supplement (not replace) existing citation tags in XML files.

Approach: Post-hoc filtering after inference.

Algorithm:

  1. Extract existing citations:

    • Parse XML and identify all <bibl>, <quote>, <cit> tags
    • Store as character-level spans: [(start_pos, end_pos, tag_type), ...]
  2. Strip citation tags for inference:

    • Remove <bibl>, <quote>, <cit> tags (keep other XML tags)
    • Run inference on stripped text
  3. Merge predictions:

    • Filter out predicted entities that overlap with existing citations
    • Keep existing citations unchanged
    • Insert new predicted citations

Why this works:

  • Model sees natural context around existing citations
  • Simple conflict resolution (existing citations take precedence)
  • Can later extend to more sophisticated merge strategies

Character-Level Label Merging

Problem: Overlapping windows produce multiple predictions for the same tokens.

Solution: Merge predictions at the character level, only using reliable regions.

Algorithm:

  1. Initialize: Create character-level label array: char_labels = ['O'] * len(text)

  2. For each window:

    • Get token-level predictions from model
    • Convert to character-level using tokenizer offset mapping
    • Determine reliable character range (e.g., chars corresponding to tokens 128-384)
    • Update char_labels only in the reliable region
  3. Extract entities: Convert final char_labels to entity spans using BIO logic

Advantages:

  • Handles split entities automatically (continuous character regions)
  • Clean merging of overlapping predictions
  • No special logic needed for window boundaries

Wrapping bibl-quote Pairs in <cit> Tags

Recall: The model doesn't predict <cit> tags (no training examples). These must be inserted logically.

Pattern matching algorithm:

  1. Identify adjacent pairs:

    • <bibl> immediately followed by <quote> (same parent, adjacent siblings)
    • <quote> immediately followed by <bibl> (same parent, adjacent siblings)
  2. Wrapping conditions:

    • Must be direct siblings in XML tree
    • Only whitespace allowed between elements
    • Not already inside a <cit> tag
    • Don't wrap if non-whitespace text appears between them
  3. Insert <cit> wrapper:

    • Wrap the pair: <cit><bibl>...</bibl><quote>...</quote></cit>
    • Preserve whitespace between elements

Example:

Before: See <bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote> for details.
After:  See <cit><bibl>Hdt. 8.82</bibl> <quote>Ajax hurled a rock</quote></cit> for details.

Implementation Overview

High-level pipeline:

class XMLCitationProcessor:
    """Process full XML documents with sliding window inference."""

    def __init__(self, model_path, window_size=512, stride=256):
        self.model = InferenceModel(model_path)
        self.window_size = window_size
        self.stride = stride

    def process_file(self, xml_path, preserve_existing=True):
        """
        Process XML file and insert citation tags.

        Steps:
        1. Parse XML and extract existing citations
        2. Strip citation tags from text
        3. Create sliding windows
        4. Run inference on each window
        5. Merge predictions at character level (reliable regions only)
        6. Filter predictions that conflict with existing citations
        7. Wrap bibl-quote pairs in <cit> tags
        8. Insert all tags into final XML
        9. Validate and return result
        """
        pass

Key parameters:

  • window_size: Token length of each window (default: 512)
  • stride: Token overlap between windows (default: 256)
  • preserve_existing: Keep existing citation tags (default: True)
  • max_bibl_quote_distance: Max chars between bibl/quote for wrapping (default: 100)

Training Recommendations

Hyperparameters

Transformer only:

  • Learning rate: 2e-5 to 5e-5
  • Batch size: 16-32
  • Epochs: 3-5
  • Warmup steps: 500
  • Weight decay: 0.01

Hardware Requirements

Minimum:

  • GPU: 8GB VRAM (e.g., RTX 2070, T4)
  • RAM: 16GB
  • Storage: 10GB for model + data

Recommended:

  • GPU: 16GB+ VRAM (e.g., V100, A100, RTX 3090)
  • RAM: 32GB
  • Storage: 50GB

Task 2: URN Resolution

Overview

URN resolution maps bibliographic reference strings to Canonical Text Services (CTS) URNs. This is a different ML problem from tag extraction - it's a structured prediction or sequence-to-sequence task rather than token classification.

Recommended Approaches

Approach 1: Rule-based + Hierarchical Classification (Recommended)

Strategy: Use perseus-citation-processor for 87.6% of cases, hierarchical DNN classifiers for the remaining 12.4%.

Advantages:

  • Guaranteed valid URNs - classification over known catalog, no hallucination
  • Interpretable - see which stage (author/work) succeeded or failed
  • Efficient - small vocabularies (~500 authors, ~50 works per author)
  • High precision - rule-based handles well-formatted citations (87.6%)
  • Debuggable - can inspect author confidence vs work confidence separately

Architecture:

Input: "Hdt. 8.82"
  ↓
[perseus-citation-processor] → 87.6% resolved directly
  ↓ (if unresolved or low confidence)
[Author Classifier (DNN)] → tlg0016 [conf: 0.95]
  ↓
[Work Classifier (DNN)] → tlg0016.tlg001 [conf: 0.92]
  ↓
[Passage Parser (rules)] → 8.82
  ↓
[Edition Selector (rules)] → perseus-grc2
  ↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"

Components:

  1. Rule-based baseline: perseus-citation-processor (Go binary)
  2. Author classifier: DeBERTa classification over ~500 Greek/Latin authors
  3. Work classifier: DeBERTa classification over works (conditioned on author)
  4. Passage parser: Rule-based extraction of passage references
  5. Edition selector: Rule-based (Greek quote → grc, Latin quote → lat)
  6. Confidence scorer: DeBERTa binary classifier for quality control

Approach 2: Sequence-to-Sequence Model (NOT Recommended)

Models: T5, BART, ByT5, or encoder-decoder transformers

Why NOT recommended:

  • URNs are structured catalog lookups, not creative text generation
  • Can hallucinate invalid author/work combinations
  • Less interpretable - black box generation
  • Wasteful - learns URN syntax instead of just citation→URN mappings
  • No guaranteed validity - requires complex constrained decoding

When to consider:

  • If you need to handle completely new authors not in any catalog (unlikely for classical texts)
  • If you want to experiment with end-to-end learning
  • As a baseline to compare against hierarchical classification

Architecture:

from transformers import T5ForConditionalGeneration, AutoTokenizer

# ByT5 for character-level handling of Greek/Latin
model = T5ForConditionalGeneration.from_pretrained("google/byt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")

# Training format: "resolve citation: Hdt. 8.82" → "urn:cts:greekLit:..."
input_text = "resolve citation: Hdt. 8.82"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

# Major issue: May generate syntactically valid but semantically wrong URNs
# e.g., "urn:cts:greekLit:tlg9999.tlg999.perseus-grc2:8.82" (invalid author)

Verdict: Use hierarchical classification (Approach 1) instead.

Approach 3: Retrieval-Augmented Generation

Strategy: Retrieve similar citations from database, use ML to rank/select.

Advantages:

  • Leverages existing resolved citations
  • Good for rare author/work combinations
  • Can explain predictions via retrieved examples

Architecture:

Input: "Thuc. 3.38"
  ↓
Embedding model → Dense vector representation
  ↓
Vector DB → Retrieve top-K similar resolved citations
  ↓
Ranking model → Score candidates
  ↓
Output: Highest-scoring URN

Implementation: Hybrid Rule-based + DNN System

Recommended Approach: Hierarchical Classification (NOT seq2seq)

This implementation uses:

  1. Rule-based baseline (perseus-citation-processor) for 87.6% of citations
  2. Hierarchical DNN classifiers for the remaining 12.4%
    • Author classifier (DeBERTa)
    • Work classifier (DeBERTa, conditioned on author)
  3. Confidence scorer (DeBERTa) to validate rule-based outputs

Why hierarchical classification over seq2seq?

  • Guaranteed valid URNs (no hallucination)
  • Interpretable decisions (author vs work failures)
  • Efficient (small vocabularies)
  • Better uncertainty handling (top-k at each stage)

Foundation: perseus-citation-processor

This project uses the perseus-citation-processor as the rule-based foundation:

Performance on 246K citations:

  • 87.6% resolution rate (216K resolved)
  • 12.4% unresolved (30K citations)
  • ⚡ Fast: 28 seconds for 125 XML files
  • 📚 Comprehensive author/work mappings (Greek, Latin, Scholia)

The DNN components handle the remaining 12.4% plus add confidence scoring.

Architecture Overview

Input Citation
       ↓
[perseus-citation-processor] ← Rule-based (87.6% coverage)
       ↓
   ┌───┴────┐
   ↓        ↓
Resolved  Unresolved
   ↓        ↓
[Confidence  [DNN Resolution
 Scorer]      Model]
   ↓        ↓
High conf?  URN + conf
   ↓        ↓
Yes→Output  Merge→Output
   No↘    ↗
    [DNN Model]

DNN Component 1: Unresolved Citation Resolver

Purpose: Resolve the 30K citations that rule-based system couldn't handle.

Approach: Hierarchical Classification (NOT seq2seq generation)

Why Classification over Seq2Seq:

  • URNs are structured catalog lookups, not creative text
  • Guaranteed valid outputs - cannot hallucinate invalid author/work combinations
  • Interpretable - see exactly which stage succeeded/failed
  • Efficient - small vocabulary (~500 authors × ~50 works = 25K combinations)
  • Better uncertainty handling - top-k predictions at each stage

Architecture: Multi-stage classification pipeline

"Hdt. 8.82"
    ↓
[Author Classifier] → tlg0016 (Herodotus) [confidence: 0.95]
    ↓
[Work Classifier] → tlg0016.tlg001 (Histories) [confidence: 0.92]
    ↓
[Edition Selector] → perseus-grc2 (rule-based: Greek text → grc)
    ↓
[Passage Parser] → 8.82 (rule-based extraction)
    ↓
Assemble: "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82"

Implementation:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class AuthorClassifier:
    """Stage 1: Classify citation to author URN"""

    def __init__(self):
        # Load perseus-citation-processor author catalog
        self.author_catalog = self.load_author_catalog()  # ~500 Greek/Latin authors
        self.author_to_id = {urn: i for i, urn in enumerate(self.author_catalog)}
        self.id_to_author = {i: urn for urn, i in self.author_to_id.items()}

        # DeBERTa classifier over author vocabulary
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "microsoft/deberta-v3-base",
            num_labels=len(self.author_catalog)
        )
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

    def predict(self, citation_text, context=""):
        """Predict top-k authors with confidence scores"""
        # Input format: "[citation] [SEP] [context]"
        text = f"{citation_text} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

        # Get predictions
        outputs = self.model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        top_k = torch.topk(probs[0], k=3)

        # Return [(author_urn, confidence), ...]
        return [(self.id_to_author[idx.item()], prob.item())
                for idx, prob in zip(top_k.indices, top_k.values)]


class WorkClassifier:
    """Stage 2: Classify to work URN (conditioned on author)"""

    def __init__(self):
        # Group works by author
        self.works_by_author = self.load_work_catalog()  # {tlg0016: [tlg001, tlg002, ...]}
        self.max_works = max(len(works) for works in self.works_by_author.values())

        # Classifier conditioned on author
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "microsoft/deberta-v3-base",
            num_labels=self.max_works
        )
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

    def predict(self, citation_text, author_urn, context=""):
        """Predict work for given author"""
        # Get candidate works for this author
        candidates = self.works_by_author.get(author_urn, [])
        if not candidates:
            return []

        # Input format: "[author] [SEP] [citation] [SEP] [context]"
        text = f"{author_urn} [SEP] {citation_text} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

        outputs = self.model(**inputs)

        # Filter to only valid works for this author
        valid_logits = outputs.logits[0, :len(candidates)]
        probs = torch.softmax(valid_logits, dim=-1)

        # Return [(work_urn, confidence), ...] sorted by confidence
        results = [(candidates[i], probs[i].item()) for i in range(len(candidates))]
        return sorted(results, key=lambda x: x[1], reverse=True)


class HierarchicalURNResolver:
    """Complete hierarchical URN resolution system"""

    def __init__(self):
        self.author_classifier = AuthorClassifier()
        self.work_classifier = WorkClassifier()
        self.passage_parser = PassageParser()  # Rule-based
        self.edition_selector = EditionSelector()  # Rule-based

    def resolve(self, citation_text, context="", quote=""):
        """Resolve citation to URN with confidence score"""
        # Stage 1: Classify author
        author_candidates = self.author_classifier.predict(citation_text, context)

        # Stage 2: For each author candidate, classify work
        urn_candidates = []
        for author_urn, author_conf in author_candidates[:3]:  # Top-3 authors
            work_candidates = self.work_classifier.predict(
                citation_text, author_urn, context
            )

            for work_urn, work_conf in work_candidates[:3]:  # Top-3 works
                # Stage 3: Parse passage (rule-based)
                passage = self.passage_parser.extract(citation_text)

                # Stage 4: Select edition (rule-based)
                edition = self.edition_selector.select(author_urn, quote)

                # Stage 5: Assemble URN
                namespace = self.get_namespace(author_urn)  # greekLit or latinLit
                full_urn = f"urn:cts:{namespace}:{work_urn}.{edition}:{passage}"

                # Combined confidence (product of stages)
                confidence = author_conf * work_conf

                urn_candidates.append((full_urn, confidence, {
                    'author_conf': author_conf,
                    'work_conf': work_conf,
                    'author': author_urn,
                    'work': work_urn
                }))

        # Return best candidate
        if urn_candidates:
            best = max(urn_candidates, key=lambda x: x[1])
            return best[0], best[1], best[2]
        else:
            return None, 0.0, {}

Training Data Format:

# Training examples from resolved.jsonl (216K examples)
# Split into author and work classification tasks

# Stage 1: Author Classification
author_examples = [
    {
        "citation": "Hdt. 8.82",
        "context": "",
        "label": "tlg0016"  # Herodotus
    },
    {
        "citation": "Soph. OT 151",
        "context": "τᾶς πολυχρύσου Πυθῶνος",
        "label": "tlg0011"  # Sophocles
    },
    {
        "citation": "Plat. Rep. 332D",
        "context": "",
        "label": "tlg0059"  # Plato
    }
]

# Stage 2: Work Classification (conditioned on author)
work_examples = [
    {
        "citation": "Hdt. 8.82",
        "author": "tlg0016",
        "context": "",
        "label": "tlg0016.tlg001"  # Histories
    },
    {
        "citation": "Soph. OT 151",
        "author": "tlg0011",
        "context": "τᾶς πολυχρύσου Πυθῶνος",
        "label": "tlg0011.tlg004"  # Oedipus Tyrannus
    },
    {
        "citation": "Hom. Il. 7.268",
        "author": "tlg0012",
        "context": "Ajax hurled a rock",
        "label": "tlg0012.tlg001"  # Iliad (not Odyssey)
    }
]

Features Used:

  1. Citation text (required): "Hdt. 8.82"
  2. Context (when available): Surrounding text or quote
  3. Author URN (for work classification): Condition on stage 1 output
  4. Ground truth URN: Parse author/work from urn field

DNN Component 2: Confidence Scorer

Purpose: Identify incorrect resolutions from perseus-citation-processor.

Problem: From perseus-citation-processor README: "just because the processor resolves a citation doesn't mean that it resolves it correctly"

Model: DeBERTa-based Binary Classifier

from transformers import AutoModel, AutoTokenizer
import torch.nn as nn

class URNConfidenceScorer(nn.Module):
    def __init__(self, model_name="microsoft/deberta-v3-base"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.1)

        hidden_size = self.encoder.config.hidden_size
        feature_size = 10  # Engineered features

        self.classifier = nn.Sequential(
            nn.Linear(hidden_size + feature_size, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, citation_text, urn, context=""):
        # Encode: "[citation] <SEP> [URN] <SEP> [context]"
        text = f"{citation_text} [SEP] {urn} [SEP] {context}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True)

        outputs = self.encoder(**inputs)
        pooled = outputs.last_hidden_state[:, 0, :]  # CLS token

        # Compute engineered features
        features = self.compute_features(citation_text, urn, context)

        combined = torch.cat([pooled, features], dim=1)
        confidence = self.classifier(combined)

        return confidence

    def compute_features(self, citation, urn, context):
        """Engineered features for confidence scoring"""
        return torch.tensor([
            self.citation_urn_match_score(citation, urn),
            self.has_ambiguous_abbrev(citation),
            self.context_language_match(context, urn),
            self.passage_validity(urn),
            self.author_frequency(urn),
            self.work_frequency(urn),
            self.quote_presence(context),
            self.citation_length(citation),
            self.urn_complexity(urn),
            self.catalog_match_count(citation)
        ])

Training Data:

# Positive examples (confident resolutions)
positive_examples = [
    {
        "citation": "Hdt. 8.82",
        "urn": "urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82",
        "context": "Greek text context",
        "label": 1.0  # High confidence
    }
]

# Ambiguous cases (medium confidence)
ambiguous_examples = [
    {
        "citation": "Arist. Met.",  # Metaphysics or Meteorology?
        "urn": "urn:cts:greekLit:tlg0086.tlg025.perseus-grc2:",
        "context": "",
        "label": 0.6  # Medium confidence
    }
]

# Incorrect resolutions (low confidence)
negative_examples = [
    {
        "citation": "Hom. 7.268",  # Iliad or Odyssey?
        "urn": "urn:cts:greekLit:tlg0012.tlg002.perseus-grc2:7.268",  # Odyssey
        "context": "Ajax hurled a rock",  # Ajax is in Iliad!
        "label": 0.1  # Low confidence - context mismatch
    }
]

DNN Component 3: Hybrid Orchestrator

Purpose: Combine rule-based and DNN intelligently.

class HybridURNResolver:
    def __init__(self):
        self.rule_based = PerseusProcessorWrapper()  # Calls Go binary
        self.dnn_resolver = HierarchicalURNResolver()  # Hierarchical classifier
        self.confidence_scorer = URNConfidenceScorer()

        # Thresholds (tune on validation set)
        self.high_confidence_threshold = 0.85
        self.low_confidence_threshold = 0.50

    def resolve(self, citation_text, context="", quote=""):
        # Step 1: Try rule-based
        rule_urn, rule_status = self.rule_based.resolve(citation_text)

        # Step 2: Score rule-based result
        if rule_urn:
            confidence = self.confidence_scorer(citation_text, rule_urn, context)

            if confidence > self.high_confidence_threshold:
                return rule_urn, confidence, "rule-based"

            # Medium confidence - get DNN opinion
            elif confidence > self.low_confidence_threshold:
                dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                    citation_text, context, quote
                )

                # Compare: rule-based vs DNN
                if dnn_conf > confidence:
                    return dnn_urn, dnn_conf, "dnn-override", dnn_details
                else:
                    return rule_urn, confidence, "rule-based-verified"

            # Low confidence - prefer DNN
            else:
                dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                    citation_text, context, quote
                )
                return dnn_urn, dnn_conf, "dnn-preferred", dnn_details

        # Step 3: Rule-based failed, use DNN only
        else:
            dnn_urn, dnn_conf, dnn_details = self.dnn_resolver.resolve(
                citation_text, context, quote
            )
            return dnn_urn, dnn_conf, "dnn-only", dnn_details

Training Strategy

Phase 1: Train Author Classifier

from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
import json

# Step 1: Extract author labels from URNs
def parse_urn_to_author(urn):
    """Extract author URN from full CTS URN"""
    # urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016
    try:
        parts = urn.split(":")
        author_work = parts[3]  # tlg0016.tlg001.perseus-grc2
        author = author_work.split(".")[0]  # tlg0016
        return author
    except:
        return None

# Step 2: Build author vocabulary from perseus-citation-processor data
author_catalog = set()
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        if author:
            author_catalog.add(author)

author_to_id = {author: i for i, author in enumerate(sorted(author_catalog))}
id_to_author = {i: author for author, i in author_to_id.items()}

# Step 3: Prepare training data
train_data = []
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        if author and author in author_to_id:
            train_data.append({
                "text": f"{item['bibl']} [SEP] {item.get('quote', '')}",
                "label": author_to_id[author]
            })

# Train/val split
dataset = Dataset.from_list(train_data)
dataset = dataset.train_test_split(test_size=0.1)

# Step 4: Train DeBERTa classifier
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=len(author_catalog)
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./outputs/resolution/models/author-classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
)

trainer.train()

Phase 2: Train Work Classifier

# Step 1: Parse work labels from URNs
def parse_urn_to_work(urn):
    """Extract work URN from full CTS URN"""
    # urn:cts:greekLit:tlg0016.tlg001.perseus-grc2:8.82 → tlg0016.tlg001
    try:
        parts = urn.split(":")
        author_work = parts[3]  # tlg0016.tlg001.perseus-grc2
        work = ".".join(author_work.split(".")[:2])  # tlg0016.tlg001
        return work
    except:
        return None

# Step 2: Build work vocabulary grouped by author
works_by_author = {}
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        work = parse_urn_to_work(item['urn'])
        if author and work:
            if author not in works_by_author:
                works_by_author[author] = set()
            works_by_author[author].add(work)

# Step 3: Prepare training data (conditioned on author)
train_data = []
with open("cit_data/resolved.jsonl") as f:
    for line in f:
        item = json.loads(line)
        author = parse_urn_to_author(item['urn'])
        work = parse_urn_to_work(item['urn'])

        if author and work:
            # Input includes author URN to condition on
            train_data.append({
                "text": f"{author} [SEP] {item['bibl']} [SEP] {item.get('quote', '')}",
                "author": author,
                "label": work
            })

# Create label mapping (per-author work indices)
# ... (similar training setup)

Phase 3: Train Confidence Scorer

  • Curate labeled examples with confidence scores
  • Use cross-validation on resolved.jsonl
  • Add manually labeled ambiguous cases (~1K examples)
  • Generate synthetic negative examples (incorrect author/work pairs)

Phase 4: Evaluate Hierarchical Classifier

  • Test on held-out validation set (10% of resolved.jsonl)
  • Measure component accuracy:
    • Author classification accuracy
    • Work classification accuracy (given correct author)
    • End-to-end URN exact match
  • Test on 30K unresolved.jsonl examples
  • Manual evaluation on random sample for accuracy

Expected Performance

Conservative estimates:

Component Coverage Improvement
Rule-based baseline 87.6%
DNN on unresolved (30-50% success) +4-6%
Confidence filtering (catch incorrect rules) +2-3%
Total hybrid coverage ~94-97%

Note Even resolving 1/3 of unresolved cases is significant improvement.

Evaluation Metrics

For URN Resolution:

  • Exact match accuracy: Percentage of perfectly resolved URNs
  • Component accuracy: Separate metrics for author, work, passage
  • Coverage: Percentage of citations with confident predictions
  • Precision@K: Accuracy when allowing top-K predictions

Data Split Considerations

Unlike tag extraction, URN resolution should be split by unique citation patterns rather than documents to test generalization:

  • Test on unseen author abbreviations
  • Test on unseen work titles
  • Test on seen authors but unseen works

Next Steps

Tag Extraction

Completed:

  • ✅ Data pipeline (JSONL → BIO format with special tokens)
  • ✅ DeBERTa token classification model
  • ✅ Training pipeline with data splitting by filename
  • ✅ Evaluation metrics (seqeval for BIO tagging)
  • ✅ Inference model for predictions on plain text
  • ✅ Test suite (98 tests covering data loading, model, training)

Remaining:

  • ⏳ Train on full dataset and tune hyperparameters
  • ⏳ Error analysis on test set predictions
  • ⏳ DeBERTa+CRF implementation (if baseline insufficient)
  • ⏳ Production deployment and inference optimization

URN Resolution

Phase 1: Rule-based Baseline (Week 1-2)

  1. Integration: Wrap perseus-citation-processor as Python callable
  2. Baseline metrics: Establish 87.6% resolution rate on 216K citations
  3. Error analysis: Analyze 30K unresolved.jsonl patterns
  4. Catalog extraction: Load author/work URNs from perseus-citation-processor data

Phase 2: Hierarchical Classifiers (Week 3-5)

  1. Catalog extraction: Extract author/work vocabularies from resolved.jsonl URNs
  2. Train author classifier: DeBERTa classification over ~500 authors
    • Input: citation + context
    • Output: author URN (e.g., tlg0016)
  3. Train work classifier: DeBERTa classification conditioned on author
    • Input: author + citation + context
    • Output: work URN (e.g., tlg0016.tlg001)
  4. Implement passage parser: Rule-based extraction of passage references
  5. Evaluate components: Measure author accuracy, work accuracy, end-to-end
  6. Test on unresolved: Evaluate on 30K unresolved.jsonl citations

Phase 3: Confidence Scorer (Week 5-6)

  1. Training data curation: Label confident vs ambiguous resolutions
  2. Train DeBERTa classifier: (citation, URN, context) → confidence score
  3. Threshold tuning: Calibrate confidence thresholds on validation set
  4. Cross-validation: Test on ambiguous cases from resolved.jsonl

Phase 4: Hybrid System (Week 7-8)

  1. Orchestrator implementation: Combine rule-based + DNN components
  2. End-to-end evaluation: Measure coverage improvement (target: 94-97%)
  3. Component analysis: Track rule-based vs DNN vs hybrid performance
  4. Error analysis: Identify remaining failure modes

Project Structure

perseus-citation-model/
│
├── cit_data/                          # Raw training data (not in repo)
│   ├── resolved.jsonl                 # 216K citations with URNs
│   └── unresolved.jsonl               # 30K citations without URNs
│
├── model_data/                        # Partitioned data
│   ├── extraction                     # Partitions for extraction task
│   └── resolution                     # Partitions for resolution task

├── outputs/                           # Fine-tuned model weights from training

├── src/                               # Source code
│   └── perscit_model/                 # Main package
│       ├── __init__.py
│       ├── shared/                    # Shared utilities across tasks
│       │   ├── __init__.py
│       │   ├── data_loader.py         # Base JSONL loader, tokenization
│       │   └── training_utils.py      # Training configuration utilities
│       ├── extraction/                # Task 1: Tag Extraction (BIO tagging)
│       │   ├── __init__.py
│       │   ├── data_loader.py         # XML → special tokens → BIO labels
│       │   ├── model.py               # DeBERTa token classification model
│       │   ├── train.py               # Training pipeline and data splitting
│       │   ├── evaluate.py            # Evaluation on test set
│       │   └── inference.py           # Inference model for predictions
│       └── resolution/                # Task 2: URN Resolution
│           ├── __init__.py
│           └── data_loader.py         # Citation data loading for resolution
│
├── configs/                           # Configuration files
│   └── extraction/
│       └── baseline.yaml              # Hyperparameters (model, max_length)
│
├── tests/                             # Test suite (98 tests)
│   ├── conftest.py                    # Shared fixtures (mock tokenizer)
│   ├── fixtures/                      # Test data
│   │   └── sample_extraction.jsonl   # 5 real citation examples
│   ├── unit/                          # Fast unit tests (88 tests, ~3s)
│   │   ├── test_extraction_dataset.py    # BIO label generation tests
│   │   ├── test_extraction_loader.py     # Data loader tests
│   │   ├── test_extraction_pipeline.py   # End-to-end pipeline tests
│   │   ├── test_resolution_loader.py     # Resolution data tests
│   │   └── test_shared_data_loader.py    # Shared utility tests
│   └── integration/                   # Slow integration tests (10 tests, ~8s)
│       └── test_extraction_model.py      # Real model loading/training tests
│
├── pyproject.toml                     # Project config, dependencies, test settings
├── .gitignore
├── .python-version                    # Python 3.13
└── README.md

Key Implementation Details:

Extraction Data Pipeline (Special Tokens Approach)

Instead of word-level BIO tagging with complex alignment, we use special tokens:

  1. XML → Special Tokens: Replace <bibl>, <quote>, <cit> tags with [BIBL_START], [BIBL_END], etc.
  2. Add to Vocabulary: Special tokens added to DeBERTa tokenizer (won't be split)
  3. Tokenize: DeBERTa tokenizes text with special tokens intact
  4. Generate BIO Labels: State machine generates labels based on special token positions
  5. Strip Special Tokens: Remove special tokens from input while keeping labels aligned

Example:

XML:      <bibl>Hdt. 8.82</bibl> some context
↓
Special:  [BIBL_START] Hdt. 8.82 [BIBL_END] some context
↓
Tokens:   [CLS] [BIBL_START] Hdt . 8 . 82 [BIBL_END] some context [SEP]
↓
Labels:   -100  -100          B-  I- I- I- I- -100        O    O       -100
↓ (strip special tokens)
Final Tokens: [CLS] Hdt . 8 . 82 some context [SEP]
Final Labels: -100  B-  I- I- I- I- O    O       -100

Key point: The model sees only [CLS] Hdt . 8 . 82 some context [SEP] during training, NOT the special tokens. It must learn to predict citation boundaries from the context alone.

Advantages over word-level alignment:

  • No complex subword↔word alignment logic
  • Special tokens guaranteed not to split
  • Simpler, more reliable label generation
  • Handles malformed XML gracefully (BeautifulSoup repair)

Model Initialization

Embedding Resizing:

  • Base DeBERTa vocab: 128,000 tokens
  • +6 special tokens = 128,006 tokens
  • New embeddings initialized to mean of existing embeddings (training stability)

Testing

# Fast unit tests only (default)
pytest                    # 88 tests in ~3s

# Integration tests (downloads real models)
pytest tests/integration  # 10 tests in ~8s

# All tests
pytest tests              # 98 tests in ~9s

End-to-End Pipeline

The two tasks can be combined into a complete citation processing pipeline:

Raw Text → Tag Extraction → URN Resolution → Structured Citations

Example workflow:

  1. Input: "Homer mentions this in Il. 7.268-272: 'Ajax hurled a rock'"
  2. Tag Extraction: Identify <bibl>Il. 7.268-272</bibl> and <quote>Ajax hurled a rock</quote>
  3. URN Resolution: Map "Il. 7.268-272" → "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:7.268"
  4. Output: Structured citation with linked canonical reference

This enables:

  • Automated citation extraction from plain text
  • Linking to canonical text passages
  • Cross-referencing across documents
  • Building citation networks in classical scholarship

Useful Links

General:

Tag Extraction:

URN Resolution:

About

Code for training transformer models to do citation extraction and resolution for Perseus XML files (according to TEI EpiDoc standards)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors