This repository contains a dictionary-based method for identifying architectural discourse in Late Antique and Medieval Latin texts. It is applied to a test corpus of Latin inscriptions.
The workflow combines a dictionary-based approach with an evaluation against GLiNER named-entity recognition models.
EccLexica is being developed as part of the ANR E-cclesia project.
.
├── data/
│ └── dicts
│ └── auto_terms.csv
│ └── asso_terms.csv
│ └── mat_terms.csv
│ └── sample.csv
│
├── output/
│
├── arch-score.py
├── gliner-eval.py
└── README.md
- Main dataset
The main dataset is not stored in this repository due to its size.
Please download EDCS_text_cleaned_2022-09-12.json from Zenodo:
👉 https://zenodo.org/records/7072337
After downloading, place the file in the following location:
data/EDCS_text_cleaned_2022-09-12.json
- Sample dataset (
data/sample.csv)- 100 inscriptions.
- Used for testing and GLiNER evaluation.
Computes an architectural concentration score (0–100) for each inscription based on the presence and distribution of architectural vocabulary.
Main steps:
- Filtering
- Date range: 399–1199 CE.
- Excludes a predefined list of eastern provinces.
- Lemmatization
- Uses spaCy’s Latin model (
la_core_web_md) on cleaned interpretive text.
- Uses spaCy’s Latin model (
- Scoring
- Uses three term categories:
- Autonomous (e.g.
basilica,ecclesia) - Associative (e.g.
murus,porta) - Material (e.g.
marmor,aurum)
- Autonomous (e.g.
- Score combines:
- Term count
- Co-occurrence of term types
- Proximity of terms in the text
- Term density
- Uses three term categories:
- Output generation
- Full scored dataset.
- Score-based subsets (per 10-point bin).
- High-score subset (score > 50).
Evaluates GLiNER NER models against the dictionary-based architectural terms.
What it does:
- Loads a scored sample dataset from
output/sample.csv. - Runs multiple GLiNER models on:
- Lemmatized text
- Cleaned interpretive text
- Compares GLiNER-extracted entities with dictionary-based terms.
Evaluated labels:
building/typebuilding/partbuilding/material
Metrics computed:
- Precision
- Recall
- F1 score
- Percentage and count of overlapping terms
- Number of inscriptions with at least one shared term
- Python 3.9+
- Core libraries:
pandas,numpy,scikit-learnspacy(+la_core_web_md)gliner
- Run
arch-score.pyto filter inscriptions and compute architectural scores. - Inspect or subset results in the
output/directory. - Run
gliner-eval.pyon a scored sample to compare dictionary-based detection with GLiNER models.
- Scores are heuristic and intended for comparative and exploratory analysis, not absolute classification.
Heřmánková, Petra. “EDCS”. Zenodo, September 12, 2022. https://doi.org/10.5281/zenodo.7072337.

