This repository contains the implementation of a complete OCR text correction pipeline specifically designed for medieval manuscripts. The system combines Kraken for line segmentation, TrOCR for initial text recognition, and a fine-tuned ByT5 model for OCR error correction.
Medieval manuscripts present unique challenges for OCR systems due to:
- Historical writing styles and letterforms
- Abbreviations and contractions common in medieval texts
- Varying print quality and document degradation
- Complex layouts with irregular line spacing
Our solution addresses these challenges through a multi-stage pipeline that achieves significant improvements in text recognition accuracy.
graph TB
subgraph FT ["π Fine-Tuning Phase"]
A[π Raw OCR Line]
F[β
Verified Ground<br/>Truth Line]
B[π§ Manual Alignment]
C[π Aligned Line Pairs]
D[π€ ByT5 Fine-Tuning]
M[πΎ Trained ByT5<br/>Correction Model]
A --> B
F --> B
B --> C
C --> D
D --> M
end
subgraph IP ["π Inference Pipeline"]
IMG[π New Manuscript<br/>Image]
SEG[π Kraken Line<br/>Segmentation]
LINES[π Line Images]
OCR[π€ TrOCR Medieval<br/>Model]
NOISY[π Raw OCR Output]
CORR[β¨ ByT5 Correction]
FINAL[π Corrected<br/>Output]
IMG --> SEG
SEG --> LINES
LINES --> OCR
OCR --> NOISY
NOISY --> CORR
CORR --> FINAL
end
M -.-> CORR
style A fill:#bbdefb,color:#000
style F fill:#c8e6c9,color:#000
style B fill:#fff59d,color:#000
style C fill:#ffcc80,color:#000
style D fill:#f8bbd9,color:#000
style M fill:#a5d6a7,color:#000
style IMG fill:#90caf9,color:#000
style SEG fill:#81d4fa,color:#000
style LINES fill:#e0e0e0,color:#000
style OCR fill:#ffab91,color:#000
style NOISY fill:#e0e0e0,color:#000
style CORR fill:#ffcc80,color:#000
style FINAL fill:#a5d6a7,color:#000
- Data Preparation: Raw OCR lines and verified ground truth texts are collected
- Manual Alignment: OCR outputs are aligned with their corresponding ground truth
- Dataset Creation: Aligned line pairs form the training dataset
- Model Training: ByT5 model is fine-tuned on medieval OCR correction patterns
- Line Segmentation: Kraken automatically detects and segments text lines
- Initial OCR: TrOCR processes each line with medieval-optimized model
- Error Correction: Fine-tuned ByT5 model corrects OCR errors
- Text Assembly: Individual corrected lines are combined into final output
Our dataset consists of 10,643 text line pairs extracted from medieval manuscripts:
| Column | Description | Example |
|---|---|---|
line_id |
Unique identifier | 0033_033_line_30 |
image_path |
Path to image | image.png |
text |
Ground truth text | Ius dei occultaret. la terza ut tetatis fa |
ocr_prediction |
Raw OCR output | Ius dei occultare. la terza un tαΊ½tatis fa |
page_id |
Source page identifier | 0033_033 |
line_number |
Line position in page | 30 |
- Total lines: 10,643
- Average line length: ~45 characters
- Character Error Rate (before correction): ~12.3%
- Training/Validation/Test split: 80%/10%/10%
Note: the 12.3% baseline CER reflects the TrOCR medieval model before ByT5 correction. The paper reports ~38% using Kraken's default model as the OCR baseline. The early printed books used for OCR and post-correction tasks originate from the MAGIC digital archive, which provides open access to digitized manuscripts. Our training data was created by aligning OCR outputs with manually verified transcriptions based on these sources.
Required: Python 3.9 - 3.12
Important: Python 3.13 is NOT compatible due to Kraken dependencies. Use Python 3.12 for best compatibility.
# Clone the repository
git clone https://github.com/yahyamomtaz/medieval-ocr-pipeline.git
cd medieval-ocr-pipeline
# Install dependencies
pip install -r requirements.txt
# Install Kraken for line segmentation
pip install krakenThe pipeline uses two pre-trained models that are automatically downloaded on first run:
π€ Models (Auto-downloaded from HuggingFace):
- TrOCR Medieval:
medieval-data/trocr-medieval-print - Fine-tuned ByT5 Correction:
yayamomt/byt5-medieval-ocr-correction
The pipeline automatically downloads both models from HuggingFace Hub on first run. Simply install dependencies and run:
python complete_ocr_pipeline.py --image_path your_image.jpgThe models will be cached locally for future use (~1.5GB total download).
Process a medieval manuscript image through the entire pipeline:
python complete_ocr_pipeline.py \
--image_path manuscript.jpg \
--output_file corrected_text.txtcorrected_text.txt: Final corrected textdetailed_results.txt: Line-by-line comparison- Individual line images (if
--keep_tempflag used)
from complete_ocr_pipeline import process_complete_image
# Process image and get results
final_text, line_results = process_complete_image(
image_path="path/to/manuscript.jpg",
output_file="output.txt"
)
print(f"Processed {len(line_results)} lines")
print(f"Final text: {final_text[:100]}...")If you want to train with your own dataset:
# Prepare dataset in the required format
python prepare_dataset.py \
--input_csv your_data.csv \
--output_dir processed_dataset/Train the correction model:
python Byt5_finetune.py \
--dataset_path dataset/dataset.csv \
--output_dir ./byt5-ocr-correction \
--num_epochs 4 \
--batch_size 2 \
--learning_rate 5e-5| Parameter | Default | Description |
|---|---|---|
--num_epochs |
4 | Number of training epochs |
--batch_size |
2 | Training batch size |
--learning_rate |
5e-5 | Learning rate |
--max_length |
128 | Maximum sequence length |
--warmup_steps |
250 | Warmup steps for scheduler |
| Original OCR | Corrected Text | Ground Truth |
|---|---|---|
Ius dei occultare. la terza un tαΊ½tatis fa |
Ius dei occultaret. la terza ut tetatis fa |
Ius dei occultaret. la terza ut tetatis fa |
ti perlo peccato:cessi uolse esser tenta |
ti per lo peccato: Cossi uolse esser tenta |
ti per lo peccato: Cossi uolse esser tenta |
Uctus αΊ½ iesus in desertΕ© a sai |
Vctus est iesus in desertum a spi |
Vctus est iesus in desertum a spi |
- Kraken OCR engine for line segmentation
- TrOCR team for the medieval manuscript model
- Google for the ByT5 architecture
- MDPI Electronics for publishing our research
Citation: If you use this work in your research, please cite our paper:
@Article{electronics14153083,
AUTHOR = {Momtaz, Yahya and Laccetti, Lorenza and Russo, Guido},
TITLE = {Modular Pipeline for Text Recognition in Early Printed Books Using Kraken and ByT5},
JOURNAL = {Electronics},
VOLUME = {14},
YEAR = {2025},
NUMBER = {15},
ARTICLE-NUMBER = {3083},
URL = {https://www.mdpi.com/2079-9292/14/15/3083},
ISSN = {2079-9292},
DOI = {10.3390/electronics14153083}
}
Feel free to contact me via LinkedIn