Skip to content

erstcl/ocr-service-LCT-2025-hackathon

 
 

Repository files navigation

🇷🇺 Русская версия | 🇬🇧 English version


Moscow Archives OCR Service

Web service for automated processing and indexing of historical archive documents

Project Presentation

System for automatic extraction and indexing of information from archival document images with support for pre-revolutionary Russian orthography.


Overview

The service provides comprehensive document processing capabilities:

  • Image preprocessing — alignment, contrast enhancement, noise reduction
  • OCR recognition — handwritten and printed text extraction
  • Historical orthography support — recognition of pre-revolutionary characters (ѣ, ѳ, і, ѵ, ъ)
  • Named entity recognition — extraction of names, dates, addresses, archival codes
  • Verification and correction — user-guided result validation
  • Data export — multiple output formats
  • Quality monitoring — processing statistics and accuracy metrics

Supported Formats

  • Input: JPG, JPEG, TIFF, PDF
  • Output: JSON, CSV, XML
  • Max file size: 100MB

Quick Start

1. Clone the repository

git clone <repository-url>
cd archive-ocr-service

2. Install dependencies

pip install -r requirements.txt

3. Configure environment variables

Create a .env file:

# Yandex API (optional for extended features)
YANDEX_API_KEY=your_yandex_api_key_here
YANDEX_FOLDER_ID=your_folder_id_here

# Database settings
DATABASE_URL=sqlite:///./archive_service.db

# File settings
MAX_FILE_SIZE=104857600
UPLOAD_DIR=uploads

4. Run the service

python web_service.py

5. Access the interface

Open your browser at http://localhost:8000


Project Structure

archive-ocr-service/
├── web_service.py                    # Main FastAPI service
├── integrated_archive_processor.py   # OCR processor
├── requirements.txt                  # Python dependencies
├── .env                             # Environment variables
├── Dockerfile                       # Docker image
├── docker-compose.yml              # Docker Compose config
├── README.md                        # Documentation
├── docs/                           # Documentation
│   ├── deployment-guide.pdf        # Deployment guide
│   └── api-docs.md                 # API documentation
├── uploads/                        # Uploaded files
├── static/                         # Static files
└── tests/                          # Tests

Configuration

Core Settings

# In web_service.py
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100MB
LOW_CONFIDENCE_THRESHOLD = 0.75     # Low confidence threshold
DATABASE_URL = "sqlite:///./archive_service.db"

OCR Parameters

# In integrated_archive_processor.py
use_angle_cls=True                  # Angle classification
det_db_thresh=0.2                   # Detection threshold
det_db_box_thresh=0.3              # Bounding box threshold
max_side_len=4096                  # Maximum resolution

Docker Deployment

Build the image

docker build -t archive-ocr-service .

Run container

docker run -d \
  --name archive-ocr \
  -p 8000:8000 \
  -v $(pwd)/uploads:/app/uploads \
  -v $(pwd)/.env:/app/.env \
  archive-ocr-service

Using Docker Compose

docker-compose up -d

API Endpoints

Core Operations

  • GET / — Web interface
  • POST /upload — Upload document
  • GET /status/{task_id} — Processing status
  • GET /results/{task_id} — OCR results
  • GET /stats — Statistics

Verification and Correction

  • POST /correct/{task_id} — Correct text
  • POST /verify/{task_id} — Verify segment

Data Export

  • GET /export/{task_id} — Export document
  • POST /export — Extended export

Service

  • GET /health — Health check
  • GET /docs — Swagger documentation

System Requirements

Minimum Requirements

  • OS: Linux Ubuntu 20.04+, Windows 10+, macOS 10.15+
  • Python: 3.8+
  • RAM: 4GB (8GB recommended)
  • Disk: 10GB free space
  • Internet: required for downloading OCR models on first run

Recommended Requirements

  • CPU: 4+ cores
  • RAM: 16GB+
  • GPU: NVIDIA GPU with CUDA support (optional)
  • SSD: for faster file access

Testing

Run tests

python -m pytest tests/

Health check

curl http://localhost:8000/health

Example API usage

# Upload file
curl -X POST "http://localhost:8000/upload" \
     -F "file=@document.jpg"

# Check status
curl "http://localhost:8000/status/{task_id}"

# Get results
curl "http://localhost:8000/results/{task_id}"

Security

  • File type validation
  • File size limits (100MB)
  • Input data sanitization
  • CORS configuration
  • Operation logging

Monitoring and Logging

View logs

tail -f web_service.log

Metrics tracked

  • Number of processed documents
  • Average recognition confidence
  • Processing time
  • Error statistics

Updates

git pull
pip install -r requirements.txt --upgrade
python web_service.py

Authors

Team: CUphoria
Organizer: Digital Transformation Leaders
Year: 2025

About

OCR/HTR web service for digitizing Moscow church records (18-19th century). Extracts structured data (names, dates, addresses) from historical documents with NER and WER quality metrics. Hackathon project for Moscow Main Archive.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 82.9%
  • Shell 15.1%
  • Dockerfile 2.0%