Moscow Archives OCR Service

Web service for automated processing and indexing of historical archive documents

System for automatic extraction and indexing of information from archival document images with support for pre-revolutionary Russian orthography.

Overview

The service provides comprehensive document processing capabilities:

Image preprocessing — alignment, contrast enhancement, noise reduction
OCR recognition — handwritten and printed text extraction
Historical orthography support — recognition of pre-revolutionary characters (ѣ, ѳ, і, ѵ, ъ)
Named entity recognition — extraction of names, dates, addresses, archival codes
Verification and correction — user-guided result validation
Data export — multiple output formats
Quality monitoring — processing statistics and accuracy metrics

Supported Formats

Input: JPG, JPEG, TIFF, PDF
Output: JSON, CSV, XML
Max file size: 100MB

Quick Start

1. Clone the repository

git clone <repository-url>
cd archive-ocr-service

2. Install dependencies

pip install -r requirements.txt

3. Configure environment variables

Create a .env file:

# Yandex API (optional for extended features)
YANDEX_API_KEY=your_yandex_api_key_here
YANDEX_FOLDER_ID=your_folder_id_here

# Database settings
DATABASE_URL=sqlite:///./archive_service.db

# File settings
MAX_FILE_SIZE=104857600
UPLOAD_DIR=uploads

4. Run the service

python web_service.py

5. Access the interface

Open your browser at http://localhost:8000

Project Structure

archive-ocr-service/
├── web_service.py                    # Main FastAPI service
├── integrated_archive_processor.py   # OCR processor
├── requirements.txt                  # Python dependencies
├── .env                             # Environment variables
├── Dockerfile                       # Docker image
├── docker-compose.yml              # Docker Compose config
├── README.md                        # Documentation
├── docs/                           # Documentation
│   ├── deployment-guide.pdf        # Deployment guide
│   └── api-docs.md                 # API documentation
├── uploads/                        # Uploaded files
├── static/                         # Static files
└── tests/                          # Tests

Configuration

Core Settings

# In web_service.py
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100MB
LOW_CONFIDENCE_THRESHOLD = 0.75     # Low confidence threshold
DATABASE_URL = "sqlite:///./archive_service.db"

OCR Parameters

# In integrated_archive_processor.py
use_angle_cls=True                  # Angle classification
det_db_thresh=0.2                   # Detection threshold
det_db_box_thresh=0.3              # Bounding box threshold
max_side_len=4096                  # Maximum resolution

Docker Deployment

Build the image

docker build -t archive-ocr-service .

Run container

docker run -d \
  --name archive-ocr \
  -p 8000:8000 \
  -v $(pwd)/uploads:/app/uploads \
  -v $(pwd)/.env:/app/.env \
  archive-ocr-service

Using Docker Compose

docker-compose up -d

API Endpoints

Core Operations

GET / — Web interface
POST /upload — Upload document
GET /status/{task_id} — Processing status
GET /results/{task_id} — OCR results
GET /stats — Statistics

Verification and Correction

POST /correct/{task_id} — Correct text
POST /verify/{task_id} — Verify segment

Data Export

GET /export/{task_id} — Export document
POST /export — Extended export

Service

GET /health — Health check
GET /docs — Swagger documentation

System Requirements

Minimum Requirements

OS: Linux Ubuntu 20.04+, Windows 10+, macOS 10.15+
Python: 3.8+
RAM: 4GB (8GB recommended)
Disk: 10GB free space
Internet: required for downloading OCR models on first run

Recommended Requirements

CPU: 4+ cores
RAM: 16GB+
GPU: NVIDIA GPU with CUDA support (optional)
SSD: for faster file access

Testing

Run tests

python -m pytest tests/

Health check

curl http://localhost:8000/health

Example API usage

# Upload file
curl -X POST "http://localhost:8000/upload" \
     -F "file=@document.jpg"

# Check status
curl "http://localhost:8000/status/{task_id}"

# Get results
curl "http://localhost:8000/results/{task_id}"

Security

File type validation
File size limits (100MB)
Input data sanitization
CORS configuration
Operation logging

Monitoring and Logging

View logs

tail -f web_service.log

Metrics tracked

Number of processed documents
Average recognition confidence
Processing time
Error statistics

Updates

git pull
pip install -r requirements.txt --upgrade
python web_service.py

Authors

Team: CUphoria
Organizer: Digital Transformation Leaders
Year: 2025

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
examples		examples
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
Dockerfile		Dockerfile
README.md		README.md
README.ru.md		README.ru.md
docker-compose.yml		docker-compose.yml
lct_presentation.pdf		lct_presentation.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Moscow Archives OCR Service

Overview

Supported Formats

Quick Start

1. Clone the repository

2. Install dependencies

3. Configure environment variables

4. Run the service

5. Access the interface

Project Structure

Configuration

Core Settings

OCR Parameters

Docker Deployment

Build the image

Run container

Using Docker Compose

API Endpoints

Core Operations

Verification and Correction

Data Export

Service

System Requirements

Minimum Requirements

Recommended Requirements

Testing

Run tests

Health check

Example API usage

Security

Monitoring and Logging

View logs

Metrics tracked

Updates

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages