🇷🇺 Русская версия | 🇬🇧 English version
Web service for automated processing and indexing of historical archive documents
System for automatic extraction and indexing of information from archival document images with support for pre-revolutionary Russian orthography.
The service provides comprehensive document processing capabilities:
- Image preprocessing — alignment, contrast enhancement, noise reduction
- OCR recognition — handwritten and printed text extraction
- Historical orthography support — recognition of pre-revolutionary characters (ѣ, ѳ, і, ѵ, ъ)
- Named entity recognition — extraction of names, dates, addresses, archival codes
- Verification and correction — user-guided result validation
- Data export — multiple output formats
- Quality monitoring — processing statistics and accuracy metrics
- Input: JPG, JPEG, TIFF, PDF
- Output: JSON, CSV, XML
- Max file size: 100MB
git clone <repository-url>
cd archive-ocr-servicepip install -r requirements.txtCreate a .env file:
# Yandex API (optional for extended features)
YANDEX_API_KEY=your_yandex_api_key_here
YANDEX_FOLDER_ID=your_folder_id_here
# Database settings
DATABASE_URL=sqlite:///./archive_service.db
# File settings
MAX_FILE_SIZE=104857600
UPLOAD_DIR=uploadspython web_service.pyOpen your browser at http://localhost:8000
archive-ocr-service/
├── web_service.py # Main FastAPI service
├── integrated_archive_processor.py # OCR processor
├── requirements.txt # Python dependencies
├── .env # Environment variables
├── Dockerfile # Docker image
├── docker-compose.yml # Docker Compose config
├── README.md # Documentation
├── docs/ # Documentation
│ ├── deployment-guide.pdf # Deployment guide
│ └── api-docs.md # API documentation
├── uploads/ # Uploaded files
├── static/ # Static files
└── tests/ # Tests
# In web_service.py
MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB
LOW_CONFIDENCE_THRESHOLD = 0.75 # Low confidence threshold
DATABASE_URL = "sqlite:///./archive_service.db"# In integrated_archive_processor.py
use_angle_cls=True # Angle classification
det_db_thresh=0.2 # Detection threshold
det_db_box_thresh=0.3 # Bounding box threshold
max_side_len=4096 # Maximum resolutiondocker build -t archive-ocr-service .docker run -d \
--name archive-ocr \
-p 8000:8000 \
-v $(pwd)/uploads:/app/uploads \
-v $(pwd)/.env:/app/.env \
archive-ocr-servicedocker-compose up -dGET /— Web interfacePOST /upload— Upload documentGET /status/{task_id}— Processing statusGET /results/{task_id}— OCR resultsGET /stats— Statistics
POST /correct/{task_id}— Correct textPOST /verify/{task_id}— Verify segment
GET /export/{task_id}— Export documentPOST /export— Extended export
GET /health— Health checkGET /docs— Swagger documentation
- OS: Linux Ubuntu 20.04+, Windows 10+, macOS 10.15+
- Python: 3.8+
- RAM: 4GB (8GB recommended)
- Disk: 10GB free space
- Internet: required for downloading OCR models on first run
- CPU: 4+ cores
- RAM: 16GB+
- GPU: NVIDIA GPU with CUDA support (optional)
- SSD: for faster file access
python -m pytest tests/curl http://localhost:8000/health# Upload file
curl -X POST "http://localhost:8000/upload" \
-F "file=@document.jpg"
# Check status
curl "http://localhost:8000/status/{task_id}"
# Get results
curl "http://localhost:8000/results/{task_id}"- File type validation
- File size limits (100MB)
- Input data sanitization
- CORS configuration
- Operation logging
tail -f web_service.log- Number of processed documents
- Average recognition confidence
- Processing time
- Error statistics
git pull
pip install -r requirements.txt --upgrade
python web_service.pyTeam: CUphoria
Organizer: Digital Transformation Leaders
Year: 2025