A high-performance FastAPI service for PDF document processing, OCR (Optical Character Recognition), and text parsing using Google Cloud Vision API. Prototype uses regex-based classification, no multi-document handling, Vertex embedding disabled, KAG handoff active.
This is a prototype version with the following characteristics:
- β Single-document mode only - No multi-document handling
- β Regex-based classification - Template matching using legal keywords (no Vertex Matching Engine)
- β Vertex embedding disabled - Embeddings set to null/placeholder values
- β KAG handoff active - Unified KAG Writer component with automatic schema-compliant generation
- β Deterministic results - Consistent outputs for the same test document
- β Complete artifact generation - parsed_output.json, classification_verdict.json, kag_input.json
- π PDF Processing: Convert PDF documents to high-quality images
- π OCR Integration: Google Cloud Vision API for accurate text extraction
- π€ Document AI: Integration with Google Document AI for structured parsing
- π Pipeline Orchestration: Unified workflow combining PDF β Images β OCR β DocAI β Classification β KAG
- π·οΈ Regex Classification: Pattern-based document classification using legal keywords
- π§ KAG Integration: Automatic
kag_input.jsongeneration with unified schema - π Schema Compliance: Structured output pairing DocAI results with classifier verdicts
- π Multi-language Support: Configurable language hints for better OCR accuracy
- π File Management: Upload, process, and manage document processing workflows
- βοΈ Admin Tools: Data purge operations and usage analytics
- ποΈ Modular Architecture: Router-based design for easy feature expansion
- β‘ Background Processing: Async processing for large documents
- π Health Monitoring: Comprehensive health checks and status endpoints
- πΎ DocAI Compatible: Output format compatible with Google Document AI
- π§ KAG Writer: Unified component for automatic knowledge input generation
- Python 3.8 or higher
- Google Cloud Project with Vision API enabled
- Google Cloud Service Account with Vision API permissions
git clone https://github.com/DocuMint-AI/ai-backend.git
cd ai-backendOption A: Using uv (Recommended)
# Run setup script (installs uv if needed and sets up environment)
./setup.sh # Linux/Mac
# or
setup.bat # Windows
# Or manually:
uv venv # Create virtual environment
uv pip install -r requirements.txt # Install dependenciesOption B: Using traditional Python venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtCreate a .env file in the project root:
# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT_ID=your-project-id
GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
# Application Configuration
DATA_ROOT=./data
IMAGE_FORMAT=PNG
IMAGE_DPI=300
LANGUAGE_HINTS=en,es,fr
MAX_FILE_SIZE_MB=50# Download your service account key file from Google Cloud Console
# Place it in a secure location and update GOOGLE_APPLICATION_CREDENTIALS# Start the development server
uv run main.py
# Or using uvicorn with auto-reload
uvicorn main:app --reload --host 0.0.0.0 --port 8000uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4- API Documentation: http://localhost:8000/docs
- Alternative Docs: http://localhost:8000/redoc
- Health Check: http://localhost:8000/health
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
API information and available endpoints |
GET |
/health |
Health check and service status |
POST |
/upload |
Upload PDF files for processing |
POST |
/ocr-process |
Process uploaded PDFs with OCR |
GET |
/results/{uid} |
Retrieve processing results |
GET |
/folders |
List all processing folders |
DELETE |
/cleanup/{uid} |
Clean up processing folders |
| Method | Endpoint | Description |
|---|---|---|
POST |
/admin/purge |
Execute data cleanup operations |
GET |
/admin/data-usage |
Get storage usage statistics |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/docai/parse |
Parse document with Google Document AI |
POST |
/api/docai/parse/batch |
Batch process multiple documents |
GET |
/api/docai/processors |
List available DocAI processors |
GET |
/api/docai/config |
Get DocAI configuration |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/process-document |
Complete pipeline: PDF β Images β OCR β DocAI |
GET |
/api/v1/pipeline-status/{pipeline_id} |
Get real-time processing status |
GET |
/api/v1/pipeline-results/{pipeline_id} |
Retrieve complete pipeline results |
GET |
/api/v1/health |
Orchestration service health check |
The orchestration API provides a single endpoint for the complete workflow with integrated classification and KAG handoff:
# Process a document through the complete MVP pipeline
curl -X POST "http://localhost:8000/api/v1/process-document" \
-F "file=@contract.pdf" \
-F "language_hints=en,hi" \
-F "confidence_threshold=0.8"
# Response includes pipeline_id and processing artifacts
{
"success": true,
"pipeline_id": "abc123-def456",
"message": "Document processing completed successfully in 45.2s",
"total_processing_time": 45.2,
"final_results_path": "data/processed/pipeline_result_abc123-def456.json",
"stage_timings": {
"upload": 2.1,
"ocr": 15.4,
"docai": 20.3,
"classification": 1.2,
"kag": 3.8,
"saving": 2.4
}
}MVP Pipeline Flow:
- π Upload PDF - Secure file upload and validation
- πΌοΈ PDF β Images - Multi-library fallback conversion
- ποΈ Vision OCR - Google Cloud Vision text extraction
- π§ Document AI - Structured document parsing
- π·οΈ Regex Classification - Pattern-based document categorization
- π€ KAG Processing - Knowledge Augmented Generation preparation
- πΎ Artifact Generation - Save classification_verdict.json, kag_input.json, feature_vector.json
Generated Artifacts:
classification_verdict.json- Document classification results with matched patternskag_input.json- Structured handoff payload for downstream processingfeature_vector.json- ML-ready features with classifier verdict (embeddings disabled)
curl -X POST "http://localhost:8000/upload" \
-F "file=@document.pdf"curl -X POST "http://localhost:8000/ocr-process" \
-H "Content-Type: application/json" \
-d '{
"pdf_path": "/data/uploads/document.pdf",
"language_hints": ["en", "es"],
"force_reprocess": false
}'curl -X GET "http://localhost:8000/results/{uid}"curl -X GET "http://localhost:8000/health"π For complete orchestration API documentation with examples, status monitoring, and result formats, see ORCHESTRATION_API.md
ai-backend/
βββ main.py # FastAPI application entry point
βββ routers/ # Modular router architecture
β βββ __init__.py # Router package initialization
β βββ processing_handler.py # Document processing endpoints
β βββ doc_ai_router.py # Document AI integration
β βββ orchestration_router.py # MVP Pipeline orchestration
βββ services/ # Business logic and utilities
β βββ doc_ai/ # Document AI services
β βββ preprocessing/ # Document preprocessing
β β βββ OCR-processing.py # OCR processing logic
β β βββ parsing.py # Text parsing utilities
β βββ template_matching/ # MVP Classification (NEW)
β β βββ legal_keywords.py # Legal keyword database
β β βββ regex_classifier.py # Regex-based classifier
β βββ kag_component.py # KAG handoff component (NEW)
β βββ feature_emitter.py # Enhanced with classifier verdict
β βββ util-services.py # Utility functions
β βββ project_utils.py # Project utilities
βββ data/ # Data storage directory
β βββ uploads/ # Uploaded files
β βββ processed/ # Pipeline results (NEW)
β βββ test-files/ # Test documents
βββ docs/ # Documentation
β βββ ORCHESTRATION_API.md # Pipeline API docs (NEW)
β βββ ... # Other documentation
βββ tests/ # Test suite
βββ requirements.txt # Python dependencies
βββ test_orchestration.py # Orchestration tests (NEW)
βββ README.md # Project documentation
# Run the new MVP regex classification tests
python -m pytest tests/test_single_doc_regex.py -v
# Run quick validation
python tests/test_single_doc_regex.py
# Run migration tests
python test_orchestration.py# Install test dependencies
pip install -r tests/test_requirements.txt
# Run all tests
pytest tests/
# Run specific test categories
pytest tests/test_ocr_processing.py
pytest tests/test_api_endpoints.py
pytest tests/test_single_doc_regex.py # MVP tests| Variable | Description | Default |
|---|---|---|
GOOGLE_CLOUD_PROJECT_ID |
Google Cloud Project ID | Required |
GOOGLE_APPLICATION_CREDENTIALS |
Path to service account key | Required |
DATA_ROOT |
Data storage directory | ./data |
IMAGE_FORMAT |
Output image format | PNG |
IMAGE_DPI |
Image resolution | 300 |
LANGUAGE_HINTS |
OCR language hints | en |
MAX_FILE_SIZE_MB |
Maximum upload size | 50 |
- Create a Google Cloud Project
- Enable the Cloud Vision API
- Create a service account with Vision API permissions
- Download the service account key file
- Set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable
# Build the container
docker build -t ai-backend .
# Run the container
docker run -p 8000:8000 \
-e GOOGLE_CLOUD_PROJECT_ID=your-project \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
-v /path/to/credentials.json:/app/credentials.json:ro \
ai-backenddocker-compose up -dThe application provides comprehensive logging and monitoring:
- Health Checks: Service status and dependency checks
- Processing Metrics: OCR success rates and processing times
- Storage Analytics: Data usage and cleanup statistics
- Error Tracking: Detailed error logs with tracebacks
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation for API changes
- Ensure all tests pass before submitting
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Check the
/docsdirectory for detailed guides - API Docs: Interactive documentation at
/docsendpoint - Issues: Report bugs and request features via GitHub Issues
- Discussions: Join project discussions on GitHub
- β Regex-based document classification system
- β KAG (Knowledge Augmented Generation) component integration
- β Single-document mode enforcement
- β Vertex embedding disabled for prototype
- β Complete artifact generation (classification_verdict.json, kag_input.json, feature_vector.json)
- β Enhanced pipeline orchestration with 6-stage processing
- β Comprehensive MVP test suite
- β Modular router architecture implementation
- β Google Cloud Vision API integration
- β DocAI-compatible output format
- β Comprehensive test suite
- β Admin tools and monitoring
- FastAPI - Modern web framework
- Google Cloud Vision - OCR capabilities
- PyMuPDF - PDF processing
- Pillow - Image manipulation
Built with β€οΈ by the DocuMint-AI Team