Skip to content

DocuMint-AI/ai-backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI Backend Document Processing API - MVP Prototype

FastAPI Python Google Cloud Vision License

A high-performance FastAPI service for PDF document processing, OCR (Optical Character Recognition), and text parsing using Google Cloud Vision API. Prototype uses regex-based classification, no multi-document handling, Vertex embedding disabled, KAG handoff active.

🎯 MVP Prototype Features

This is a prototype version with the following characteristics:

  • βœ… Single-document mode only - No multi-document handling
  • βœ… Regex-based classification - Template matching using legal keywords (no Vertex Matching Engine)
  • βœ… Vertex embedding disabled - Embeddings set to null/placeholder values
  • βœ… KAG handoff active - Unified KAG Writer component with automatic schema-compliant generation
  • βœ… Deterministic results - Consistent outputs for the same test document
  • βœ… Complete artifact generation - parsed_output.json, classification_verdict.json, kag_input.json

πŸš€ Features

  • πŸ“„ PDF Processing: Convert PDF documents to high-quality images
  • πŸ” OCR Integration: Google Cloud Vision API for accurate text extraction
  • πŸ€– Document AI: Integration with Google Document AI for structured parsing
  • πŸ”„ Pipeline Orchestration: Unified workflow combining PDF β†’ Images β†’ OCR β†’ DocAI β†’ Classification β†’ KAG
  • 🏷️ Regex Classification: Pattern-based document classification using legal keywords
  • 🧠 KAG Integration: Automatic kag_input.json generation with unified schema
  • πŸ“‹ Schema Compliance: Structured output pairing DocAI results with classifier verdicts
  • 🌐 Multi-language Support: Configurable language hints for better OCR accuracy
  • πŸ“ File Management: Upload, process, and manage document processing workflows
  • βš™οΈ Admin Tools: Data purge operations and usage analytics
  • πŸ—οΈ Modular Architecture: Router-based design for easy feature expansion
  • ⚑ Background Processing: Async processing for large documents
  • πŸ“Š Health Monitoring: Comprehensive health checks and status endpoints
  • πŸ’Ύ DocAI Compatible: Output format compatible with Google Document AI
  • πŸ”§ KAG Writer: Unified component for automatic knowledge input generation

πŸ“‹ Prerequisites

  • Python 3.8 or higher
  • Google Cloud Project with Vision API enabled
  • Google Cloud Service Account with Vision API permissions

πŸ›  Installation

1. Clone the Repository

git clone https://github.com/DocuMint-AI/ai-backend.git
cd ai-backend

2. Set Up Environment (Using uv - Recommended)

Option A: Using uv (Recommended)

# Run setup script (installs uv if needed and sets up environment)
./setup.sh          # Linux/Mac
# or
setup.bat           # Windows

# Or manually:
uv venv              # Create virtual environment  
uv pip install -r requirements.txt  # Install dependencies

Option B: Using traditional Python venv

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Configure Environment Variables

Create a .env file in the project root:

# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT_ID=your-project-id
GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json

# Application Configuration
DATA_ROOT=./data
IMAGE_FORMAT=PNG
IMAGE_DPI=300
LANGUAGE_HINTS=en,es,fr
MAX_FILE_SIZE_MB=50

5. Set Up Google Cloud Credentials

# Download your service account key file from Google Cloud Console
# Place it in a secure location and update GOOGLE_APPLICATION_CREDENTIALS

πŸš€ Quick Start

Development Mode

# Start the development server
uv run main.py

# Or using uvicorn with auto-reload
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Production Mode

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Access the API

πŸ“š API Endpoints

Core Endpoints

Method Endpoint Description
GET / API information and available endpoints
GET /health Health check and service status
POST /upload Upload PDF files for processing
POST /ocr-process Process uploaded PDFs with OCR
GET /results/{uid} Retrieve processing results
GET /folders List all processing folders
DELETE /cleanup/{uid} Clean up processing folders

Admin Endpoints

Method Endpoint Description
POST /admin/purge Execute data cleanup operations
GET /admin/data-usage Get storage usage statistics

Document AI Endpoints

Method Endpoint Description
POST /api/docai/parse Parse document with Google Document AI
POST /api/docai/parse/batch Batch process multiple documents
GET /api/docai/processors List available DocAI processors
GET /api/docai/config Get DocAI configuration

πŸ”„ Pipeline Orchestration (NEW)

Method Endpoint Description
POST /api/v1/process-document Complete pipeline: PDF β†’ Images β†’ OCR β†’ DocAI
GET /api/v1/pipeline-status/{pipeline_id} Get real-time processing status
GET /api/v1/pipeline-results/{pipeline_id} Retrieve complete pipeline results
GET /api/v1/health Orchestration service health check

πŸ”§ Usage Examples

πŸ”„ Complete Document Pipeline (MVP Prototype)

The orchestration API provides a single endpoint for the complete workflow with integrated classification and KAG handoff:

# Process a document through the complete MVP pipeline
curl -X POST "http://localhost:8000/api/v1/process-document" \
  -F "file=@contract.pdf" \
  -F "language_hints=en,hi" \
  -F "confidence_threshold=0.8"

# Response includes pipeline_id and processing artifacts
{
  "success": true,
  "pipeline_id": "abc123-def456",
  "message": "Document processing completed successfully in 45.2s",
  "total_processing_time": 45.2,
  "final_results_path": "data/processed/pipeline_result_abc123-def456.json",
  "stage_timings": {
    "upload": 2.1,
    "ocr": 15.4,
    "docai": 20.3,
    "classification": 1.2,
    "kag": 3.8,
    "saving": 2.4
  }
}

MVP Pipeline Flow:

  1. πŸ“„ Upload PDF - Secure file upload and validation
  2. πŸ–ΌοΈ PDF β†’ Images - Multi-library fallback conversion
  3. πŸ‘οΈ Vision OCR - Google Cloud Vision text extraction
  4. 🧠 Document AI - Structured document parsing
  5. 🏷️ Regex Classification - Pattern-based document categorization
  6. πŸ€– KAG Processing - Knowledge Augmented Generation preparation
  7. πŸ’Ύ Artifact Generation - Save classification_verdict.json, kag_input.json, feature_vector.json

Generated Artifacts:

  • classification_verdict.json - Document classification results with matched patterns
  • kag_input.json - Structured handoff payload for downstream processing
  • feature_vector.json - ML-ready features with classifier verdict (embeddings disabled)

Individual Step Processing

1. Upload a PDF File

curl -X POST "http://localhost:8000/upload" \
  -F "file=@document.pdf"

2. Process with OCR

curl -X POST "http://localhost:8000/ocr-process" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_path": "/data/uploads/document.pdf",
    "language_hints": ["en", "es"],
    "force_reprocess": false
  }'

3. Get Processing Results

curl -X GET "http://localhost:8000/results/{uid}"

4. Health Check

curl -X GET "http://localhost:8000/health"

πŸ“– For complete orchestration API documentation with examples, status monitoring, and result formats, see ORCHESTRATION_API.md

πŸ— Project Structure

ai-backend/
β”œβ”€β”€ main.py                     # FastAPI application entry point
β”œβ”€β”€ routers/                    # Modular router architecture
β”‚   β”œβ”€β”€ __init__.py            # Router package initialization
β”‚   β”œβ”€β”€ processing_handler.py  # Document processing endpoints
β”‚   β”œβ”€β”€ doc_ai_router.py       # Document AI integration
β”‚   └── orchestration_router.py # MVP Pipeline orchestration
β”œβ”€β”€ services/                   # Business logic and utilities
β”‚   β”œβ”€β”€ doc_ai/               # Document AI services
β”‚   β”œβ”€β”€ preprocessing/         # Document preprocessing
β”‚   β”‚   β”œβ”€β”€ OCR-processing.py  # OCR processing logic
β”‚   β”‚   └── parsing.py         # Text parsing utilities
β”‚   β”œβ”€β”€ template_matching/     # MVP Classification (NEW)
β”‚   β”‚   β”œβ”€β”€ legal_keywords.py  # Legal keyword database
β”‚   β”‚   └── regex_classifier.py # Regex-based classifier
β”‚   β”œβ”€β”€ kag_component.py       # KAG handoff component (NEW)
β”‚   β”œβ”€β”€ feature_emitter.py     # Enhanced with classifier verdict
β”‚   β”œβ”€β”€ util-services.py       # Utility functions
β”‚   └── project_utils.py       # Project utilities
β”œβ”€β”€ data/                      # Data storage directory
β”‚   β”œβ”€β”€ uploads/              # Uploaded files
β”‚   β”œβ”€β”€ processed/            # Pipeline results (NEW)
β”‚   └── test-files/           # Test documents
β”œβ”€β”€ docs/                      # Documentation
β”‚   β”œβ”€β”€ ORCHESTRATION_API.md  # Pipeline API docs (NEW)
β”‚   └── ...                   # Other documentation
β”œβ”€β”€ tests/                     # Test suite
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ test_orchestration.py      # Orchestration tests (NEW)
└── README.md                 # Project documentation

πŸ§ͺ Testing

Run MVP Test Suite

# Run the new MVP regex classification tests
python -m pytest tests/test_single_doc_regex.py -v

# Run quick validation
python tests/test_single_doc_regex.py

# Run migration tests
python test_orchestration.py

Run Full Test Suite

# Install test dependencies
pip install -r tests/test_requirements.txt

# Run all tests
pytest tests/

# Run specific test categories
pytest tests/test_ocr_processing.py
pytest tests/test_api_endpoints.py
pytest tests/test_single_doc_regex.py  # MVP tests

πŸ”§ Configuration

Environment Variables

Variable Description Default
GOOGLE_CLOUD_PROJECT_ID Google Cloud Project ID Required
GOOGLE_APPLICATION_CREDENTIALS Path to service account key Required
DATA_ROOT Data storage directory ./data
IMAGE_FORMAT Output image format PNG
IMAGE_DPI Image resolution 300
LANGUAGE_HINTS OCR language hints en
MAX_FILE_SIZE_MB Maximum upload size 50

Google Cloud Setup

  1. Create a Google Cloud Project
  2. Enable the Cloud Vision API
  3. Create a service account with Vision API permissions
  4. Download the service account key file
  5. Set the GOOGLE_APPLICATION_CREDENTIALS environment variable

🚒 Deployment

Docker Deployment

# Build the container
docker build -t ai-backend .

# Run the container
docker run -p 8000:8000 \
  -e GOOGLE_CLOUD_PROJECT_ID=your-project \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
  -v /path/to/credentials.json:/app/credentials.json:ro \
  ai-backend

Docker Compose

docker-compose up -d

πŸ“Š Monitoring and Logging

The application provides comprehensive logging and monitoring:

  • Health Checks: Service status and dependency checks
  • Processing Metrics: OCR success rates and processing times
  • Storage Analytics: Data usage and cleanup statistics
  • Error Tracking: Detailed error logs with tracebacks

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation for API changes
  • Ensure all tests pass before submitting

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

  • Documentation: Check the /docs directory for detailed guides
  • API Docs: Interactive documentation at /docs endpoint
  • Issues: Report bugs and request features via GitHub Issues
  • Discussions: Join project discussions on GitHub

πŸ”„ Changelog

Version 1.1.0 (MVP Prototype)

  • βœ… Regex-based document classification system
  • βœ… KAG (Knowledge Augmented Generation) component integration
  • βœ… Single-document mode enforcement
  • βœ… Vertex embedding disabled for prototype
  • βœ… Complete artifact generation (classification_verdict.json, kag_input.json, feature_vector.json)
  • βœ… Enhanced pipeline orchestration with 6-stage processing
  • βœ… Comprehensive MVP test suite

Version 1.0.0

  • βœ… Modular router architecture implementation
  • βœ… Google Cloud Vision API integration
  • βœ… DocAI-compatible output format
  • βœ… Comprehensive test suite
  • βœ… Admin tools and monitoring

πŸ™ Acknowledgments


Built with ❀️ by the DocuMint-AI Team

About

Backend involving all endpoints for AI tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors