Skip to content

charangajjala/DocuMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ DocuMind: Intelligent Document Data Extraction

AI-powered document processing with custom schema support and visual grounding

React TypeScript Tailwind FastAPI Python Document AI OpenAI


▢️ Demo

Document Upload & OCR Processing

Document Upload & OCR Processing

  • Drag & Drop Interface: Upload documents (PDF, PNG, JPG, TIFF) through an intuitive drag-and-drop interface
  • Real-time Processing: Documents are processed instantly using Google Document AI with quality assessment
  • Visual Feedback: Progress indicators show OCR processing status and completion

Interactive Document Visualization

Interactive Document Visualization

  • Bounding Box Overlay: View OCR results with color-coded bounding boxes for blocks, paragraphs, lines, and tokens
  • Hover Interactions: Hover over bounding boxes to preview text content and confidence scores
  • Element Filtering: Toggle visibility of different text elements (blocks, paragraphs, lines, tokens)
  • Click Selection: Click bounding boxes to highlight corresponding text in the results panel

Custom Schema Generation & Builder

Custom Schema Generation & Builder

Custom Schema Generation & Builder

  • Auto Schema Generation: AI automatically generates JSON schemas from document samples (based on optional provided instructions)
  • Interactive Schema Builder: Design custom schemas with drag-and-drop field creation
  • Field Types: String, number, boolean, array, and nested object fields with validation
  • Additional Instructions: Can add additional instructions in the prompt box to guide the extraction process

Results Dashboard

Results Dashboard

  • Extracted Fields Display: View all extracted fields with confidence scores, values, and source text blocks
  • Bounding Box Highlights: Visual mapping shows exactly where each field was found in the document
  • Reasoning Tooltips: LLM explanations for how each field was identified and extracted
  • Quality Metrics: Processing time, confidence averages, schema validation status, and quality grades

LLM Debug Section

LLM Debug Section

  • Complete Prompts View: Inspect the exact system and user prompts sent to the LLM, including all instructions and context
  • Multi-Stage Prompt Display: For hybrid extraction mode, view Stage 1 and Stage 2 prompts separately in organized tabs
  • Raw LLM Response: See the exact JSON response from the AI model before any backend post-processing or validation
  • Stage-Separated Responses: In hybrid mode, view raw responses from both Stage 1 (initial extraction) and Stage 2 (grounding) in separate tabs

Multi-Mode Extraction

Three extraction modes optimized for different needs:

  • Visual Grounding: LLM receives image + all OCR blocks (with coordinates) β†’ directly identifies source blocks (highest accuracy, highest cost)
  • Text-Only (Manual Grounding): LLM receives full OCR text + image (no OCR blocks) β†’ backend matches extracted values to OCR blocks (60-70% cost savings, lower accuracy)
  • Hybrid RAG: Stage 1: text + image extraction β†’ Stage 2: filtered OCR blocks only (no image) β†’ LLM grounds fields (60-80% token reduction, higher accuracy)

πŸ–₯️ Installation & Setup βš™οΈ

Prerequisites


πŸ“ Step 1: Clone the Repository

git clone <repo_link>
cd Data_Extraction

☁️ Step 2: Set Up Google Document AI

Follow the official setup guide: Google Document AI Quickstart

What you'll need:

  • Google Cloud Project with Document AI API enabled
  • Document AI Processor ID
  • Service account JSON credentials file

Save the credentials file as config/gcp-credentials.json in your project.


πŸ€– Step 3: Set Up OpenAI

  1. Get your API key from OpenAI Platform
  2. No endpoint or deployment name needed for direct API

πŸ”§ Step 4: Configure Environment Variables

The project includes a template file with all available configuration options. Use it as a starting point:

# Copy the template file to create your .env file
cp config/env.template config/.env

# Or copy to project root (both locations work)
cp config/env.template .env

The template file is located at: config/env.template

Open the .env file you just created and fill in your actual values. The template includes:

Required Configuration

  1. Google Document AI (for OCR):

    GOOGLE_APPLICATION_CREDENTIALS=config/gcp-credentials.json
    GOOGLE_CLOUD_PROJECT=your-project-id
    GOOGLE_DOCUMENT_AI_PROCESSOR_ID=your-processor-id
    GOOGLE_DOCUMENT_AI_LOCATION=us
  2. OpenAI (for structured extraction):

    # API Key (required)
    OPENAI_API_KEY=your-api-key-here
    
    # Base URL and model (optional, defaults to api.openai.com/v1 and gpt-4o)
    # OPENAI_BASE_URL=https://api.openai.com/v1
    # OPENAI_MODEL=gpt-4o

Optional Configuration

The template also includes optional settings for:

  • Server configuration (host, port, log level)
  • Image processing limits
  • Model-specific overrides (for multiple models)

Important Notes:

  • βœ… Use the template: Always start with config/env.template - it has all available options documented
  • βœ… For Direct OpenAI: Set OPENAI_BASE_URL and OPENAI_MODEL (or leave defaults)
  • ⚠️ Never commit: The .env file is gitignored - keep your secrets safe!

🐍 Step 5: Install Python Dependencies

# Option 1: Using pip
pip install -e .

# Option 2: Using uv (faster)
pip install uv
uv pip install -e .

πŸš€ Step 6: Start the Backend

# Start the FastAPI server
python -m src.api.main

You should see:

βœ… OpenAI structured extraction initialized. Models: ['gpt-4o-mini'], default='gpt-4o-mini'
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Backend is now running at: http://localhost:8000

API Documentation: http://localhost:8000/docs


🎨 Step 7: Install Frontend Dependencies

Open a new terminal window:

cd src/frontend
npm install

🌐 Step 8: Start the Frontend

npm run dev

You should see:

VITE v7.0.4  ready in 500 ms

➜  Local:   http://localhost:3000/
➜  Network: use --host to expose

Frontend is now running at: http://localhost:3000


βœ… Step 9: Verify Installation

  1. Open http://localhost:3000 in your browser
  2. You should see the Document OCR interface
  3. Try uploading a test image (invoice, receipt, or any document)
  4. If everything works, you'll see OCR results with bounding boxes!

πŸ“‹ Key Features

  • Custom Schema Generation: AI automatically generates JSON schemas from document samples - no manual schema design needed
  • Enterprise-Grade OCR: Powered by Google Document AI with 99%+ accuracy on printed documents
  • Visual Grounding: AI-powered field extraction with precise bounding box locations
  • Hybrid Extraction Pipeline: Combines text-only extraction with semantic block filtering for optimal results
  • Image Quality Assessment: Real-time quality scoring with defect detection and recommendations
  • Multi-Model Support: Configure multiple OpenAI models (GPT-4o, GPT-4o-mini, GPT-4.5)
  • Interactive UI: Modern React interface with real-time visualization and schema builder
  • RESTful API: Well-documented FastAPI backend with OpenAPI/Swagger documentation
  • Type Safety: Full TypeScript frontend and Pydantic backend for end-to-end type safety

πŸš€ Technical Details

πŸ’‘ High-Level System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CLIENT LAYER                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  React Frontend (Port 3000)                                   β”‚  β”‚
β”‚  β”‚  β€’ Document Upload UI                                         β”‚  β”‚
β”‚  β”‚  β€’ Interactive Bounding Box Visualization                     β”‚  β”‚
β”‚  β”‚  β€’ Schema Builder                                             β”‚  β”‚
β”‚  β”‚  β€’ Results Dashboard                                          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β”‚ HTTP/REST API
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      API GATEWAY LAYER                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  FastAPI Backend (Port 8000)                                  β”‚  β”‚
β”‚  β”‚  β€’ /ocr/upload - Document processing                          β”‚  β”‚
β”‚  β”‚  β€’ /extract/structured - Visual grounding extraction          β”‚  β”‚
β”‚  β”‚  β€’ /extract/structured/ocr-only - Text-only extraction        β”‚  β”‚
β”‚  β”‚  β€’ /extract/structured/hybrid - Hybrid RAG pipeline           β”‚  β”‚
β”‚  β”‚  β€’ /schema/generate - AI-powered custom schema generation     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                             β”‚
                    β”‚                             β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                      β–Ό     β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GOOGLE CLOUD   β”‚  β”‚         OPENAI API                        β”‚
β”‚  DOCUMENT AI    β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                 β”‚  β”‚  β”‚  GPT Models         β”‚   β”‚
β”‚  β€’ OCR Engine   β”‚  β”‚  β”‚  β€’ Structured Extraction             β”‚   β”‚
β”‚  β€’ Layout Parse β”‚  β”‚  β”‚  β€’ Visual Grounding                  β”‚   β”‚
β”‚  β€’ Quality Scoreβ”‚  β”‚  β”‚  β€’ Schema Generation                 β”‚   β”‚
β”‚  β€’ Bounding Box β”‚  β”‚  β”‚  β€’ Reasoning & Confidence            β”‚   β”‚
β”‚                 β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                              β”‚
        β”‚                              β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PROCESSING LAYER                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Core Services                                                β”‚  β”‚
β”‚  β”‚  β€’ DocumentOCRProcessor - OCR orchestration                   β”‚  β”‚
β”‚  β”‚  β€’ VisuallyGroundedExtractor - AI extraction with grounding   β”‚  β”‚
β”‚  β”‚  β€’ HybridGroundingService - RAG-style 2-stage pipeline        β”‚  β”‚
β”‚  β”‚  β€’ ImageQualityAssessment - Quality scoring & defects         β”‚  β”‚
β”‚  β”‚  β€’ SchemaValidator - JSON schema validation                   β”‚  β”‚
β”‚  β”‚  β€’ ManualVisualGrounder - Backend text-to-block mapping       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Extraction Pipeline Modes

1. Visual Grounding Mode

Document Image β†’ Google OCR β†’ OpenAI (Image + Structured OCR Blocks)
                                    ↓
                        LLM returns field_mappings with block IDs
                                    ↓
                        Extracted Data + Bounding Boxes

LLM receives: Image + structured OCR blocks (block_id, text, confidence, bounding_box coordinates)
LLM returns: extracted_data + field_mappings (field β†’ source_block_id)
Accuracy: Highest - LLM directly identifies source blocks using visual understanding
Cost: Highest - image tokens + full OCR block data

2. Text-Only Mode (Manual Grounding)

Document Image β†’ Google OCR β†’ OpenAI (Full Text + Image, no ocr blocks)
                                    ↓
                        LLM returns extracted_data (no field_mappings)
                                    ↓
                    Backend Manual Grounding Service
                    (fuzzy string matching to OCR blocks)
                                    ↓
                        Extracted Data + Bounding Boxes

LLM receives: Full OCR text (plain string) + image (no structured block data)
LLM returns: extracted_data + reasoning_map (no field_mappings)
Backend: Matches extracted values to OCR blocks using lexical/fuzzy matching
Cost savings: 60-70% - no structured OCR blocks in prompt, reduces tokens significantly

3. Hybrid RAG Mode

Document Image β†’ Google OCR
                    ↓
        Stage 1: OpenAI (Full Text + Image) β†’ extracted_data + reasoning_map
                    ↓
        Stage 2: Semantic Block Filtering β†’ top-K relevant blocks
                    ↓
        Stage 2: OpenAI (Full Text + Filtered Blocks, no image) β†’ field_mappings
                    ↓
                        Extracted Data + Bounding Boxes

Stage 1 LLM receives: Full OCR text + image (same as Text-Only mode)
Stage 1 LLM returns: extracted_data + reasoning_map
Block filtering: Semantic + lexical matching identifies relevant OCR blocks
Stage 2 LLM receives: Full text + filtered structured blocks (no image)
Stage 2 LLM returns: field_mappings (field β†’ source_block_id)
Cost savings: 60-80% token reduction vs Visual Grounding (no image in stage 2, filtered blocks only)


πŸ—οΈ API Endpoints

Endpoint Method Description
/health GET Health check with service status
/ocr/upload POST Upload file for OCR processing
/ocr/base64 POST Process base64-encoded image
/extract/structured POST Visual grounding extraction (image + OCR)
/extract/structured/ocr-only POST Text-only extraction with backend grounding
/extract/structured/hybrid POST Hybrid RAG pipeline (2-stage)
/schema/validate POST Validate JSON schema
/schema/generate POST Auto-generate schema from document
/docs GET Interactive API documentation (Swagger)

🎟️ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


πŸ“œ References & Acknowledgements

  1. Google Cloud Document AI
    Enterprise-grade document processing with OCR, layout understanding, and quality assessment.
    Documentation

  2. OpenAI API
    GPT-4o models with structured outputs and visual understanding capabilities.
    Documentation

  3. FastAPI Framework
    Modern, fast web framework for building APIs with Python 3.11+ type hints.
    Documentation

  4. shadcn/ui Component Library
    Beautifully designed accessible components built with Radix UI and Tailwind CSS.
    Documentation

  5. RAG (Retrieval-Augmented Generation)
    Hybrid extraction approach inspired by RAG patterns for optimal accuracy and efficiency.
    Paper

  6. Structured Extraction with LLMs
    Best practices for extracting structured data from documents using vision-language models.
    OpenAI Documentation

  7. Cursor IDE
    AI-powered code editor that accelerates development with intelligent code completion and assistance.
    Documentation | Website

About

πŸ“ƒπŸ€– AI-powered document extraction system with custom schema generation, visual grounding, OCR processing, and structured data extraction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors