- Drag & Drop Interface: Upload documents (PDF, PNG, JPG, TIFF) through an intuitive drag-and-drop interface
- Real-time Processing: Documents are processed instantly using Google Document AI with quality assessment
- Visual Feedback: Progress indicators show OCR processing status and completion
- Bounding Box Overlay: View OCR results with color-coded bounding boxes for blocks, paragraphs, lines, and tokens
- Hover Interactions: Hover over bounding boxes to preview text content and confidence scores
- Element Filtering: Toggle visibility of different text elements (blocks, paragraphs, lines, tokens)
- Click Selection: Click bounding boxes to highlight corresponding text in the results panel
- Auto Schema Generation: AI automatically generates JSON schemas from document samples (based on optional provided instructions)
- Interactive Schema Builder: Design custom schemas with drag-and-drop field creation
- Field Types: String, number, boolean, array, and nested object fields with validation
- Additional Instructions: Can add additional instructions in the prompt box to guide the extraction process
- Extracted Fields Display: View all extracted fields with confidence scores, values, and source text blocks
- Bounding Box Highlights: Visual mapping shows exactly where each field was found in the document
- Reasoning Tooltips: LLM explanations for how each field was identified and extracted
- Quality Metrics: Processing time, confidence averages, schema validation status, and quality grades
- Complete Prompts View: Inspect the exact system and user prompts sent to the LLM, including all instructions and context
- Multi-Stage Prompt Display: For hybrid extraction mode, view Stage 1 and Stage 2 prompts separately in organized tabs
- Raw LLM Response: See the exact JSON response from the AI model before any backend post-processing or validation
- Stage-Separated Responses: In hybrid mode, view raw responses from both Stage 1 (initial extraction) and Stage 2 (grounding) in separate tabs
Three extraction modes optimized for different needs:
- Visual Grounding: LLM receives image + all OCR blocks (with coordinates) β directly identifies source blocks (highest accuracy, highest cost)
- Text-Only (Manual Grounding): LLM receives full OCR text + image (no OCR blocks) β backend matches extracted values to OCR blocks (60-70% cost savings, lower accuracy)
- Hybrid RAG: Stage 1: text + image extraction β Stage 2: filtered OCR blocks only (no image) β LLM grounds fields (60-80% token reduction, higher accuracy)
- Python 3.11+ (Download)
- Node.js 18+ and npm (Download)
- Google Cloud Platform account (Sign up)
- OpenAI API access: OpenAI Platform
git clone <repo_link>
cd Data_ExtractionFollow the official setup guide: Google Document AI Quickstart
What you'll need:
- Google Cloud Project with Document AI API enabled
- Document AI Processor ID
- Service account JSON credentials file
Save the credentials file as config/gcp-credentials.json in your project.
- Get your API key from OpenAI Platform
- No endpoint or deployment name needed for direct API
The project includes a template file with all available configuration options. Use it as a starting point:
# Copy the template file to create your .env file
cp config/env.template config/.env
# Or copy to project root (both locations work)
cp config/env.template .envThe template file is located at: config/env.template
Open the .env file you just created and fill in your actual values. The template includes:
-
Google Document AI (for OCR):
GOOGLE_APPLICATION_CREDENTIALS=config/gcp-credentials.json GOOGLE_CLOUD_PROJECT=your-project-id GOOGLE_DOCUMENT_AI_PROCESSOR_ID=your-processor-id GOOGLE_DOCUMENT_AI_LOCATION=us
-
OpenAI (for structured extraction):
# API Key (required) OPENAI_API_KEY=your-api-key-here # Base URL and model (optional, defaults to api.openai.com/v1 and gpt-4o) # OPENAI_BASE_URL=https://api.openai.com/v1 # OPENAI_MODEL=gpt-4o
The template also includes optional settings for:
- Server configuration (host, port, log level)
- Image processing limits
- Model-specific overrides (for multiple models)
Important Notes:
- β
Use the template: Always start with
config/env.template- it has all available options documented - β
For Direct OpenAI: Set
OPENAI_BASE_URLandOPENAI_MODEL(or leave defaults) β οΈ Never commit: The.envfile is gitignored - keep your secrets safe!
# Option 1: Using pip
pip install -e .
# Option 2: Using uv (faster)
pip install uv
uv pip install -e .# Start the FastAPI server
python -m src.api.mainYou should see:
β
OpenAI structured extraction initialized. Models: ['gpt-4o-mini'], default='gpt-4o-mini'
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Backend is now running at: http://localhost:8000
API Documentation: http://localhost:8000/docs
Open a new terminal window:
cd src/frontend
npm installnpm run devYou should see:
VITE v7.0.4 ready in 500 ms
β Local: http://localhost:3000/
β Network: use --host to expose
Frontend is now running at: http://localhost:3000
- Open http://localhost:3000 in your browser
- You should see the Document OCR interface
- Try uploading a test image (invoice, receipt, or any document)
- If everything works, you'll see OCR results with bounding boxes!
- Custom Schema Generation: AI automatically generates JSON schemas from document samples - no manual schema design needed
- Enterprise-Grade OCR: Powered by Google Document AI with 99%+ accuracy on printed documents
- Visual Grounding: AI-powered field extraction with precise bounding box locations
- Hybrid Extraction Pipeline: Combines text-only extraction with semantic block filtering for optimal results
- Image Quality Assessment: Real-time quality scoring with defect detection and recommendations
- Multi-Model Support: Configure multiple OpenAI models (GPT-4o, GPT-4o-mini, GPT-4.5)
- Interactive UI: Modern React interface with real-time visualization and schema builder
- RESTful API: Well-documented FastAPI backend with OpenAPI/Swagger documentation
- Type Safety: Full TypeScript frontend and Pydantic backend for end-to-end type safety
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β React Frontend (Port 3000) β β
β β β’ Document Upload UI β β
β β β’ Interactive Bounding Box Visualization β β
β β β’ Schema Builder β β
β β β’ Results Dashboard β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β HTTP/REST API
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API GATEWAY LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastAPI Backend (Port 8000) β β
β β β’ /ocr/upload - Document processing β β
β β β’ /extract/structured - Visual grounding extraction β β
β β β’ /extract/structured/ocr-only - Text-only extraction β β
β β β’ /extract/structured/hybrid - Hybrid RAG pipeline β β
β β β’ /schema/generate - AI-powered custom schema generation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β β
βββββββββββββ΄βββββββββββ βββββββββββββ΄βββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ
β GOOGLE CLOUD β β OPENAI API β
β DOCUMENT AI β β βββββββββββββββββββββββββββββββββββββββ β
β β β β GPT Models β β
β β’ OCR Engine β β β β’ Structured Extraction β β
β β’ Layout Parse β β β β’ Visual Grounding β β
β β’ Quality Scoreβ β β β’ Schema Generation β β
β β’ Bounding Box β β β β’ Reasoning & Confidence β β
β β β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ
β β
β β
ββββββββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROCESSING LAYER β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Core Services β β
β β β’ DocumentOCRProcessor - OCR orchestration β β
β β β’ VisuallyGroundedExtractor - AI extraction with grounding β β
β β β’ HybridGroundingService - RAG-style 2-stage pipeline β β
β β β’ ImageQualityAssessment - Quality scoring & defects β β
β β β’ SchemaValidator - JSON schema validation β β
β β β’ ManualVisualGrounder - Backend text-to-block mapping β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Document Image β Google OCR β OpenAI (Image + Structured OCR Blocks)
β
LLM returns field_mappings with block IDs
β
Extracted Data + Bounding Boxes
LLM receives: Image + structured OCR blocks (block_id, text, confidence, bounding_box coordinates)
LLM returns: extracted_data + field_mappings (field β source_block_id)
Accuracy: Highest - LLM directly identifies source blocks using visual understanding
Cost: Highest - image tokens + full OCR block data
Document Image β Google OCR β OpenAI (Full Text + Image, no ocr blocks)
β
LLM returns extracted_data (no field_mappings)
β
Backend Manual Grounding Service
(fuzzy string matching to OCR blocks)
β
Extracted Data + Bounding Boxes
LLM receives: Full OCR text (plain string) + image (no structured block data)
LLM returns: extracted_data + reasoning_map (no field_mappings)
Backend: Matches extracted values to OCR blocks using lexical/fuzzy matching
Cost savings: 60-70% - no structured OCR blocks in prompt, reduces tokens significantly
Document Image β Google OCR
β
Stage 1: OpenAI (Full Text + Image) β extracted_data + reasoning_map
β
Stage 2: Semantic Block Filtering β top-K relevant blocks
β
Stage 2: OpenAI (Full Text + Filtered Blocks, no image) β field_mappings
β
Extracted Data + Bounding Boxes
Stage 1 LLM receives: Full OCR text + image (same as Text-Only mode)
Stage 1 LLM returns: extracted_data + reasoning_map
Block filtering: Semantic + lexical matching identifies relevant OCR blocks
Stage 2 LLM receives: Full text + filtered structured blocks (no image)
Stage 2 LLM returns: field_mappings (field β source_block_id)
Cost savings: 60-80% token reduction vs Visual Grounding (no image in stage 2, filtered blocks only)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with service status |
/ocr/upload |
POST | Upload file for OCR processing |
/ocr/base64 |
POST | Process base64-encoded image |
/extract/structured |
POST | Visual grounding extraction (image + OCR) |
/extract/structured/ocr-only |
POST | Text-only extraction with backend grounding |
/extract/structured/hybrid |
POST | Hybrid RAG pipeline (2-stage) |
/schema/validate |
POST | Validate JSON schema |
/schema/generate |
POST | Auto-generate schema from document |
/docs |
GET | Interactive API documentation (Swagger) |
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
-
Google Cloud Document AI
Enterprise-grade document processing with OCR, layout understanding, and quality assessment.
Documentation -
OpenAI API
GPT-4o models with structured outputs and visual understanding capabilities.
Documentation -
FastAPI Framework
Modern, fast web framework for building APIs with Python 3.11+ type hints.
Documentation -
shadcn/ui Component Library
Beautifully designed accessible components built with Radix UI and Tailwind CSS.
Documentation -
RAG (Retrieval-Augmented Generation)
Hybrid extraction approach inspired by RAG patterns for optimal accuracy and efficiency.
Paper -
Structured Extraction with LLMs
Best practices for extracting structured data from documents using vision-language models.
OpenAI Documentation -
Cursor IDE
AI-powered code editor that accelerates development with intelligent code completion and assistance.
Documentation | Website






