Skip to content

SamSpectre/multimodal-vision-api

Repository files navigation

Vision Platform API

Enterprise-grade Document Intelligence & Real-time Vision API

A production-ready REST API combining best-in-class document processing with low-latency robotics vision capabilities.


Key Features

Document Intelligence

  • Mistral OCR 3 - Industry-leading accuracy, outperforms GPT-4o, Google, and Azure
  • Table Extraction - Preserves structure in HTML format
  • Document Classification - Auto-detect invoices, contracts, forms, receipts
  • Multi-page PDF - Process entire documents with page-by-page results

Real-time Vision (Coming Soon)

  • ~50ms Latency - Powered by Groq's Llama 3.2 Vision
  • Face Detection - MediaPipe integration for real-time detection
  • Object Detection - Scene understanding for robotics
  • Video Processing - Frame extraction and analysis

Enterprise Ready

  • JWT + API Key Authentication - Secure access control
  • Health Checks - Kubernetes-ready liveness/readiness probes
  • Swagger Documentation - Auto-generated interactive API docs
  • Multi-Agent Architecture - Intelligent routing to specialist agents

Quick Start

1. Get API Access

# Clone the repository
git clone <repository-url>
cd Project-4_Vision&Langchain

# Create virtual environment
python -m venv .venv
.\.venv\Scripts\activate  # Windows
source .venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

2. Configure API Keys

Create a .env file:

# Required
OPENAI_API_KEY=sk-proj-...
MISTRAL_API_KEY=your_mistral_key

# Optional (for robotics vision)
GROQ_API_KEY=gsk_...

# Authentication (change in production!)
SECRET_KEY=generate_with_openssl_rand_hex_32
ADMIN_USERNAME=admin
ADMIN_PASSWORD_HASH=$2b$12$...

3. Start the Server

uvicorn api.main:app --reload --port 8000

4. Access the API


Authentication

Login

curl -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "admin123"}'

Response:

{
  "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9...",
  "token_type": "bearer"
}

Use Token in Requests

curl http://localhost:8000/api/v1/agent/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "Extract text from invoice.pdf"}'

API Key Alternative

For service-to-service communication:

curl http://localhost:8000/api/v1/documents/ocr \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"file_path": "/path/to/document.pdf"}'

API Endpoints

Authentication

Method Endpoint Description Auth
POST /api/v1/auth/login Login with credentials No
POST /api/v1/auth/token OAuth2 token endpoint No
GET /api/v1/auth/me Get current user Yes
GET /api/v1/auth/status Auth system status No

Multi-Agent Chat

Method Endpoint Description Auth
POST /api/v1/agent/chat Chat with AI agents Yes
POST /api/v1/agent/stream Stream responses (SSE) Yes
GET /api/v1/agent/status Agent status & models No
POST /api/v1/agent/reset Reset memory Admin

Document Processing

Method Endpoint Description Auth
POST /api/v1/documents/ocr Extract text Yes
POST /api/v1/documents/tables Extract tables Yes
POST /api/v1/documents/classify Classify document Yes
POST /api/v1/documents/analyze Full analysis Yes
POST /api/v1/documents/pdf Process PDF Yes
POST /api/v1/documents/upload Upload file Yes
GET /api/v1/documents/status Service status No

Health

Method Endpoint Description Auth
GET /api/v1/health/ Health status No
GET /api/v1/health/live Liveness probe No
GET /api/v1/health/ready Readiness probe No

Examples

Extract Text from Document

curl -X POST http://localhost:8000/api/v1/documents/ocr \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "file_base64": "BASE64_ENCODED_IMAGE",
    "extract_tables": true
  }'

Chat with Multi-Agent System

curl -X POST http://localhost:8000/api/v1/agent/chat \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What type of document is invoice.pdf and extract all the line items?"
  }'

Upload and Process File

curl -X POST http://localhost:8000/api/v1/documents/upload \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@document.pdf"

Architecture

                         Client Request
                              ↓
                    ┌─────────────────┐
                    │   FastAPI App   │
                    │   + JWT Auth    │
                    └────────┬────────┘
                             ↓
              ┌──────────────┴──────────────┐
              ↓                             ↓
     ┌────────────────┐           ┌────────────────┐
     │   /documents   │           │    /agent      │
     │   Direct OCR   │           │  Multi-Agent   │
     └────────────────┘           └────────────────┘
              ↓                             ↓
              │                   ┌─────────────────┐
              │                   │   SUPERVISOR    │
              │                   │  (gpt-4o-mini)  │
              │                   └────────┬────────┘
              │                            ↓
              │              ┌─────────────┴─────────────┐
              ↓              ↓                           ↓
     ┌────────────────┐  ┌────────────────┐   ┌────────────────┐
     │  Mistral OCR 3 │  │ Document Agent │   │  Video Agent   │
     │  $2/1000 pages │  │ mistral-large  │   │ Groq (~50ms)   │
     └────────────────┘  └────────────────┘   └────────────────┘

Model Configuration

Component Model Provider Latency Cost
Supervisor gpt-4o-mini OpenAI ~100ms $0.15/1M tokens
Document Agent mistral-large-latest Mistral ~500ms ~$2/1M tokens
Video Agent llama-3.2-11b-vision Groq ~50ms $0.18/1M tokens
OCR mistral-ocr-2512 Mistral ~1s $2/1000 pages

Deployment

Docker (Coming Soon)

docker-compose up -d

Environment Variables

Variable Required Description
OPENAI_API_KEY Yes OpenAI API key for supervisor
MISTRAL_API_KEY Yes Mistral API key for OCR
GROQ_API_KEY No Groq API key for robotics vision
SECRET_KEY Yes JWT signing key (32+ bytes)
AUTH_ENABLED No Enable/disable auth (default: True)

Performance

Operation Typical Latency Throughput
OCR (single page) ~1-2s 30 pages/min
Document Classification ~500ms 120 req/min
Multi-Agent Chat ~3-5s 20 req/min
Video Frame Analysis ~50ms 1200 frames/min

Support & Contact


License

MIT License - See LICENSE file for details.


Roadmap

  • Document Intelligence (Mistral OCR 3)
  • Multi-Agent System (LangGraph)
  • Authentication (JWT + API Keys)
  • Video/Robotics Vision (Groq)
  • Async Job Processing (Celery)
  • Docker Deployment
  • Cloud Deployment (AWS/GCP/Azure)
  • Usage Analytics & Billing

Built with FastAPI, LangChain, Mistral AI, and Groq

About

Enterprise REST API for document OCR, table extraction, classification, and real-time robotics vision. Multi-agent routing with JWT auth.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages