AI Document Intelligence API

A backend service that turns raw documents into structured JSON using a mix of text extraction, AI analysis and metadata extraction.

Upload a PDF, DOCX or TXT file and the API will:

extract the text
generate a short summary
classify the document
pull out a few useful key fields
extract entities, including character offsets
store everything in SQLite, per user

All of that comes back as a single, well-defined JSON object that can be used by other services or UIs.

Features

Document upload (PDF, DOCX, TXT) via FastAPI
Text extraction
- PDFs via pypdf
- DOCX via python-docx
- Plain text files as-is
LLM-backed analysis using OpenAI GPT-4o-mini (or an OpenAI compatible API)
Structured JSON output:
- summary (3–7 sentences)
- classification (document_type and category)
- key_fields (date, due_date, total_amount, person_or_company, reference_id)
- entities (PERSON, ORG, DATE, MONEY, LOCATION, OTHER) with optional offsets
Entity character offsets
- Each entity can include start_offset and end_offset into the original document text
- Useful for UI highlighting or redaction workflows
SQLite backed storage
- Users, documents, analyses and entities stored in SQLite
- Per user history with /documents and /documents/{id}
JWT authentication
- /auth/register and /auth/login
- Bearer token required for analysis and document endpoints
OpenAI compatible endpoint support
- Use OpenAI directly
- Or point to an OpenAI compatible base URL via OPENAI_BASE_URL
Interactive docs
- Swagger UI at /docs
- ReDoc at /redoc
Simple HTML UI
- Minimal upload page at /
- Paste token, upload a file, see the raw JSON response

Architecture

High level flow:

flowchart LR
    U[Client] -->|Upload file + JWT| A[FastAPI /analyze]
    A -->|Auth check| AUTH[JWT verification]
    A -->|Save temp file| F[uploads/]
    A -->|Detect type| T[Extractor]
    T -->|Extract text| TXT[Document text]
    TXT -->|Prompt| L[AI client]
    L -->|Strict JSON| J[AnalysisResult]
    J -->|Persist| DB[(SQLite)]
    DB -->|History and detail| H[History endpoints]
    A -->|Return JSON| U

The service is intentionally monolithic. A single FastAPI app owns routing, analysis, persistence and authentication.
The goal is to keep the code readable and straightforward rather than overly abstract.

Tech Stack

Backend: FastAPI, Uvicorn
Language: Python 3.12+
AI / LLM: OpenAI client (GPT-4o-mini by default)
Parsing: pypdf, python-docx
Config: python-dotenv
Auth: JWT using python-jose and passlib
Database: SQLite via SQLAlchemy

Getting Started

Requirements

Python 3.10 or later (developed with 3.12)
A working pip installation
An OpenAI API key or a compatible provider

Installation

Clone the repository and set up a virtual environment:

git clone https://github.com/ZXRProductions/document-intel-api.git
cd document-intel-api

python -m venv venv

# Windows
venv\Scripts\Activate
# macOS / Linux
source venv/bin/Activate

pip install --upgrade pip
pip install -r requirements.txt

Environment Variables

Copy the example file and fill in your details:

cp .env.example .env

.env:

# LLM config
OPENAI_API_KEY=your_api_key_here
MODEL_NAME=gpt-4o-mini

# Optional OpenAI compatible endpoint
# Example: https://api.groq.com/openai/v1
OPENAI_BASE_URL=

# App behaviour
DEBUG=false
MAX_FILE_SIZE_MB=10

# Database (leave default for local sqlite)
# DATABASE_URL=sqlite:///document_intel.db

# Auth / security
SECRET_KEY=CHANGE_ME_IN_PRODUCTION
ACCESS_TOKEN_EXPIRE_MINUTES=60

Running the API

From the project root:

uvicorn app.main:app --reload

Open:

API Reference

Auth

`POST /auth/register`

Create a new user.

Request body:

{
  "email": "user@example.com",
  "password": "yourpassword"
}

Response (201):

{
  "id": 1,
  "email": "user@example.com"
}

`POST /auth/login`

Authenticate and obtain a JWT access token.

Form data (application/x-www-form-urlencoded):

username (email)
password

Example:

curl -X POST "http://127.0.0.1:8000/auth/login"   -H "Content-Type: application/x-www-form-urlencoded"   -d "username=user@example.com&password=yourpassword"

Response (200):

{
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "token_type": "bearer"
}

Use this token in the Authorization header:

Authorization: Bearer <token>

`GET /me`

Return basic information about the authenticated user.

Response:

{
  "id": 1,
  "email": "user@example.com"
}

System

`GET /health`

Simple health check.

Response:

{
  "status": "ok",
  "detail": "Service is up."
}

Analysis

`POST /analyze`

Upload a document and get structured analysis back. Requires a valid Bearer token.

Method: POST
Content-Type: multipart/form-data
Field: file (PDF, DOCX, TXT)
Auth: Authorization: Bearer <token>

Possible responses:

200 analysis JSON
400 invalid file or empty text
401 missing or invalid token
422 extraction failed
500 unexpected server or AI error

Documents

`GET /documents`

List analysed documents for the current user.

Response:

[
  {
    "id": 1,
    "filename": "invoice-june.pdf",
    "created_at": "2025-11-28T19:22:15.123456",
    "document_type": "invoice",
    "category": "finance",
    "summary": "Short summary of the invoice..."
  }
]

`GET /documents/{id}`

Get full details for a single document owned by the current user.

Response:

{
  "id": 1,
  "filename": "invoice-june.pdf",
  "mime_type": "application/pdf",
  "created_at": "2025-11-28T19:22:15.123456",
  "full_text": "Full extracted document text here...",
  "analysis": {
    "summary": "This invoice from Acme Corp...",
    "classification": {
      "document_type": "invoice",
      "category": "finance"
    },
    "key_fields": {
      "date": "2025-06-30",
      "due_date": "2025-07-30",
      "total_amount": "£1,250.00",
      "person_or_company": "Acme Corp",
      "reference_id": "INV-2025-0612"
    },
    "entities": [
      {
        "type": "ORG",
        "text": "Acme Corp",
        "start_offset": 15,
        "end_offset": 24
      },
      {
        "type": "PERSON",
        "text": "John Doe",
        "start_offset": 120,
        "end_offset": 128
      }
    ]
  }
}

If the document does not exist or is not owned by the current user, a 404 is returned.

Example Request

Using curl:

curl -X POST "http://127.0.0.1:8000/analyze"   -H "accept: application/json"   -H "Authorization: Bearer <your_token>"   -F "file=@example_invoice.pdf"

Example Response

{
  "summary": "This document is an invoice issued by Acme Corp to John Doe for consulting services rendered in June 2025. It lists the invoice number, billing address, service description, total amount due and payment terms. The invoice highlights a single line item with hourly consulting fees. Payment is due within 30 days via bank transfer. Contact details are provided for billing queries. The document serves as a formal request for payment.",
  "classification": {
    "document_type": "invoice",
    "category": "finance"
  },
  "key_fields": {
    "date": "2025-06-30",
    "due_date": "2025-07-30",
    "total_amount": "£1,250.00",
    "person_or_company": "Acme Corp",
    "reference_id": "INV-2025-0612"
  },
  "entities": [
    {
      "type": "ORG",
      "text": "Acme Corp",
      "start_offset": 10,
      "end_offset": 19
    },
    {
      "type": "PERSON",
      "text": "John Doe",
      "start_offset": 120,
      "end_offset": 128
    },
    {
      "type": "DATE",
      "text": "30 June 2025",
      "start_offset": 200,
      "end_offset": 212
    },
    {
      "type": "MONEY",
      "text": "£1,250.00",
      "start_offset": 260,
      "end_offset": 270
    }
  ]
}

The exact values will depend on the uploaded document and the model output.

Project Structure

document-intel-api/
  ├─ app/
  │   ├─ main.py          # FastAPI app, routes, error handling, simple HTML UI
  │   ├─ extractor.py     # PDF / DOCX / TXT extraction
  │   ├─ ai_client.py     # AI prompt and OpenAI client wrapper
  │   ├─ auth.py          # JWT auth, register/login, current user dependency
  │   ├─ database.py      # SQLAlchemy engine and session helpers
  │   ├─ db_models.py     # ORM models (User, Document, Entity)
  │   ├─ models.py        # Pydantic models (request and response)
  │   ├─ config.py        # Environment and settings
  │   └─ utils.py         # Helper utilities (file saving, normalisation, offsets)
  │
  ├─ uploads/             # Temporary uploads (kept out of git)
  │   └─ .gitkeep
  │
  ├─ README.md
  ├─ .env.example
  ├─ requirements.txt
  ├─ .gitignore
  └─ LICENSE

Planned Roadmap

The current version focuses on a clean, end to end flow rather than every possible feature.

AI and NLP

More robust NER with confidence scores
Richer entity types and relationships
Better handling of long and complex documents
Table extraction from invoices and reports
PII redaction utilities for personal data
A dedicated endpoint for follow up questions about a document

Backend and API

More detailed analytics and metrics
Pagination and filtering for /documents
Optional soft deletion or archiving of documents
API keys for service to service access
Rate limiting options for public deployments

Frontend

A more polished HTML/JS front end
Optional React dashboard with:
- drag and drop upload
- document list with filters
- detail view with highlighted entities
- simple charts for document distribution

Deployment

Dockerfile and container image build
Example configuration for Railway, Render or Fly.io
CORS tightening for production
Health and readiness endpoints for container platforms

Testing

pytest suite
Fixtures with fake PDF, DOCX and TXT documents
Mocked AI responses for deterministic tests
Integration tests covering auth, analysis and history

Development Notes

Some practical details when working on the project:

Documents and analyses are stored per user and only visible to the owner.
Entity extraction is handled by the AI model, not a separate NER library.
Offsets are calculated using a simple first match approach. It is usually enough for demos and small tools but can be refined.
SQLite is the default to keep setup simple. Switching to Postgres or another database mainly involves updating DATABASE_URL and running migrations.

Contributing

If you spot something odd or want to extend the project, feel free to:

Fork the repository.
Create a feature branch:
```
git checkout -b feature/my-idea
```
Commit your changes:
```
git commit -am "Describe your change"
```
Push the branch:
```
git push origin feature/my-idea
```
Open a pull request.

Bug reports and small improvements to the documentation are very welcome.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
app		app
docs		docs
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ZXRProductions/document-intel-api

Folders and files

Latest commit

History

Repository files navigation

AI Document Intelligence API

Table of Contents

Features

Architecture

Tech Stack

Getting Started

Requirements

Installation

Environment Variables

Running the API

API Reference

Auth

POST /auth/register

POST /auth/login

GET /me

System

GET /health

Analysis

POST /analyze

Documents

GET /documents

GET /documents/{id}

Example Request

Example Response

Project Structure

Planned Roadmap

AI and NLP

Backend and API

Frontend

Deployment

Testing

Development Notes

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`POST /auth/register`

`POST /auth/login`

`GET /me`

`GET /health`

`POST /analyze`

`GET /documents`

`GET /documents/{id}`

Packages