A backend service that turns raw documents into structured JSON using a mix of text extraction, AI analysis and metadata extraction.
Upload a PDF, DOCX or TXT file and the API will:
- extract the text
- generate a short summary
- classify the document
- pull out a few useful key fields
- extract entities, including character offsets
- store everything in SQLite, per user
All of that comes back as a single, well-defined JSON object that can be used by other services or UIs.
- Features
- Architecture
- Tech Stack
- Getting Started
- API Reference
- Project Structure
- Planned Roadmap
- Development Notes
- Contributing
- License
- Document upload (PDF, DOCX, TXT) via FastAPI
- Text extraction
- PDFs via
pypdf - DOCX via
python-docx - Plain text files as-is
- PDFs via
- LLM-backed analysis using OpenAI GPT-4o-mini (or an OpenAI compatible API)
- Structured JSON output:
- summary (3–7 sentences)
- classification (
document_typeandcategory) - key_fields (date, due_date, total_amount, person_or_company, reference_id)
- entities (PERSON, ORG, DATE, MONEY, LOCATION, OTHER) with optional offsets
- Entity character offsets
- Each entity can include
start_offsetandend_offsetinto the original document text - Useful for UI highlighting or redaction workflows
- Each entity can include
- SQLite backed storage
- Users, documents, analyses and entities stored in SQLite
- Per user history with
/documentsand/documents/{id}
- JWT authentication
/auth/registerand/auth/login- Bearer token required for analysis and document endpoints
- OpenAI compatible endpoint support
- Use OpenAI directly
- Or point to an OpenAI compatible base URL via
OPENAI_BASE_URL
- Interactive docs
- Swagger UI at
/docs - ReDoc at
/redoc
- Swagger UI at
- Simple HTML UI
- Minimal upload page at
/ - Paste token, upload a file, see the raw JSON response
- Minimal upload page at
High level flow:
flowchart LR
U[Client] -->|Upload file + JWT| A[FastAPI /analyze]
A -->|Auth check| AUTH[JWT verification]
A -->|Save temp file| F[uploads/]
A -->|Detect type| T[Extractor]
T -->|Extract text| TXT[Document text]
TXT -->|Prompt| L[AI client]
L -->|Strict JSON| J[AnalysisResult]
J -->|Persist| DB[(SQLite)]
DB -->|History and detail| H[History endpoints]
A -->|Return JSON| U
The service is intentionally monolithic. A single FastAPI app owns routing, analysis, persistence and authentication.
The goal is to keep the code readable and straightforward rather than overly abstract.
- Backend: FastAPI, Uvicorn
- Language: Python 3.12+
- AI / LLM: OpenAI client (GPT-4o-mini by default)
- Parsing:
pypdf,python-docx - Config:
python-dotenv - Auth: JWT using
python-joseandpasslib - Database: SQLite via SQLAlchemy
- Python 3.10 or later (developed with 3.12)
- A working
pipinstallation - An OpenAI API key or a compatible provider
Clone the repository and set up a virtual environment:
git clone https://github.com/ZXRProductions/document-intel-api.git
cd document-intel-api
python -m venv venv
# Windows
venv\Scripts\Activate
# macOS / Linux
source venv/bin/Activate
pip install --upgrade pip
pip install -r requirements.txtCopy the example file and fill in your details:
cp .env.example .env.env:
# LLM config
OPENAI_API_KEY=your_api_key_here
MODEL_NAME=gpt-4o-mini
# Optional OpenAI compatible endpoint
# Example: https://api.groq.com/openai/v1
OPENAI_BASE_URL=
# App behaviour
DEBUG=false
MAX_FILE_SIZE_MB=10
# Database (leave default for local sqlite)
# DATABASE_URL=sqlite:///document_intel.db
# Auth / security
SECRET_KEY=CHANGE_ME_IN_PRODUCTION
ACCESS_TOKEN_EXPIRE_MINUTES=60From the project root:
uvicorn app.main:app --reloadOpen:
- Swagger UI: http://127.0.0.1:8000/docs
- ReDoc: http://127.0.0.1:8000/redoc
- Simple UI: http://127.0.0.1:8000/
Create a new user.
Request body:
{
"email": "user@example.com",
"password": "yourpassword"
}Response (201):
{
"id": 1,
"email": "user@example.com"
}Authenticate and obtain a JWT access token.
Form data (application/x-www-form-urlencoded):
username(email)password
Example:
curl -X POST "http://127.0.0.1:8000/auth/login" -H "Content-Type: application/x-www-form-urlencoded" -d "username=user@example.com&password=yourpassword"Response (200):
{
"access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "bearer"
}Use this token in the Authorization header:
Authorization: Bearer <token>Return basic information about the authenticated user.
Response:
{
"id": 1,
"email": "user@example.com"
}Simple health check.
Response:
{
"status": "ok",
"detail": "Service is up."
}Upload a document and get structured analysis back. Requires a valid Bearer token.
- Method:
POST - Content-Type:
multipart/form-data - Field:
file(PDF, DOCX, TXT) - Auth:
Authorization: Bearer <token>
Possible responses:
200analysis JSON400invalid file or empty text401missing or invalid token422extraction failed500unexpected server or AI error
List analysed documents for the current user.
Response:
[
{
"id": 1,
"filename": "invoice-june.pdf",
"created_at": "2025-11-28T19:22:15.123456",
"document_type": "invoice",
"category": "finance",
"summary": "Short summary of the invoice..."
}
]Get full details for a single document owned by the current user.
Response:
{
"id": 1,
"filename": "invoice-june.pdf",
"mime_type": "application/pdf",
"created_at": "2025-11-28T19:22:15.123456",
"full_text": "Full extracted document text here...",
"analysis": {
"summary": "This invoice from Acme Corp...",
"classification": {
"document_type": "invoice",
"category": "finance"
},
"key_fields": {
"date": "2025-06-30",
"due_date": "2025-07-30",
"total_amount": "£1,250.00",
"person_or_company": "Acme Corp",
"reference_id": "INV-2025-0612"
},
"entities": [
{
"type": "ORG",
"text": "Acme Corp",
"start_offset": 15,
"end_offset": 24
},
{
"type": "PERSON",
"text": "John Doe",
"start_offset": 120,
"end_offset": 128
}
]
}
}If the document does not exist or is not owned by the current user, a 404 is returned.
Using curl:
curl -X POST "http://127.0.0.1:8000/analyze" -H "accept: application/json" -H "Authorization: Bearer <your_token>" -F "file=@example_invoice.pdf"{
"summary": "This document is an invoice issued by Acme Corp to John Doe for consulting services rendered in June 2025. It lists the invoice number, billing address, service description, total amount due and payment terms. The invoice highlights a single line item with hourly consulting fees. Payment is due within 30 days via bank transfer. Contact details are provided for billing queries. The document serves as a formal request for payment.",
"classification": {
"document_type": "invoice",
"category": "finance"
},
"key_fields": {
"date": "2025-06-30",
"due_date": "2025-07-30",
"total_amount": "£1,250.00",
"person_or_company": "Acme Corp",
"reference_id": "INV-2025-0612"
},
"entities": [
{
"type": "ORG",
"text": "Acme Corp",
"start_offset": 10,
"end_offset": 19
},
{
"type": "PERSON",
"text": "John Doe",
"start_offset": 120,
"end_offset": 128
},
{
"type": "DATE",
"text": "30 June 2025",
"start_offset": 200,
"end_offset": 212
},
{
"type": "MONEY",
"text": "£1,250.00",
"start_offset": 260,
"end_offset": 270
}
]
}The exact values will depend on the uploaded document and the model output.
document-intel-api/
├─ app/
│ ├─ main.py # FastAPI app, routes, error handling, simple HTML UI
│ ├─ extractor.py # PDF / DOCX / TXT extraction
│ ├─ ai_client.py # AI prompt and OpenAI client wrapper
│ ├─ auth.py # JWT auth, register/login, current user dependency
│ ├─ database.py # SQLAlchemy engine and session helpers
│ ├─ db_models.py # ORM models (User, Document, Entity)
│ ├─ models.py # Pydantic models (request and response)
│ ├─ config.py # Environment and settings
│ └─ utils.py # Helper utilities (file saving, normalisation, offsets)
│
├─ uploads/ # Temporary uploads (kept out of git)
│ └─ .gitkeep
│
├─ README.md
├─ .env.example
├─ requirements.txt
├─ .gitignore
└─ LICENSE
The current version focuses on a clean, end to end flow rather than every possible feature.
- More robust NER with confidence scores
- Richer entity types and relationships
- Better handling of long and complex documents
- Table extraction from invoices and reports
- PII redaction utilities for personal data
- A dedicated endpoint for follow up questions about a document
- More detailed analytics and metrics
- Pagination and filtering for
/documents - Optional soft deletion or archiving of documents
- API keys for service to service access
- Rate limiting options for public deployments
- A more polished HTML/JS front end
- Optional React dashboard with:
- drag and drop upload
- document list with filters
- detail view with highlighted entities
- simple charts for document distribution
- Dockerfile and container image build
- Example configuration for Railway, Render or Fly.io
- CORS tightening for production
- Health and readiness endpoints for container platforms
- pytest suite
- Fixtures with fake PDF, DOCX and TXT documents
- Mocked AI responses for deterministic tests
- Integration tests covering auth, analysis and history
Some practical details when working on the project:
- Documents and analyses are stored per user and only visible to the owner.
- Entity extraction is handled by the AI model, not a separate NER library.
- Offsets are calculated using a simple first match approach. It is usually enough for demos and small tools but can be refined.
- SQLite is the default to keep setup simple. Switching to Postgres or another database mainly involves updating
DATABASE_URLand running migrations.
If you spot something odd or want to extend the project, feel free to:
-
Fork the repository.
-
Create a feature branch:
git checkout -b feature/my-idea
-
Commit your changes:
git commit -am "Describe your change" -
Push the branch:
git push origin feature/my-idea
-
Open a pull request.
Bug reports and small improvements to the documentation are very welcome.
This project is licensed under the MIT License.
