-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation Proposal
Leverage Eagle. Minimize AI. Keep costs down. Run in OpenShift.
DFL (Digital File Library) and AICT (AI Classifier Tool) can be delivered by extending eagle-api with a Docling microservice for document ingestion. The existing Eagle platform already handles 70% of what DFL needs. Net-new work is: a Docling Python microservice (handles all document parsing, OCR, table extraction, chunking), and a rule-based classifier in eagle-api. No LLM required for v1.
Monthly cost: $0 (v1) to ~$3 (v2 with AI answers). Compare to EPIC.search approach: ~$500โ2000/mo.
Hosting: 100% OpenShift (free for team). Azure used ONLY for AI model API calls when absolutely needed (VLM fallback for degraded docs, conversational search).
| Where | What | Cost |
|---|---|---|
| OpenShift (free) | eagle-api, MongoDB, Typesense, ClamAV, S3 Object Storage, docling-service (CPU), all application code | $0 |
| Azure (pay-per-use) | OpenAI GPT-4.1-mini vision (remote VLM fallback after Granite-Docling fails), Azure OpenAI GPT-4.1-nano (v2 conversational search) | ~$1โ3/mo + one-time ETL |
Rule: If it can run in a container, it runs in OpenShift. Azure is only for external API calls to AI models that cannot be self-hosted cost-effectively.
graph LR
subgraph "OpenShift (Free)"
API[eagle-api]
MONGO[(MongoDB)]
TS[(Typesense)]
S3[(S3 Object Store)]
CLAM[ClamAV]
DOCLING[docling-service<br/>Python, CPU-only]
end
subgraph "โ๏ธ Azure (Pay-per-use API calls only)"
VLM[OpenAI GPT-4.1-mini Vision<br/>~$0.003/page, remote fallback only]
AOAI[OpenAI GPT-4.1-nano<br/>~$0.0005/query, v2 only]
end
DOCLING -.-> |VLM fallback for degraded docs| VLM
TS -.-> |conversational search, v2| AOAI
classDef free fill:#2a4a2a,stroke:#1a3a1a,color:#fff
classDef paid fill:#4a1a1a,stroke:#3a0a0a,color:#fff
class API,MONGO,TS,S3,CLAM,DOCLING free
class VLM,AOAI paid
graph LR
subgraph "Already Running in OpenShift (Free)"
direction TB
CLAM[ClamAV<br/>Virus scanning]
S3[(S3 Object Storage<br/>OpenShift)]
MONGO[(MongoDB 8.0<br/>Replica set + Change Streams)]
TS[(Typesense 30.1<br/>5 collections, RBAC proxy)]
UPLOAD[Upload Pipeline<br/>multer โ scan โ store]
SYNC[Change Stream Sync<br/>Real-time โ Typesense]
EXTRACT[content-extract.js<br/>PDF text extraction]
CHUNK[chunker.js<br/>Page-based chunking]
RBAC[Scoped Search Keys<br/>allowed_roles filter injection]
end
| Component | Status | Location | DFL Reuse |
|---|---|---|---|
| ClamAV scanning | โ Prod + Test | api/helpers/utils.js:76-99 |
Every DFL upload scanned before storage |
| S3 Object Storage (OpenShift) | โ Prod | api/helpers/minio.js |
File binary storage for all DFL docs |
| MongoDB replica set | โ Prod | Change Streams enabled | DFL metadata, job state, audit log |
| Typesense 30.1 | โ Prod |
typesense-sync/ (5 files) |
Full-text search, facets, semantic (ready to enable) |
| Upload endpoints | โ Prod | api/controllers/document.js |
POST/PUT with multer, ClamAV, S3 pipeline |
| Document model | โ Prod | api/helpers/models/document.js |
read[]/write[]/delete[] arrays, audit fields |
| RBAC proxy | โ Prod | api/controllers/typesenseProxy.js |
Role injection, filter sanitization, field stripping |
| Change Stream sync | โ Prod | typesense-sync/src/index.js |
Real-time MongoDBโTypesense mirror |
| Full reindex | โ Prod | typesense-sync/src/full-sync.js |
Zero-downtime alias swap nightly |
| PDF extraction | โ Prod | typesense-sync/src/content-extract.js |
pdf-parse โ text โ chunks โ Typesense |
| Chunker | โ Prod | typesense-sync/src/chunker.js |
4000 char pages, 200 overlap |
| Popularity scoring | โ Prod | typesense-sync/src/popularity-sync.js |
30-day click scores |
We don't need to build:
- โ A new search engine
- โ A new file storage service
- โ A new virus scanning service
- โ A new database
- โ A new RBAC system
- โ A new upload pipeline
- โ A new real-time sync mechanism
Effort: ~2 days | Cost: $0
Add DFL-specific fields to existing Document model:
// New fields on existing Document schema
holdState: { type: String, enum: ['staged', 'admitted'], default: 'staged' },
holdCompletedBy: { type: ObjectId, ref: 'User' },
holdCompletedDate: Date,
visibility: { type: String, enum: ['team', 'eao', 'idir', 'public'] },
classification: {
documentType: String, // from controlled vocabulary
topics: [String], // metatags
orcsCode: String, // ORCS classification code
retentionClass: String, // from ORCS lookup
},
aictSuggestions: {
documentType: [{ name: String, confidence: Number }],
topics: [{ name: String, confidence: Number }],
},
aictProcessedDate: Date,Visibility โ allowed_roles mapping (enforced by existing proxy):
| DFL Visibility | Typesense allowed_roles
|
|---|---|
team |
['project:<id>'] |
eao |
['eao-staff', 'project:<id>'] |
idir |
['idir-user', 'eao-staff', 'project:<id>'] |
public |
['public', 'idir-user', 'eao-staff', 'project:<id>'] |
No changes to Typesense proxy logic needed โ allowed_roles filter injection already works.
Cost: $0 (MIT license, runs on OpenShift CPU)
Deploy Docling as a dedicated Python microservice that handles ALL document parsing, OCR, table extraction, and chunking:
| Capability | How Docling Handles It | Reference |
|---|---|---|
| PDF (native text) | Standard pipeline, layout analysis | Architecture |
| PDF (scanned) | Tesseract CLI OCR (CPU, pluggable) | Full page OCR |
| Word (.docx) | DOCX backend | Multi-format |
| Excel (.xlsx) | XLSX backend | Supported formats |
| Email (.msg/.eml) | Email parsing | Supported formats |
| PowerPoint (.pptx) | PPTX backend | Multi-format |
| HTML | HTML backend | Supported formats |
| Images (PNG/TIFF/JPEG) | Image pipeline + OCR | Supported formats |
| Tables | TableFormer ML model (merged cells, spanning headers) | Table export |
| Chunking | HierarchicalChunker (RAG-optimized, by document structure) | Chunking |
| Degraded/complex docs | VLM: Granite-Docling-258M (local CPU) โ Azure OpenAI GPT-4.1-mini vision (remote fallback) | VLM pipeline |
Replaces: pdf-parse, mammoth, xlsx, mailparser, pptx-parser, tesseract.js, custom chunker.js โ ALL superseded by one microservice.
Architecture:
flowchart TD
EAGLE[eagle-api] -->|POST /extract<br/>multipart file| DOCLING[docling-service<br/>CPU-only pod]
DOCLING --> NATIVE{Text layer present?}
NATIVE -->|Yes| EXTRACT[Algorithmic extraction<br/>No OCR needed]
NATIVE -->|No| STD[OCR: Tesseract CLI<br/>+ Layout + TableFormer]
EXTRACT --> CONF{confidence โฅ 0.70?}
STD --> CONF
CONF -->|Yes| OUT[Return chunks + metadata]
CONF -->|No| GRANITE[Granite-Docling-258M<br/>Local VLM, CPU-only]
GRANITE --> CONF2{confidence โฅ 0.70?}
CONF2 -->|Yes| OUT
CONF2 -->|No| VLM[Remote VLM<br/>โ Azure OpenAI GPT-4.1-mini]
VLM --> CONF3{confidence โฅ 0.70?}
CONF3 -->|Yes| OUT
CONF3 -->|No| FLAG[Flag: low_confidence = true]
FLAG --> OUT
OUT --> EAGLE
EAGLE --> MONGO[(MongoDB<br/>DocumentChunks)]
Deployment (Helm):
# docling-service pod (OpenShift, CPU-only)
image: custom Dockerfile (python + docling + tesseract-ocr)
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2000m, memory: 2Gi }
replicas: 1 # scale to 3+ during ETL batch
env:
AZURE_OPENAI_ENDPOINT: (from secret)
AZURE_OPENAI_API_KEY: (from secret)
VLM_CONFIDENCE_THRESHOLD: "0.70"
OCR_ENGINE: "tesseract_cli"API contract (POST /extract):
// Response
{
"metadata": {
"format": "pdf", "pages": 42, "confidence": 0.87,
"tables_detected": 3, "pipeline_used": "standard"
},
"chunks": [
{ "text": "...", "heading": "Section 1.2", "page": 3, "type": "paragraph" },
{ "text": "...", "heading": "Table 4", "page": 7, "type": "table" }
],
"full_text": "...",
"markdown": "..."
}Effort: ~5 days | Cost: $0/document (no LLM)
Keyword scoring against YAML-defined controlled vocabularies:
# vocabularies/document-types.yaml
- name: "Environmental Assessment Certificate"
indicators: ["certificate", "EA certificate", "environmental assessment certificate"]
weight: 3
- name: "Application Information Requirements"
indicators: ["AIR", "application information requirements", "valued component"]
weight: 2
- name: "Assessment Report"
indicators: ["assessment report", "findings", "recommendation"]
weight: 1Algorithm: Count keyword occurrences in extracted text โ score โ rank โ return top-3 suggestions with confidence.
Human always reviews. Classifier suggests; staff confirms or overrides during Metadata Verification Hold. No auto-classification without human sign-off.
Effort: ~3 days | Cost: $0
Upload โ auto-extract โ auto-classify โ HOLD โ staff reviews โ admit
- Documents in
holdState: 'staged'are NOT synced to Typesense (invisible to search) - Staff reviews AICT suggestions, corrects/approves metadata
- On admission:
holdState โ 'admitted', change stream picks up, syncs to Typesense
This is a UI + workflow concern on eagle-admin. Backend is just a status field + sync filter.
Effort: ~1 day | Cost: $0
Add DFL fields to existing documents collection schema + enable embeddings on document_chunks:
// Add to documents collection
{ name: 'visibility', type: 'string', facet: true },
{ name: 'documentType', type: 'string', facet: true },
{ name: 'topics', type: 'string[]', facet: true },
{ name: 'orcsCode', type: 'string', facet: true },
{ name: 'holdState', type: 'string' },
// Enable semantic search on document_chunks (one-line change)
{ name: 'embedding', type: 'float[]', embed: {
from: ['content'],
model_config: { model_name: 'ts/e5-small' }
}}Semantic search is then available at $0/query โ Typesense generates embeddings locally.
Effort: ~10 days | Cost: $0
Staff-facing Angular interface for:
- Upload with hold workflow
- AICT suggestion review (accept/reject/override)
- Search with DFL facets (type, topic, ORCS, visibility)
- Batch operations (admit multiple, change visibility)
Reuses existing Angular patterns in eagle-admin.
Cost: $0 standard pipeline / ~$15โ75 remote VLM fallback for degraded docs
Batch-process existing 65,000 EPIC documents:
- Read from current Minio storage
- POST to docling-service (scale replicas to 3+ for throughput)
- Store returned chunks in MongoDB
- Run through classifier
- Stage with
holdState: 'staged'for review (or auto-admit known-good) - Change stream syncs admitted docs to Typesense
Pattern already exists in full-sync.js (zero-downtime reindex). Adapt for ETL. Scale docling-service replicas during batch to parallelize.
Effort: ~1 day config | Cost: ~$0.002/AI query (only when user requests)
Typesense 30.x has built-in Conversational Search. One config object:
{
"id": "dfl-conv-model",
"model_name": "openai/gpt-4.1-nano",
"system_prompt": "Answer ONLY from provided context. Cite document name and page.",
"max_bytes": 16384
}User triggers AI answer explicitly. Not automatic. Source docs always shown alongside for verification. Prevents hallucination by design.
| EPIC.search Component | Cost | Our Alternative | Our Cost |
|---|---|---|---|
| PostgreSQL + pgvector | ~$100/mo | Typesense (already running) | $0 |
| search-vector-api (Flask) | Ops overhead | Typesense built-in embeddings | $0 |
| search-model (Ollama/vLLM) | ~$200-500/mo GPU | Typesense ts/e5-small
|
$0 |
| search-api (Flask + OpenAI) | ~$200-1500/mo API | Typesense Conversational Search | ~$5/mo |
| search-web (React) | Dev time | eagle-admin/eagle-public (existing) | $0 |
| sentence-transformers (Python) | Ops overhead | Typesense built-in | $0 |
| Custom RAG pipeline | Dev time + LLM cost | Typesense built-in | Config only |
| LLM classification per doc | ~$1950 one-time + ongoing | Rule-based keyword scoring | $0 |
| Custom PDF/DOCX/XLSX extractors (6+ libs) | Maintenance | Docling (single unified pipeline) | $0 |
| Custom chunking logic | Maintenance | Docling HierarchicalChunker | $0 |
| Azure Document Intelligence | ~$293 one-time | Docling Tesseract + VLM fallback | ~$15-75 |
graph TB
subgraph "โ
Existing โ OpenShift (No Changes, Free)"
CLAM[ClamAV]
S3[(S3 Object Storage)]
MONGO[(MongoDB)]
PROXY[Typesense RBAC Proxy]
SYNC[Change Stream Sync]
end
subgraph "๐ New โ OpenShift (Free)"
DOCLING[docling-service<br/>Python microservice<br/>Parse + OCR + Tables + Chunk]
GRANITE_VLM[Granite-Docling-258M<br/>Local VLM, CPU-only]
CLASS[Rule-Based Classifier]
HOLD[Metadata Verification Hold]
SCHEMA[Schema: DFL fields]
EMBED[Typesense Embeddings<br/>ts/e5-small]
end
subgraph "๐ฎ Azure API calls (pay-per-use)"
VLM[GPT-4.1-mini Vision<br/>Remote VLM fallback]
CONV[GPT-4.1-nano<br/>Conversational search, v2]
end
DOCLING --> CLASS
DOCLING -.-> GRANITE_VLM
GRANITE_VLM -.-> VLM
CLASS --> HOLD
HOLD --> SYNC
SYNC --> EMBED
EMBED -.-> CONV
classDef existing fill:#2a4a2a,stroke:#1a3a1a,color:#fff
classDef new fill:#4a3a1a,stroke:#3a2a0a,color:#fff
classDef future fill:#4a1a1a,stroke:#3a0a0a,color:#fff
class CLAM,S3,MONGO,PROXY,SYNC existing
class DOCLING,CLASS,HOLD,SCHEMA,EMBED new
class VLM,CONV future
| Item | Monthly | One-Time |
|---|---|---|
| Typesense (OpenShift pod) | $0 | โ |
| MongoDB (OpenShift pod) | $0 | โ |
| S3 Object Storage (OpenShift) | $0 | โ |
| ClamAV (OpenShift pod) | $0 | โ |
| docling-service (OpenShift pod, CPU) | $0 | $0 |
| Rule-based classifier | $0 | $0 |
| VLM fallback โ Granite-Docling-258M local + GPT-4.1-mini remote (~2%) | ~$0โ0.10/day | ~$15โ75 (ETL batch) |
| Conversational Search โ GPT-4.1-nano (v2, user-triggered) | ~$1-3 | โ |
| Total | $0 (v1) / $1-3 (v2) | $15โ75 |
Pricing basis (May 2026): GPT-4.1-nano: $0.10/1M input, $0.40/1M output (
$0.0005/query). GPT-4.1-mini: $0.40/1M input, $1.60/1M output ($0.003/page VLM). Granite-Docling-258M: $0 (local CPU). Azure DI Layout (upgrade path): $10/1K pages. OpenShift: $0. Docling: $0 (MIT).
| Work Package | Dependencies |
|---|---|
| WP1: Schema extension | None |
| WP2: docling-service deployment | None |
| WP3: Rule-based classifier | WP2 |
| WP4: Metadata verification hold | WP1 |
| WP5: Typesense schema update | WP1 |
| WP6: eagle-admin UI | WP1, WP4, WP5 |
| WP7: Priority 1A ETL | WP2, WP3 |
| WP8: Conversational search (v2) | WP5 |
Critical path: WP1 + WP2 (parallel) โ WP3 โ WP7 (ETL). UI (WP6) can parallel after WP1.
See Technical Decisions for research-backed analysis of each.
-
VLM fallback threshold: What confidence score triggers Granite-Docling-258M (local) and then Azure OpenAI GPT-4.1-mini (remote)?
- Recommended: 0.70 (pages below this get VLM re-processing)
- Tunable via env var, no code change needed
-
Conversational Search timing: v1 (ship with search) or v2 (after core DFL)?
- Recommendation: v2. Core DFL value is filing + finding, not AI chat.
-
Auto-admit threshold: Should high-confidence classifications skip hold?
- Recommendation: No. Human review for v1. Consider auto-admit for v2 after classifier is proven.
- OpenShift-first โ If it runs in a container, it runs in OpenShift (free). Azure only for AI API calls.
- No AI where rules work โ Keyword scoring at $0 beats LLM at $0.03/doc
- Leverage existing infra โ Eagle has 70% of what DFL needs running in OpenShift today
- Human in the loop โ AICT suggests, staff decides. No black-box classification.
- AI only when user asks โ Conversational search is opt-in, not default
- Source verification โ Every AI answer shows source documents. Anti-hallucination by design.
- Extend, don't replace โ Same codebase, same team, same deployment pipeline