Skip to content

Implementation Proposal

Daniel Truong edited this page May 26, 2026 · 4 revisions

DFL/AICT Implementation Proposal

Leverage Eagle. Minimize AI. Keep costs down. Run in OpenShift.

Executive Summary

DFL (Digital File Library) and AICT (AI Classifier Tool) can be delivered by extending eagle-api with a Docling microservice for document ingestion. The existing Eagle platform already handles 70% of what DFL needs. Net-new work is: a Docling Python microservice (handles all document parsing, OCR, table extraction, chunking), and a rule-based classifier in eagle-api. No LLM required for v1.

Monthly cost: $0 (v1) to ~$3 (v2 with AI answers). Compare to EPIC.search approach: ~$500โ€“2000/mo.

Hosting: 100% OpenShift (free for team). Azure used ONLY for AI model API calls when absolutely needed (VLM fallback for degraded docs, conversational search).


Hosting Strategy

Where What Cost
OpenShift (free) eagle-api, MongoDB, Typesense, ClamAV, S3 Object Storage, docling-service (CPU), all application code $0
Azure (pay-per-use) OpenAI GPT-4.1-mini vision (remote VLM fallback after Granite-Docling fails), Azure OpenAI GPT-4.1-nano (v2 conversational search) ~$1โ€“3/mo + one-time ETL

Rule: If it can run in a container, it runs in OpenShift. Azure is only for external API calls to AI models that cannot be self-hosted cost-effectively.

graph LR
    subgraph "OpenShift (Free)"
        API[eagle-api]
        MONGO[(MongoDB)]
        TS[(Typesense)]
        S3[(S3 Object Store)]
        CLAM[ClamAV]
        DOCLING[docling-service<br/>Python, CPU-only]
    end

    subgraph "โ˜๏ธ Azure (Pay-per-use API calls only)"
        VLM[OpenAI GPT-4.1-mini Vision<br/>~$0.003/page, remote fallback only]
        AOAI[OpenAI GPT-4.1-nano<br/>~$0.0005/query, v2 only]
    end

    DOCLING -.-> |VLM fallback for degraded docs| VLM
    TS -.-> |conversational search, v2| AOAI

    classDef free fill:#2a4a2a,stroke:#1a3a1a,color:#fff
    classDef paid fill:#4a1a1a,stroke:#3a0a0a,color:#fff
    class API,MONGO,TS,S3,CLAM,DOCLING free
    class VLM,AOAI paid
Loading

What Eagle Already Has (Reuse)

graph LR
    subgraph "Already Running in OpenShift (Free)"
        direction TB
        CLAM[ClamAV<br/>Virus scanning]
        S3[(S3 Object Storage<br/>OpenShift)]
        MONGO[(MongoDB 8.0<br/>Replica set + Change Streams)]
        TS[(Typesense 30.1<br/>5 collections, RBAC proxy)]
        UPLOAD[Upload Pipeline<br/>multer โ†’ scan โ†’ store]
        SYNC[Change Stream Sync<br/>Real-time โ†’ Typesense]
        EXTRACT[content-extract.js<br/>PDF text extraction]
        CHUNK[chunker.js<br/>Page-based chunking]
        RBAC[Scoped Search Keys<br/>allowed_roles filter injection]
    end
Loading

Detailed Inventory

Component Status Location DFL Reuse
ClamAV scanning โœ… Prod + Test api/helpers/utils.js:76-99 Every DFL upload scanned before storage
S3 Object Storage (OpenShift) โœ… Prod api/helpers/minio.js File binary storage for all DFL docs
MongoDB replica set โœ… Prod Change Streams enabled DFL metadata, job state, audit log
Typesense 30.1 โœ… Prod typesense-sync/ (5 files) Full-text search, facets, semantic (ready to enable)
Upload endpoints โœ… Prod api/controllers/document.js POST/PUT with multer, ClamAV, S3 pipeline
Document model โœ… Prod api/helpers/models/document.js read[]/write[]/delete[] arrays, audit fields
RBAC proxy โœ… Prod api/controllers/typesenseProxy.js Role injection, filter sanitization, field stripping
Change Stream sync โœ… Prod typesense-sync/src/index.js Real-time MongoDBโ†’Typesense mirror
Full reindex โœ… Prod typesense-sync/src/full-sync.js Zero-downtime alias swap nightly
PDF extraction โœ… Prod typesense-sync/src/content-extract.js pdf-parse โ†’ text โ†’ chunks โ†’ Typesense
Chunker โœ… Prod typesense-sync/src/chunker.js 4000 char pages, 200 overlap
Popularity scoring โœ… Prod typesense-sync/src/popularity-sync.js 30-day click scores

What This Means

We don't need to build:

  • โŒ A new search engine
  • โŒ A new file storage service
  • โŒ A new virus scanning service
  • โŒ A new database
  • โŒ A new RBAC system
  • โŒ A new upload pipeline
  • โŒ A new real-time sync mechanism

What's Net-New (Must Build)

Work Package 1: Schema Extension

Effort: ~2 days | Cost: $0

Add DFL-specific fields to existing Document model:

// New fields on existing Document schema
holdState:           { type: String, enum: ['staged', 'admitted'], default: 'staged' },
holdCompletedBy:     { type: ObjectId, ref: 'User' },
holdCompletedDate:   Date,
visibility:          { type: String, enum: ['team', 'eao', 'idir', 'public'] },
classification: {
  documentType:      String,       // from controlled vocabulary
  topics:           [String],      // metatags
  orcsCode:          String,       // ORCS classification code
  retentionClass:    String,       // from ORCS lookup
},
aictSuggestions: {
  documentType:     [{ name: String, confidence: Number }],
  topics:           [{ name: String, confidence: Number }],
},
aictProcessedDate:   Date,

Visibility โ†’ allowed_roles mapping (enforced by existing proxy):

DFL Visibility Typesense allowed_roles
team ['project:<id>']
eao ['eao-staff', 'project:<id>']
idir ['idir-user', 'eao-staff', 'project:<id>']
public ['public', 'idir-user', 'eao-staff', 'project:<id>']

No changes to Typesense proxy logic needed โ€” allowed_roles filter injection already works.


Work Package 2: Document Ingestion Microservice (Docling)

Cost: $0 (MIT license, runs on OpenShift CPU)

Deploy Docling as a dedicated Python microservice that handles ALL document parsing, OCR, table extraction, and chunking:

Capability How Docling Handles It Reference
PDF (native text) Standard pipeline, layout analysis Architecture
PDF (scanned) Tesseract CLI OCR (CPU, pluggable) Full page OCR
Word (.docx) DOCX backend Multi-format
Excel (.xlsx) XLSX backend Supported formats
Email (.msg/.eml) Email parsing Supported formats
PowerPoint (.pptx) PPTX backend Multi-format
HTML HTML backend Supported formats
Images (PNG/TIFF/JPEG) Image pipeline + OCR Supported formats
Tables TableFormer ML model (merged cells, spanning headers) Table export
Chunking HierarchicalChunker (RAG-optimized, by document structure) Chunking
Degraded/complex docs VLM: Granite-Docling-258M (local CPU) โ†’ Azure OpenAI GPT-4.1-mini vision (remote fallback) VLM pipeline

Replaces: pdf-parse, mammoth, xlsx, mailparser, pptx-parser, tesseract.js, custom chunker.js โ€” ALL superseded by one microservice.

Architecture:

flowchart TD
    EAGLE[eagle-api] -->|POST /extract<br/>multipart file| DOCLING[docling-service<br/>CPU-only pod]
    DOCLING --> NATIVE{Text layer present?}
    NATIVE -->|Yes| EXTRACT[Algorithmic extraction<br/>No OCR needed]
    NATIVE -->|No| STD[OCR: Tesseract CLI<br/>+ Layout + TableFormer]
    EXTRACT --> CONF{confidence โ‰ฅ 0.70?}
    STD --> CONF
    CONF -->|Yes| OUT[Return chunks + metadata]
    CONF -->|No| GRANITE[Granite-Docling-258M<br/>Local VLM, CPU-only]
    GRANITE --> CONF2{confidence โ‰ฅ 0.70?}
    CONF2 -->|Yes| OUT
    CONF2 -->|No| VLM[Remote VLM<br/>โ†’ Azure OpenAI GPT-4.1-mini]
    VLM --> CONF3{confidence โ‰ฅ 0.70?}
    CONF3 -->|Yes| OUT
    CONF3 -->|No| FLAG[Flag: low_confidence = true]
    FLAG --> OUT
    OUT --> EAGLE
    EAGLE --> MONGO[(MongoDB<br/>DocumentChunks)]
Loading

Deployment (Helm):

# docling-service pod (OpenShift, CPU-only)
image: custom Dockerfile (python + docling + tesseract-ocr)
resources:
  requests: { cpu: 500m, memory: 1Gi }
  limits: { cpu: 2000m, memory: 2Gi }
replicas: 1    # scale to 3+ during ETL batch
env:
  AZURE_OPENAI_ENDPOINT: (from secret)
  AZURE_OPENAI_API_KEY: (from secret)
  VLM_CONFIDENCE_THRESHOLD: "0.70"
  OCR_ENGINE: "tesseract_cli"

API contract (POST /extract):

// Response
{
  "metadata": {
    "format": "pdf", "pages": 42, "confidence": 0.87,
    "tables_detected": 3, "pipeline_used": "standard"
  },
  "chunks": [
    { "text": "...", "heading": "Section 1.2", "page": 3, "type": "paragraph" },
    { "text": "...", "heading": "Table 4", "page": 7, "type": "table" }
  ],
  "full_text": "...",
  "markdown": "..."
}

Work Package 3: Rule-Based Classifier (AICT v1)

Effort: ~5 days | Cost: $0/document (no LLM)

Keyword scoring against YAML-defined controlled vocabularies:

# vocabularies/document-types.yaml
- name: "Environmental Assessment Certificate"
  indicators: ["certificate", "EA certificate", "environmental assessment certificate"]
  weight: 3
- name: "Application Information Requirements"
  indicators: ["AIR", "application information requirements", "valued component"]
  weight: 2
- name: "Assessment Report"
  indicators: ["assessment report", "findings", "recommendation"]
  weight: 1

Algorithm: Count keyword occurrences in extracted text โ†’ score โ†’ rank โ†’ return top-3 suggestions with confidence.

Human always reviews. Classifier suggests; staff confirms or overrides during Metadata Verification Hold. No auto-classification without human sign-off.


Work Package 4: Metadata Verification Hold

Effort: ~3 days | Cost: $0

Upload โ†’ auto-extract โ†’ auto-classify โ†’ HOLD โ†’ staff reviews โ†’ admit

  • Documents in holdState: 'staged' are NOT synced to Typesense (invisible to search)
  • Staff reviews AICT suggestions, corrects/approves metadata
  • On admission: holdState โ†’ 'admitted', change stream picks up, syncs to Typesense

This is a UI + workflow concern on eagle-admin. Backend is just a status field + sync filter.


Work Package 5: Typesense Schema Update

Effort: ~1 day | Cost: $0

Add DFL fields to existing documents collection schema + enable embeddings on document_chunks:

// Add to documents collection
{ name: 'visibility', type: 'string', facet: true },
{ name: 'documentType', type: 'string', facet: true },
{ name: 'topics', type: 'string[]', facet: true },
{ name: 'orcsCode', type: 'string', facet: true },
{ name: 'holdState', type: 'string' },

// Enable semantic search on document_chunks (one-line change)
{ name: 'embedding', type: 'float[]', embed: {
    from: ['content'],
    model_config: { model_name: 'ts/e5-small' }
}}

Semantic search is then available at $0/query โ€” Typesense generates embeddings locally.


Work Package 6: eagle-admin UI (DFL Interface)

Effort: ~10 days | Cost: $0

Staff-facing Angular interface for:

  • Upload with hold workflow
  • AICT suggestion review (accept/reject/override)
  • Search with DFL facets (type, topic, ORCS, visibility)
  • Batch operations (admit multiple, change visibility)

Reuses existing Angular patterns in eagle-admin.


Work Package 7: Priority 1A ETL (Existing EPIC Documents)

Cost: $0 standard pipeline / ~$15โ€“75 remote VLM fallback for degraded docs

Batch-process existing 65,000 EPIC documents:

  1. Read from current Minio storage
  2. POST to docling-service (scale replicas to 3+ for throughput)
  3. Store returned chunks in MongoDB
  4. Run through classifier
  5. Stage with holdState: 'staged' for review (or auto-admit known-good)
  6. Change stream syncs admitted docs to Typesense

Pattern already exists in full-sync.js (zero-downtime reindex). Adapt for ETL. Scale docling-service replicas during batch to parallelize.


Work Package 8: Conversational Search (Optional v2)

Effort: ~1 day config | Cost: ~$0.002/AI query (only when user requests)

Typesense 30.x has built-in Conversational Search. One config object:

{
  "id": "dfl-conv-model",
  "model_name": "openai/gpt-4.1-nano",
  "system_prompt": "Answer ONLY from provided context. Cite document name and page.",
  "max_bytes": 16384
}

User triggers AI answer explicitly. Not automatic. Source docs always shown alongside for verification. Prevents hallucination by design.


What We're NOT Building (Cost Avoidance)

EPIC.search Component Cost Our Alternative Our Cost
PostgreSQL + pgvector ~$100/mo Typesense (already running) $0
search-vector-api (Flask) Ops overhead Typesense built-in embeddings $0
search-model (Ollama/vLLM) ~$200-500/mo GPU Typesense ts/e5-small $0
search-api (Flask + OpenAI) ~$200-1500/mo API Typesense Conversational Search ~$5/mo
search-web (React) Dev time eagle-admin/eagle-public (existing) $0
sentence-transformers (Python) Ops overhead Typesense built-in $0
Custom RAG pipeline Dev time + LLM cost Typesense built-in Config only
LLM classification per doc ~$1950 one-time + ongoing Rule-based keyword scoring $0
Custom PDF/DOCX/XLSX extractors (6+ libs) Maintenance Docling (single unified pipeline) $0
Custom chunking logic Maintenance Docling HierarchicalChunker $0
Azure Document Intelligence ~$293 one-time Docling Tesseract + VLM fallback ~$15-75

Architecture Summary

graph TB
    subgraph "โœ… Existing โ€” OpenShift (No Changes, Free)"
        CLAM[ClamAV]
        S3[(S3 Object Storage)]
        MONGO[(MongoDB)]
        PROXY[Typesense RBAC Proxy]
        SYNC[Change Stream Sync]
    end

    subgraph "๐Ÿ†• New โ€” OpenShift (Free)"
        DOCLING[docling-service<br/>Python microservice<br/>Parse + OCR + Tables + Chunk]
        GRANITE_VLM[Granite-Docling-258M<br/>Local VLM, CPU-only]
        CLASS[Rule-Based Classifier]
        HOLD[Metadata Verification Hold]
        SCHEMA[Schema: DFL fields]
        EMBED[Typesense Embeddings<br/>ts/e5-small]
    end

    subgraph "๐Ÿ”ฎ Azure API calls (pay-per-use)"
        VLM[GPT-4.1-mini Vision<br/>Remote VLM fallback]
        CONV[GPT-4.1-nano<br/>Conversational search, v2]
    end

    DOCLING --> CLASS
    DOCLING -.-> GRANITE_VLM
    GRANITE_VLM -.-> VLM
    CLASS --> HOLD
    HOLD --> SYNC
    SYNC --> EMBED
    EMBED -.-> CONV

    classDef existing fill:#2a4a2a,stroke:#1a3a1a,color:#fff
    classDef new fill:#4a3a1a,stroke:#3a2a0a,color:#fff
    classDef future fill:#4a1a1a,stroke:#3a0a0a,color:#fff

    class CLAM,S3,MONGO,PROXY,SYNC existing
    class DOCLING,CLASS,HOLD,SCHEMA,EMBED new
    class VLM,CONV future
Loading

Cost Summary

Item Monthly One-Time
Typesense (OpenShift pod) $0 โ€”
MongoDB (OpenShift pod) $0 โ€”
S3 Object Storage (OpenShift) $0 โ€”
ClamAV (OpenShift pod) $0 โ€”
docling-service (OpenShift pod, CPU) $0 $0
Rule-based classifier $0 $0
VLM fallback โ€” Granite-Docling-258M local + GPT-4.1-mini remote (~2%) ~$0โ€“0.10/day ~$15โ€“75 (ETL batch)
Conversational Search โ€” GPT-4.1-nano (v2, user-triggered) ~$1-3 โ€”
Total $0 (v1) / $1-3 (v2) $15โ€“75

Pricing basis (May 2026): GPT-4.1-nano: $0.10/1M input, $0.40/1M output ($0.0005/query). GPT-4.1-mini: $0.40/1M input, $1.60/1M output ($0.003/page VLM). Granite-Docling-258M: $0 (local CPU). Azure DI Layout (upgrade path): $10/1K pages. OpenShift: $0. Docling: $0 (MIT).


Effort Summary

Work Package Dependencies
WP1: Schema extension None
WP2: docling-service deployment None
WP3: Rule-based classifier WP2
WP4: Metadata verification hold WP1
WP5: Typesense schema update WP1
WP6: eagle-admin UI WP1, WP4, WP5
WP7: Priority 1A ETL WP2, WP3
WP8: Conversational search (v2) WP5

Critical path: WP1 + WP2 (parallel) โ†’ WP3 โ†’ WP7 (ETL). UI (WP6) can parallel after WP1.


Decision Points

See Technical Decisions for research-backed analysis of each.

  1. VLM fallback threshold: What confidence score triggers Granite-Docling-258M (local) and then Azure OpenAI GPT-4.1-mini (remote)?

    • Recommended: 0.70 (pages below this get VLM re-processing)
    • Tunable via env var, no code change needed
  2. Conversational Search timing: v1 (ship with search) or v2 (after core DFL)?

    • Recommendation: v2. Core DFL value is filing + finding, not AI chat.
  3. Auto-admit threshold: Should high-confidence classifications skip hold?

    • Recommendation: No. Human review for v1. Consider auto-admit for v2 after classifier is proven.

Principles

  1. OpenShift-first โ€” If it runs in a container, it runs in OpenShift (free). Azure only for AI API calls.
  2. No AI where rules work โ€” Keyword scoring at $0 beats LLM at $0.03/doc
  3. Leverage existing infra โ€” Eagle has 70% of what DFL needs running in OpenShift today
  4. Human in the loop โ€” AICT suggests, staff decides. No black-box classification.
  5. AI only when user asks โ€” Conversational search is opt-in, not default
  6. Source verification โ€” Every AI answer shows source documents. Anti-hallucination by design.
  7. Extend, don't replace โ€” Same codebase, same team, same deployment pipeline

Clone this wiki locally