Implementation Proposal

DFL/AICT Implementation Proposal

Leverage Eagle. Minimize AI. Keep costs down. Run in OpenShift.

Executive Summary

DFL (Digital File Library) and AICT (AI Classifier Tool) can be delivered by extending eagle-api with a Docling microservice for document ingestion. The existing Eagle platform already handles 70% of what DFL needs. Net-new work is: a Docling Python microservice (handles all document parsing, OCR, table extraction, chunking), and a rule-based classifier in eagle-api. No LLM required for v1.

Monthly cost: $0 (v1) to ~$3 (v2 with AI answers). Compare to EPIC.search approach: ~$500–2000/mo.

Hosting: 100% OpenShift (free for team). Azure used ONLY for AI model API calls when absolutely needed (VLM fallback for degraded docs, conversational search).

Hosting Strategy

Where	What	Cost
OpenShift (free)	eagle-api, MongoDB, Typesense, ClamAV, S3 Object Storage, docling-service (CPU), all application code	$0
Azure (pay-per-use)	OpenAI GPT-4.1-mini vision (remote VLM fallback after Granite-Docling fails), Azure OpenAI GPT-4.1-nano (v2 conversational search)	~$1–3/mo + one-time ETL

Rule: If it can run in a container, it runs in OpenShift. Azure is only for external API calls to AI models that cannot be self-hosted cost-effectively.

graph LR
    subgraph "OpenShift (Free)"
        API[eagle-api]
        MONGO[(MongoDB)]
        TS[(Typesense)]
        S3[(S3 Object Store)]
        CLAM[ClamAV]
        DOCLING[docling-service<br/>Python, CPU-only]
    end

    subgraph "☁️ Azure (Pay-per-use API calls only)"
        VLM[OpenAI GPT-4.1-mini Vision<br/>~$0.003/page, remote fallback only]
        AOAI[OpenAI GPT-4.1-nano<br/>~$0.0005/query, v2 only]
    end

    DOCLING -.-> |VLM fallback for degraded docs| VLM
    TS -.-> |conversational search, v2| AOAI

    classDef free fill:#2a4a2a,stroke:#1a3a1a,color:#fff
    classDef paid fill:#4a1a1a,stroke:#3a0a0a,color:#fff
    class API,MONGO,TS,S3,CLAM,DOCLING free
    class VLM,AOAI paid

What Eagle Already Has (Reuse)

graph LR
    subgraph "Already Running in OpenShift (Free)"
        direction TB
        CLAM[ClamAV<br/>Virus scanning]
        S3[(S3 Object Storage<br/>OpenShift)]
        MONGO[(MongoDB 8.0<br/>Replica set + Change Streams)]
        TS[(Typesense 30.1<br/>5 collections, RBAC proxy)]
        UPLOAD[Upload Pipeline<br/>multer → scan → store]
        SYNC[Change Stream Sync<br/>Real-time → Typesense]
        EXTRACT[content-extract.js<br/>PDF text extraction]
        CHUNK[chunker.js<br/>Page-based chunking]
        RBAC[Scoped Search Keys<br/>allowed_roles filter injection]
    end

Detailed Inventory

Component	Status	Location	DFL Reuse
ClamAV scanning	✅ Prod + Test	`api/helpers/utils.js:76-99`	Every DFL upload scanned before storage
S3 Object Storage (OpenShift)	✅ Prod	`api/helpers/minio.js`	File binary storage for all DFL docs
MongoDB replica set	✅ Prod	Change Streams enabled	DFL metadata, job state, audit log
Typesense 30.1	✅ Prod	`typesense-sync/` (5 files)	Full-text search, facets, semantic (ready to enable)
Upload endpoints	✅ Prod	`api/controllers/document.js`	POST/PUT with multer, ClamAV, S3 pipeline
Document model	✅ Prod	`api/helpers/models/document.js`	`read[]`/`write[]`/`delete[]` arrays, audit fields
RBAC proxy	✅ Prod	`api/controllers/typesenseProxy.js`	Role injection, filter sanitization, field stripping
Change Stream sync	✅ Prod	`typesense-sync/src/index.js`	Real-time MongoDB→Typesense mirror
Full reindex	✅ Prod	`typesense-sync/src/full-sync.js`	Zero-downtime alias swap nightly
PDF extraction	✅ Prod	`typesense-sync/src/content-extract.js`	pdf-parse → text → chunks → Typesense
Chunker	✅ Prod	`typesense-sync/src/chunker.js`	4000 char pages, 200 overlap
Popularity scoring	✅ Prod	`typesense-sync/src/popularity-sync.js`	30-day click scores

What This Means

We don't need to build:

❌ A new search engine
❌ A new file storage service
❌ A new virus scanning service
❌ A new database
❌ A new RBAC system
❌ A new upload pipeline
❌ A new real-time sync mechanism

What's Net-New (Must Build)

Work Package 1: Schema Extension

Effort: ~2 days | Cost: $0

Add DFL-specific fields to existing Document model:

// New fields on existing Document schema
holdState:           { type: String, enum: ['staged', 'admitted'], default: 'staged' },
holdCompletedBy:     { type: ObjectId, ref: 'User' },
holdCompletedDate:   Date,
visibility:          { type: String, enum: ['team', 'eao', 'idir', 'public'] },
classification: {
  documentType:      String,       // from controlled vocabulary
  topics:           [String],      // metatags
  orcsCode:          String,       // ORCS classification code
  retentionClass:    String,       // from ORCS lookup
},
aictSuggestions: {
  documentType:     [{ name: String, confidence: Number }],
  topics:           [{ name: String, confidence: Number }],
},
aictProcessedDate:   Date,

Visibility → allowed_roles mapping (enforced by existing proxy):

DFL Visibility	Typesense `allowed_roles`
`team`	`['project:<id>']`
`eao`	`['eao-staff', 'project:<id>']`
`idir`	`['idir-user', 'eao-staff', 'project:<id>']`
`public`	`['public', 'idir-user', 'eao-staff', 'project:<id>']`

No changes to Typesense proxy logic needed — allowed_roles filter injection already works.

Work Package 2: Document Ingestion Microservice (Docling)

Cost: $0 (MIT license, runs on OpenShift CPU)

Deploy Docling as a dedicated Python microservice that handles ALL document parsing, OCR, table extraction, and chunking:

Capability	How Docling Handles It	Reference
PDF (native text)	Standard pipeline, layout analysis	Architecture
PDF (scanned)	Tesseract CLI OCR (CPU, pluggable)	Full page OCR
Word (.docx)	DOCX backend	Multi-format
Excel (.xlsx)	XLSX backend	Supported formats
Email (.msg/.eml)	Email parsing	Supported formats
PowerPoint (.pptx)	PPTX backend	Multi-format
HTML	HTML backend	Supported formats
Images (PNG/TIFF/JPEG)	Image pipeline + OCR	Supported formats
Tables	TableFormer ML model (merged cells, spanning headers)	Table export
Chunking	HierarchicalChunker (RAG-optimized, by document structure)	Chunking
Degraded/complex docs	VLM: Granite-Docling-258M (local CPU) → Azure OpenAI GPT-4.1-mini vision (remote fallback)	VLM pipeline

Replaces: pdf-parse, mammoth, xlsx, mailparser, pptx-parser, tesseract.js, custom chunker.js — ALL superseded by one microservice.

Architecture:

flowchart TD
    EAGLE[eagle-api] -->|POST /extract<br/>multipart file| DOCLING[docling-service<br/>CPU-only pod]
    DOCLING --> NATIVE{Text layer present?}
    NATIVE -->|Yes| EXTRACT[Algorithmic extraction<br/>No OCR needed]
    NATIVE -->|No| STD[OCR: Tesseract CLI<br/>+ Layout + TableFormer]
    EXTRACT --> CONF{confidence ≥ 0.70?}
    STD --> CONF
    CONF -->|Yes| OUT[Return chunks + metadata]
    CONF -->|No| GRANITE[Granite-Docling-258M<br/>Local VLM, CPU-only]
    GRANITE --> CONF2{confidence ≥ 0.70?}
    CONF2 -->|Yes| OUT
    CONF2 -->|No| VLM[Remote VLM<br/>→ Azure OpenAI GPT-4.1-mini]
    VLM --> CONF3{confidence ≥ 0.70?}
    CONF3 -->|Yes| OUT
    CONF3 -->|No| FLAG[Flag: low_confidence = true]
    FLAG --> OUT
    OUT --> EAGLE
    EAGLE --> MONGO[(MongoDB<br/>DocumentChunks)]

Deployment (Helm):

# docling-service pod (OpenShift, CPU-only)
image: custom Dockerfile (python + docling + tesseract-ocr)
resources:
  requests: { cpu: 500m, memory: 1Gi }
  limits: { cpu: 2000m, memory: 2Gi }
replicas: 1    # scale to 3+ during ETL batch
env:
  AZURE_OPENAI_ENDPOINT: (from secret)
  AZURE_OPENAI_API_KEY: (from secret)
  VLM_CONFIDENCE_THRESHOLD: "0.70"
  OCR_ENGINE: "tesseract_cli"

API contract (POST /extract):

// Response
{
  "metadata": {
    "format": "pdf", "pages": 42, "confidence": 0.87,
    "tables_detected": 3, "pipeline_used": "standard"
  },
  "chunks": [
    { "text": "...", "heading": "Section 1.2", "page": 3, "type": "paragraph" },
    { "text": "...", "heading": "Table 4", "page": 7, "type": "table" }
  ],
  "full_text": "...",
  "markdown": "..."
}

Work Package 3: Rule-Based Classifier (AICT v1)

Effort: ~5 days | Cost: $0/document (no LLM)

Keyword scoring against YAML-defined controlled vocabularies:

# vocabularies/document-types.yaml
- name: "Environmental Assessment Certificate"
  indicators: ["certificate", "EA certificate", "environmental assessment certificate"]
  weight: 3
- name: "Application Information Requirements"
  indicators: ["AIR", "application information requirements", "valued component"]
  weight: 2
- name: "Assessment Report"
  indicators: ["assessment report", "findings", "recommendation"]
  weight: 1

Algorithm: Count keyword occurrences in extracted text → score → rank → return top-3 suggestions with confidence.

Human always reviews. Classifier suggests; staff confirms or overrides during Metadata Verification Hold. No auto-classification without human sign-off.

Work Package 4: Metadata Verification Hold

Effort: ~3 days | Cost: $0

Upload → auto-extract → auto-classify → HOLD → staff reviews → admit

Documents in holdState: 'staged' are NOT synced to Typesense (invisible to search)
Staff reviews AICT suggestions, corrects/approves metadata
On admission: holdState → 'admitted', change stream picks up, syncs to Typesense

This is a UI + workflow concern on eagle-admin. Backend is just a status field + sync filter.

Work Package 5: Typesense Schema Update

Effort: ~1 day | Cost: $0

Add DFL fields to existing documents collection schema + enable embeddings on document_chunks:

// Add to documents collection
{ name: 'visibility', type: 'string', facet: true },
{ name: 'documentType', type: 'string', facet: true },
{ name: 'topics', type: 'string[]', facet: true },
{ name: 'orcsCode', type: 'string', facet: true },
{ name: 'holdState', type: 'string' },

// Enable semantic search on document_chunks (one-line change)
{ name: 'embedding', type: 'float[]', embed: {
    from: ['content'],
    model_config: { model_name: 'ts/e5-small' }
}}

Semantic search is then available at $0/query — Typesense generates embeddings locally.

Work Package 6: eagle-admin UI (DFL Interface)

Effort: ~10 days | Cost: $0

Staff-facing Angular interface for:

Upload with hold workflow
AICT suggestion review (accept/reject/override)
Search with DFL facets (type, topic, ORCS, visibility)
Batch operations (admit multiple, change visibility)

Reuses existing Angular patterns in eagle-admin.

Work Package 7: Priority 1A ETL (Existing EPIC Documents)

Cost: $0 standard pipeline / ~$15–75 remote VLM fallback for degraded docs

Batch-process existing 65,000 EPIC documents:

Read from current Minio storage
POST to docling-service (scale replicas to 3+ for throughput)
Store returned chunks in MongoDB
Run through classifier
Stage with holdState: 'staged' for review (or auto-admit known-good)
Change stream syncs admitted docs to Typesense

Pattern already exists in full-sync.js (zero-downtime reindex). Adapt for ETL. Scale docling-service replicas during batch to parallelize.

Work Package 8: Conversational Search (Optional v2)

Effort: ~1 day config | Cost: ~$0.002/AI query (only when user requests)

Typesense 30.x has built-in Conversational Search. One config object:

{
  "id": "dfl-conv-model",
  "model_name": "openai/gpt-4.1-nano",
  "system_prompt": "Answer ONLY from provided context. Cite document name and page.",
  "max_bytes": 16384
}

User triggers AI answer explicitly. Not automatic. Source docs always shown alongside for verification. Prevents hallucination by design.

What We're NOT Building (Cost Avoidance)

EPIC.search Component	Cost	Our Alternative	Our Cost
PostgreSQL + pgvector	~$100/mo	Typesense (already running)	$0
search-vector-api (Flask)	Ops overhead	Typesense built-in embeddings	$0
search-model (Ollama/vLLM)	~$200-500/mo GPU	Typesense `ts/e5-small`	$0
search-api (Flask + OpenAI)	~$200-1500/mo API	Typesense Conversational Search	~$5/mo
search-web (React)	Dev time	eagle-admin/eagle-public (existing)	$0
sentence-transformers (Python)	Ops overhead	Typesense built-in	$0
Custom RAG pipeline	Dev time + LLM cost	Typesense built-in	Config only
LLM classification per doc	~$1950 one-time + ongoing	Rule-based keyword scoring	$0
Custom PDF/DOCX/XLSX extractors (6+ libs)	Maintenance	Docling (single unified pipeline)	$0
Custom chunking logic	Maintenance	Docling HierarchicalChunker	$0
Azure Document Intelligence	~$293 one-time	Docling Tesseract + VLM fallback	~$15-75

Architecture Summary

graph TB
    subgraph "✅ Existing — OpenShift (No Changes, Free)"
        CLAM[ClamAV]
        S3[(S3 Object Storage)]
        MONGO[(MongoDB)]
        PROXY[Typesense RBAC Proxy]
        SYNC[Change Stream Sync]
    end

    subgraph "🆕 New — OpenShift (Free)"
        DOCLING[docling-service<br/>Python microservice<br/>Parse + OCR + Tables + Chunk]
        GRANITE_VLM[Granite-Docling-258M<br/>Local VLM, CPU-only]
        CLASS[Rule-Based Classifier]
        HOLD[Metadata Verification Hold]
        SCHEMA[Schema: DFL fields]
        EMBED[Typesense Embeddings<br/>ts/e5-small]
    end

    subgraph "🔮 Azure API calls (pay-per-use)"
        VLM[GPT-4.1-mini Vision<br/>Remote VLM fallback]
        CONV[GPT-4.1-nano<br/>Conversational search, v2]
    end

    DOCLING --> CLASS
    DOCLING -.-> GRANITE_VLM
    GRANITE_VLM -.-> VLM
    CLASS --> HOLD
    HOLD --> SYNC
    SYNC --> EMBED
    EMBED -.-> CONV

    classDef existing fill:#2a4a2a,stroke:#1a3a1a,color:#fff
    classDef new fill:#4a3a1a,stroke:#3a2a0a,color:#fff
    classDef future fill:#4a1a1a,stroke:#3a0a0a,color:#fff

    class CLAM,S3,MONGO,PROXY,SYNC existing
    class DOCLING,CLASS,HOLD,SCHEMA,EMBED new
    class VLM,CONV future

Cost Summary

Item	Monthly	One-Time
Typesense (OpenShift pod)	$0	—
MongoDB (OpenShift pod)	$0	—
S3 Object Storage (OpenShift)	$0	—
ClamAV (OpenShift pod)	$0	—
docling-service (OpenShift pod, CPU)	$0	$0
Rule-based classifier	$0	$0
VLM fallback — Granite-Docling-258M local + GPT-4.1-mini remote (~2%)	~$0–0.10/day	~$15–75 (ETL batch)
Conversational Search — GPT-4.1-nano (v2, user-triggered)	~$1-3	—
Total	$0 (v1) / $1-3 (v2)	$15–75

Pricing basis (May 2026): GPT-4.1-nano: $0.10/1M input, $0.40/1M output (~~$0.0005/query). GPT-4.1-mini: $0.40/1M input, $1.60/1M output (~~$0.003/page VLM). Granite-Docling-258M: $0 (local CPU). Azure DI Layout (upgrade path): $10/1K pages. OpenShift: $0. Docling: $0 (MIT).

Effort Summary

Work Package	Dependencies
WP1: Schema extension	None
WP2: docling-service deployment	None
WP3: Rule-based classifier	WP2
WP4: Metadata verification hold	WP1
WP5: Typesense schema update	WP1
WP6: eagle-admin UI	WP1, WP4, WP5
WP7: Priority 1A ETL	WP2, WP3
WP8: Conversational search (v2)	WP5

Critical path: WP1 + WP2 (parallel) → WP3 → WP7 (ETL). UI (WP6) can parallel after WP1.

Decision Points

See Technical Decisions for research-backed analysis of each.

VLM fallback threshold: What confidence score triggers Granite-Docling-258M (local) and then Azure OpenAI GPT-4.1-mini (remote)?
- Recommended: 0.70 (pages below this get VLM re-processing)
- Tunable via env var, no code change needed
Conversational Search timing: v1 (ship with search) or v2 (after core DFL)?
- Recommendation: v2. Core DFL value is filing + finding, not AI chat.
Auto-admit threshold: Should high-confidence classifications skip hold?
- Recommendation: No. Human review for v1. Consider auto-admit for v2 after classifier is proven.

Principles

OpenShift-first — If it runs in a container, it runs in OpenShift (free). Azure only for AI API calls.
No AI where rules work — Keyword scoring at $0 beats LLM at $0.03/doc
Leverage existing infra — Eagle has 70% of what DFL needs running in OpenShift today
Human in the loop — AICT suggests, staff decides. No black-box classification.
AI only when user asks — Conversational search is opt-in, not default
Source verification — Every AI answer shows source documents. Anti-hallucination by design.
Extend, don't replace — Same codebase, same team, same deployment pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Proposal

DFL/AICT Implementation Proposal

Executive Summary

Hosting Strategy

What Eagle Already Has (Reuse)

Detailed Inventory

What This Means

What's Net-New (Must Build)

Work Package 1: Schema Extension

Work Package 2: Document Ingestion Microservice (Docling)

Work Package 3: Rule-Based Classifier (AICT v1)

Work Package 4: Metadata Verification Hold

Work Package 5: Typesense Schema Update

Work Package 6: eagle-admin UI (DFL Interface)

Work Package 7: Priority 1A ETL (Existing EPIC Documents)

Work Package 8: Conversational Search (Optional v2)

What We're NOT Building (Cost Avoidance)

Architecture Summary

Cost Summary

Effort Summary

Decision Points

Principles

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DEMI Wiki

Architecture

Planning

ADRs

Related

Clone this wiki locally