Skip to content
Daniel Truong edited this page May 26, 2026 · 4 revisions

DEMI — Digital Ecosystem Modernization Initiative

Architecture documentation for BC EAO's Digital File Library (DFL) and AI Classifier Tool (AICT).

Key Decision

Extend the existing Eagle ecosystem (eagle-api + Typesense) with a Docling microservice for document ingestion. eagle-api handles workflow/auth/classification; docling-service handles all document parsing, OCR, table extraction, and chunking as a separate Python pod.

Architecture Summary

OpenShift (free, CPU-only):
├── eagle-api (Node.js/Express) — workflow, auth, classification, API
├── docling-service (Python) — document parsing, OCR, tables, chunking
│   ├── Standard pipeline: Tesseract CLI + TableFormer (CPU)
│   └── Local VLM: Granite-Docling-258M (CPU, for complex layouts)
├── Typesense 30.x — keyword + semantic + faceted search + RAG
├── MongoDB — documents, metadata, audit log
├── S3 Object Storage — file binaries
└── ClamAV — virus scanning

Azure (pay-per-use API calls only):
├── GPT-4.1-mini vision — remote VLM fallback for degraded docs (~2%, after Granite fails)
└── GPT-4.1-nano — conversational search answers (v2, user-triggered)

Documentation

Document Purpose
Project Plan Business needs → work packages → delivery sequence
Technical Decisions Research-backed technology choices with evidence
Architecture Overview Master plan — Typesense-first, extend eagle-api
OCR Pipeline OCR + text extraction detailed design
Eagle vs EPIC.search Capability comparison — why Eagle is the better fit
ADR-001: Typesense ADR: Typesense as unified search engine
ADR-002: Async Processing ADR: MongoDB-based async job queue
ADR-003: Classification ADR: No-LLM classification in v1
Implementation Proposal Costs, hosting, work packages

Source Briefs

In briefs/:

  • Developer Brief - Digital File Library - APR 2026 2.docx
  • Developer Brief - AI Classifier Tool - APR 2026 3.docx
  • Developer Brief - EPIC.system Integration - APR 2026 1.docx
  • Context Based Tags in the DFL.vsdx

Cost

Component Monthly Notes
OpenShift (all infra) $0 Free — Typesense, MongoDB, ClamAV, S3, docling-service all run here
docling-service (CPU) $0 MIT license, CPU-only pod in OpenShift
VLM fallback (degraded docs) ~$0–5 Granite-Docling-258M local ($0) + GPT-4.1-mini remote (~2% of pages)
Conversational Search (optional v2) ~$1-3 GPT-4.1-nano at $0.0005/query, only when user triggers
Classification $0 Rule-based, no LLM
Total $0–3 (v1) / $1–8 (v2) vs $500-2000/mo for EPIC.search approach

Pricing sources (May 2026): GPT-4.1-nano: $0.10/1M input, $0.40/1M output. GPT-4.1-mini: $0.40/1M input, $1.60/1M output. Granite-Docling-258M: $0 (local CPU). Azure DI Layout (upgrade path): $10/1K pages. Docling: $0 (MIT). One-time ETL: ~$15–75.

Related Repositories

Repo Role
eagle-api Backend — extends with DFL features
eagle-admin Staff UI — search + upload + metadata review
eagle-public Public UI — document search
EPIC.search Reference — OCR patterns salvageable, architecture not adopted

Clone this wiki locally