-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Daniel Truong edited this page May 26, 2026
·
4 revisions
Architecture documentation for BC EAO's Digital File Library (DFL) and AI Classifier Tool (AICT).
Extend the existing Eagle ecosystem (eagle-api + Typesense) with a Docling microservice for document ingestion. eagle-api handles workflow/auth/classification; docling-service handles all document parsing, OCR, table extraction, and chunking as a separate Python pod.
OpenShift (free, CPU-only):
├── eagle-api (Node.js/Express) — workflow, auth, classification, API
├── docling-service (Python) — document parsing, OCR, tables, chunking
│ ├── Standard pipeline: Tesseract CLI + TableFormer (CPU)
│ └── Local VLM: Granite-Docling-258M (CPU, for complex layouts)
├── Typesense 30.x — keyword + semantic + faceted search + RAG
├── MongoDB — documents, metadata, audit log
├── S3 Object Storage — file binaries
└── ClamAV — virus scanning
Azure (pay-per-use API calls only):
├── GPT-4.1-mini vision — remote VLM fallback for degraded docs (~2%, after Granite fails)
└── GPT-4.1-nano — conversational search answers (v2, user-triggered)
| Document | Purpose |
|---|---|
| Project Plan | Business needs → work packages → delivery sequence |
| Technical Decisions | Research-backed technology choices with evidence |
| Architecture Overview | Master plan — Typesense-first, extend eagle-api |
| OCR Pipeline | OCR + text extraction detailed design |
| Eagle vs EPIC.search | Capability comparison — why Eagle is the better fit |
| ADR-001: Typesense | ADR: Typesense as unified search engine |
| ADR-002: Async Processing | ADR: MongoDB-based async job queue |
| ADR-003: Classification | ADR: No-LLM classification in v1 |
| Implementation Proposal | Costs, hosting, work packages |
In briefs/:
Developer Brief - Digital File Library - APR 2026 2.docxDeveloper Brief - AI Classifier Tool - APR 2026 3.docxDeveloper Brief - EPIC.system Integration - APR 2026 1.docxContext Based Tags in the DFL.vsdx
| Component | Monthly | Notes |
|---|---|---|
| OpenShift (all infra) | $0 | Free — Typesense, MongoDB, ClamAV, S3, docling-service all run here |
| docling-service (CPU) | $0 | MIT license, CPU-only pod in OpenShift |
| VLM fallback (degraded docs) | ~$0–5 | Granite-Docling-258M local ($0) + GPT-4.1-mini remote (~2% of pages) |
| Conversational Search (optional v2) | ~$1-3 | GPT-4.1-nano at $0.0005/query, only when user triggers |
| Classification | $0 | Rule-based, no LLM |
| Total | $0–3 (v1) / $1–8 (v2) | vs $500-2000/mo for EPIC.search approach |
Pricing sources (May 2026): GPT-4.1-nano: $0.10/1M input, $0.40/1M output. GPT-4.1-mini: $0.40/1M input, $1.60/1M output. Granite-Docling-258M: $0 (local CPU). Azure DI Layout (upgrade path): $10/1K pages. Docling: $0 (MIT). One-time ETL: ~$15–75.
| Repo | Role |
|---|---|
| eagle-api | Backend — extends with DFL features |
| eagle-admin | Staff UI — search + upload + metadata review |
| eagle-public | Public UI — document search |
| EPIC.search | Reference — OCR patterns salvageable, architecture not adopted |