Skip to content

Project Plan

Daniel Truong edited this page May 26, 2026 · 2 revisions

Project Plan — DEMI

Business needs → technical work → delivery sequence.


Business Needs

The EAO needs to modernize document management for Environmental Assessments. Three developer briefs define the scope:

1. Digital File Library (DFL)

Problem: Staff cannot efficiently find, file, or retrieve EA documents. 65,000+ existing documents in EPIC have limited metadata, no full-text search, and no structured filing system.

Business Outcomes:

  • Staff find documents in seconds (full-text + faceted search)
  • Documents are classified by type, topic, and ORCS code
  • Visibility controls enforce who sees what (team → EAO → IDIR → public)
  • Paper records and legacy scans become searchable
  • Filing backlog (27,000 paper records) has a digital path forward

2. AI Classifier Tool (AICT)

Problem: Manual metadata entry for every uploaded document is slow and inconsistent. Staff skip fields, use wrong categories, or leave documents untagged.

Business Outcomes:

  • Uploaded documents get automatic metadata suggestions (type, topic, ORCS)
  • Staff review and confirm — not replace — AI suggestions
  • Classification is consistent across 65,000 documents
  • Zero per-document cost (rule-based, no AI billing)
  • Future LLM upgrade path exists if accuracy insufficient

3. EPIC.system Integration

Problem: DFL/AICT must work within the existing EPIC platform (eagle-api, eagle-admin, eagle-public), not as standalone services.

Business Outcomes:

  • No new applications to deploy or maintain
  • Existing authentication, RBAC, upload pipelines reused
  • Single system of record (MongoDB) for all document metadata
  • Search works across both existing EPIC content and new DFL filings

How Work Maps to Business Needs

Business Need Work Package(s) Outcome
Find documents fast WP5 (Typesense schema) + WP6 (UI) Full-text + semantic + faceted search
Classify documents WP3 (Rule-based classifier) Auto-suggestions, $0/doc
Human review before publish WP4 (Metadata hold) Nothing goes live without staff approval
Search scanned PDFs WP2 (docling-service) Tesseract CLI OCR + TableFormer, CPU-only
Search Office/Email/HTML WP2 (docling-service) Unified parsing for all formats
Handle complex/degraded docs WP2 (docling-service VLM) Granite-Docling-258M local + Azure OpenAI GPT-4.1-mini vision remote fallback
Handle legacy 65K docs WP7 (ETL) Batch reprocess via scaled docling replicas
Search across entities WP5 (Typesense federated) One search bar → docs + projects + activities
Search UX (suggestions, groups) WP5 (Typesense analytics + grouped hits) Query suggestions, results grouped by project
Admin result control WP5 (Typesense curation) Pin important docs, bury superseded, domain synonyms
Visibility controls WP1 (Schema) + existing RBAC proxy allowed_roles already enforces
AI answers (future) WP8 (Conversational search) GPT-4.1-nano via Typesense, user-triggered
No new infrastructure cost All Everything in OpenShift (free), Azure API calls only

Delivery Phases

Phase 1: Foundation

WP Task Delivers
WP1 Schema extension — add DFL fields to Document model Visibility, classification, holdState fields
WP5 Typesense schema update — facets, embeddings, synonyms, analytics, curation Semantic + hybrid search, query suggestions, grouped results, domain synonyms
WP4 Metadata verification hold — staged/admitted workflow Upload → hold → review → publish pipeline

Milestone: Documents can be uploaded with DFL metadata and held for review.

Phase 2: Intelligence

WP Task Delivers
WP2 docling-service — Python microservice for parsing, OCR, tables, chunking All document formats → searchable, chunked text (RAG-ready)
WP3 Rule-based classifier — keyword scoring + vocabularies Auto-suggest type, topics, ORCS code

Milestone: Uploaded documents get auto-extracted text (via Docling) and classification suggestions.

Phase 3: Interface (parallel with Phase 2)

WP Task Delivers
WP6 eagle-admin UI — DFL search, upload, review screens Staff can search, upload, review, admit documents

Milestone: Staff-usable DFL interface in eagle-admin.

Phase 4: Migration

WP Task Delivers
WP7 Priority 1A ETL — batch-process 65K existing docs (scale docling replicas) Legacy documents searchable + classified

Milestone: All existing EPIC documents available in DFL search.

Phase 5: AI Enhancement (post-launch, optional)

WP Task Delivers
WP8 Conversational search — Typesense + GPT-4.1-nano "Ask a question" AI answers with source citations

Milestone: Users can ask natural language questions and get answers grounded in actual documents.


Dependency Graph

graph LR
    WP1[WP1: Schema] --> WP4[WP4: Metadata Hold]
    WP1 --> WP5[WP5: Typesense Schema]
    WP2[WP2: docling-service] --> WP3[WP3: Classifier]
    WP3 --> WP7[WP7: ETL 65K docs]
    WP4 --> WP6[WP6: eagle-admin UI]
    WP5 --> WP6
    WP5 --> WP8[WP8: Conversational Search]

    classDef critical fill:#4a1a1a,stroke:#3a0a0a,color:#fff
    classDef parallel fill:#1a3a4a,stroke:#0a2a3a,color:#fff
    classDef optional fill:#3a3a1a,stroke:#2a2a0a,color:#fff
    class WP1,WP2,WP3,WP7 critical
    class WP4,WP5,WP6 parallel
    class WP8 optional
Loading

Critical path: WP1 + WP2 (parallel, no dependency) → WP3 → WP7 (searchable legacy corpus).

Parallel track: WP4/WP5 → WP6 (UI) runs alongside intelligence work.


Cost Summary

Phase Monthly One-Time Notes
Phase 1–4 (v1) $0 ~$15–75 Remote VLM fallback for degraded docs during ETL
Phase 5 (v2 AI answers) $1–3 GPT-4.1-nano, user-triggered only

Full pricing breakdown: Implementation Proposal


Risks and Mitigations

Risk Impact Mitigation
Docling accuracy on degraded scans Some docs poorly extracted Two-tier VLM: Granite-Docling-258M (local CPU) → Azure OpenAI GPT-4.1-mini vision (remote). Upgrade path: Azure DI Layout ($10/1K pages)
Rule-based accuracy below 60% Staff find suggestions unhelpful Upgrade to Typesense embedding similarity (still $0)
65K ETL takes too long Delayed availability Scale docling replicas to 3+, prioritize most-accessed docs
docling-service resource usage Pod eviction/OOM Set memory limits at 2Gi, monitor during ETL, scale horizontally
UI complexity Timeline slip Ship minimal viable search first, iterate
Staff adoption DFL unused Involve staff in hold workflow design, make it faster than current process

Decision Points (Need Business Input)

# Decision Options Recommendation
1 VLM confidence threshold? A) 0.70 (more Azure calls), B) 0.50 (only worst docs) A — better quality, small cost delta
2 Conversational search timing? A) Ship with v1, B) Add after core DFL proven B — core value is filing + finding
3 Auto-admit high-confidence classifications? A) Always hold, B) Auto-admit >95% confidence A for v1, revisit after accuracy data
4 Paper records (27K) timeline? A) After digital, B) Parallel A — prove pipeline on digital first
5 Public search (eagle-public) scope? A) Staff-only initially, B) Public from day 1 A — staff validates before public exposure

Success Metrics

Metric Target How Measured
Document findability <5 seconds to locate any document Search latency + user feedback
Classification accuracy >70% top-3 suggestions correct Staff override rate during hold
Filing speed 2x faster than current manual process Time-to-admit per document
Corpus coverage 100% of 65K docs searchable ETL completion tracking
Cost $0/month for v1 operations Azure billing dashboard

Next Steps

  1. Approve plan — confirm scope and decision points
  2. Start WP1 — schema extension (unblocks everything else)
  3. Define vocabularies — document types, topics, ORCS mappings for classifier
  4. UI wireframes — eagle-admin DFL screens (can parallel with WP1-3)

Clone this wiki locally