-
Notifications
You must be signed in to change notification settings - Fork 0
Project Plan
Business needs → technical work → delivery sequence.
The EAO needs to modernize document management for Environmental Assessments. Three developer briefs define the scope:
Problem: Staff cannot efficiently find, file, or retrieve EA documents. 65,000+ existing documents in EPIC have limited metadata, no full-text search, and no structured filing system.
Business Outcomes:
- Staff find documents in seconds (full-text + faceted search)
- Documents are classified by type, topic, and ORCS code
- Visibility controls enforce who sees what (team → EAO → IDIR → public)
- Paper records and legacy scans become searchable
- Filing backlog (27,000 paper records) has a digital path forward
Problem: Manual metadata entry for every uploaded document is slow and inconsistent. Staff skip fields, use wrong categories, or leave documents untagged.
Business Outcomes:
- Uploaded documents get automatic metadata suggestions (type, topic, ORCS)
- Staff review and confirm — not replace — AI suggestions
- Classification is consistent across 65,000 documents
- Zero per-document cost (rule-based, no AI billing)
- Future LLM upgrade path exists if accuracy insufficient
Problem: DFL/AICT must work within the existing EPIC platform (eagle-api, eagle-admin, eagle-public), not as standalone services.
Business Outcomes:
- No new applications to deploy or maintain
- Existing authentication, RBAC, upload pipelines reused
- Single system of record (MongoDB) for all document metadata
- Search works across both existing EPIC content and new DFL filings
| Business Need | Work Package(s) | Outcome |
|---|---|---|
| Find documents fast | WP5 (Typesense schema) + WP6 (UI) | Full-text + semantic + faceted search |
| Classify documents | WP3 (Rule-based classifier) | Auto-suggestions, $0/doc |
| Human review before publish | WP4 (Metadata hold) | Nothing goes live without staff approval |
| Search scanned PDFs | WP2 (docling-service) | Tesseract CLI OCR + TableFormer, CPU-only |
| Search Office/Email/HTML | WP2 (docling-service) | Unified parsing for all formats |
| Handle complex/degraded docs | WP2 (docling-service VLM) | Granite-Docling-258M local + Azure OpenAI GPT-4.1-mini vision remote fallback |
| Handle legacy 65K docs | WP7 (ETL) | Batch reprocess via scaled docling replicas |
| Search across entities | WP5 (Typesense federated) | One search bar → docs + projects + activities |
| Search UX (suggestions, groups) | WP5 (Typesense analytics + grouped hits) | Query suggestions, results grouped by project |
| Admin result control | WP5 (Typesense curation) | Pin important docs, bury superseded, domain synonyms |
| Visibility controls | WP1 (Schema) + existing RBAC proxy |
allowed_roles already enforces |
| AI answers (future) | WP8 (Conversational search) | GPT-4.1-nano via Typesense, user-triggered |
| No new infrastructure cost | All | Everything in OpenShift (free), Azure API calls only |
| WP | Task | Delivers |
|---|---|---|
| WP1 | Schema extension — add DFL fields to Document model | Visibility, classification, holdState fields |
| WP5 | Typesense schema update — facets, embeddings, synonyms, analytics, curation | Semantic + hybrid search, query suggestions, grouped results, domain synonyms |
| WP4 | Metadata verification hold — staged/admitted workflow | Upload → hold → review → publish pipeline |
Milestone: Documents can be uploaded with DFL metadata and held for review.
| WP | Task | Delivers |
|---|---|---|
| WP2 | docling-service — Python microservice for parsing, OCR, tables, chunking | All document formats → searchable, chunked text (RAG-ready) |
| WP3 | Rule-based classifier — keyword scoring + vocabularies | Auto-suggest type, topics, ORCS code |
Milestone: Uploaded documents get auto-extracted text (via Docling) and classification suggestions.
| WP | Task | Delivers |
|---|---|---|
| WP6 | eagle-admin UI — DFL search, upload, review screens | Staff can search, upload, review, admit documents |
Milestone: Staff-usable DFL interface in eagle-admin.
| WP | Task | Delivers |
|---|---|---|
| WP7 | Priority 1A ETL — batch-process 65K existing docs (scale docling replicas) | Legacy documents searchable + classified |
Milestone: All existing EPIC documents available in DFL search.
| WP | Task | Delivers |
|---|---|---|
| WP8 | Conversational search — Typesense + GPT-4.1-nano | "Ask a question" AI answers with source citations |
Milestone: Users can ask natural language questions and get answers grounded in actual documents.
graph LR
WP1[WP1: Schema] --> WP4[WP4: Metadata Hold]
WP1 --> WP5[WP5: Typesense Schema]
WP2[WP2: docling-service] --> WP3[WP3: Classifier]
WP3 --> WP7[WP7: ETL 65K docs]
WP4 --> WP6[WP6: eagle-admin UI]
WP5 --> WP6
WP5 --> WP8[WP8: Conversational Search]
classDef critical fill:#4a1a1a,stroke:#3a0a0a,color:#fff
classDef parallel fill:#1a3a4a,stroke:#0a2a3a,color:#fff
classDef optional fill:#3a3a1a,stroke:#2a2a0a,color:#fff
class WP1,WP2,WP3,WP7 critical
class WP4,WP5,WP6 parallel
class WP8 optional
Critical path: WP1 + WP2 (parallel, no dependency) → WP3 → WP7 (searchable legacy corpus).
Parallel track: WP4/WP5 → WP6 (UI) runs alongside intelligence work.
| Phase | Monthly | One-Time | Notes |
|---|---|---|---|
| Phase 1–4 (v1) | $0 | ~$15–75 | Remote VLM fallback for degraded docs during ETL |
| Phase 5 (v2 AI answers) | $1–3 | — | GPT-4.1-nano, user-triggered only |
Full pricing breakdown: Implementation Proposal
| Risk | Impact | Mitigation |
|---|---|---|
| Docling accuracy on degraded scans | Some docs poorly extracted | Two-tier VLM: Granite-Docling-258M (local CPU) → Azure OpenAI GPT-4.1-mini vision (remote). Upgrade path: Azure DI Layout ($10/1K pages) |
| Rule-based accuracy below 60% | Staff find suggestions unhelpful | Upgrade to Typesense embedding similarity (still $0) |
| 65K ETL takes too long | Delayed availability | Scale docling replicas to 3+, prioritize most-accessed docs |
| docling-service resource usage | Pod eviction/OOM | Set memory limits at 2Gi, monitor during ETL, scale horizontally |
| UI complexity | Timeline slip | Ship minimal viable search first, iterate |
| Staff adoption | DFL unused | Involve staff in hold workflow design, make it faster than current process |
| # | Decision | Options | Recommendation |
|---|---|---|---|
| 1 | VLM confidence threshold? | A) 0.70 (more Azure calls), B) 0.50 (only worst docs) | A — better quality, small cost delta |
| 2 | Conversational search timing? | A) Ship with v1, B) Add after core DFL proven | B — core value is filing + finding |
| 3 | Auto-admit high-confidence classifications? | A) Always hold, B) Auto-admit >95% confidence | A for v1, revisit after accuracy data |
| 4 | Paper records (27K) timeline? | A) After digital, B) Parallel | A — prove pipeline on digital first |
| 5 | Public search (eagle-public) scope? | A) Staff-only initially, B) Public from day 1 | A — staff validates before public exposure |
| Metric | Target | How Measured |
|---|---|---|
| Document findability | <5 seconds to locate any document | Search latency + user feedback |
| Classification accuracy | >70% top-3 suggestions correct | Staff override rate during hold |
| Filing speed | 2x faster than current manual process | Time-to-admit per document |
| Corpus coverage | 100% of 65K docs searchable | ETL completion tracking |
| Cost | $0/month for v1 operations | Azure billing dashboard |
- Approve plan — confirm scope and decision points
- Start WP1 — schema extension (unblocks everything else)
- Define vocabularies — document types, topics, ORCS mappings for classifier
- UI wireframes — eagle-admin DFL screens (can parallel with WP1-3)