Skip to content

feat: implement Master Incident Data Lake - Supports File Once, Report Everywhere (along with Report Once , File Everywhere)#385

Open
utkarshqz wants to merge 6 commits intofireform-core:mainfrom
utkarshqz:feat/incident-data-lake
Open

feat: implement Master Incident Data Lake - Supports File Once, Report Everywhere (along with Report Once , File Everywhere)#385
utkarshqz wants to merge 6 commits intofireform-core:mainfrom
utkarshqz:feat/incident-data-lake

Conversation

@utkarshqz
Copy link
Copy Markdown

Description

This PR implements the Master Incident Data Lake — a foundational architectural re-design that transforms FireForm from a single-shot form filler into a persistent, intelligent incident intelligence system.

The core problem it solves: Every time a transcript was processed, FireForm extracted data against one specific PDF template, filled it, and silently discarded everything else. Any spoken detail that didn't match a field in that particular template was permanently lost. Officers were forced to re-dictate information for every agency form they needed to fill.

This PR introduces a clean separation of two concerns that were previously conflated:

  1. Capture — extracting all spoken intelligence, fully and permanently.
  2. Report — generating a filled PDF for any agency template, at any time, from the stored data.

The result delivers two complementary paradigms:

"File Once. Report Everywhere."
One officer records an incident transcript once. Every agency — Fire, Police, EMS — generates their own department-specific filled PDF from that single persisted record. Zero repeated LLM calls.

"Report Once. Filed Everywhere."
Multiple officers at the same scene each submit their perspective. The system automatically merges all reports collaboratively into a single, authoritative incident record. One unified source of truth, contributed to by every responder.

Because all incident data is stored as structured JSON in a queryable database, the Data Lake also acts as a live intelligence backbone:

  • Any connected device — including the FireForm PWA — can query incident records via the REST API, making offline-first mobile reporting and background sync seamless.
  • Incident records can be searched, filtered, and aggregated for analytics, shift handover summaries, or cross-agency intelligence sharing.
  • Future integrations (CAD systems, RMS platforms, dashboards) can consume the same standardised JSON without any new LLM calls.

Fixes #384


🎯 Overview

Before this PR (single-shot, data-discarding pipeline):
  Officer A: Transcript → LLM → PDF (one template) → ❌ data discarded

After this PR — dual paradigm:

  FILE ONCE. REPORT EVERYWHERE.
  Officer A: Transcript → LLM → Master JSON ──→ PDF A (Fire Dept)
                              (persisted)    ├──→ PDF B (Police)
                                             └──→ PDF C (EMS)
                                             (any template, zero new LLM calls)

  REPORT ONCE. FILED EVERYWHERE.
  Officer A: "Structure fire, 2 victims..."  ─┐
  Officer B: "Victim names: John, Mary..."   ─┼──→ Single merged Master JSON
  Officer C: "Fire suppressed at 14:32..."   ─┘    (one authoritative record)

  INTEGRATED EVERYWHERE.
  PWA (mobile) ────┐
  Web Dashboard ───┼──→ GET /incidents/{id}  →  Same JSON, any device
  CAD System ──────┘

🚀 Key Changes

1. IncidentMasterData Model (api/db/models.py)

Introduced a new SQLModel table that acts as the central Data Lake record:

class IncidentMasterData(SQLModel, table=True):
    incident_id: str             # INC-2026-0401-0912 (auto-generated or supplied)
    master_json: str             # Full extracted payload as JSON string
    transcript_text: str         # Original raw transcript (permanent audit trail)
    location_lat: Optional[float]
    location_lng: Optional[float]
    officer_notes: Optional[str]
    created_at / updated_at: datetime

Fully decoupled from Template — the Data Lake record exists independently of any PDF.


2. Collaborative Consensus Merge (api/db/repositories.py)

The most critical new capability: multiple officers can submit reports for the same incident_id over time, and the system intelligently merges them.

Scenario Behaviour
Second officer sends null for field that already has data Existing value protected
Second officer adds a field not seen before New field added to the lake
Both officers mention Notes or Description Values appended with [UPDATE: timestamp] tag
Second officer sends a corrected non-null value Field updated

This prevents the most dangerous failure mode in multi-officer reporting: a partial or hallucinated LLM response silently overwriting real, validated data.


3. Incident Endpoints (api/routes/incidents.py)

POST /incidents/extract

Accepts a raw transcript and optional incident_id. Runs the LLM extraction batch and either creates a new Data Lake record or merges into an existing one.

// Response — new incident
{ "incident_id": "INC-2026-0401-0912", "status": "created", "fields_extracted": 7 }

// Response — subsequent report by second officer
{ "incident_id": "INC-2026-0401-0912", "status": "merged", "total_fields": 11 }

POST /incidents/{incident_id}/generate/{template_id}

Generates a filled PDF for any registered agency template from the stored Data Lake — with zero new LLM calls. One incident. Any number of templates.

GET /incidents/{incident_id}

Returns the full raw master JSON for any stored incident — useful for debugging, auditing, or downstream integrations.

GET /incidents

Lists all stored incidents.


4. Schema-less LLM Extraction (src/llm.py)

The extraction prompt was updated to operate in two modes:

  • Template-guided mode (when templates are uploaded): Uses template field names as base context, while also instructing Mistral to invent additional descriptive keys for any other critical details found in the transcript.
  • Pure schema-less mode (no templates): Fully ad-hoc — Mistral invents all keys dynamically from the transcript alone.

The Python-level field filter that previously stripped any key not present in the active template schema has been removed. All extracted keys are accepted and stored.


5. Frontend Integration (frontend/index.html)

  • Added an Incident ID input field, allowing officers to supply an existing ID to append data or trigger PDF generation from a stored record.
  • The Data Lake ID returned on creation is displayed in the UI response, making it easy to copy and reuse.
  • The existing template-fill workflow remains fully intact — the Data Lake is an invisible enhancement, not a replacement.

6. Test Coverage (tests/test_incidents.py)

13 new tests added, covering:

Unit tests (no Ollama needed):

  • Creating an incident record persists correctly to the database.
  • get_incident retrieves the correct record by ID; returns None for unknown IDs.
  • Consensus merge: null values do not overwrite existing valid data.
  • Consensus merge: Notes fields append with [UPDATE] tags.
  • Consensus merge: corrected non-null values are updated.

Integration tests (LLM mocked):

  • POST /incidents/extract → creates new incident, returns "status": "created".
  • POST /incidents/extract (same ID) → merges, returns "status": "merged".
  • GET /incidents/{id} → returns stored master JSON.
  • GET /incidents/{id} (unknown) → 404.
  • POST /incidents/{id}/generate/{template_id} (unknown incident) → 404.
  • POST /incidents/{id}/generate/{template_id} (unknown template) → 404.
  • GET /incidents → returns list of all stored incidents.
python -m pytest tests/test_incidents.py -v
# 13 passed, 0 failed
datalake_testcases

7. Documentation (docs/SETUP.md)

A full 🗄️ Master Incident Data Lake section was added to SETUP.md, covering:

  • Architecture diagram (old vs new pipeline)
  • Step-by-step Data Lake workflow
  • Collaborative Consensus Merge behaviour table
  • Full API reference table
  • Updated environment variable reference (OLLAMA_TIMEOUT)
  • Test running instructions

🛠 Technical Highlights

  • Async architecture: The extraction endpoint uses httpx.AsyncClient with a configurable OLLAMA_TIMEOUT (default: 300s) to prevent event-loop blocking on high-latency local LLM hardware.
  • format: json enforced: All Ollama API payloads pass "format": "json" to force strict JSON output, eliminating parse failures from verbose LLM responses.
  • Graceful fallback: If LLM extraction returns zero fields (e.g. JSON parse failure), the system stores an empty record and returns a valid response — the user is never left with a 500 error.
  • Audit trail: Every raw transcript segment is permanently appended to the incident's transcript_text, creating an immutable history of all inputs for legal/compliance purposes.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing single-shot extract-and-fill pipeline is extended to the persistent Lake architecture; existing /forms endpoints are unaffected)
  • This change requires a documentation update

How Has This Been Tested?

Automated (13 tests, no Ollama required):

python -m pytest tests/test_incidents.py -v

Manual end-to-end verification:

  1. Start Ollama: ollama serve
  2. Start API: uvicorn api.main:app --reload
  3. Upload a template via POST /templates/upload
  4. Submit transcript: POST /incidents/extract?input_text=<text>
  5. Note the returned incident_id
  6. Submit a second transcript with same incident_id → verify "status": "merged"
  7. GET /incidents/{incident_id} → inspect full master JSON
  8. POST /incidents/{incident_id}/generate/{template_id} → download filled PDF

Test Configuration:

  • Python 3.11+
  • Ollama 0.17.7+ running mistral locally
  • SQLite (default) or PostgreSQL

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Implement Master Data Lake Architecture for Schema-less Core Extraction

1 participant