feat: implement Master Incident Data Lake - Supports File Once, Report Everywhere (along with Report Once , File Everywhere) by utkarshqz · Pull Request #385 · fireform-core/FireForm

utkarshqz · 2026-03-30T11:17:59Z

Description

This PR implements the Master Incident Data Lake — a foundational architectural re-design that transforms FireForm from a single-shot form filler into a persistent, intelligent incident intelligence system.

The core problem it solves: Every time a transcript was processed, FireForm extracted data against one specific PDF template, filled it, and silently discarded everything else. Any spoken detail that didn't match a field in that particular template was permanently lost. Officers were forced to re-dictate information for every agency form they needed to fill.

This PR introduces a clean separation of two concerns that were previously conflated:

Capture — extracting all spoken intelligence, fully and permanently.
Report — generating a filled PDF for any agency template, at any time, from the stored data.

The result delivers two complementary paradigms:

"File Once. Report Everywhere."
One officer records an incident transcript once. Every agency — Fire, Police, EMS — generates their own department-specific filled PDF from that single persisted record. Zero repeated LLM calls.

"Report Once. Filed Everywhere."
Multiple officers at the same scene each submit their perspective. The system automatically merges all reports collaboratively into a single, authoritative incident record. One unified source of truth, contributed to by every responder.

Because all incident data is stored as structured JSON in a queryable database, the Data Lake also acts as a live intelligence backbone:

Any connected device — including the FireForm PWA — can query incident records via the REST API, making offline-first mobile reporting and background sync seamless.
Incident records can be searched, filtered, and aggregated for analytics, shift handover summaries, or cross-agency intelligence sharing.
Future integrations (CAD systems, RMS platforms, dashboards) can consume the same standardised JSON without any new LLM calls.

Fixes #384

🎯 Overview

Before this PR (single-shot, data-discarding pipeline):
  Officer A: Transcript → LLM → PDF (one template) → ❌ data discarded

After this PR — dual paradigm:

  FILE ONCE. REPORT EVERYWHERE.
  Officer A: Transcript → LLM → Master JSON ──→ PDF A (Fire Dept)
                              (persisted)    ├──→ PDF B (Police)
                                             └──→ PDF C (EMS)
                                             (any template, zero new LLM calls)

  REPORT ONCE. FILED EVERYWHERE.
  Officer A: "Structure fire, 2 victims..."  ─┐
  Officer B: "Victim names: John, Mary..."   ─┼──→ Single merged Master JSON
  Officer C: "Fire suppressed at 14:32..."   ─┘    (one authoritative record)

  INTEGRATED EVERYWHERE.
  PWA (mobile) ────┐
  Web Dashboard ───┼──→ GET /incidents/{id}  →  Same JSON, any device
  CAD System ──────┘

🚀 Key Changes

1. `IncidentMasterData` Model (`api/db/models.py`)

Introduced a new SQLModel table that acts as the central Data Lake record:

class IncidentMasterData(SQLModel, table=True):
    incident_id: str             # INC-2026-0401-0912 (auto-generated or supplied)
    master_json: str             # Full extracted payload as JSON string
    transcript_text: str         # Original raw transcript (permanent audit trail)
    location_lat: Optional[float]
    location_lng: Optional[float]
    officer_notes: Optional[str]
    created_at / updated_at: datetime

Fully decoupled from Template — the Data Lake record exists independently of any PDF.

2. Collaborative Consensus Merge (`api/db/repositories.py`)

The most critical new capability: multiple officers can submit reports for the same incident_id over time, and the system intelligently merges them.

Scenario	Behaviour
Second officer sends `null` for field that already has data	Existing value protected
Second officer adds a field not seen before	New field added to the lake
Both officers mention `Notes` or `Description`	Values appended with `[UPDATE: timestamp]` tag
Second officer sends a corrected non-null value	Field updated

This prevents the most dangerous failure mode in multi-officer reporting: a partial or hallucinated LLM response silently overwriting real, validated data.

3. Incident Endpoints (`api/routes/incidents.py`)

`POST /incidents/extract`

Accepts a raw transcript and optional incident_id. Runs the LLM extraction batch and either creates a new Data Lake record or merges into an existing one.

// Response — new incident
{ "incident_id": "INC-2026-0401-0912", "status": "created", "fields_extracted": 7 }

// Response — subsequent report by second officer
{ "incident_id": "INC-2026-0401-0912", "status": "merged", "total_fields": 11 }

`POST /incidents/{incident_id}/generate/{template_id}`

Generates a filled PDF for any registered agency template from the stored Data Lake — with zero new LLM calls. One incident. Any number of templates.

`GET /incidents/{incident_id}`

Returns the full raw master JSON for any stored incident — useful for debugging, auditing, or downstream integrations.

`GET /incidents`

Lists all stored incidents.

4. Schema-less LLM Extraction (`src/llm.py`)

The extraction prompt was updated to operate in two modes:

Template-guided mode (when templates are uploaded): Uses template field names as base context, while also instructing Mistral to invent additional descriptive keys for any other critical details found in the transcript.
Pure schema-less mode (no templates): Fully ad-hoc — Mistral invents all keys dynamically from the transcript alone.

The Python-level field filter that previously stripped any key not present in the active template schema has been removed. All extracted keys are accepted and stored.

5. Frontend Integration (`frontend/index.html`)

Added an Incident ID input field, allowing officers to supply an existing ID to append data or trigger PDF generation from a stored record.
The Data Lake ID returned on creation is displayed in the UI response, making it easy to copy and reuse.
The existing template-fill workflow remains fully intact — the Data Lake is an invisible enhancement, not a replacement.

6. Test Coverage (`tests/test_incidents.py`)

13 new tests added, covering:

Unit tests (no Ollama needed):

Creating an incident record persists correctly to the database.
get_incident retrieves the correct record by ID; returns None for unknown IDs.
Consensus merge: null values do not overwrite existing valid data.
Consensus merge: Notes fields append with [UPDATE] tags.
Consensus merge: corrected non-null values are updated.

Integration tests (LLM mocked):

POST /incidents/extract → creates new incident, returns "status": "created".
POST /incidents/extract (same ID) → merges, returns "status": "merged".
GET /incidents/{id} → returns stored master JSON.
GET /incidents/{id} (unknown) → 404.
POST /incidents/{id}/generate/{template_id} (unknown incident) → 404.
POST /incidents/{id}/generate/{template_id} (unknown template) → 404.
GET /incidents → returns list of all stored incidents.

python -m pytest tests/test_incidents.py -v
# 13 passed, 0 failed

7. Documentation (`docs/SETUP.md`)

A full 🗄️ Master Incident Data Lake section was added to SETUP.md, covering:

Architecture diagram (old vs new pipeline)
Step-by-step Data Lake workflow
Collaborative Consensus Merge behaviour table
Full API reference table
Updated environment variable reference (OLLAMA_TIMEOUT)
Test running instructions

🛠 Technical Highlights

Async architecture: The extraction endpoint uses httpx.AsyncClient with a configurable OLLAMA_TIMEOUT (default: 300s) to prevent event-loop blocking on high-latency local LLM hardware.
format: json enforced: All Ollama API payloads pass "format": "json" to force strict JSON output, eliminating parse failures from verbose LLM responses.
Graceful fallback: If LLM extraction returns zero fields (e.g. JSON parse failure), the system stores an empty record and returns a valid response — the user is never left with a 500 error.
Audit trail: Every raw transcript segment is permanently appended to the incident's transcript_text, creating an immutable history of all inputs for legal/compliance purposes.

Type of change

New feature (non-breaking change which adds functionality)
Breaking change (existing single-shot extract-and-fill pipeline is extended to the persistent Lake architecture; existing /forms endpoints are unaffected)
This change requires a documentation update

How Has This Been Tested?

Automated (13 tests, no Ollama required):

python -m pytest tests/test_incidents.py -v

Manual end-to-end verification:

Start Ollama: ollama serve
Start API: uvicorn api.main:app --reload
Upload a template via POST /templates/upload
Submit transcript: POST /incidents/extract?input_text=<text>
Note the returned incident_id
Submit a second transcript with same incident_id → verify "status": "merged"
GET /incidents/{incident_id} → inspect full master JSON
POST /incidents/{incident_id}/generate/{template_id} → download filled PDF

Test Configuration:

Python 3.11+
Ollama 0.17.7+ running mistral locally
SQLite (default) or PostgreSQL

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

…vation

…where

utkarshqz added 6 commits March 17, 2026 22:56

feat: voice transcription via faster-whisper + all accumulated fixes

e6689fc

feat: voice transcription, PWA mobile, frontend improvements, 70 tests

c4fb150

chore: remove mobile/

721539c

fix: robust radio button kid extraction and checkbox AP stream preser…

0f9bfab

…vation

feat: implement Master Incident Data Lake — Record Once, Report Every…

f3fd0fd

…where

feat: implement Master Incident Data Lake — Record Once, Report Every…

4e3c6c5

…where

This was referenced Mar 30, 2026

feat: Dynamic AI Semantic Mapper — Universal Schema-less PDF Generation from Data Lake #386

Open

[FEAT]: Department Profile System for Pre-Mapped PDF Templates #206

Open

[FEAT]: Field Mapping Wizard for Non-Technical Users #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement Master Incident Data Lake - Supports File Once, Report Everywhere (along with Report Once , File Everywhere)#385

feat: implement Master Incident Data Lake - Supports File Once, Report Everywhere (along with Report Once , File Everywhere)#385
utkarshqz wants to merge 6 commits intofireform-core:mainfrom
utkarshqz:feat/incident-data-lake

utkarshqz commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

utkarshqz commented Mar 30, 2026

Description

🎯 Overview

🚀 Key Changes

1. IncidentMasterData Model (api/db/models.py)

2. Collaborative Consensus Merge (api/db/repositories.py)

3. Incident Endpoints (api/routes/incidents.py)

POST /incidents/extract

POST /incidents/{incident_id}/generate/{template_id}

GET /incidents/{incident_id}

GET /incidents

4. Schema-less LLM Extraction (src/llm.py)

5. Frontend Integration (frontend/index.html)

6. Test Coverage (tests/test_incidents.py)

7. Documentation (docs/SETUP.md)

🛠 Technical Highlights

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `IncidentMasterData` Model (`api/db/models.py`)

2. Collaborative Consensus Merge (`api/db/repositories.py`)

3. Incident Endpoints (`api/routes/incidents.py`)

`POST /incidents/extract`

`POST /incidents/{incident_id}/generate/{template_id}`

`GET /incidents/{incident_id}`

`GET /incidents`

4. Schema-less LLM Extraction (`src/llm.py`)

5. Frontend Integration (`frontend/index.html`)

6. Test Coverage (`tests/test_incidents.py`)

7. Documentation (`docs/SETUP.md`)