The healthcare world is trapped in a "free-text silo" where much of patient data is locked in unstructured notes, invisible to modern analytical systems. This solution bridges this interoperability gap by transforming raw medical notes into useful data formats.
This repository is a live document of an engineering journey: translating unstructured medical notes into standardized FHIR R4 resources using LLMs.
Rather than a "finished product" manual, this README tracks our architectural pivots, experimental results, and the reasoning behind our technical decisions.
We have transitioned from a single-threaded extraction tool into a multi-threaded, self-auditing clinical engine. The system now captures not just the output, but the reasoning behind every clinical decision.
- [2026-03-05] Parallel Execution & Scalability:
- Performance: Implemented multi-threaded processing (
ThreadPoolExecutor) allowing for 2x faster results on large datasets. - CLI Control: Added arguments for flexible testing (
--n <count>and--parallelflags). - Thread Safety: Integrated
threading.Lockto ensure data integrity during high-concurrency
- Performance: Implemented multi-threaded processing (
- [2026-03-05] The "Deep Audit" (Explainability):
- Transparency: Now capturing the
thought_signature(Chain of Thought) from both the Extraction and Judging models. - Forensics: Reasoning is base64-encoded and stored in the
audit_log.csv(v0.8), allowing clinicians to audit why the AI assigned a specific code or score.
- Transparency: Now capturing the
- [2026-03-05] Unified Evaluation Engine:
- Consolidation: Merged deterministic structural valiation into the LLM-as-a-Judge suite.
- Precision: The Judge now performs a "Gatekeeper" check for FHIR structural integrity before proceeding to clinical qualitative analysis.
graph TD
subgraph Extraction_Assembly ["1. The Pipeline (Extraction & Assembly)"]
A[Clinical Note] --> B(Gemini Flash: Clinical Extraction)
B --> C{Pydantic Check}
C -- Valid --> D(FHIR Assembly & UUID Linking)
D --> E[FHIR Transaction Bundle]
end
subgraph Evaluation_Judge ["2. The Judge Suite (Unified Validation)"]
E --> F{Gatekeeper: Structural Check}
F -- Passed --> G(Gemini Pro: Qualitative Review)
F -- Failed --> H[Audit: Structural Error Log]
G --> I[Final Score & Strengths/Weaknesses]
end
subgraph Deep_Audit ["3. The Audit Trail (v0.8 Schema)"]
B -.-> J[Extraction Chain of Thought]
G -.-> K[Judge Chain of Thought]
J & K --> L[(v0.8 Enriched Audit CSV)]
I --> L
H --> L
end
style J stroke-dasharray: 5 5
style K stroke-dasharray: 5 5
evaluator.py: The Judge. Now a unified engine performing both deterministic structural validation and CMIO-level clinical review.pipeline.py: The Orchestrator. Handles structured extraction with reasoning capture and strict FHIR assembly.run_experiment.py: The High-Performance Runner. Supports parallel processing, CLI arguments, and thread-safe logging.models.py: The Schema Truth. Defines Pydantic models for both clinical data and judge evaluations.
Tested on the top 1% most complex transcribed notes.
| Metric | Result |
|---|---|
| Structural Integrity | 100% |
| Average Judge Score | 9.15/10 |
| Concurrency Support | 2 parallel threads |
| Audit Depth | Full reasoning signatures captured |
- Deep Reasoning Analysis: Systematically analyze the captured
thought_signatureand judging logs to understand extraction failures and patterns. - Recursive Improvement: Use these findings to refine prompts and logical assembly rules for higher clinical accuracy.
- PT-BR Adaptation: Calibrating prompts to handle Brazilian clinical nuances and datasets (e.g., BRATECA and SemClinBr).
- Multilingual Extraction: Ensuring 100% reliability for both EN-US and PT-BR transcripts using the same unified pipeline.
- Real Terminology APIs: Transition from standalone LLM extraction to live API consumption (e.g., NIH UMLS/VSAC) to verify and ground clinical codes.
- Agentic Conflict Resolver: Implementing a dedicated agent responsible for Contradiction Detection, cross-referencing extracted FHIR findings with the source text to surface clinical discrepancies.
- PII/PHI Scrubbing: Implement local de-identification using Microsoft Presidio to ensure sensitive clinical data is scrubbed before processing (supporting multi-language entities).
- Enterprise API: Deploy the pipeline as a production-grade FastAPI service.
- Traffic Management: Implement strict rate limits and session management to ensure system stability and cost control.
- Professional UI: Build a clean, intuitive frontend connected to the API for easy testing and demonstration.
- Voice Clinical Intake: Integrate OpenAI Whisper so users can input clinical notes directly via audio dictation (supporting multiple languages).
Disclaimer: This is an experimental research project. Do not use for actual PHI or clinical decision-making.