A proof-of-concept demonstrating that a sub-2 GB LLM running entirely on CPU can normalize emergency-department (ED) chief complaints into candidate ICD-10 codes via Retrieval-Augmented Generation (RAG) — with no cloud API calls, no GPU, and no proprietary data.
⚠️ Status: Research / dev prototype. Not validated for clinical use.
| Step | Script | What it does |
|---|---|---|
| 1a | 01_build_fake_kb_lancedb.py |
Generates synthetic KB docs (hospital SQL schema + ~600 ICD-10 codes) and writes them to fake_kb_data.csv. |
| 1b | 01_build_fake_kb_lancedb.py |
Reads back the CSV, flushes the existing LanceDB table, embeds all docs, and re-ingests into LanceDB. |
| 2 | 02_rag_llama32_edge_tests.py |
Runs 33 structured edge-test cases through the full RAG pipeline: embed → retrieve → prompt → generate → parse → validate JSON. Auto-saves results to results/rag_results_<timestamp>.json. |
| — | run_pipeline.py |
One-command runner that chains steps 1 and 2 end-to-end. |
Chief Complaint (free text)
│
▼
Sentence Embedder
NeuML/pubmedbert-base-embeddings (110 M params, 768-dim, CPU)
│
▼
LanceDB cosine search ◄── KB: ~600 ICD-10 codes + 3 SQL schema docs
│ ▲
│ │ ingested from fake_kb_data.csv
│ │ (flushed + re-embedded on each build)
│ top-k retrieved chunks
▼
Prompt builder
(structured system prompt + anti-laziness rules + few-shot behavior examples)
│
▼
Tiny LLM (1 B params, CPU-only, float16 or Q4_K_M)
│
▼
JSON output parser + post_process() guards + validator
│
▼
{ candidate_icd_codes, confidence, flags, … }
│
▼
results/rag_results_<timestamp>.json
Four profiles are defined and benchmarked. Switch with --model <profile>, the MODEL_PROFILE env var, or by editing active_profile in model_config.yaml.
| Profile key | Model | Size | Dtype | Gated | Backend |
|---|---|---|---|---|---|
llama32_1b_instruct |
meta-llama/Llama-3.2-1B-Instruct |
~2.5 GB | float16 | ✅ HF token + Meta licence | Transformers |
gemma3_1b |
google/gemma-3-1b-it |
~2.0 GB | bfloat16 | ✅ HF token + Google licence | Transformers |
danube3_500m |
h2oai/h2o-danube3-500m-chat |
~0.98 GB | float16 | ❌ ungated | Transformers |
llama32_1b_q4km (default) |
bartowski/Llama-3.2-1B-Instruct-GGUF (Q4_K_M) |
~0.81 GB | Q4_K_M | ❌ ungated | llama-cpp-python |
No HF token? Use
danube3_500morllama32_1b_q4km— both run with zero credentials.
The builder now follows a CSV-first pipeline:
- Generate —
build_fake_schema_docs()+build_fake_icd_docs()produce in-memory dicts. - Persist to CSV —
write_docs_to_csv()serialises all docs tofake_kb_data.csv(columns:doc_type,doc_id,title,text,icd_code,icd_desc). - Reload from CSV —
read_docs_from_csv()reads the CSV back, making the CSV the authoritative source for downstream ingestion. - Flush —
flush_lancedb_table()drops the existingkb_docstable for a clean slate. - Embed + ingest — Docs are vectorised with
NeuML/pubmedbert-base-embeddings(768-dim, domain-tuned on PubMed/MEDLINE) and written tolancedb_store/.
Override paths via env vars:
KB_CSV_PATH(CSV output),LANCEDB_DIR(store directory).
Two document types are embedded and stored in ./lancedb_store:
Three SQL CREATE TABLE definitions:
dbo.Patient— MRN, DOB, Sex, PostalCodedbo.Encounter— EncounterDtm, FacilityCode, ChiefComplaint, TriageAcuitydbo.Diagnosis— DxCode, DxSystem, DxDescription, DxRank
| Clinical domain | Code range | Sample conditions |
|---|---|---|
| Symptoms & Signs | R00–R99 | Chest pain, syncope, fever, haematuria, altered mental status |
| Cardiovascular | I10–I99 | Angina, STEMI, NSTEMI, AFib, PE, aortic dissection, DVT, heart failure |
| Respiratory | J00–J99 | URI, pharyngitis, pneumonia, COPD exacerbation, asthma, pneumothorax |
| Gastrointestinal | K00–K95 | GERD, appendicitis, bowel obstruction, cholecystitis, pancreatitis, GI bleed |
| Genitourinary | N00–N99 | UTI, pyelonephritis, renal colic, kidney stones, PID, torsion |
| Neurology | G00–G99 | Meningitis, seizure, migraine, TIA, stroke, Bell palsy |
| Mental Health / Tox | F00–F99 | Alcohol withdrawal, opioid OD, psychosis, depression, panic, PTSD |
| Musculoskeletal | M00–M99 | Gout, OA, low back pain, sciatica, rotator cuff, fibromyalgia |
| Infectious Disease | A00–B99 | Sepsis, C. diff, herpes zoster, HIV, hepatitis, Lyme |
| Endocrine / Metabolic | E00–E90 | DKA, hypoglycaemia, thyroid storm, electrolyte disorders |
| Dermatology | L00–L99 | Cellulitis, abscess, urticaria, Stevens-Johnson |
| Eye / ENT | H00–H95 | Conjunctivitis, acute glaucoma, otitis media, Ménière's, vertigo |
| Haematology | D50–D89 | Anaemia, DIC, ITP, neutropenia, sickle-cell crisis |
| Injury / Trauma | S00–T98 | Fractures, concussion, burns, poisoning, anaphylaxis |
| Obstetrics | O00–O9A | Spontaneous abortion, pre-eclampsia, hyperemesis, PPROM |
| Neoplasms | C00–D49 | Common oncology presentations seen in the ED |
33 cases cover three behavioural categories:
| Case ID | Category | CTAS | What is being tested |
|---|---|---|---|
GREEN_angina_like |
GREEN | 3 | Happy path — chest tightness → I20.9 / R07.9 |
GREEN_uri_like |
GREEN | 4 | Happy path — fever + sore throat → J06.9 |
GREEN_uti_like |
GREEN | 4 | Happy path — pelvic pain + urgency → N39.0 |
RED_empty |
RED | 5 | Empty input → EMPTY_INPUT flag, zero ICD codes |
RED_nonsense |
RED | 5 | Garbage text → NONSENSE_INPUT / LOW_CONTEXT flag, zero ICD codes |
RED_schema_request |
RED | — | Prompt-injection asking for DB schema → rejected, no ICD codes |
EDGE_contains_code |
EDGE | 4 | Complaint already contains an ICD code string → test for echo / hallucination |
EDGE_conflicting_symptoms |
EDGE | 3 | Multi-system symptoms → CONFLICTING_SYMPTOMS flag |
EDGE_very_long |
EDGE | 3 | ~1 800-token input → truncation + token limit handling |
| Case ID | Category | What is being tested |
|---|---|---|
GREEN_chest_pain_exertional |
GREEN | Exertional chest pain → R07.x |
GREEN_sob_acute |
GREEN | Acute shortness of breath → J96.x |
GREEN_appendicitis_like |
GREEN | RLQ pain, nausea → K35.x |
GREEN_dvt_leg |
GREEN | Unilateral leg swelling → I82.4x |
GREEN_migraine_classic |
GREEN | Classic migraine with aura → G43.x |
GREEN_wrist_injury |
GREEN | Wrist injury / fracture → S52.x |
GREEN_cellulitis_leg |
GREEN | Red/warm leg → L03.1x |
GREEN_hypoglycemia |
GREEN | Shakiness, diaphoresis → E16.x |
GREEN_hypertension_headache |
GREEN | Headache + elevated BP → R51 / I10 |
GREEN_eye_redness |
GREEN | Red painful eye → H10.x |
GREEN_back_pain_acute |
GREEN | Acute low back pain → M54.5 |
GREEN_pediatric_ear |
GREEN | Ear pain, paediatric → H66.x |
GREEN_allergic_hives |
GREEN | Urticaria, allergic reaction → L50.x |
GREEN_vertigo |
GREEN | Dizziness / vertigo → R42 / H81.x |
GREEN_kidney_stone |
GREEN | Flank pain, haematuria → N20.x |
| Case ID | Category | What is being tested |
|---|---|---|
EDGE_vague_unwell |
EDGE | "Just not feeling right" — minimal info |
EDGE_sob_abbreviations |
EDGE | Heavy use of clinical abbreviations |
EDGE_overdose_intentional |
EDGE | Intentional overdose — safety-sensitive |
EDGE_seizure_postictal |
EDGE | Post-ictal state description → G40.x |
EDGE_pregnancy_bleeding |
EDGE | Early pregnancy bleeding → O20.x |
EDGE_mental_health |
EDGE | Psychiatric presentation → F-codes |
EDGE_hematuria_painless |
EDGE | Painless haematuria → R31.x |
EDGE_anaphylaxis |
EDGE | Anaphylactic reaction → T78.2 |
EDGE_foreign_body_ingested |
EDGE | Swallowed foreign body → T18.x |
Every case produces a validated JSON object:
{
"input_text": "<exact ChiefComplaint text, char-for-char>",
"normalized_chief_complaint": "<cleaned clinical phrase>",
"candidate_icd_codes": ["I20.9"],
"candidate_icd_rationales": ["Chest tightness on exertion, relieved at rest — consistent with stable angina."],
"sql_fields_to_store": [
"EncounterId", "ChiefComplaintRaw", "ChiefComplaintNormalized",
"CandidateICD1", "CandidateICD1Confidence", "ModelName", "RunTimestampUTC"
],
"confidence": 0.72,
"flags": [],
"model_used": "meta-llama/Llama-3.2-1B-Instruct"
}Validation rules enforced on every output:
- All required keys present
candidate_icd_codescontains only codes that appear in the retrieved context (grounding check)input_textmatches the original complaint exactlyconfidencein[0.0, 1.0]; universal floor of0.10when codes are present- Placeholder strings (e.g.
"string") are rejected - On parse or validation failure: prompt is reinforced and retried once before raising
post_process() guards (applied between generation and validation):
- Dict coercion — If
candidate_icd_rationalescontainsdictobjects (a frequent 1B model artefact), each is unwrapped to its string value automatically. - JSON-bleed sanitisation — If
normalized_chief_complaintstarts with{or containscandidate_icd_codes, the field is cleared to prevent schema leakage. - Schema-echo guard — Detects and removes schema keywords (including
CANDIDATEICD1,MODELNAME,RUNTIMESTAMPUTC) leaked into clinical fields. - ICD fallback — When the LLM returns an empty
candidate_icd_codeslist, the retrieval rank-1 code is injected so validation always receives at least one code. - Rationale padding / truncation — Pads with
"No rationale provided."or truncates so rationales align 1:1 with codes. - Spurious
LOW_CONTEXTstripping — Removes theLOW_CONTEXTflag when codes were actually extracted successfully.
Tested on GitHub Codespace — AMD EPYC 9V74, 4 vCPUs, 32 GB RAM, no GPU.
| Model | Pass rate (33 cases) | Avg latency/case | Peak RAM |
|---|---|---|---|
| Llama 3.2 1B Instruct (float16) | see MODEL_COMPARISON_REPORT.md |
~8–12 tok/s | ~2.5 GB |
| Gemma 3 1B Instruct (bfloat16) | see report | ~8–11 tok/s | ~2.0 GB |
| H2O-Danube3 500M (float16) | see report | ~14–18 tok/s | ~0.98 GB |
| Llama 3.2 1B Q4_K_M (GGUF) | 33/33 (100%) | ~10.7 s/case | ~0.81 GB |
The Q4_K_M profile achieved 33/33 structural passes with the PubMedBERT embedder and anti-laziness prompt. ICD fallback rate was reduced from 48% → 29% through prompt engineering alone. See results/PIPELINE_REPORT_20260306.md for the detailed per-case run report.
Full per-case pass/fail table and timing data: MODEL_COMPARISON_REPORT.md
- Python 3.10+
- ~3 GB free disk for model weights (less for GGUF / Danube3)
- No GPU required
pip install lancedb sentence-transformers transformers torch python-dotenv pyyaml pandasFor the GGUF profile, also install:
pip install llama-cpp-pythoncp .env.example .env
# Edit .env — set HF_TOKEN if using a gated model, adjust HF_HOME to your cache pathpython 01_build_fake_kb_lancedb.pyThis runs the full CSV-first pipeline:
- Generates 605 KB docs (3 schema + 602 ICD-10)
- Writes them to
fake_kb_data.csv - Flushes any existing
kb_docstable from LanceDB - Embeds all docs and re-ingests into
./lancedb_store/
# Default profile (llama32_1b_instruct) — saves JSON automatically
python -u 02_rag_llama32_edge_tests.py
# Ungated alternative (no HF token needed)
python -u 02_rag_llama32_edge_tests.py --model danube3_500m
# Specify custom results path
python -u 02_rag_llama32_edge_tests.py --save-results my_results.json
# Via environment variable
MODEL_PROFILE=gemma3_1b python -u 02_rag_llama32_edge_tests.pyResults are always persisted to results/rag_results_<YYYYMMDD_HHMMSS>.json (unless overridden with --save-results).
# Uses active_profile from model_config.yaml
python run_pipeline.py
# Override model
python run_pipeline.py --model danube3_500m
# Override output path
python run_pipeline.py --save-results path/to/output.jsonrun_pipeline.py chains steps 3 and 4: build KB from CSV → ingest LanceDB → run RAG tests → save JSON.
| Decision | Rationale |
|---|---|
| CPU-only inference | Targets Windows Server / developer laptops with no GPU; PyTorch CPU threads saturate available cores. |
| Grounded ICD codes only | The prompt forbids codes absent from the retrieved context, directly reducing hallucination. |
| Retry with reinforcement | On JSON parse / validation failure, stricter rules are appended and generation is retried once. |
| Heartbeat threads | Long model-load and generation phases emit periodic HEARTBEAT lines so the process never looks hung in CI or terminal. |
| Behaviour-only few-shot examples | In-prompt examples demonstrate empty/nonsense handling only — no diagnosis-anchoring — to avoid biasing ICD predictions. |
model_config.yaml profiles |
All model hyperparameters (dtype, max_new_tokens, repetition_penalty, backend) live in a single YAML file; switching models requires no code changes. |
| Domain-tuned embedder | NeuML/pubmedbert-base-embeddings (768-dim) replaces the generic all-MiniLM-L6-v2; dramatically improves retrieval for clinical/obstetric queries. |
| Anti-laziness prompt | Explicit instruction that the model must extract ≥ 1 ICD code if any retrieved context code is even partially relevant; reduced fallback rate by 40%. |
Defensive post_process() pipeline |
Six guards (dict coercion, JSON-bleed, schema-echo, ICD fallback, rationale alignment, flag cleanup) run between generation and validation, catching 1B-model output artefacts without requiring retries. |
| Universal confidence floor | confidence ≥ 0.10 whenever candidate_icd_codes is non-empty, preventing misleading zero-confidence outputs. |
- Swap in real data — Replace
build_fake_icd_docs()with a reader for a real ICD-10 CSV (CMS tabular file or WHO release); it will be automatically written tofake_kb_data.csvand ingested. - Add real EMR schema — Extend
build_fake_schema_docs()with your actualCREATE TABLEdefinitions. - Swap the generator — Any
AutoModelForCausalLM-compatible model works; add a new profile block tomodel_config.yaml. - Persist results — The
results/rag_results_<timestamp>.jsonfile includes fullcase_outputsper case; feed those into your pipeline or BI tool. - Batch mode — Wrap
run_case()in a loop over a CSV of real chief complaints, or extendbuild_test_cases()to load from a CSV.
.
├── 01_build_fake_kb_lancedb.py # KB builder: generate → CSV → flush → embed → LanceDB
├── 02_rag_llama32_edge_tests.py # RAG pipeline + 33-case edge-test harness (auto-saves JSON)
├── run_pipeline.py # One-command runner: KB build → RAG tests → JSON results
├── fake_kb_data.csv # Generated KB docs (CSV intermediary; auto-created by step 1)
├── model_config.yaml # Model profiles (switch without code changes)
├── lancedb_store/ # Persisted vector store (flushed + rebuilt by step 1)
├── results/ # Per-run JSON results (created by step 2 / run_pipeline.py)
├── MODEL_COMPARISON_REPORT.md # Benchmark results across all four model profiles
├── TEST_REPORT.md # Detailed per-case pass/fail output
├── results/PIPELINE_REPORT_20260306.md # Detailed pipeline run report with per-case analysis
├── .env.example # Environment variable template (copy → .env)
└── .gitignore