WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient-Facing Dialogue

This repository hosts the code, models, and datasets accompanying the paper. The work investigates how Automatic Speech Recognition (ASR) errors distort clinical meaning in patient-facing dialogue — and shows that traditional metrics like Word Error Rate (WER) fail to capture real clinical risk. The project includes scripts for aligning ground-truth utterances to ASR-generated utterances using an LLM-based semantic aligner, and optimizing an LLM-as-a-Judge for clinical impact assessment using GEPA through DSPy.

Abstract

As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

What This Repo Provides

28 ASR evaluation metrics across 3 tiers — from simple edit-distance metrics (WER, CER) through n-gram overlap (BLEU, ROUGE, METEOR) to learned semantic metrics (BERTScore, SeMaScore, Intelligibility, Heval)
LLM-based clinical impact judge — GEPA-optimized judge that replicates expert clinician assessment of ASR error severity
Two CLI evaluation scripts — evaluate a single GT/HYP pair (evaluate_example.py) or batch-process a CSV (evaluate_dataset.py). See scripts/README.md for full usage documentation.

Quick Start

git clone https://github.com/YOUR_ORG/wer-is-unaware.git && cd wer-is-unaware
uv sync                          # Tier 1 metrics (15 metrics, no GPU)

# Evaluate a single pair
python scripts/evaluate_example.py \
  --gt "I have been experiencing chest pain for three days" \
  --hyp "I have been experiencing chess pain for three days"

# List all available metrics
python scripts/evaluate_example.py --list-metrics

Dependency Tiers

Install Command	What You Get	Notes
`uv sync`	Tier 1: WER, CER, BLEU, ROUGE, etc. (15 metrics)	No GPU needed
`uv sync --extra learned-semantic`	+ Tiers 2 & 3 (13 more metrics)	Models auto-download on first run (~10 GB)
`uv sync --extra bleurt`	+ BLEURT & Clinical BLEURT	TensorFlow required, manual checkpoint download
`uv sync --extra judge`	LLM clinical impact judge	Requires API key (`.env`)
`uv sync --extra all-metrics`	All 28 metrics
`uv sync --extra all`	Everything (metrics + judge + data-prep + plot + dev)

After install, download required NLTK data (needed for BLEU, METEOR):

python -c "import nltk; nltk.download('punkt_tab'); nltk.download('wordnet')"

Manual Downloads

Learned-semantic models (Tiers 2 & 3) auto-download from HuggingFace on first use (~10 GB total, cached in ~/.cache/huggingface/). Two metrics require manual checkpoint downloads:

BARTScore (ParaBank2) — download the .pth checkpoint from neulab/BARTScore,checkpoint google-drive, then set in .env:
```
BARTSCORE_CHECKPOINT=path/to/bart_score.pth
```
Clinical BLEURT — download the checkpoint following ClinicalBLEURT instructions,checkpoint google-drive then set in .env:
```
CLINICAL_BLEURT_CHECKPOINT=path/to/ClinicalBLEURT
```

Standard BLEURT uses the test checkpoint bundled with the package (no manual download needed). See .env.example for all environment variable names.

Folder Structure

metrics/ — ASR evaluation metrics toolkit (28 metrics across 3 tiers). API: from metrics import calculate_metric, list_metrics.
scripts/ — CLI evaluation scripts for single-pair and batch CSV evaluation. See scripts/README.md.
alignment/ — semantic alignment toolkit (aligner code, scripts, sample data, sample results). See alignment/README.md.
llm_judge/ — clinical impact judge (signatures, metrics, providers, optimizers, CLI, bundled dataset, saved judges). See llm_judge/README.md.
data_preparation/ — data pipeline scripts and PriMock57 transcript data. See data_preparation/README.md.
tests/ — test suite (pytest tests/).

Dataset

Our evaluation uses 21 consultations from the PriMock57 dataset of simulated primary-care consultations (CC BY 4.0). Ground-truth transcripts, ASR hypotheses, and preparation scripts are in data_preparation/.

Papadopoulos Korfiatis, A., Moramarco, F., Sarac, R. & Savkov, A. (2022). PriMock57: A Dataset Of Primary Care Mock Consultations. ACL 2022.

The repo ships two CSVs:

File	Rows	Description
`llm_judge/dataset/primock_data_final_outcomes.csv`	175	Clinician-annotated clinical-impact labels (judge dataset)
`metrics/data/primock_metrics_subset.csv`	157	Pre-computed metric scores for utterances with non-zero WER after NLT filtering

Column Reference

Shared base columns (19 columns, present in both CSVs):

Column	Description
`composite_key`	Unique utterance identifier (`call_id` + `interaction_index`)
`dora_or_primock`	Dataset source identifier
`interaction_index`	Turn index within the consultation
`call_id`	Consultation identifier
`doctor`	Doctor identifier
`patient_ground_truth`	Original ground-truth patient utterance
`patient_hypothesis`	ASR-generated hypothesis
`alignment_similarity_score`	Semantic similarity score from the LLM aligner
`alignment_status`	Alignment classification (e.g. match, mismatch)
`provider`	ASR system provider
`clinician_a`	Clinician A's impact label
`justification_a`	Clinician A's justification
`clinician_b`	Clinician B's impact label
`justification_b`	Clinician B's justification
`resolved_label`	Resolved label after adjudication
`disagreement_final_reasoning_from_meeting`	Resolution reasoning (if clinicians disagreed)
`final_outcome`	Final clinical impact label (0 = No, 1 = Minimal, 2 = Significant)

Judge CSV only:

Column	Description
`norm_ground_truth`	Cleaned text WITHOUT NLT filtering (fillers like "uh", "um" preserved)
`norm_hypothesis`	Cleaned hypothesis WITHOUT NLT filtering
`gt_context`	Preceding conversation context (ground-truth side)
`hyp_context`	Preceding conversation context (hypothesis side)

Metrics CSV only:

Column	Description
`gt_context`	Preceding conversation context (ground-truth side)
`hyp_context`	Preceding conversation context (hypothesis side)
`clean_ground_truth`	Cleaned text WITH NLT filtering (fillers removed)
`clean_hypothesis`	Cleaned hypothesis WITH NLT filtering
28 metric columns	Pre-computed scores (names match registry keys — run `python scripts/evaluate_example.py --list-metrics` to list)

Why two text columns? The judge CSV uses norm_* text (non-lexical tokens like "uh", "um" preserved) because these are needed for utterance filtering. The metrics CSV uses clean_* text (NLTs filtered out) because metrics are computed on cleaned text. Some utterances appear in the judge CSV but not the metrics CSV because they had zero WER after NLT filtering and were excluded from the metrics dataset.

Paper

Preprint available on arXiv: https://arxiv.org/abs/2511.16544

Citation

@misc{ellis2025werunawareassessingasr,
      title={WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue},
      author={Zachary Ellis and Jared Joselowitz and Yash Deo and Yajie He and Anna Kalygina and Aisling Higham and Mana Rahimzadeh and Yan Jia and Ibrahim Habli and Ernest Lim},
      year={2025},
      eprint={2511.16544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.16544},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient-Facing Dialogue

Abstract

What This Repo Provides

Quick Start

Dependency Tiers

Manual Downloads

Folder Structure

Dataset

Column Reference

Paper

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
alignment		alignment
data_preparation		data_preparation
llm_judge		llm_judge
metrics		metrics
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
overview.png		overview.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient-Facing Dialogue

Abstract

What This Repo Provides

Quick Start

Dependency Tiers

Manual Downloads

Folder Structure

Dataset

Column Reference

Paper

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages