This directory contains scripts to benchmark the on-device models used in MedGem against their original HuggingFace counterparts.
Before running any evaluation, set up the shared Python environment from the project root:
# From the project root (MedGem/)
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r evaluation/requirements.txtCompares the ExecuTorch build of google/medgemma-1.5-4b-it against
the original HuggingFace model using a sample chest X-ray image and text prompt.
You must build and install the ExecuTorch Python module to run this evaluation:
git clone https://github.com/kamalkraj/executorch.git
cd executorch
git checkout feature/topp-sampling-support
./install_executorch.sh
cd ..python evaluation/medgemma_evaluation.py
# With a custom image URL and prompt
python evaluation/medgemma_evaluation.py \
--image-url "https://example.com/xray.jpg" \
--prompt "What is visible in this image?"
# With a local image file
python evaluation/medgemma_evaluation.py \
--local-image path/to/xray.png \
--prompt "Describe this X-ray"
# Run only one backend
python evaluation/medgemma_evaluation.py --skip-hf # ExecuTorch only
python evaluation/medgemma_evaluation.py --skip-et # HF direct only
# Save results to JSON for programmatic comparison
python evaluation/medgemma_evaluation.py --output-json results.json| Flag | Default | Description |
|---|---|---|
--hf-model |
google/medgemma-1.5-4b-it |
HuggingFace model ID |
--et-model |
medgemma-1.5-4b-it-executorch/8192/model.pte |
Path to the ExecuTorch model file |
--et-tokenizer |
medgemma-1.5-4b-it-executorch/tokenizer.model |
Path to the ExecuTorch tokenizer |
--image-url |
(Wikimedia Chest X-ray) | URL of the image to evaluate (overridden by --local-image) |
--local-image |
— | Path to a local image file (overrides --image-url) |
--prompt |
Describe this X-ray |
Prompt to provide with the image |
--max-tokens |
1024 |
Maximum new tokens to generate |
--skip-hf |
— | Skip the HuggingFace backend |
--skip-et |
— | Skip the ExecuTorch backend |
--output-json |
— | Write results to a JSON file for programmatic comparison |
python evaluation/medgemma_evaluation.pyDownloading image from https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png...
[HF] Loading model google/medgemma-1.5-4b-it...
[HF] Preparing inputs...
[HF] Generating...
[ExecuTorch] Loading runner with medgemma-1.5-4b-it-executorch/8192/model.pte...
[ExecuTorch] Loading image processor...
[ExecuTorch] Preparing inputs...
[ExecuTorch] Generating...
================================================================================
RESULTS
================================================================================
--- HuggingFace ---
Time : 163.36s
Speed : 1.6 tokens/sec (265 tokens)
Text :
This is a chest X-ray image. Here's a description of what I see:***Overall Appearance:** The image shows the chest cavity, including the lungs, heart, ribs, and other structures.***Lungs:** The lungs appear clear, with no obvious signs of consolidation (like pneumonia) or large masses. The lung markings are visible, indicating normal lung tissue.***Heart:** The heart size appears within normal limits.***Mediastinum:** The mediastinum (the space between the lungs containing the heart, great vessels, trachea, etc.) appears normal in width.***Ribs and Bones:** The ribs and other bony structures are visible and appear intact.***Diaphragm:** The diaphragm (the muscle separating the chest from the abdomen) is visible at the bottom of the image.***No Obvious Abnormalities:** There are no obvious signs of acute pathology, such as pneumothorax (collapsed lung), pleural effusion (fluid in the pleural space), or significant consolidation.**Important Note:** This is a visual description based on the image provided. A definitive diagnosis requires a qualified radiologist to interpret the image in the context of the patient's clinical history and other relevant information.
--- ExecuTorch ---
Time : 29.61s
Speed : 7.1 tokens/sec (210 tokens)
Text :
This chest X-ray shows the lungs, heart, and surrounding structures. Here's a general description:
* **Lungs:** The lungs appear clear, with no obvious consolidation (areas of dense opacity) or infiltrates. The lung markings are relatively uniform.
* **Heart:** The heart size appears within normal limits, although the exact cardiothoracic ratio (ratio of heart width to chest width) is difficult to determine precisely from this single image without specific measurements.
* **Ribs and Chest Wall:** The ribs and clavicles are visible.
* **Diaphragm:** The diaphragm is visible at the bottom of the chest.
* **Overall:** The image appears generally normal, with no obvious acute abnormalities.
**Important Note:** This is a visual description based on a single image. A proper medical interpretation requires considering the patient's clinical history, comparing the image to previous ones if available, and potentially using additional imaging techniques. This description is not a substitute for a professional medical diagnosis.
Compares the ONNX (sherpa-onnx, int8-quantised) build of google/medasr against
the original HuggingFace model (AutoModelForCTC) using Word Error Rate (WER)
and Real-Time Factor (RTF). Results include a coloured word-level diff that
highlights insertions, deletions, and substitutions against the reference transcript.
The medasr-onnx/ directory must contain model.int8.onnx and tokens.txt.
huggingface-cli download kamalkraj/medasr-onnx --local-dir medasr-onnx# Default — downloads the bundled test_audio.wav from the HuggingFace hub
python evaluation/medasr_evaluation.py
# Custom audio with ground-truth references for WER
python evaluation/medasr_evaluation.py \
--audio path/to/audio.wav \
--references "expected transcript here"
# Multiple files
python evaluation/medasr_evaluation.py \
--audio audio1.wav audio2.wav \
--references "ref one" "ref two"
# Run only one backend
python evaluation/medasr_evaluation.py --skip-hf-direct # ONNX only
python evaluation/medasr_evaluation.py --skip-onnx # HF direct only| Flag | Default | Description |
|---|---|---|
--onnx-dir |
medasr-onnx |
Directory with model.int8.onnx + tokens.txt |
--hf-model |
google/medasr |
HuggingFace model ID |
--audio |
(hub test file) | One or more .wav files to evaluate |
--references |
(none) | Ground-truth transcripts for WER (one per audio file) |
--num-threads |
2 |
CPU threads for ONNX inference |
--skip-onnx |
— | Skip the ONNX backend |
--skip-hf-direct |
— | Skip the HuggingFace direct backend |
python evaluation/medasr_evaluation.py[INFO] No --audio files supplied. Downloading test_audio.wav from HuggingFace hub …
[INFO] Downloaded to: /Users/kamalkraj/.cache/huggingface/hub/models--google--medasr/snapshots/7383e5e461baa820bdf9060652fe51b333bafba5/test_audio.wav
[INFO] Loading ONNX recognizer …
[INFO] Running ONNX inference …
======================================================================
ONNX (sherpa-onnx)
======================================================================
File : /Users/kamalkraj/.cache/huggingface/hub/models--google--medasr/snapshots/7383e5e461baa820bdf9060652fe51b333bafba5/test_audio.wav
Reference: Exam type CT chest PE protocol period. Indication 54 year old female, shortness of breath, evaluate for PE period. Technique standard protocol period. Findings colon. Pulmonary vasculature colon. The main PA is patent period. There are filling defects in the segmental branches of the right lower lobe comma compatible with acute PE period. No saddle embolus period. Lungs colon. No pneumothorax period. Small bilateral effusions comma right greater than left period. New paragraph. Impression colon Acute segmental PE right lower lobe period.
HYP: exam type ct chest pe protocol period indication 54 year old female shortness of breath evaluate for pe period technique standard protocol period findings colon pulmonary vasculature colon the main pa is patent period there are filling defects in the segmental branches of the right lower lobe comma compatible with acute pe period no saddle embolus period lungs colon no pneumothorax period small bilateral effusions comma right greater than left period new paragraph impression colon acute segmental pe right lower lobe period
WER: 0.00%: insertions 0, deletions 0, substitutions 0, ref tokens 82
exam type ct chest pe protocol period indication 54 year old female shortness of breath evaluate for pe period technique standard protocol period findings colon pulmonary vasculature colon the main pa is patent period there are filling defects in the segmental branches of the right lower lobe comma compatible with acute pe period no saddle embolus period lungs colon no pneumothorax period small bilateral effusions comma right greater than left period new paragraph impression colon acute segmental pe right lower lobe period
Duration : 43.80s Elapsed: 0.88s RTF: 0.020
--- Totals ---
Total audio : 43.80s
Total time : 0.88s
Overall RTF : 0.020
[INFO] Loading HuggingFace direct model (AutoModelForCTC) …
[INFO] Running HuggingFace direct inference on cpu …
======================================================================
HuggingFace direct (google/medasr)
======================================================================
File : /Users/kamalkraj/.cache/huggingface/hub/models--google--medasr/snapshots/7383e5e461baa820bdf9060652fe51b333bafba5/test_audio.wav
Reference: Exam type CT chest PE protocol period. Indication 54 year old female, shortness of breath, evaluate for PE period. Technique standard protocol period. Findings colon. Pulmonary vasculature colon. The main PA is patent period. There are filling defects in the segmental branches of the right lower lobe comma compatible with acute PE period. No saddle embolus period. Lungs colon. No pneumothorax period. Small bilateral effusions comma right greater than left period. New paragraph. Impression colon Acute segmental PE right lower lobe period.
HYP: exam type ct chest pe protocol period indication 54 year old female shortness of breath evaluate for pe period technique standard protocol period findings colon pulmonary vasculature colon the main pa is patent period there are filling defects in the segmental branches of the right lower lobe comma compatible with acute pe period no saddle embolus period lungs colon no pneumothorax period small bilateral effusions comma right greater than left period new paragraph impression colon acute segmental pe right lower lobe period
WER: 0.00%: insertions 0, deletions 0, substitutions 0, ref tokens 82
exam type ct chest pe protocol period indication 54 year old female shortness of breath evaluate for pe period technique standard protocol period findings colon pulmonary vasculature colon the main pa is patent period there are filling defects in the segmental branches of the right lower lobe comma compatible with acute pe period no saddle embolus period lungs colon no pneumothorax period small bilateral effusions comma right greater than left period new paragraph impression colon acute segmental pe right lower lobe period
Duration : 43.80s Elapsed: 0.44s RTF: 0.010
--- Totals ---
Total audio : 43.80s
Total time : 0.44s
Overall RTF : 0.010
======================================================================
COMPARISON SUMMARY
======================================================================
File WER ONNX WER Direct RTF ONNX RTF Direct
---------------------------------------------------------------------------
test_audio.wav 0.00% 0.00% 0.020 0.010
Mean WER — ONNX: 0.00% | Direct: 0.00%
Overall RTF — ONNX: 0.020 | Direct: 0.010
Note: When no
--referencesare provided, the HF direct model output is used as the WER reference, measuring how faithfully the quantised ONNX model reproduces the original model's transcription. Deleted words are highlighted in red and inserted words in green in terminals that support ANSI colour codes.
Compares the LiteRT (TFLite, int8-quantised) build of google/embeddinggemma-300m against
the original HuggingFace model (SentenceTransformer) using Cosine Similarity and inference time.
The embeddinggemma-300m-litert/ directory must contain embedding_gemma_no_normalize_q8.tflite.
The tokenizer.model is included by default in that directory — no separate download needed.
python evaluation/embedding_evaluation.py| Flag | Default | Description |
|---|---|---|
--model-path |
embeddinggemma-300m-litert/embedding_gemma_no_normalize_q8.tflite |
Path to the LiteRT model file |
--tokenizer-path |
embeddinggemma-300m-litert/tokenizer.model |
Path to the SentencePiece tokenizer |
--hf-model |
google/embeddinggemma-300m |
HuggingFace model ID |
Loading HF model: google/embeddinggemma-300m...
Loading LiteRT model: embeddinggemma-300m-litert/embedding_gemma_no_normalize_q8.tflite...
Warming up LiteRT...
==========================================================================================
Test Case | Cosine Sim | HF Time | LT Time | Speedup | Status
------------------------------------------------------------------------------------------
Which planet is known as the Red Planet? | 0.9986 | 0.1658s | 0.8831s | 0.19x | ✅ PASS
What are the symptoms of diabetes? | 0.9987 | 0.0737s | 0.8786s | 0.08x | ✅ PASS
How to treat a common cold? | 0.9989 | 0.0426s | 0.8869s | 0.05x | ✅ PASS
==========================================================================================
Encoding documents...
# Document (truncated) Sim Status
---------------------------------------------------------------------------
1 Mars, known for its reddish appearance, is often referr 0.9988 ✅ PASS
2 Diabetes symptoms include increased thirst, frequent ur 0.9988 ✅ PASS
3 Common cold treatment involves rest, fluids, and over-t 0.9988 ✅ PASS
4 The moon is Earth's only natural satellite. 0.9987 ✅ PASS
5 A healthy diet and regular exercise are important for w 0.9985 ✅ PASS
---------------------------------------------------------------------------
Average Doc Sim: 0.9987 | HF: 0.0939s | LiteRT (sequential): 4.4285s | Speedup: 0.02x
==========================================================================================
SUMMARY
------------------------------------------------------------------------------------------
Similarity threshold : 0.99
Avg query cosine similarity : 0.9987 (3/3 queries pass)
Avg document cosine similarity: 0.9987 (5/5 docs pass)
Avg HF query latency : 0.0940s
Avg LiteRT query latency : 0.8829s (post warm-up)
Overall result : ✅ ALL PASS
==========================================================================================