You have a pile of meeting recordings — Teams calls, podcasts, interviews — and you need them transcribed. The market gives you three very different answers, and the "right" one flips depending on volume:
- Hosted API (OpenAI
gpt-4o-transcribe) — zero ops, excellent quality, but you pay per audio minute, forever. - Cheaper hosted sibling (
gpt-4o-mini-transcribe) — half the price. But how much accuracy do you give up? Nobody quotes that number. - Self-hosted Whisper — near-zero marginal cost once you own the hardware, full data control… but slower, and Whisper is famous for hallucinating on silence and padding transcripts.
The usual "comparison" is a pricing-page screenshot. That's not a decision — it ignores that cost, speed, and accuracy trade off against each other, and the break-even moves with your volume and your GPU utilisation.
STTbench turns that hand-wave into numbers. It runs the same audio through all three backends, measures real cost, real wall-clock speed, and word-level accuracy (WER with a substitution / deletion / insertion breakdown), and drops a single self-contained report — chart included — so you can actually choose.
Point it at one recording and it produces a consolidated report.md with an embedded chart:
Plus a Winners callout and a results table:
| Model | Backend | Cost (USD) | Speed | WER | Sub | Del | Ins |
|---|---|---|---|---|---|---|---|
gpt-4o-transcribe |
OpenAI API | $0.2426 (real) | 21.7× RT | ref | — | — | — |
gpt-4o-mini-transcribe |
OpenAI API | $0.1213 (real) | 34.2× RT | 5.6% | 50 | 41 | 254 |
whisper-large-v3 |
Local MLX | $0.0674 (est.) | 5.5× RT | 44.9% | 758 | 110 | 1,923 |
Real 40.4-minute podcast clip. The mini model is half the cost, fastest, and only 5.6% WER off the flagship. Local Whisper is effectively free but slower — and its 44.9% is mostly insertions (1,923), i.e. repetition/verbosity, not genuinely misheard words. That distinction is exactly what the Sub/Del/Ins split surfaces.
- 🎧 One input, three transcripts —
gpt-4o-transcribe+gpt-4o-mini-transcribevia API, and Whisper locally on Apple MLX (no cloud, no OMLX server needed). - 💸 Real cost for the API legs (duration × per-minute rate), estimated GPU-rental cost for local.
- ⏱️ Measured wall-clock speed for every backend, reported as ×real-time.
- 🎯 Word Error Rate with a Sub / Del / Ins breakdown — see why two transcripts diverge, not just that they do.
- 📐 Bring-your-own ground truth —
--reference transcript.txtscores all three against a human-corrected transcript for true accuracy instead of mutual agreement. - 📊 matplotlib chart with "↓ lower is better" / "↑ higher is better" hints, embedded straight into the report.
- 📈 Cost-sweep mode — no audio needed: plot cost vs. 1h→1000h of audio and find the break-even hours where self-hosting overtakes the API.
- 🔒 Trusted-models allow-list for the self-hosted column — provenance matters in regulated environments.
- ✂️ Auto-chunking — recordings over OpenAI's 25 MB upload limit are segmented with ffmpeg and stitched back transparently.
- 🪄 One-command bootstrap —
start.shcreates the conda env, installs deps, and runs.
# 1. Clone
git clone https://github.com/sw30labs/STTbench.git
cd STTbench
# 2. Add your OpenAI key
cp .env.example .env
$EDITOR .env # set OPENAI_API_KEY
# 3. Bootstrap the conda env (creates 'sttbench', py3.12) + run
./start.sh --input videos/meeting.mp4 --output-dir outstart.sh checks for a sttbench conda env, creates it if missing, installs requirements.txt, then runs the benchmark. ffmpeg/ffprobe must be on PATH (brew install ffmpeg).
Requirements: macOS on Apple Silicon (for the local MLX leg), conda, and ffmpeg. The two API legs work on any platform; pass
--no-localto skip MLX.
# Full 3-way benchmark → out/<name>.report.md + chart + per-model transcripts
./start.sh --input meeting.mp4 --output-dir out
# API models only (skip the local MLX leg)
./start.sh --input meeting.mp4 --no-local
# Score against a human-corrected transcript for TRUE accuracy
./start.sh --input meeting.mp4 --reference corrected.txt
# Cost-only, no transcription / no API spend
./start.sh --input meeting.mp4 --no-transcribe
# Pure cost-sweep + break-even plot — no audio file required
./start.sh
./start.sh --min-hours 0.5 --max-hours 5000 --points 400| Flag | Effect |
|---|---|
--input PATH |
Recording to benchmark (.mp4/.mov/.m4a/.wav/.mkv/…). Omit for cost-sweep only. |
--reference PATH |
Human transcript → WER becomes true accuracy instead of agreement. |
--no-local |
Skip the local MLX Whisper leg (API models only). |
--no-transcribe |
Cost-only: no API calls, no report. |
--no-plot |
Skip PNG charts. |
--local-model ID |
Pick the trusted local model (default openai/whisper-large-v3). |
--output-dir DIR |
Where reports, charts, and intermediate audio land. |
--token-mode |
Token-based pricing instead of per-minute (for the cost sweep). |
WER is word-level Levenshtein edit distance, normalised (lowercase, punctuation stripped):
WER = (Substitutions + Deletions + Insertions) / reference_word_count
The backtrace splits the errors into the three types so you can tell misheard words (Sub) apart from dropped words (Del) and hallucinated/repeated words (Ins) — Whisper's failure mode is overwhelmingly the last one.
- No
--reference:gpt-4o-transcribeis used as the reference. The numbers reflect agreement, not ground truth. - With
--reference file.txt: all three models (flagship included) are scored against your corrected transcript — true accuracy.
💡 Generate a starting point:
cp out/<name>.gpt-4o-transcribe.txt corrected.txt, fix it by ear, then re-run with--reference corrected.txt.
All pricing lives as constants at the top of transcription_cost_compare.py — edit and re-run:
| Constant | Default | Meaning |
|---|---|---|
OPENAI_GPT4O_TRANSCRIBE_USD_PER_MIN |
0.006 |
Hosted per-minute price |
OPENAI_GPT4O_MINI_TRANSCRIBE_USD_PER_MIN |
0.003 |
Cheaper sibling |
GPU_USD_PER_HOUR |
1.0 |
Your GPU rental rate |
GPU_REALTIME_FACTOR |
10.0 |
Audio-minutes per GPU-minute |
GPU_UTILISATION |
1.0 |
0.2 ⇒ idle 80% ⇒ effective 5× cost |
⚠️ Prices are placeholders as of mid-2025 — verify on the OpenAI pricing page before quoting a real budget. Break-even is highly sensitive to GPU utilisation: a GPU idle 80% of the time costs 5× its sticker rate.
The self-hosted column is restricted to an allow-list (TRUSTED_LOCAL_MODELS) — enterprise-grade checkpoints with clear provenance, not arbitrary HuggingFace forks: openai/whisper-large-v3, whisper-large-v2, distil-whisper/distil-large-v3, and the Azure STT container (disconnected SKU for air-gapped use). Unknown values are refused with a clear error.
A run with --input meeting.mp4 --output-dir out writes:
out/
├── meeting.report.md # consolidated report (table + chart + transcripts)
├── meeting.benchmark.png # the metrics chart
├── meeting.gpt-4o-transcribe.txt # raw transcript per model
├── meeting.gpt-4o-mini-transcribe.txt
└── meeting.whisper-large-v3.txt
Everything under out/, plus .env and test media, is gitignored.
- WER without
--referenceis agreement, not truth —gpt-4ois the yardstick only because it's the de-facto standard. - Local-leg cost is an estimate from the GPU-rental model; on your own Mac the marginal cost is ~$0 — the meaningful local number is the measured wall-clock time.
- Apple Silicon only for the local leg (MLX). Use
--no-localelsewhere. - Transcription is non-deterministic —
gpt-4o-transcribecan collapse repetitive passages differently between runs, so transcript length varies slightly. - Token-mode pricing is an estimate; real billing depends on the API response.
