🎙️ STTbench

Benchmark speech-to-text on the three axes that actually decide your bill: cost · speed · accuracy

The problem

You have a pile of meeting recordings — Teams calls, podcasts, interviews — and you need them transcribed. The market gives you three very different answers, and the "right" one flips depending on volume:

Hosted API (OpenAI gpt-4o-transcribe) — zero ops, excellent quality, but you pay per audio minute, forever.
Cheaper hosted sibling (gpt-4o-mini-transcribe) — half the price. But how much accuracy do you give up? Nobody quotes that number.
Self-hosted Whisper — near-zero marginal cost once you own the hardware, full data control… but slower, and Whisper is famous for hallucinating on silence and padding transcripts.

The usual "comparison" is a pricing-page screenshot. That's not a decision — it ignores that cost, speed, and accuracy trade off against each other, and the break-even moves with your volume and your GPU utilisation.

STTbench turns that hand-wave into numbers. It runs the same audio through all three backends, measures real cost, real wall-clock speed, and word-level accuracy (WER with a substitution / deletion / insertion breakdown), and drops a single self-contained report — chart included — so you can actually choose.

What you get

Point it at one recording and it produces a consolidated report.md with an embedded chart:

Plus a Winners callout and a results table:

Model	Backend	Cost (USD)	Speed	WER	Sub	Del	Ins
`gpt-4o-transcribe`	OpenAI API	$0.2426 (real)	21.7× RT	ref	—	—	—
`gpt-4o-mini-transcribe`	OpenAI API	$0.1213 (real)	34.2× RT	5.6%	50	41	254
`whisper-large-v3`	Local MLX	$0.0674 (est.)	5.5× RT	44.9%	758	110	1,923

Real 40.4-minute podcast clip. The mini model is half the cost, fastest, and only 5.6% WER off the flagship. Local Whisper is effectively free but slower — and its 44.9% is mostly insertions (1,923), i.e. repetition/verbosity, not genuinely misheard words. That distinction is exactly what the Sub/Del/Ins split surfaces.

Features

🎧 One input, three transcripts — gpt-4o-transcribe + gpt-4o-mini-transcribe via API, and Whisper locally on Apple MLX (no cloud, no OMLX server needed).
💸 Real cost for the API legs (duration × per-minute rate), estimated GPU-rental cost for local.
⏱️ Measured wall-clock speed for every backend, reported as ×real-time.
🎯 Word Error Rate with a Sub / Del / Ins breakdown — see why two transcripts diverge, not just that they do.
📐 Bring-your-own ground truth — --reference transcript.txt scores all three against a human-corrected transcript for true accuracy instead of mutual agreement.
📊 matplotlib chart with "↓ lower is better" / "↑ higher is better" hints, embedded straight into the report.
📈 Cost-sweep mode — no audio needed: plot cost vs. 1h→1000h of audio and find the break-even hours where self-hosting overtakes the API.
🔒 Trusted-models allow-list for the self-hosted column — provenance matters in regulated environments.
✂️ Auto-chunking — recordings over OpenAI's 25 MB upload limit are segmented with ffmpeg and stitched back transparently.
🪄 One-command bootstrap — start.sh creates the conda env, installs deps, and runs.

Quickstart

# 1. Clone
git clone https://github.com/sw30labs/STTbench.git
cd STTbench

# 2. Add your OpenAI key
cp .env.example .env
$EDITOR .env          # set OPENAI_API_KEY

# 3. Bootstrap the conda env (creates 'sttbench', py3.12) + run
./start.sh --input videos/meeting.mp4 --output-dir out

start.sh checks for a sttbench conda env, creates it if missing, installs requirements.txt, then runs the benchmark. ffmpeg/ffprobe must be on PATH (brew install ffmpeg).

Requirements: macOS on Apple Silicon (for the local MLX leg), conda, and ffmpeg. The two API legs work on any platform; pass --no-local to skip MLX.

Usage

# Full 3-way benchmark → out/<name>.report.md + chart + per-model transcripts
./start.sh --input meeting.mp4 --output-dir out

# API models only (skip the local MLX leg)
./start.sh --input meeting.mp4 --no-local

# Score against a human-corrected transcript for TRUE accuracy
./start.sh --input meeting.mp4 --reference corrected.txt

# Cost-only, no transcription / no API spend
./start.sh --input meeting.mp4 --no-transcribe

# Pure cost-sweep + break-even plot — no audio file required
./start.sh
./start.sh --min-hours 0.5 --max-hours 5000 --points 400

Key flags

Flag	Effect
`--input PATH`	Recording to benchmark (`.mp4/.mov/.m4a/.wav/.mkv/…`). Omit for cost-sweep only.
`--reference PATH`	Human transcript → WER becomes true accuracy instead of agreement.
`--no-local`	Skip the local MLX Whisper leg (API models only).
`--no-transcribe`	Cost-only: no API calls, no report.
`--no-plot`	Skip PNG charts.
`--local-model ID`	Pick the trusted local model (default `openai/whisper-large-v3`).
`--output-dir DIR`	Where reports, charts, and intermediate audio land.
`--token-mode`	Token-based pricing instead of per-minute (for the cost sweep).

How accuracy is measured

WER is word-level Levenshtein edit distance, normalised (lowercase, punctuation stripped):

WER = (Substitutions + Deletions + Insertions) / reference_word_count

The backtrace splits the errors into the three types so you can tell misheard words (Sub) apart from dropped words (Del) and hallucinated/repeated words (Ins) — Whisper's failure mode is overwhelmingly the last one.

No --reference: gpt-4o-transcribe is used as the reference. The numbers reflect agreement, not ground truth.
With --reference file.txt: all three models (flagship included) are scored against your corrected transcript — true accuracy.

💡 Generate a starting point: cp out/<name>.gpt-4o-transcribe.txt corrected.txt, fix it by ear, then re-run with --reference corrected.txt.

The cost model

All pricing lives as constants at the top of transcription_cost_compare.py — edit and re-run:

Constant	Default	Meaning
`OPENAI_GPT4O_TRANSCRIBE_USD_PER_MIN`	`0.006`	Hosted per-minute price
`OPENAI_GPT4O_MINI_TRANSCRIBE_USD_PER_MIN`	`0.003`	Cheaper sibling
`GPU_USD_PER_HOUR`	`1.0`	Your GPU rental rate
`GPU_REALTIME_FACTOR`	`10.0`	Audio-minutes per GPU-minute
`GPU_UTILISATION`	`1.0`	`0.2` ⇒ idle 80% ⇒ effective 5× cost

⚠️ Prices are placeholders as of mid-2025 — verify on the OpenAI pricing page before quoting a real budget. Break-even is highly sensitive to GPU utilisation: a GPU idle 80% of the time costs 5× its sticker rate.

Trusted local models

The self-hosted column is restricted to an allow-list (TRUSTED_LOCAL_MODELS) — enterprise-grade checkpoints with clear provenance, not arbitrary HuggingFace forks: openai/whisper-large-v3, whisper-large-v2, distil-whisper/distil-large-v3, and the Azure STT container (disconnected SKU for air-gapped use). Unknown values are refused with a clear error.

Output files

A run with --input meeting.mp4 --output-dir out writes:

out/
├── meeting.report.md              # consolidated report (table + chart + transcripts)
├── meeting.benchmark.png          # the metrics chart
├── meeting.gpt-4o-transcribe.txt  # raw transcript per model
├── meeting.gpt-4o-mini-transcribe.txt
└── meeting.whisper-large-v3.txt

Everything under out/, plus .env and test media, is gitignored.

Limitations

WER without --reference is agreement, not truth — gpt-4o is the yardstick only because it's the de-facto standard.
Local-leg cost is an estimate from the GPU-rental model; on your own Mac the marginal cost is ~$0 — the meaningful local number is the measured wall-clock time.
Apple Silicon only for the local leg (MLX). Use --no-local elsewhere.
Transcription is non-deterministic — gpt-4o-transcribe can collapse repetitive passages differently between runs, so transcript length varies slightly.
Token-mode pricing is an estimate; real billing depends on the API response.

_{Built with ❤️ and Claude Code · SW3.0 Labs}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
start.sh		start.sh
transcription_cost_compare.py		transcription_cost_compare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎙️ STTbench

Benchmark speech-to-text on the three axes that actually decide your bill: cost · speed · accuracy

The problem

What you get

Features

Quickstart

Usage

Key flags

How accuracy is measured

The cost model

Trusted local models

Output files

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🎙️ STTbench

Benchmark speech-to-text on the three axes that actually decide your bill: cost · speed · accuracy

The problem

What you get

Features

Quickstart

Usage

Key flags

How accuracy is measured

The cost model

Trusted local models

Output files

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages