Skip to content

sw30labs/STTbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ STTbench

Benchmark speech-to-text on the three axes that actually decide your bill: cost · speed · accuracy

Python 3.12 Platform OpenAI MLX Conda Made with Claude


The problem

You have a pile of meeting recordings — Teams calls, podcasts, interviews — and you need them transcribed. The market gives you three very different answers, and the "right" one flips depending on volume:

  • Hosted API (OpenAI gpt-4o-transcribe) — zero ops, excellent quality, but you pay per audio minute, forever.
  • Cheaper hosted sibling (gpt-4o-mini-transcribe) — half the price. But how much accuracy do you give up? Nobody quotes that number.
  • Self-hosted Whisper — near-zero marginal cost once you own the hardware, full data control… but slower, and Whisper is famous for hallucinating on silence and padding transcripts.

The usual "comparison" is a pricing-page screenshot. That's not a decision — it ignores that cost, speed, and accuracy trade off against each other, and the break-even moves with your volume and your GPU utilisation.

STTbench turns that hand-wave into numbers. It runs the same audio through all three backends, measures real cost, real wall-clock speed, and word-level accuracy (WER with a substitution / deletion / insertion breakdown), and drops a single self-contained report — chart included — so you can actually choose.


What you get

Point it at one recording and it produces a consolidated report.md with an embedded chart:

Sample benchmark chart: cost, speed, and word errors per model

Plus a Winners callout and a results table:

Model Backend Cost (USD) Speed WER Sub Del Ins
gpt-4o-transcribe OpenAI API $0.2426 (real) 21.7× RT ref
gpt-4o-mini-transcribe OpenAI API $0.1213 (real) 34.2× RT 5.6% 50 41 254
whisper-large-v3 Local MLX $0.0674 (est.) 5.5× RT 44.9% 758 110 1,923

Real 40.4-minute podcast clip. The mini model is half the cost, fastest, and only 5.6% WER off the flagship. Local Whisper is effectively free but slower — and its 44.9% is mostly insertions (1,923), i.e. repetition/verbosity, not genuinely misheard words. That distinction is exactly what the Sub/Del/Ins split surfaces.


Features

  • 🎧 One input, three transcriptsgpt-4o-transcribe + gpt-4o-mini-transcribe via API, and Whisper locally on Apple MLX (no cloud, no OMLX server needed).
  • 💸 Real cost for the API legs (duration × per-minute rate), estimated GPU-rental cost for local.
  • ⏱️ Measured wall-clock speed for every backend, reported as ×real-time.
  • 🎯 Word Error Rate with a Sub / Del / Ins breakdown — see why two transcripts diverge, not just that they do.
  • 📐 Bring-your-own ground truth--reference transcript.txt scores all three against a human-corrected transcript for true accuracy instead of mutual agreement.
  • 📊 matplotlib chart with "↓ lower is better" / "↑ higher is better" hints, embedded straight into the report.
  • 📈 Cost-sweep mode — no audio needed: plot cost vs. 1h→1000h of audio and find the break-even hours where self-hosting overtakes the API.
  • 🔒 Trusted-models allow-list for the self-hosted column — provenance matters in regulated environments.
  • ✂️ Auto-chunking — recordings over OpenAI's 25 MB upload limit are segmented with ffmpeg and stitched back transparently.
  • 🪄 One-command bootstrapstart.sh creates the conda env, installs deps, and runs.

Quickstart

# 1. Clone
git clone https://github.com/sw30labs/STTbench.git
cd STTbench

# 2. Add your OpenAI key
cp .env.example .env
$EDITOR .env          # set OPENAI_API_KEY

# 3. Bootstrap the conda env (creates 'sttbench', py3.12) + run
./start.sh --input videos/meeting.mp4 --output-dir out

start.sh checks for a sttbench conda env, creates it if missing, installs requirements.txt, then runs the benchmark. ffmpeg/ffprobe must be on PATH (brew install ffmpeg).

Requirements: macOS on Apple Silicon (for the local MLX leg), conda, and ffmpeg. The two API legs work on any platform; pass --no-local to skip MLX.


Usage

# Full 3-way benchmark → out/<name>.report.md + chart + per-model transcripts
./start.sh --input meeting.mp4 --output-dir out

# API models only (skip the local MLX leg)
./start.sh --input meeting.mp4 --no-local

# Score against a human-corrected transcript for TRUE accuracy
./start.sh --input meeting.mp4 --reference corrected.txt

# Cost-only, no transcription / no API spend
./start.sh --input meeting.mp4 --no-transcribe

# Pure cost-sweep + break-even plot — no audio file required
./start.sh
./start.sh --min-hours 0.5 --max-hours 5000 --points 400

Key flags

Flag Effect
--input PATH Recording to benchmark (.mp4/.mov/.m4a/.wav/.mkv/…). Omit for cost-sweep only.
--reference PATH Human transcript → WER becomes true accuracy instead of agreement.
--no-local Skip the local MLX Whisper leg (API models only).
--no-transcribe Cost-only: no API calls, no report.
--no-plot Skip PNG charts.
--local-model ID Pick the trusted local model (default openai/whisper-large-v3).
--output-dir DIR Where reports, charts, and intermediate audio land.
--token-mode Token-based pricing instead of per-minute (for the cost sweep).

How accuracy is measured

WER is word-level Levenshtein edit distance, normalised (lowercase, punctuation stripped):

WER = (Substitutions + Deletions + Insertions) / reference_word_count

The backtrace splits the errors into the three types so you can tell misheard words (Sub) apart from dropped words (Del) and hallucinated/repeated words (Ins) — Whisper's failure mode is overwhelmingly the last one.

  • No --reference: gpt-4o-transcribe is used as the reference. The numbers reflect agreement, not ground truth.
  • With --reference file.txt: all three models (flagship included) are scored against your corrected transcript — true accuracy.

💡 Generate a starting point: cp out/<name>.gpt-4o-transcribe.txt corrected.txt, fix it by ear, then re-run with --reference corrected.txt.


The cost model

All pricing lives as constants at the top of transcription_cost_compare.py — edit and re-run:

Constant Default Meaning
OPENAI_GPT4O_TRANSCRIBE_USD_PER_MIN 0.006 Hosted per-minute price
OPENAI_GPT4O_MINI_TRANSCRIBE_USD_PER_MIN 0.003 Cheaper sibling
GPU_USD_PER_HOUR 1.0 Your GPU rental rate
GPU_REALTIME_FACTOR 10.0 Audio-minutes per GPU-minute
GPU_UTILISATION 1.0 0.2 ⇒ idle 80% ⇒ effective 5× cost

⚠️ Prices are placeholders as of mid-2025 — verify on the OpenAI pricing page before quoting a real budget. Break-even is highly sensitive to GPU utilisation: a GPU idle 80% of the time costs 5× its sticker rate.

Trusted local models

The self-hosted column is restricted to an allow-list (TRUSTED_LOCAL_MODELS) — enterprise-grade checkpoints with clear provenance, not arbitrary HuggingFace forks: openai/whisper-large-v3, whisper-large-v2, distil-whisper/distil-large-v3, and the Azure STT container (disconnected SKU for air-gapped use). Unknown values are refused with a clear error.


Output files

A run with --input meeting.mp4 --output-dir out writes:

out/
├── meeting.report.md              # consolidated report (table + chart + transcripts)
├── meeting.benchmark.png          # the metrics chart
├── meeting.gpt-4o-transcribe.txt  # raw transcript per model
├── meeting.gpt-4o-mini-transcribe.txt
└── meeting.whisper-large-v3.txt

Everything under out/, plus .env and test media, is gitignored.


Limitations

  • WER without --reference is agreement, not truthgpt-4o is the yardstick only because it's the de-facto standard.
  • Local-leg cost is an estimate from the GPU-rental model; on your own Mac the marginal cost is ~$0 — the meaningful local number is the measured wall-clock time.
  • Apple Silicon only for the local leg (MLX). Use --no-local elsewhere.
  • Transcription is non-deterministicgpt-4o-transcribe can collapse repetitive passages differently between runs, so transcript length varies slightly.
  • Token-mode pricing is an estimate; real billing depends on the API response.

Built with ❤️ and Claude Code · SW3.0 Labs

About

Speech-to-text benchmark: cost, speed, and WER across gpt-4o-transcribe, gpt-4o-mini-transcribe, and local Apple MLX Whisper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors