Skip to content

ritza-co/tts-eval

Repository files navigation

tts-eval

A small Python toolkit for benchmarking text-to-speech providers head-to-head. Generates clips from any of 17 TTS APIs through a common interface, stitches dialogs, and runs blind A/B brackets where you rate clips without knowing which provider produced them.

This is the code we used for the Best TTS Model in 2026 blind benchmark on Techstackups.

What's in the box

  • 17 provider adapters behind a single TTSProvider interface: OpenAI, xAI Grok, Google Gemini, ElevenLabs, Azure, AWS Polly, Deepgram, Hume, Mistral Voxtral, Rime, Cartesia, Groq Orpheus, LMNT, Inworld, Smallest, UnrealSpeech.
  • generate.py — one-shot CLI for generating clips from any subset of providers and voices.
  • tournament.py — blind bracket runner. Generate clips for a scenario, rate them A/B without knowing the providers, then reveal the results.
  • scenarios/ — five pre-built scenarios: Pride & Prejudice (dialog), The News (conversational dialog), Nineteen Eighty-Four (narration), Le Café (French/English code-switching), and an expressive annotations test.

Prerequisites

  • Python 3.10 or higher
  • ffmpeg on your PATH (used to stitch dialog clips). On macOS: brew install ffmpeg.
  • API keys for the providers you want to test. You only need keys for the ones you call.

Setup

Clone the repo and create a virtual environment:

git clone https://github.com/ritza-co/tts-eval.git
cd tts-eval
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Copy the env template and fill in keys for the providers you want to use:

cp .env.example .env

Leave the rest blank. Any provider whose key is missing will raise an error when called, but the others will still work.

Smoke test

Confirm everything's wired up by running the hello-world script. It makes one API call per provider and saves the audio to test_hello/:

python test_hello.py

Failed providers are listed at the end with the error message. This is the quickest way to check which of your keys are working.

Generating clips

Generate a clip from one or more providers with generate.py:

python generate.py "The quick brown fox jumps over the lazy dog." --providers xai openai

Useful flags:

  • --providers / -p — one or more provider keys. Defaults to all registered providers.
  • --voices — restrict to specific voice IDs. Unknown IDs are silently skipped per provider.
  • --instructions / -i — style instructions for providers that support them (currently OpenAI's gpt-4o-mini-tts).
  • --file / -f — read text from a file instead of the command line.
  • --play — play each clip after generating it.
  • --list-providers — print every registered provider and its voices.

Output lands in samples/<timestamp>/ along with a manifest.json describing what was generated.

Running a blind tournament

The tournament workflow has three steps: generate, test, results.

First, generate clips for a scenario. The bracket is shuffled so you can't see which slot belongs to which provider:

python tournament.py --scenario pride_and_prejudice generate

Then run the blind A/B test. You'll hear two clips per matchup and pick a winner without knowing the provider:

python tournament.py --scenario pride_and_prejudice test

When the bracket is complete, reveal the providers behind each slot:

python tournament.py --scenario pride_and_prejudice results

Output for each scenario lands in tournament/<scenario_id>/:

  • audio/ — generated WAV files, one per slot
  • bracket.json — the bracket structure
  • reveal.json — the slot-to-provider mapping (don't peek before testing)
  • results.json — generation metadata and your A/B picks

Available scenarios:

ID What it tests
pride_and_prejudice Period dialog with British voices
the_news Casual conversational dialog
nineteen_eighty_four Narration, spare and foreboding
le_cafe French/English code-switching
annotations Expressive tags (laugh, sigh, whisper, etc.)

Run python tournament.py with no arguments to pick a scenario and command interactively.

Adding a provider

Each provider lives in providers/<name>.py and implements the TTSProvider interface from providers/base.py. Three things to define:

  1. name — short identifier string.
  2. voices — a list of Voice dataclasses describing the available voices.
  3. generate(text, voice_id, *, instructions=None) — call the API and return (audio_bytes, extension).

Then add the class to REGISTRY in providers/__init__.py. The new provider is immediately usable from generate.py and any scenario that lists it in VOICE_PAIRS or VOICES.

Adding a scenario

Scenarios live in scenarios/<name>.py. Each one defines a SCENARIO_ID, SCENARIO_LABEL, SCENARIO_TYPE, and a voice mapping per provider. The supported types are:

  • dialog — two speakers with a LINES list of (speaker, text) tuples and a VOICE_PAIRS mapping.
  • narration — single voice with a SEGMENTS list and a VOICES mapping.
  • annotated_dialog / annotated_narration — same as above but the script can differ per provider, useful when annotation syntax varies (ElevenLabs uses [laughs], xAI uses <laugh>, etc.).

See scenarios/pride_and_prejudice.py for the simplest dialog example.

Notes

  • The dialog stitcher in dialog.py calls the API once per line and concatenates with ffmpeg, with a configurable silence gap between turns. This is slower than passing the whole script at once but produces cleaner cuts and makes per-line latency measurable.
  • All clips are normalized to mono 24 kHz 16-bit WAV during stitching for consistent playback.
  • Audio output and tournament state are gitignored by default. If you want to share runs, commit them explicitly.

About

Blind TTS benchmark toolkit: 16 provider adapters and an A/B tournament runner. Companion code for the Techstackups Best TTS Model in 2026 article.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages