A small Python toolkit for benchmarking text-to-speech providers head-to-head. Generates clips from any of 17 TTS APIs through a common interface, stitches dialogs, and runs blind A/B brackets where you rate clips without knowing which provider produced them.
This is the code we used for the Best TTS Model in 2026 blind benchmark on Techstackups.
- 17 provider adapters behind a single
TTSProviderinterface: OpenAI, xAI Grok, Google Gemini, ElevenLabs, Azure, AWS Polly, Deepgram, Hume, Mistral Voxtral, Rime, Cartesia, Groq Orpheus, LMNT, Inworld, Smallest, UnrealSpeech. generate.py— one-shot CLI for generating clips from any subset of providers and voices.tournament.py— blind bracket runner. Generate clips for a scenario, rate them A/B without knowing the providers, then reveal the results.scenarios/— five pre-built scenarios: Pride & Prejudice (dialog), The News (conversational dialog), Nineteen Eighty-Four (narration), Le Café (French/English code-switching), and an expressive annotations test.
- Python 3.10 or higher
ffmpegon your PATH (used to stitch dialog clips). On macOS:brew install ffmpeg.- API keys for the providers you want to test. You only need keys for the ones you call.
Clone the repo and create a virtual environment:
git clone https://github.com/ritza-co/tts-eval.git
cd tts-eval
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCopy the env template and fill in keys for the providers you want to use:
cp .env.example .envLeave the rest blank. Any provider whose key is missing will raise an error when called, but the others will still work.
Confirm everything's wired up by running the hello-world script. It makes one API call per provider and saves the audio to test_hello/:
python test_hello.pyFailed providers are listed at the end with the error message. This is the quickest way to check which of your keys are working.
Generate a clip from one or more providers with generate.py:
python generate.py "The quick brown fox jumps over the lazy dog." --providers xai openaiUseful flags:
--providers / -p— one or more provider keys. Defaults to all registered providers.--voices— restrict to specific voice IDs. Unknown IDs are silently skipped per provider.--instructions / -i— style instructions for providers that support them (currently OpenAI's gpt-4o-mini-tts).--file / -f— read text from a file instead of the command line.--play— play each clip after generating it.--list-providers— print every registered provider and its voices.
Output lands in samples/<timestamp>/ along with a manifest.json describing what was generated.
The tournament workflow has three steps: generate, test, results.
First, generate clips for a scenario. The bracket is shuffled so you can't see which slot belongs to which provider:
python tournament.py --scenario pride_and_prejudice generateThen run the blind A/B test. You'll hear two clips per matchup and pick a winner without knowing the provider:
python tournament.py --scenario pride_and_prejudice testWhen the bracket is complete, reveal the providers behind each slot:
python tournament.py --scenario pride_and_prejudice resultsOutput for each scenario lands in tournament/<scenario_id>/:
audio/— generated WAV files, one per slotbracket.json— the bracket structurereveal.json— the slot-to-provider mapping (don't peek before testing)results.json— generation metadata and your A/B picks
Available scenarios:
| ID | What it tests |
|---|---|
pride_and_prejudice |
Period dialog with British voices |
the_news |
Casual conversational dialog |
nineteen_eighty_four |
Narration, spare and foreboding |
le_cafe |
French/English code-switching |
annotations |
Expressive tags (laugh, sigh, whisper, etc.) |
Run python tournament.py with no arguments to pick a scenario and command interactively.
Each provider lives in providers/<name>.py and implements the TTSProvider interface from providers/base.py. Three things to define:
name— short identifier string.voices— a list ofVoicedataclasses describing the available voices.generate(text, voice_id, *, instructions=None)— call the API and return(audio_bytes, extension).
Then add the class to REGISTRY in providers/__init__.py. The new provider is immediately usable from generate.py and any scenario that lists it in VOICE_PAIRS or VOICES.
Scenarios live in scenarios/<name>.py. Each one defines a SCENARIO_ID, SCENARIO_LABEL, SCENARIO_TYPE, and a voice mapping per provider. The supported types are:
dialog— two speakers with aLINESlist of(speaker, text)tuples and aVOICE_PAIRSmapping.narration— single voice with aSEGMENTSlist and aVOICESmapping.annotated_dialog/annotated_narration— same as above but the script can differ per provider, useful when annotation syntax varies (ElevenLabs uses[laughs], xAI uses<laugh>, etc.).
See scenarios/pride_and_prejudice.py for the simplest dialog example.
- The dialog stitcher in
dialog.pycalls the API once per line and concatenates with ffmpeg, with a configurable silence gap between turns. This is slower than passing the whole script at once but produces cleaner cuts and makes per-line latency measurable. - All clips are normalized to mono 24 kHz 16-bit WAV during stitching for consistent playback.
- Audio output and tournament state are gitignored by default. If you want to share runs, commit them explicitly.