feat(tts): add Soniox TTS provider end-to-end#748
Conversation
Wires Soniox WebSocket TTS into the existing builder/factory pattern alongside ElevenLabs, Cartesia, and Sarvam. Templates can now select Soniox via tts_configuration.provider="soniox" with voice_id, model, and language. - New app/ai/voice/tts/soniox.py: SonioxTTSConfig + build_soniox_tts thin wrapper over pipecat's SonioxTTSService, plus _generate_soniox_audio for greeting prep (pipecat's Soniox client is streaming-only, so the one-shot is a small WS exchange against the same protocol) - Add SONIOX to TTSProvider enum and to TTSConfig docstring - Wire soniox branch in get_tts_service and generate_audio - Add hardcoded soniox defaults to BB_SPEECH_PROVIDER_DEFAULTS Reuses the existing SONIOX_API_KEY from STT. BB_VOICE_DEFAULTS_SONIOX and BB_SONIOX_AGGREGATE_SENTENCES Redis keys are picked up automatically by the existing dynamic-config helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughAdds Soniox as a new Text-to-Speech provider integrated into the Breeze Buddy voice agent. Changes span type definitions, service implementation, public exports, wiring into the audio pipeline, and default configuration. ChangesSoniox TTS Provider Integration
Sequence Diagram(s)sequenceDiagram
participant BBClient as Breeze Buddy Client
participant Service as get_tts_service
participant SonioxService as SonioxTTSService
participant Synthesis as _generate_soniox_audio
participant WebSocket as Soniox WebSocket
participant AudioProcessing as Audio Processing
BBClient->>Service: request TTS (provider="soniox")
Service->>Service: validate SONIOX_API_KEY
Service->>Service: fetch aggregation settings
Service->>SonioxService: build with config
SonioxService-->>Service: constructed service
Service-->>BBClient: SonioxTTSService ready
BBClient->>Synthesis: generate_audio(text)
Synthesis->>Synthesis: apply voice/model defaults
Synthesis->>Synthesis: parse & validate language
Synthesis->>WebSocket: send config + text JSON
WebSocket-->>Synthesis: stream base64 audio chunks
Synthesis->>Synthesis: decode & concatenate PCM
WebSocket-->>Synthesis: terminated signal
Synthesis-->>BBClient: pcm_s16le bytes
BBClient->>AudioProcessing: convert_to_mulaw(pcm)
AudioProcessing-->>BBClient: mulaw audio
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds Soniox as a first-class TTS provider in the Breeze Buddy TTS factory/builder flow, enabling templates to select tts_configuration.provider = "soniox" and supporting both streaming TTS (via pipecat) and one-shot greeting synthesis (direct WebSocket protocol).
Changes:
- Added Soniox defaults to dynamic per-provider TTS defaults (
BB_SPEECH_PROVIDER_DEFAULTS). - Introduced
app/ai/voice/tts/soniox.pywithSonioxTTSConfig,build_soniox_tts, and a one-shot_generate_soniox_audioWebSocket synth helper. - Wired
sonioxinto Breeze Buddy TTS service construction and greeting audio generation; extendedTTSProvider/TTSConfigdocs accordingly.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| app/core/config/dynamic.py | Adds Soniox hardcoded provider defaults for Redis-override merge. |
| app/ai/voice/tts/soniox.py | Implements Soniox TTS builder + one-shot WebSocket greeting synthesis helper. |
| app/ai/voice/tts/init.py | Exposes Soniox builder/config via shared TTS package exports. |
| app/ai/voice/agents/breeze_buddy/tts/init.py | Adds soniox branches to get_tts_service and generate_audio. |
| app/ai/voice/agents/breeze_buddy/template/types.py | Extends TTSProvider enum and documents Soniox example config. |
| language: Optional[str] = None, | ||
| sample_rate: int = 16000, | ||
| ) -> bytes: | ||
| """One-shot synth via Soniox WebSocket for greeting prep. | ||
|
|
||
| Opens a single WebSocket, sends config + text + ``text_end:true``, collects | ||
| base64-encoded audio chunks until ``terminated``, and returns the | ||
| concatenated PCM bytes. | ||
|
|
||
| Returns 16-bit little-endian PCM mono at the requested ``sample_rate``, | ||
| matching ``convert_to_mulaw`` expectations for downstream telephony use. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
app/ai/voice/tts/soniox.py (1)
28-32: 💤 Low value
__all__is not sorted (Ruff RUF022).♻️ Proposed fix
__all__ = [ "SonioxTTSConfig", + "_generate_soniox_audio", "build_soniox_tts", - "_generate_soniox_audio", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/ai/voice/tts/soniox.py` around lines 28 - 32, The __all__ export list in soniox.py is not alphabetically sorted; update the __all__ list to be sorted lexicographically (e.g., "SonioxTTSConfig", "build_soniox_tts", "_generate_soniox_audio" -> order them alphabetically) so it satisfies Ruff RUF022; edit the __all__ variable declaration to reorder the strings accordingly while leaving the exact symbol names unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/tts/__init__.py`:
- Around line 262-269: The streaming vs one-shot language handling is
inconsistent: update the generate_audio call site to pre-parse resolved.language
using _parse_language (the same logic get_tts_service uses) before passing it to
_generate_soniox_audio, and change _generate_soniox_audio's language parameter
type from Optional[str] to Optional[Language] so it skips internal value-based
parsing; reference get_tts_service, _parse_language, generate_audio,
_generate_soniox_audio, resolved.language and the Language enum when making the
change.
In `@app/ai/voice/tts/soniox.py`:
- Around line 128-130: The current logger.info in _generate_soniox_audio exposes
substituted template text (PII); remove logging of text[:50] and instead log
only non-PII metadata—e.g., sample_rate, voice/id, text length, and a redacted
or hashed fingerprint if you need traceability—and update the logger.info call
in _generate_soniox_audio to output those safe fields only so no
customer-sensitive content is written to logs.
- Around line 133-153: The WebSocket receive loop opened with
websocket_connect(SONIOX_TTS_WS_URL) has no overall receive timeout; wrap the
receive/processing block (the async for raw in ws: loop that decodes messages,
checks error_code, collects audio_chunks, and breaks on msg.get("terminated"))
in an asyncio.timeout(...) context (e.g., configurable seconds) so a silent
Soniox hang raises asyncio.TimeoutError; on timeout cancel/close the ws and
raise an informative exception so callers know the TTS request failed instead of
hanging indefinitely. Ensure the timeout is applied after sending config_msg and
text_msg and that you still handle JSONDecodeError and existing Soniox
error_code logic inside the timeout.
---
Nitpick comments:
In `@app/ai/voice/tts/soniox.py`:
- Around line 28-32: The __all__ export list in soniox.py is not alphabetically
sorted; update the __all__ list to be sorted lexicographically (e.g.,
"SonioxTTSConfig", "build_soniox_tts", "_generate_soniox_audio" -> order them
alphabetically) so it satisfies Ruff RUF022; edit the __all__ variable
declaration to reorder the strings accordingly while leaving the exact symbol
names unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a0ba975a-4f8d-4335-a31a-88b0b7a89768
📒 Files selected for processing (5)
app/ai/voice/agents/breeze_buddy/template/types.pyapp/ai/voice/agents/breeze_buddy/tts/__init__.pyapp/ai/voice/tts/__init__.pyapp/ai/voice/tts/soniox.pyapp/core/config/dynamic.py
| elif provider == "soniox": | ||
| audio_data = await _generate_soniox_audio( | ||
| text=text, | ||
| voice=resolved.voice_id, | ||
| model=resolved.model, | ||
| language=resolved.language, | ||
| ) | ||
| input_format = "raw" |
There was a problem hiding this comment.
Language lookup inconsistency between streaming and one-shot paths.
get_tts_service uses _parse_language (key-based: Language[code.upper().replace("-", "_")]) for robustness, but generate_audio forwards resolved.language as a raw string to _generate_soniox_audio which applies value-based Language(language). These resolve identically for lowercase BCP 47 codes ("en", "hi"), but diverge for uppercase inputs ("EN", "EN_IN"): the value-based path silently falls back to Language.EN, while the streaming path would correctly map to the intended enum member.
🛠️ Proposed fix — pre-parse with _parse_language before forwarding
- elif provider == "soniox":
- audio_data = await _generate_soniox_audio(
- text=text,
- voice=resolved.voice_id,
- model=resolved.model,
- language=resolved.language,
- )
- input_format = "raw"
+ elif provider == "soniox":
+ audio_data = await _generate_soniox_audio(
+ text=text,
+ voice=resolved.voice_id,
+ model=resolved.model,
+ language=_parse_language(resolved.language, Language.EN),
+ )
+ input_format = "raw"This also requires updating _generate_soniox_audio's language parameter type from Optional[str] to Optional[Language] to skip re-parsing internally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/tts/__init__.py` around lines 262 - 269, The
streaming vs one-shot language handling is inconsistent: update the
generate_audio call site to pre-parse resolved.language using _parse_language
(the same logic get_tts_service uses) before passing it to
_generate_soniox_audio, and change _generate_soniox_audio's language parameter
type from Optional[str] to Optional[Language] so it skips internal value-based
parsing; reference get_tts_service, _parse_language, generate_audio,
_generate_soniox_audio, resolved.language and the Language enum when making the
change.
| logger.info( | ||
| f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}): {text[:50]}..." | ||
| ) |
There was a problem hiding this comment.
PII exposure risk: greeting text logged verbatim.
By the time _generate_soniox_audio is called, template variables (e.g., {{customer_name}}) are already substituted, so text[:50] can contain customer names. Per project guidelines, logging sensitive data is a major compliance risk.
🛡️ Proposed fix — log metadata only
- logger.info(
- f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}): {text[:50]}..."
- )
+ logger.info(
+ f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}), "
+ f"text_length={len(text)} chars"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/tts/soniox.py` around lines 128 - 130, The current logger.info
in _generate_soniox_audio exposes substituted template text (PII); remove
logging of text[:50] and instead log only non-PII metadata—e.g., sample_rate,
voice/id, text length, and a redacted or hashed fingerprint if you need
traceability—and update the logger.info call in _generate_soniox_audio to output
those safe fields only so no customer-sensitive content is written to logs.
| async with websocket_connect(SONIOX_TTS_WS_URL) as ws: | ||
| await ws.send(json.dumps(config_msg)) | ||
| await ws.send(json.dumps(text_msg)) | ||
|
|
||
| async for raw in ws: | ||
| try: | ||
| msg = json.loads(raw) | ||
| except json.JSONDecodeError: | ||
| continue | ||
|
|
||
| error_code = msg.get("error_code") | ||
| if error_code is not None: | ||
| error_message = msg.get("error_message", "") | ||
| raise Exception(f"Soniox TTS error {error_code}: {error_message}") | ||
|
|
||
| audio_b64 = msg.get("audio") | ||
| if audio_b64: | ||
| audio_chunks.append(base64.b64decode(audio_b64)) | ||
|
|
||
| if msg.get("terminated"): | ||
| break |
There was a problem hiding this comment.
No overall timeout on the WebSocket receive loop.
open_timeout=10 handles connection establishment, but once connected, async for raw in ws: blocks until either terminated=True arrives or the keepalive mechanism fires (~40 s at the default ping_interval=20 + ping_timeout=20). A silent Soniox-side hang stalls greeting preparation — and therefore call startup — for up to 40 seconds.
The websockets library itself recommends asyncio.timeout() (Python ≥ 3.11) for per-receive timeouts, and the project requires Python 3.11+.
⏱️ Proposed fix — add asyncio.timeout around the WS block
+import asyncio
...
async def _generate_soniox_audio(
text: str,
voice: Optional[str] = None,
model: Optional[str] = None,
language: Optional[str] = None,
sample_rate: int = 16000,
+ timeout_secs: float = 30.0,
) -> bytes:
...
- async with websocket_connect(SONIOX_TTS_WS_URL) as ws:
- await ws.send(json.dumps(config_msg))
- await ws.send(json.dumps(text_msg))
-
- async for raw in ws:
- ...
+ try:
+ async with asyncio.timeout(timeout_secs):
+ async with websocket_connect(SONIOX_TTS_WS_URL) as ws:
+ await ws.send(json.dumps(config_msg))
+ await ws.send(json.dumps(text_msg))
+
+ async for raw in ws:
+ ...
+ except TimeoutError:
+ raise Exception(
+ f"Soniox TTS timed out after {timeout_secs}s waiting for audio"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/tts/soniox.py` around lines 133 - 153, The WebSocket receive
loop opened with websocket_connect(SONIOX_TTS_WS_URL) has no overall receive
timeout; wrap the receive/processing block (the async for raw in ws: loop that
decodes messages, checks error_code, collects audio_chunks, and breaks on
msg.get("terminated")) in an asyncio.timeout(...) context (e.g., configurable
seconds) so a silent Soniox hang raises asyncio.TimeoutError; on timeout
cancel/close the ws and raise an informative exception so callers know the TTS
request failed instead of hanging indefinitely. Ensure the timeout is applied
after sending config_msg and text_msg and that you still handle JSONDecodeError
and existing Soniox error_code logic inside the timeout.
Summary
tts_configuration.provider = "soniox".app/ai/voice/tts/soniox.py: thinSonioxTTSConfig+build_soniox_ttswrapping pipecat'sSonioxTTSService(no subclassing — pipecat handles the WS protocol, multiplexing, keepalives), plus a small_generate_soniox_audioone-shot WS synth for greeting prep (pipecat's Soniox client is streaming-only, so the one-shot is implemented directly against the same documented WS protocol).SONIOXtoTTSProviderenum and a Soniox example to theTTSConfigdocstring; wires thesonioxbranch intoget_tts_serviceandgenerate_audio; adds hardcoded soniox defaults (Adrian/tts-rt-v1-preview/en) toBB_SPEECH_PROVIDER_DEFAULTS. Reuses the existingSONIOX_API_KEYfrom the STT integration.What's exposed
TTSConfigfieldvoice_idvoice(e.g.Adrian, plus the v3 voices listed inSonioxTTSSpeakerV3)modelmodel(defaulttts-rt-v1-preview)language_parse_languageto aLanguageenum, then converted by pipecat to a Soniox 2-letter code (en,ml,hi, ...)speed/volume/emotion/pitchOther Soniox-specific knobs (
sample_rate16000,audio_formatpcm_s16le) are set to telephony-friendly defaults inside the builder; pipecat resamples downstream andconvert_to_mulawproduces 8 kHz mu-law for Twilio/Plivo/Exotel.Dynamic config
Picked up automatically by existing helpers:
BB_VOICE_DEFAULTS_SONIOX(Redis JSON, optional override of hardcoded defaults viaBB_VOICE_PROVIDER_DEFAULTS)BB_SONIOX_AGGREGATE_SENTENCES(Redis bool, viaBB_AGGREGATE_SENTENCES("soniox"))Test plan
uv run pyrefly checkpasses (verified locally — 0 errors)uv run black --check . && uv run isort --check . --profile blackpassprovider=sonioxand verify the resulting service isSonioxTTSServicewith expectedvoice/model/language/audio_format(verified locally)"tts_configuration": {"provider": "soniox", "voice_id": "Adrian", "model": "tts-rt-v1-preview", "language": "en"}and confirm audio plays correctlystatic_greetingtext) and verify_generate_soniox_audioreturns playable audio that converts cleanly to mu-law"language": "ml") and confirm Soniox accepts it and produces audio🤖 Generated with Claude Code
Summary by CodeRabbit