Skip to content

feat(tts): add Soniox TTS provider end-to-end#748

Open
swaroopvarma1 wants to merge 1 commit into
releasefrom
feat/soniox-tts
Open

feat(tts): add Soniox TTS provider end-to-end#748
swaroopvarma1 wants to merge 1 commit into
releasefrom
feat/soniox-tts

Conversation

@swaroopvarma1
Copy link
Copy Markdown
Collaborator

@swaroopvarma1 swaroopvarma1 commented May 7, 2026

Summary

  • Wires Soniox WebSocket TTS into the shared builder/factory pattern alongside ElevenLabs, Cartesia, and Sarvam. Templates can now select Soniox via tts_configuration.provider = "soniox".
  • New app/ai/voice/tts/soniox.py: thin SonioxTTSConfig + build_soniox_tts wrapping pipecat's SonioxTTSService (no subclassing — pipecat handles the WS protocol, multiplexing, keepalives), plus a small _generate_soniox_audio one-shot WS synth for greeting prep (pipecat's Soniox client is streaming-only, so the one-shot is implemented directly against the same documented WS protocol).
  • Adds SONIOX to TTSProvider enum and a Soniox example to the TTSConfig docstring; wires the soniox branch into get_tts_service and generate_audio; adds hardcoded soniox defaults (Adrian / tts-rt-v1-preview / en) to BB_SPEECH_PROVIDER_DEFAULTS. Reuses the existing SONIOX_API_KEY from the STT integration.

What's exposed

TTSConfig field Soniox use
voice_id maps to Soniox voice (e.g. Adrian, plus the v3 voices listed in SonioxTTSSpeakerV3)
model maps to Soniox model (default tts-rt-v1-preview)
language parsed via _parse_language to a Language enum, then converted by pipecat to a Soniox 2-letter code (en, ml, hi, ...)
speed / volume / emotion / pitch silently ignored — Soniox doesn't accept them

Other Soniox-specific knobs (sample_rate 16000, audio_format pcm_s16le) are set to telephony-friendly defaults inside the builder; pipecat resamples downstream and convert_to_mulaw produces 8 kHz mu-law for Twilio/Plivo/Exotel.

Dynamic config

Picked up automatically by existing helpers:

  • BB_VOICE_DEFAULTS_SONIOX (Redis JSON, optional override of hardcoded defaults via BB_VOICE_PROVIDER_DEFAULTS)
  • BB_SONIOX_AGGREGATE_SENTENCES (Redis bool, via BB_AGGREGATE_SENTENCES("soniox"))

Test plan

  • uv run pyrefly check passes (verified locally — 0 errors)
  • uv run black --check . && uv run isort --check . --profile black pass
  • Construct a TTS service via the factory with provider=soniox and verify the resulting service is SonioxTTSService with expected voice / model / language / audio_format (verified locally)
  • Make an outbound call with a template that has "tts_configuration": {"provider": "soniox", "voice_id": "Adrian", "model": "tts-rt-v1-preview", "language": "en"} and confirm audio plays correctly
  • Test a template with a dynamic greeting (variables in static_greeting text) and verify _generate_soniox_audio returns playable audio that converts cleanly to mu-law
  • Try a non-English language (e.g. "language": "ml") and confirm Soniox accepts it and produces audio

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added Soniox as a supported text-to-speech provider. Users can now configure voice synthesis with customizable models, languages, and audio formats for audio generation.

Wires Soniox WebSocket TTS into the existing builder/factory pattern alongside
ElevenLabs, Cartesia, and Sarvam. Templates can now select Soniox via
tts_configuration.provider="soniox" with voice_id, model, and language.

- New app/ai/voice/tts/soniox.py: SonioxTTSConfig + build_soniox_tts thin
  wrapper over pipecat's SonioxTTSService, plus _generate_soniox_audio for
  greeting prep (pipecat's Soniox client is streaming-only, so the one-shot
  is a small WS exchange against the same protocol)
- Add SONIOX to TTSProvider enum and to TTSConfig docstring
- Wire soniox branch in get_tts_service and generate_audio
- Add hardcoded soniox defaults to BB_SPEECH_PROVIDER_DEFAULTS

Reuses the existing SONIOX_API_KEY from STT. BB_VOICE_DEFAULTS_SONIOX and
BB_SONIOX_AGGREGATE_SENTENCES Redis keys are picked up automatically by the
existing dynamic-config helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 09:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

Adds Soniox as a new Text-to-Speech provider integrated into the Breeze Buddy voice agent. Changes span type definitions, service implementation, public exports, wiring into the audio pipeline, and default configuration.

Changes

Soniox TTS Provider Integration

Layer / File(s) Summary
Data Types and Configuration
app/ai/voice/agents/breeze_buddy/template/types.py, app/ai/voice/tts/soniox.py
Adds SONIOX member to TTSProvider enum and defines SonioxTTSConfig dataclass with API key, voice, model, language, sample rate, audio format, and sentence aggregation settings. TTSConfig docstring updated with Soniox example.
Soniox Service and Audio Generation
app/ai/voice/tts/soniox.py
Implements build_soniox_tts to construct a Soniox service with text aggregation mode mapping, and _generate_soniox_audio as an async WebSocket synthesizer that validates API key, parses language with EN fallback, streams base64-encoded PCM audio, and handles Soniox-reported errors.
Package Exports
app/ai/voice/tts/__init__.py
Imports and exports SonioxTTSConfig and build_soniox_tts to the public TTS package API.
Breeze Buddy Integration
app/ai/voice/agents/breeze_buddy/tts/__init__.py
Wires Soniox into get_tts_service (validates SONIOX_API_KEY, resolves aggregation and parameters, builds service) and generate_audio (routes to _generate_soniox_audio with input_format="raw", converts result to mulaw).
Default Configuration
app/core/config/dynamic.py
Adds "soniox" entry to BB_SPEECH_PROVIDER_DEFAULTS with voice, model, and language defaults.

Sequence Diagram(s)

sequenceDiagram
    participant BBClient as Breeze Buddy Client
    participant Service as get_tts_service
    participant SonioxService as SonioxTTSService
    participant Synthesis as _generate_soniox_audio
    participant WebSocket as Soniox WebSocket
    participant AudioProcessing as Audio Processing
    
    BBClient->>Service: request TTS (provider="soniox")
    Service->>Service: validate SONIOX_API_KEY
    Service->>Service: fetch aggregation settings
    Service->>SonioxService: build with config
    SonioxService-->>Service: constructed service
    Service-->>BBClient: SonioxTTSService ready
    
    BBClient->>Synthesis: generate_audio(text)
    Synthesis->>Synthesis: apply voice/model defaults
    Synthesis->>Synthesis: parse & validate language
    Synthesis->>WebSocket: send config + text JSON
    WebSocket-->>Synthesis: stream base64 audio chunks
    Synthesis->>Synthesis: decode & concatenate PCM
    WebSocket-->>Synthesis: terminated signal
    Synthesis-->>BBClient: pcm_s16le bytes
    
    BBClient->>AudioProcessing: convert_to_mulaw(pcm)
    AudioProcessing-->>BBClient: mulaw audio
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • juspay/clairvoyance#461: Adds another new TTS provider with similar modifications to TTS initialization and package exports.
  • juspay/clairvoyance#559: Modifies the same TTSProvider enum in template/types.py to add other providers.
  • juspay/clairvoyance#421: Foundational Breeze Buddy TTS integration that this PR extends with Soniox provider support.

Poem

🐰 A new voice joins the choir so sweet,
Soniox brings beats to Breeze's fleet,
From WebSocket streams to mulaw's call,
The rabbit hops through types and all—
Configuration, exports, and more,
Audio magic through every door! 🎵

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main objective—adding Soniox as a new TTS provider end-to-end across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/soniox-tts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Soniox as a first-class TTS provider in the Breeze Buddy TTS factory/builder flow, enabling templates to select tts_configuration.provider = "soniox" and supporting both streaming TTS (via pipecat) and one-shot greeting synthesis (direct WebSocket protocol).

Changes:

  • Added Soniox defaults to dynamic per-provider TTS defaults (BB_SPEECH_PROVIDER_DEFAULTS).
  • Introduced app/ai/voice/tts/soniox.py with SonioxTTSConfig, build_soniox_tts, and a one-shot _generate_soniox_audio WebSocket synth helper.
  • Wired soniox into Breeze Buddy TTS service construction and greeting audio generation; extended TTSProvider / TTSConfig docs accordingly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
app/core/config/dynamic.py Adds Soniox hardcoded provider defaults for Redis-override merge.
app/ai/voice/tts/soniox.py Implements Soniox TTS builder + one-shot WebSocket greeting synthesis helper.
app/ai/voice/tts/init.py Exposes Soniox builder/config via shared TTS package exports.
app/ai/voice/agents/breeze_buddy/tts/init.py Adds soniox branches to get_tts_service and generate_audio.
app/ai/voice/agents/breeze_buddy/template/types.py Extends TTSProvider enum and documents Soniox example config.

Comment on lines +86 to +96
language: Optional[str] = None,
sample_rate: int = 16000,
) -> bytes:
"""One-shot synth via Soniox WebSocket for greeting prep.

Opens a single WebSocket, sends config + text + ``text_end:true``, collects
base64-encoded audio chunks until ``terminated``, and returns the
concatenated PCM bytes.

Returns 16-bit little-endian PCM mono at the requested ``sample_rate``,
matching ``convert_to_mulaw`` expectations for downstream telephony use.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
app/ai/voice/tts/soniox.py (1)

28-32: 💤 Low value

__all__ is not sorted (Ruff RUF022).

♻️ Proposed fix
 __all__ = [
     "SonioxTTSConfig",
+    "_generate_soniox_audio",
     "build_soniox_tts",
-    "_generate_soniox_audio",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ai/voice/tts/soniox.py` around lines 28 - 32, The __all__ export list in
soniox.py is not alphabetically sorted; update the __all__ list to be sorted
lexicographically (e.g., "SonioxTTSConfig", "build_soniox_tts",
"_generate_soniox_audio" -> order them alphabetically) so it satisfies Ruff
RUF022; edit the __all__ variable declaration to reorder the strings accordingly
while leaving the exact symbol names unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/ai/voice/agents/breeze_buddy/tts/__init__.py`:
- Around line 262-269: The streaming vs one-shot language handling is
inconsistent: update the generate_audio call site to pre-parse resolved.language
using _parse_language (the same logic get_tts_service uses) before passing it to
_generate_soniox_audio, and change _generate_soniox_audio's language parameter
type from Optional[str] to Optional[Language] so it skips internal value-based
parsing; reference get_tts_service, _parse_language, generate_audio,
_generate_soniox_audio, resolved.language and the Language enum when making the
change.

In `@app/ai/voice/tts/soniox.py`:
- Around line 128-130: The current logger.info in _generate_soniox_audio exposes
substituted template text (PII); remove logging of text[:50] and instead log
only non-PII metadata—e.g., sample_rate, voice/id, text length, and a redacted
or hashed fingerprint if you need traceability—and update the logger.info call
in _generate_soniox_audio to output those safe fields only so no
customer-sensitive content is written to logs.
- Around line 133-153: The WebSocket receive loop opened with
websocket_connect(SONIOX_TTS_WS_URL) has no overall receive timeout; wrap the
receive/processing block (the async for raw in ws: loop that decodes messages,
checks error_code, collects audio_chunks, and breaks on msg.get("terminated"))
in an asyncio.timeout(...) context (e.g., configurable seconds) so a silent
Soniox hang raises asyncio.TimeoutError; on timeout cancel/close the ws and
raise an informative exception so callers know the TTS request failed instead of
hanging indefinitely. Ensure the timeout is applied after sending config_msg and
text_msg and that you still handle JSONDecodeError and existing Soniox
error_code logic inside the timeout.

---

Nitpick comments:
In `@app/ai/voice/tts/soniox.py`:
- Around line 28-32: The __all__ export list in soniox.py is not alphabetically
sorted; update the __all__ list to be sorted lexicographically (e.g.,
"SonioxTTSConfig", "build_soniox_tts", "_generate_soniox_audio" -> order them
alphabetically) so it satisfies Ruff RUF022; edit the __all__ variable
declaration to reorder the strings accordingly while leaving the exact symbol
names unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a0ba975a-4f8d-4335-a31a-88b0b7a89768

📥 Commits

Reviewing files that changed from the base of the PR and between a40123c and f8f744d.

📒 Files selected for processing (5)
  • app/ai/voice/agents/breeze_buddy/template/types.py
  • app/ai/voice/agents/breeze_buddy/tts/__init__.py
  • app/ai/voice/tts/__init__.py
  • app/ai/voice/tts/soniox.py
  • app/core/config/dynamic.py

Comment on lines +262 to +269
elif provider == "soniox":
audio_data = await _generate_soniox_audio(
text=text,
voice=resolved.voice_id,
model=resolved.model,
language=resolved.language,
)
input_format = "raw"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Language lookup inconsistency between streaming and one-shot paths.

get_tts_service uses _parse_language (key-based: Language[code.upper().replace("-", "_")]) for robustness, but generate_audio forwards resolved.language as a raw string to _generate_soniox_audio which applies value-based Language(language). These resolve identically for lowercase BCP 47 codes ("en", "hi"), but diverge for uppercase inputs ("EN", "EN_IN"): the value-based path silently falls back to Language.EN, while the streaming path would correctly map to the intended enum member.

🛠️ Proposed fix — pre-parse with _parse_language before forwarding
-    elif provider == "soniox":
-        audio_data = await _generate_soniox_audio(
-            text=text,
-            voice=resolved.voice_id,
-            model=resolved.model,
-            language=resolved.language,
-        )
-        input_format = "raw"
+    elif provider == "soniox":
+        audio_data = await _generate_soniox_audio(
+            text=text,
+            voice=resolved.voice_id,
+            model=resolved.model,
+            language=_parse_language(resolved.language, Language.EN),
+        )
+        input_format = "raw"

This also requires updating _generate_soniox_audio's language parameter type from Optional[str] to Optional[Language] to skip re-parsing internally.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ai/voice/agents/breeze_buddy/tts/__init__.py` around lines 262 - 269, The
streaming vs one-shot language handling is inconsistent: update the
generate_audio call site to pre-parse resolved.language using _parse_language
(the same logic get_tts_service uses) before passing it to
_generate_soniox_audio, and change _generate_soniox_audio's language parameter
type from Optional[str] to Optional[Language] so it skips internal value-based
parsing; reference get_tts_service, _parse_language, generate_audio,
_generate_soniox_audio, resolved.language and the Language enum when making the
change.

Comment on lines +128 to +130
logger.info(
f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}): {text[:50]}..."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

PII exposure risk: greeting text logged verbatim.

By the time _generate_soniox_audio is called, template variables (e.g., {{customer_name}}) are already substituted, so text[:50] can contain customer names. Per project guidelines, logging sensitive data is a major compliance risk.

🛡️ Proposed fix — log metadata only
-    logger.info(
-        f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}): {text[:50]}..."
-    )
+    logger.info(
+        f"Synthesizing greeting with Soniox (pcm_s16le {sample_rate}), "
+        f"text_length={len(text)} chars"
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ai/voice/tts/soniox.py` around lines 128 - 130, The current logger.info
in _generate_soniox_audio exposes substituted template text (PII); remove
logging of text[:50] and instead log only non-PII metadata—e.g., sample_rate,
voice/id, text length, and a redacted or hashed fingerprint if you need
traceability—and update the logger.info call in _generate_soniox_audio to output
those safe fields only so no customer-sensitive content is written to logs.

Comment on lines +133 to +153
async with websocket_connect(SONIOX_TTS_WS_URL) as ws:
await ws.send(json.dumps(config_msg))
await ws.send(json.dumps(text_msg))

async for raw in ws:
try:
msg = json.loads(raw)
except json.JSONDecodeError:
continue

error_code = msg.get("error_code")
if error_code is not None:
error_message = msg.get("error_message", "")
raise Exception(f"Soniox TTS error {error_code}: {error_message}")

audio_b64 = msg.get("audio")
if audio_b64:
audio_chunks.append(base64.b64decode(audio_b64))

if msg.get("terminated"):
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

No overall timeout on the WebSocket receive loop.

open_timeout=10 handles connection establishment, but once connected, async for raw in ws: blocks until either terminated=True arrives or the keepalive mechanism fires (~40 s at the default ping_interval=20 + ping_timeout=20). A silent Soniox-side hang stalls greeting preparation — and therefore call startup — for up to 40 seconds.

The websockets library itself recommends asyncio.timeout() (Python ≥ 3.11) for per-receive timeouts, and the project requires Python 3.11+.

⏱️ Proposed fix — add asyncio.timeout around the WS block
+import asyncio
 ...
 async def _generate_soniox_audio(
     text: str,
     voice: Optional[str] = None,
     model: Optional[str] = None,
     language: Optional[str] = None,
     sample_rate: int = 16000,
+    timeout_secs: float = 30.0,
 ) -> bytes:
     ...
-    async with websocket_connect(SONIOX_TTS_WS_URL) as ws:
-        await ws.send(json.dumps(config_msg))
-        await ws.send(json.dumps(text_msg))
-
-        async for raw in ws:
-            ...
+    try:
+        async with asyncio.timeout(timeout_secs):
+            async with websocket_connect(SONIOX_TTS_WS_URL) as ws:
+                await ws.send(json.dumps(config_msg))
+                await ws.send(json.dumps(text_msg))
+
+                async for raw in ws:
+                    ...
+    except TimeoutError:
+        raise Exception(
+            f"Soniox TTS timed out after {timeout_secs}s waiting for audio"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ai/voice/tts/soniox.py` around lines 133 - 153, The WebSocket receive
loop opened with websocket_connect(SONIOX_TTS_WS_URL) has no overall receive
timeout; wrap the receive/processing block (the async for raw in ws: loop that
decodes messages, checks error_code, collects audio_chunks, and breaks on
msg.get("terminated")) in an asyncio.timeout(...) context (e.g., configurable
seconds) so a silent Soniox hang raises asyncio.TimeoutError; on timeout
cancel/close the ws and raise an informative exception so callers know the TTS
request failed instead of hanging indefinitely. Ensure the timeout is applied
after sending config_msg and text_msg and that you still handle JSONDecodeError
and existing Soniox error_code logic inside the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants