Skip to content

feat(breeze-buddy): add STT fallback support#647

Open
Devansh-1218 wants to merge 1 commit into
juspay:releasefrom
Devansh-1218:feat-stt-fallback-with-circuit-breaker
Open

feat(breeze-buddy): add STT fallback support#647
Devansh-1218 wants to merge 1 commit into
juspay:releasefrom
Devansh-1218:feat-stt-fallback-with-circuit-breaker

Conversation

@Devansh-1218
Copy link
Copy Markdown
Contributor

@Devansh-1218 Devansh-1218 commented Mar 18, 2026

  • Generic ServiceFallback class with configurable threshold and cooldown
  • STT fallback: Soniox (primary) -> Deepgram (fallback) on repeated failures
  • Configurable threshold, failure window, and cooldown via DevCycle/Redis
  • Proactive routing to fallback provider when fallback is active
  • Background task for automatic reset to primary after cooldown period
  • Slack alerts on failure, activation, and reset with tagged users
  • Consolidated fallback modules (tasks.py -> fallback/init.py, constants.py -> stt/fallback.py)"

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic fallback provider switching for speech-to-text to enhance call reliability
    • Added configuration options for speech-to-text provider selection and ElevenLabs voice settings
  • Bug Fixes

    • Improved handling of mid-call speech-to-text failures with proper call termination
    • Enhanced monitoring with automatic alerts for speech-to-text failures and fallback events
  • Improvements

    • Added reconnect capability for speech-to-text services
    • Extended Slack alerting with customizable user tagging

Copilot AI review requested due to automatic review settings March 18, 2026 09:39
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 18, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a851f7e-23e7-4108-b598-02c9d0f8753c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR introduces a Redis-backed STT fallback orchestration system for Breeze Buddy that detects initialization and mid-call speech-to-text failures, routes to a fallback provider when failure thresholds are exceeded, dispatches templated Slack alerts, and terminates calls on non-recoverable errors.

Changes

Cohort / File(s) Summary
STT Fallback Infrastructure
app/services/fallback/__init__.py
Introduces ServiceFallback and ServiceFallbackConfig classes with Redis-backed failure tracking, threshold-based activation, alert callbacks, and STT-specific routines (check_and_reset_stt_fallback, initialize_fallback_tasks) for periodic fallback reset scheduling.
Breeze Buddy STT Refactoring
app/ai/voice/agents/breeze_buddy/stt/__init__.py, app/ai/voice/agents/breeze_buddy/agent/pipeline.py
Introduces STTServiceResult dataclass; refactors create_stt_from_config and get_stt_service to return provider metadata alongside service instance; adds fallback routing logic on init failures; updates pipeline integration to consume and propagate provider/service results.
STT Fallback Orchestration
app/ai/voice/agents/breeze_buddy/stt/fallback.py
New module implementing templated Slack alerting, fallback lifecycle notifications, and failure handlers (send_templated_alert, record_stt_failure, notify_fallback_active, handle_stt_init_failure) for orchestrating proactive and reactive STT fallback behavior.
Agent Error Detection & Termination
app/ai/voice/agents/breeze_buddy/agent/__init__.py
Adds mid-call STT pipeline error detection via on_pipeline_error handler; records failure state; dispatches Slack alerts via fire_and_forget; terminates call by queuing EndFrame() on non-recoverable errors.
Dynamic Configuration
app/core/config/dynamic.py
Adds nine new async config accessors for STT fallback settings (enable flag, provider name, thresholds, window/duration parameters) and ElevenLabs TTS voice/model/speed configuration.
Infrastructure & Utilities
app/ai/voice/agents/breeze_buddy/utils/common.py, app/services/slack/alert.py, app/main.py, app/ai/voice/stt/soniox/config.py
Adds fire_and_forget() for async task scheduling; extends Slack alert sender with optional tag_users parameter; integrates fallback task initialization into lifespan startup; adds reconnect_on_error flag to Soniox config.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant Pipeline as Pipeline Handler
    participant Fallback as ServiceFallback
    participant Redis
    participant SlackAlert as Slack Alert
    participant CallControl as Call Control

    Agent->>Pipeline: run() / on_pipeline_error()
    Pipeline->>Pipeline: detect STT error
    Pipeline->>Fallback: record_stt_failure()
    Fallback->>Redis: EVAL (Lua script) INCR failure_count
    Fallback->>Redis: Set TTL on counter
    Fallback-->>Pipeline: failure recorded
    
    Pipeline->>Fallback: is_active()?
    Fallback->>Redis: EXISTS fallback:active
    Fallback-->>Pipeline: false (not yet active)
    
    Pipeline->>Redis: Check failure_count vs threshold
    alt Threshold Exceeded
        Pipeline->>Fallback: (internal) SET active flag
        Pipeline->>SlackAlert: fire_and_forget(alert)
        SlackAlert->>SlackAlert: send_templated_alert()
        SlackAlert-->>SlackAlert: (async, non-blocking)
    end
    
    Pipeline->>CallControl: Queue EndFrame()
    CallControl->>Agent: Terminate call
    Agent-->>Pipeline: Pipeline ends
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Integration of stt for breeze buddy #419: Refactors Breeze Buddy STT service construction and wiring (get_stt_service changes) that form the foundation for this PR's fallback integration.
  • Integration of tts in Breeze Buddy #421: Modifies STT initialization with API-key validation in the same create_stt_from_config and get_stt_service functions that are refactored here to introduce fallback behavior.

Suggested reviewers

  • sharifajahanshaik
  • cmd-err

Poem

🐰 A fallback's grace, when speech goes wrong,
Redis counts the stumbles along,
When threshold's crossed, alerts take flight,
The call concludes, all ends so right!
Slack sings out the tale so bright. 🔔

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(breeze-buddy): add STT fallback support' directly and accurately reflects the main objective of the changeset, which is to implement Speech-To-Text fallback functionality for the Breeze Buddy agent.
Docstring Coverage ✅ Passed Docstring coverage is 86.54% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Redis-backed circuit breaker to Breeze Buddy’s STT stack to automatically fail over from Soniox to Deepgram across pods, including a HALF_OPEN “probe call” path and mid-call hot swap via Pipecat ServiceSwitcher.

Changes:

  • Introduces Redis-backed STT circuit breaker state/locking and Slack alerting for trip/recovery/probe.
  • Updates Breeze Buddy STT initialization to return an STTServiceResult and route Soniox/Deepgram based on circuit state.
  • Integrates optional ServiceSwitcher wrapping in the Breeze Buddy pipeline and adds mid-call STT swap handling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
app/core/config/static.py Adds feature flag + circuit breaker tuning env vars + Breeze Buddy-specific Deepgram env vars.
app/ai/voice/stt/soniox/config.py Adds reconnect_on_error to Soniox config and passes it into the Soniox service builder.
app/ai/voice/agents/breeze_buddy/stt/fallback.py New Redis-backed circuit breaker module (states, Redis keys, Slack alerts, probe lock).
app/ai/voice/agents/breeze_buddy/stt/init.py Adds Deepgram fallback building and circuit-breaker-based STT routing, plus result wrapper.
app/ai/voice/agents/breeze_buddy/agent/pipeline.py Adds optional ServiceSwitcher wrapping when a fallback STT is provided.
app/ai/voice/agents/breeze_buddy/agent/init.py Records STT failures, attempts mid-call STT hot swap, and finalizes probe outcomes on disconnect.

Comment on lines +272 to +280
if ENABLE_BREEZE_BUDDY_STT_FALLBACK:
from app.ai.voice.agents.breeze_buddy.stt.fallback import (
CircuitState,
stt_circuit_breaker,
)

circuit_state = await stt_circuit_breaker.get_state()
logger.info(f"STT circuit breaker state: {circuit_state.value}")

Comment on lines +317 to +338
try:
stt_service = _build_soniox(language_hints, soniox_context)
fallback = _build_deepgram_fallback(vad_events=False)
logger.info(
"Probe call: Soniox primary + Deepgram fallback "
"(ServiceSwitcher, vad_events=False)"
)
return STTServiceResult(
service=stt_service,
provider="soniox",
fallback_service=fallback,
is_probe_call=True,
)
except Exception as probe_err:
logger.error(
f"Probe call Soniox init failed: {probe_err}. "
f"Recording failure and falling back to Deepgram."
)
# Init-time failure during probe: record + release lock
_fire_and_forget(_send_soniox_failure_alert(probe_err, "deepgram"))
await stt_circuit_breaker.record_failure()
await stt_circuit_breaker.release_probe()
Comment on lines +657 to +663
if self.fallback_stt and not self.stt_switched:
self.stt_switched = True
logger.info("Attempting STT hot-swap via ServiceSwitcher")
try:
from pipecat.pipeline.service_switcher import (
ManuallySwitchServiceFrame,
)
if not self._stt_failure_recorded:
await stt_circuit_breaker.record_success()
else:
await stt_circuit_breaker.release_probe()
Comment on lines +87 to +99
async def record_failure(self) -> None:
"""Increment failure count. Trip to OPEN if threshold reached."""
try:
redis = await get_redis_service()
count = await redis.incr(self._KEY_FAILURE_COUNT)
# Set TTL on first failure (rolling window)
if count == 1:
await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window)
logger.info(
f"STT circuit breaker: failure {count}/{self.failure_threshold}"
)
if count >= self.failure_threshold:
await self._trip(redis)
Comment on lines +92 to +94
# Set TTL on first failure (rolling window)
if count == 1:
await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 698-708: The probe handling currently calls
stt_circuit_breaker.record_success() whenever is_probe_call is true and
_stt_failure_recorded is false, which can close the breaker on non‑STT
teardowns; change the logic in the is_probe_call block to only call
stt_circuit_breaker.record_success() when you have an actual successful Soniox
transcription/frame (e.g., check an existing success indicator like
self._stt_successful_transcription or the variable that flags a valid
transcription event), and otherwise call stt_circuit_breaker.release_probe();
keep the import and use of stt_circuit_breaker and the _stt_failure_recorded
check but gate record_success() on the real Soniox success flag instead of
merely not having recorded a failure.

In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py`:
- Around line 317-346: The current try block wraps both _build_soniox(...) and
_build_deepgram_fallback(...), so a failure building the Deepgram fallback
wrongly triggers Soniox failure handling; refactor so you first call
_build_soniox(language_hints, soniox_context) inside its own try/except and only
on exceptions run _send_soniox_failure_alert(probe_err), await
stt_circuit_breaker.record_failure(), await stt_circuit_breaker.release_probe(),
and fall back to Deepgram; after successfully creating the Soniox service,
separately construct fallback = _build_deepgram_fallback(vad_events=False) (with
its own error handling that does not mutate Soniox circuit state) and then
return the STTServiceResult with provider="soniox" and fallback_service
populated accordingly.
- Around line 171-173: The gate-breaker routing and failure-recording currently
check only ENABLE_BREEZE_BUDDY_STT_FALLBACK, causing routing to
_build_deepgram_fallback() even when DEEPGRAM_API_KEY is absent; make the
routing and failure-recording use the same combined predicate used in
_build_soniox(): (ENABLE_BREEZE_BUDDY_STT_FALLBACK and DEEPGRAM_API_KEY). Update
any gate checks and failure-recording branches that reference the flag alone
(the routing logic that selects between _build_soniox() and
_build_deepgram_fallback() and the code that records fallback failures) to
evaluate the combined predicate so the breaker stays on Soniox when Deepgram
credentials are missing.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 56-68: The constructor stores retrip_threshold but it’s never used
so HALF_OPEN probes that fail only bump failure_count toward failure_threshold
and never re-open the breaker; update the failure-handling logic in
record_failure (and any code that handles STATE_HALF_OPEN) to compare failures
during HALF_OPEN against self.retrip_threshold instead of self.failure_threshold
and call the same re-trip behavior (set state to OPEN, set open expiry using
self.open_duration, reset probe/failure counters) when the retrip threshold is
reached; ensure references to failure_count, STATE_HALF_OPEN, record_failure,
and retrip_threshold are adjusted so HALF_OPEN failures can immediately re-trip
the circuit using retrip_threshold.
- Around line 9-13: The circuit-breaker Redis keys (strings like
"stt:cb:failure_count", "stt:cb:open", "stt:cb:half_open", "stt:cb:probe_lock")
are currently global; change the code in fallback.py to scope them under a
Breeze Buddy namespace by either adding a BREEZE_BUDDY_NAMESPACE prefix to those
key constants or, preferably, pass a namespace parameter into all
redis_get/redis_set/redis_delete calls that touch these keys (e.g., where the
circuit logic reads/writes failure_count, open, half_open, probe_lock); update
the key constant definitions and every usage in the circuit-breaker functions
(including the calls around the block noted at lines 51–54) to use the
namespaced form or the redis helper's namespace argument so the keys no longer
collide with other services.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f892c0b1-4eaa-47de-b678-09cd51a80d70

📥 Commits

Reviewing files that changed from the base of the PR and between 2e009c4 and 0d6ef1c.

📒 Files selected for processing (6)
  • app/ai/voice/agents/breeze_buddy/agent/__init__.py
  • app/ai/voice/agents/breeze_buddy/agent/pipeline.py
  • app/ai/voice/agents/breeze_buddy/stt/__init__.py
  • app/ai/voice/agents/breeze_buddy/stt/fallback.py
  • app/ai/voice/stt/soniox/config.py
  • app/core/config/static.py

Comment on lines +698 to +708
# Record probe outcome in circuit breaker
if self.is_probe_call:
try:
from app.ai.voice.agents.breeze_buddy.stt.fallback import (
stt_circuit_breaker,
)

if not self._stt_failure_recorded:
await stt_circuit_breaker.record_success()
else:
await stt_circuit_breaker.release_probe()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only close the breaker after a positive Soniox signal.

Right now any probe call that disconnects without _stt_failure_recorded calls record_success(). A caller hanging up before speaking, or any other non-STT teardown, will close the circuit globally even though Soniox was never validated. Gate record_success() on an actual successful Soniox transcription/frame and otherwise just release the probe lock.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 698 - 708,
The probe handling currently calls stt_circuit_breaker.record_success() whenever
is_probe_call is true and _stt_failure_recorded is false, which can close the
breaker on non‑STT teardowns; change the logic in the is_probe_call block to
only call stt_circuit_breaker.record_success() when you have an actual
successful Soniox transcription/frame (e.g., check an existing success indicator
like self._stt_successful_transcription or the variable that flags a valid
transcription event), and otherwise call stt_circuit_breaker.release_probe();
keep the import and use of stt_circuit_breaker and the _stt_failure_recorded
check but gate record_success() on the real Soniox success flag instead of
merely not having recorded a failure.

Comment thread app/ai/voice/agents/breeze_buddy/stt/__init__.py Outdated
Comment thread app/ai/voice/agents/breeze_buddy/stt/__init__.py Outdated
Comment on lines +9 to +13
Redis keys (no namespace prefix - each env has its own Redis):
stt:cb:failure_count - rolling failure counter with TTL window
stt:cb:open - present while circuit is OPEN (TTL = open_duration)
stt:cb:half_open - present while circuit is HALF-OPEN (auto-set when open expires)
stt:cb:probe_lock - mutex so only one call probes at a time
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Namespace these circuit-breaker keys.

The docstring and constants make these global Redis keys. In a shared Redis, another service or another breaker instance can trip or clear Breeze Buddy’s circuit state. Route the access through the namespaced Redis helpers, or at minimum scope the keys under a Breeze Buddy namespace. As per coding guidelines, Always use namespace parameter in redis_get/redis_set calls to prevent key collisions across services.

Also applies to: 51-54

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 9 - 13, The
circuit-breaker Redis keys (strings like "stt:cb:failure_count", "stt:cb:open",
"stt:cb:half_open", "stt:cb:probe_lock") are currently global; change the code
in fallback.py to scope them under a Breeze Buddy namespace by either adding a
BREEZE_BUDDY_NAMESPACE prefix to those key constants or, preferably, pass a
namespace parameter into all redis_get/redis_set/redis_delete calls that touch
these keys (e.g., where the circuit logic reads/writes failure_count, open,
half_open, probe_lock); update the key constant definitions and every usage in
the circuit-breaker functions (including the calls around the block noted at
lines 51–54) to use the namespaced form or the redis helper's namespace argument
so the keys no longer collide with other services.

Comment on lines +56 to +68
def __init__(
self,
failure_threshold: int = STT_CIRCUIT_BREAKER_FAILURE_THRESHOLD,
retrip_threshold: int = STT_CIRCUIT_BREAKER_RETRIP_THRESHOLD,
open_duration: int = STT_CIRCUIT_BREAKER_OPEN_DURATION_SECS,
failure_window: int = STT_CIRCUIT_BREAKER_FAILURE_WINDOW_SECS,
probe_lock_ttl: int = STT_CIRCUIT_BREAKER_PROBE_LOCK_TTL_SECS,
):
self.failure_threshold = failure_threshold
self.retrip_threshold = retrip_threshold
self.open_duration = open_duration
self.failure_window = failure_window
self.probe_lock_ttl = probe_lock_ttl
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

HALF_OPEN failures never re-open the breaker.

retrip_threshold is stored but never used. In HALF_OPEN, a failed probe only increments failure_count to 1 and leaves the circuit HALF_OPEN because record_failure() still compares against failure_threshold. That lets an unhealthy Soniox keep receiving periodic probe traffic instead of re-tripping immediately.

Suggested fix
     async def record_failure(self) -> None:
         """Increment failure count. Trip to OPEN if threshold reached."""
         try:
             redis = await get_redis_service()
+            threshold = self.failure_threshold
+            if await redis.exists(self._KEY_HALF_OPEN) and not await redis.exists(
+                self._KEY_OPEN
+            ):
+                threshold = self.retrip_threshold
             count = await redis.incr(self._KEY_FAILURE_COUNT)
             # Set TTL on first failure (rolling window)
             if count == 1:
                 await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window)
             logger.info(
-                f"STT circuit breaker: failure {count}/{self.failure_threshold}"
+                f"STT circuit breaker: failure {count}/{threshold}"
             )
-            if count >= self.failure_threshold:
+            if count >= threshold:
                 await self._trip(redis)

Also applies to: 87-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 56 - 68, The
constructor stores retrip_threshold but it’s never used so HALF_OPEN probes that
fail only bump failure_count toward failure_threshold and never re-open the
breaker; update the failure-handling logic in record_failure (and any code that
handles STATE_HALF_OPEN) to compare failures during HALF_OPEN against
self.retrip_threshold instead of self.failure_threshold and call the same
re-trip behavior (set state to OPEN, set open expiry using self.open_duration,
reset probe/failure counters) when the retrip threshold is reached; ensure
references to failure_count, STATE_HALF_OPEN, record_failure, and
retrip_threshold are adjusted so HALF_OPEN failures can immediately re-trip the
circuit using retrip_threshold.

@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch 2 times, most recently from 259eb66 to c7cafda Compare March 18, 2026 11:38
_background_tasks: set[asyncio.Task] = set()


def _fire_and_forget(coro) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils

Comment thread app/core/config/static.py Outdated
# When enabled, uses a circuit breaker to route calls:
# CLOSED -> Soniox only | OPEN -> Deepgram only | HALF-OPEN -> single probe call on Soniox
ENABLE_BREEZE_BUDDY_STT_FALLBACK = (
os.environ.get("ENABLE_BREEZE_BUDDY_STT_FALLBACK", "false").lower() == "true"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic.py

Comment thread app/ai/voice/agents/breeze_buddy/stt/fallback.py Outdated
Comment thread app/ai/voice/agents/breeze_buddy/stt/fallback.py Outdated
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch 2 times, most recently from 9a91582 to 5460862 Compare March 20, 2026 08:29
Comment thread app/ai/voice/agents/breeze_buddy/stt/__init__.py Outdated
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch 4 times, most recently from 567b2a8 to c970db7 Compare April 9, 2026 10:54
@Devansh-1218 Devansh-1218 changed the title feat(breeze-buddy): add STT fallback with Redis-backed circuit breaker feat(breeze-buddy): add STT fallback support Apr 9, 2026
@swaroopvarma1
Copy link
Copy Markdown
Collaborator

needs_changes

This PR introduces critical reliability bugs: a syntax error that masks root causes during failures, missing TTLs creating permanent fallback states, and race conditions that poison failure counters. Combined with hardcoded provider checks that silently disable fallback for non-Soniox deployments, these issues pose significant operational risk.

  • Syntax error masks root cause: raise primary_ references undefined variable (should be primary_err), causing NameError when fallback fails. app/ai/voice/agents/breeze_buddy/stt/fallback.py:365
  • Permanent fallback state: Redis activation key set without TTL; if reset task crashes, system stays in fallback indefinitely. app/services/fallback/__init__.py:95
  • Flapping risk: Background reset task clears fallback state without verifying fallback_duration_secs elapsed. app/services/fallback/__init__.py:150
  • Redis race condition: Non-atomic INCR/EXPIRE leaves permanent failure counter if process crashes between operations. app/services/fallback/__init__.py:55
  • Silent integration failure: Circuit breaker hardcoded to only record Soniox failures, disabling fallback protection for Deepgram/Google/Sarvam primaries. app/ai/voice/agents/breeze_buddy/agent/__init__.py:625
  • Format string injection: Slack alerts use .format() with unsanitized error messages that may contain curly braces from API responses. app/ai/voice/agents/breeze_buddy/stt/fallback.py:172
  • Swallows system signals: Bare except: clause catches KeyboardInterrupt/SIGTERM, blocking graceful shutdown. app/ai/voice/agents/breeze_buddy/agent/__init__.py:643
  • Fragile error classification: Relies on substring matching ('processor_name' in str(error)) instead of structured exception types. app/ai/voice/agents/breeze_buddy/agent/__init__.py:612
  • Hardcoded typo: Misspelled Slack group @breeze-sentinals is not configurable via environment. app/ai/voice/agents/breeze_buddy/stt/fallback.py:25
  • Blocking initialization: Fallback service build lacks timeout, can hang call setup indefinitely if provider stalls. app/ai/voice/agents/breeze_buddy/stt/__init__.py:285



async def initialize_fallback_tasks(scheduler: BackgroundTaskScheduler) -> None:
"""Register STT fallback reset task if fallback is enabled."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Reset task never registered if flag is off at startup — stuck fallback state

If ENABLE_BB_STT_FALLBACK is false at server start, this function returns early and stt_fallback_reset is never scheduled. If the flag is then flipped to true mid-run (Redis/DevCycle), record_failure will activate fallback by setting fallback:stt:active in Redis — but nothing ever calls reset_to_primary(). The key has no TTL; it survives restarts. System is permanently locked on fallback until someone manually deletes the Redis key.

Reproduced: flag=false at startup → flag flips true → STT fails twice → fallback:stt:active set → 30+ minutes pass → key still present, no reset.

Fix: either (a) always register the reset task and let it check the flag at runtime, or (b) give fallback:stt:active a TTL of fallback_duration_secs in _activate() so it self-expires without needing the reset task.


duration_secs = await BB_STT_FALLBACK_DURATION_SECS()
scheduler.register_task(
name="stt_fallback_reset",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Fallback duration is not guaranteed — reset fires on startup cadence, not activation time

The reset task is registered with interval_seconds=duration_secs (e.g. 1800s). It fires at T+1800, T+3600, ... relative to server start. Fallback activation time is unrelated to this schedule.

Worst case: fallback activates 1 minute before the next reset tick → fallback lasts 1 minute, not 30. Operators configure 30-minute fallback, get 1–60 minute windows unpredictably. On a flaky STT provider, rapid fallback→reset→fail→fallback cycles trigger alert storms.

Root cause: check_and_reset_stt_fallback calls fallback.is_active() and resets unconditionally — it doesn't check when activation occurred.

Fix option 1: Store activation timestamp (fallback:stt:activated_at) and only reset if now - activated_at >= duration_secs.

Fix option 2 (simpler): Use a TTL on fallback:stt:active itself. In _activate(), change redis.set(key, '1', nx=True) to redis.set(key, '1', nx=True, ex=fallback_duration_secs). Key self-expires after exactly duration_secs. No reset task needed — just check is_active().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_activate() was calling redis.set(key, "1", nx=True) with no TTL, making the key permanent and relying
entirely on the background reset task to clear it.

Added ex=fallback_duration_secs to the SET in _activate(). The key now self-expires after exactly the configured

cooldown, measured from activation time.

await fb.record_failure(error_msg=str(primary_err)[:200], context="init")
else:
fire_and_forget(
send_templated_alert(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Dead code + misleading alert template

This else branch (fallback disabled) is unreachable. handle_stt_init_failure is only called from create_stt_from_config which already returns early when ENABLE_BB_STT_FALLBACK=False (line 120 of stt/init.py). So this branch can never execute.

If it ever did (e.g. future refactor adds a caller), it would fire ALERT_STT_INIT_FALLBACK — the template titled "🚨 {service_name} Init Failed — Using {fallback_name}" — even when fallback is disabled, misleading on-call engineers into thinking a fallback is active when the system is actually in a broken state.

Consider removing the dead branch or adding an assertion: assert fallback_enabled, 'handle_stt_init_failure called with fallback disabled'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_activate() was calling redis.set(key, "1", nx=True) with no TTL, making the key permanent and relying
entirely on the background reset task to clear it.

Added ex=fallback_duration_secs to the SET in _activate(). The key now self-expires after exactly the configured

cooldown, measured from activation time.

Comment thread app/services/fallback/__init__.py Outdated
redis = await get_redis_service()

count = await redis.incr(self._key_failure_count)
if count == 1:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: Non-atomic INCR + EXPIRE race in multi-pod deployment

Two ops without a pipeline:

  1. Pod A: INCR failure_count → 1, then sets EXPIRE 240s
  2. Pod B: INCR failure_count → 2 (threshold hit), calls _activate → deletes failure_count
  3. Pod C: INCR failure_count → 1 (fresh counter, no TTL yet)
  4. Pod A's EXPIRE call now lands on Pod C's key → sets 240s TTL on freshly re-created counter

This is mostly benign (TTL on the new counter is correct behavior), but the counter can survive longer than intended if Pod A's expire lands after a reset cycle clears and recreates the key. More critically: if threshold=2 and 10 pods all fail simultaneously, all 10 call _activate — only 1 succeeds (NX guard) but 10 alert callbacks fire. At scale, alert flood.

Fix: use SET key 1 NX EX window_secs + INCR pattern, or Lua script to atomically incr-and-expire.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the two-step non-atomic calls with a Lua script (_LUA_INCR_WITH_EXPIRE) that executes INCR
and EXPIRE as a single indivisible Redis operation. Added a non-atomic fallback if run_script returns None (Redis EVAL
failure) so failures are never silently dropped.

Added a fallback:{service}:alerted:{count} NX key before firing on_failure_alert. Only the first pod to
reach count=N wins the NX set and fires the Slack alert — all other pods skip it. TTL is set to failure_window_secs so
the dedup keys clean up with the counter. The trip alert (on_trip_alert) was already safe — it was protected by the NX
guard in _activate().

@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch 2 times, most recently from 5632393 to dac45f1 Compare April 22, 2026 10:58
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch from dac45f1 to da74118 Compare April 23, 2026 06:10
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch from da74118 to c309fe9 Compare April 23, 2026 06:36
@swaroopvarma1
Copy link
Copy Markdown
Collaborator

Changes requested

The undefined variable in the exception handler will mask production failures with NameErrors, while hardcoded provider checks and breaking API changes undermine the fallback system's reliability. Multiple high-severity issues span correctness, security, and operational safety.

  • Critical: Undefined variable primary_ in exception handler masks root cause failures with NameError, breaking debugging and monitoring. app/ai/voice/agents/breeze_buddy/stt/fallback.py:365-365
  • Circuit breaker failure recording is hardcoded to Soniox only; failures from Deepgram/Google/Sarvam won't trigger fallback protection. app/ai/voice/agents/breeze_buddy/agent/__init__.py:628-628
  • Breaking API change: create_services now returns STTServiceResult wrapper instead of service object; existing callers will fail with AttributeError. app/ai/voice/agents/breeze_buddy/agent/pipeline.py:115-115
  • Format string injection in Slack alerts: .format(**kwargs) on JSON error responses causes KeyError and potential info leakage. app/ai/voice/agents/breeze_buddy/stt/fallback.py:168-168
  • Redis fallback creates permanent keys without TTL; crashes between INCR and cleanup cause permanent fallback activation after restarts. app/services/fallback/__init__.py:55-55
  • Bare except: clause catches SystemExit/KeyboardInterrupt, preventing graceful shutdown during deploys/scaling. app/ai/voice/agents/breeze_buddy/agent/__init__.py:665-665

@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch from c309fe9 to 50f616a Compare April 27, 2026 06:21
@Devansh-1218
Copy link
Copy Markdown
Contributor Author

Changes requested

The undefined variable in the exception handler will mask production failures with NameErrors, while hardcoded provider checks and breaking API changes undermine the fallback system's reliability. Multiple high-severity issues span correctness, security, and operational safety.

  • Critical: Undefined variable primary_ in exception handler masks root cause failures with NameError, breaking debugging and monitoring. app/ai/voice/agents/breeze_buddy/stt/fallback.py:365-365
  • Circuit breaker failure recording is hardcoded to Soniox only; failures from Deepgram/Google/Sarvam won't trigger fallback protection. app/ai/voice/agents/breeze_buddy/agent/__init__.py:628-628
  • Breaking API change: create_services now returns STTServiceResult wrapper instead of service object; existing callers will fail with AttributeError. app/ai/voice/agents/breeze_buddy/agent/pipeline.py:115-115
  • Format string injection in Slack alerts: .format(**kwargs) on JSON error responses causes KeyError and potential info leakage. app/ai/voice/agents/breeze_buddy/stt/fallback.py:168-168
  • Redis fallback creates permanent keys without TTL; crashes between INCR and cleanup cause permanent fallback activation after restarts. app/services/fallback/__init__.py:55-55
  • Bare except: clause catches SystemExit/KeyboardInterrupt, preventing graceful shutdown during deploys/scaling. app/ai/voice/agents/breeze_buddy/agent/__init__.py:665-665
  • For phase 1, the fallback circuit breaker is intentionally scoped to Soniox failures only. When fallback is active (e.g., Deepgram is in use), errors from the fallback provider are logged and alerted but do not trigger further fallback escalation or chaining. This design is intentional for operational safety and simplicity in the initial rollout. We can revisit multi-level fallback or broader error tracking in a future phase if needed.

  • Undefined variable primary_ in exception handler
    Response:
    Thanks for flagging this. This appears to be from an older revision. In the current branch, the handler consistently uses primary_provider and primary_err, and re-raises with the original exception chain at fallback.py:370. I do not see an undefined primary_ symbol in the current code path.

  • Breaking API change in create_services return type
    Response:
    Good callout. In the current branch, callers were updated to consume the STT result wrapper. The active call site destructures stt_result and uses stt_result.service/provider at init.py:864 and init.py:870. So no in-repo AttributeError path remains.

  • Format string injection in Slack templates
    Response:
    I agree we should keep this robust, but this is not a direct format-string injection from user-controlled templates in the current flow. Template formatting is applied on static in-repo templates and KeyError is already handled in fallback.py:157. If useful, I can still harden this further for ValueError/IndexError resilience.

  • Redis fallback keys permanent without TTL
    Response:
    This was true in an earlier version but is fixed now. The active fallback key is written with NX + EX (duration TTL) at init.py:170, and failure counter increment/expiry is handled atomically via Lua in init.py:36. This prevents stuck fallback due to INCR/EXPIRE race.

  • Bare except prevents graceful shutdown
    Response:
    I cannot find a bare except at the cited location in the current branch. The relevant block currently uses except Exception around EndFrame queueing in init.py:665, which does not catch SystemExit/KeyboardInterrupt.

@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch from 50f616a to 48c73ec Compare April 27, 2026 10:08
Comment thread app/core/config/dynamic.py Outdated


# --- Breeze Buddy ElevenLabs TTS Configuration ---
async def BB_ELEVENLABS_VOICE_ID() -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these required in this PR?

Comment thread app/core/config/dynamic.py Outdated
return await get_config("ENABLE_BB_STT_FALLBACK", False, bool)


async def BB_STT_FALLBACK_PROVIDER() -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about merchants who are having deepgram as STT and deepgram fails?

Copy link
Copy Markdown
Contributor Author

@Devansh-1218 Devansh-1218 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Phase-1 it was told to enable fallback only for Soniox!!

@swaroopvarma1 swaroopvarma1 requested a review from Copilot April 27, 2026 10:48
@swaroopvarma1
Copy link
Copy Markdown
Collaborator

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
app/ai/voice/agents/breeze_buddy/stt/fallback.py (1)

304-309: Leaking ServiceFallback's private _key_notified across module boundaries.

notify_fallback_active reaches into fb._key_notified, which couples this caller to ServiceFallback's internal Redis layout. If the generic class ever renames or restructures its keys (it owns this namespace), this call site silently breaks. Add a small public method on ServiceFallback (e.g. try_notify_once(ttl_secs) -> bool) and consume that instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 304 - 309,
notify_fallback_active currently accesses ServiceFallback's private attribute
fb._key_notified, leaking internal Redis key layout; add a public method on
ServiceFallback (e.g. try_notify_once(ttl_secs) -> bool) that performs the NX/EX
Redis set and returns whether the notify succeeded, then update
notify_fallback_active to call
fb.try_notify_once(fb.config.fallback_duration_secs) instead of touching
fb._key_notified directly so the implementation detail stays encapsulated.
app/ai/voice/agents/breeze_buddy/agent/__init__.py (1)

870-870: Remove dead commented-out call.

# stt, llm, tts = await create_services(self.configurations) is stale after the STTServiceResult refactor and adds noise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` at line 870, Remove the
dead commented-out call to create_services (the line "# stt, llm, tts = await
create_services(self.configurations)") since it is stale after the
STTServiceResult refactor; delete that commented line from the agent __init__
code (referencing create_services and STTServiceResult) to reduce noise and keep
the codebase clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 625-635: The substring check that builds processor_str and
stt_keywords leads to false positives (e.g., "google" matching non‑STT Google
processors) and misses "openai"; replace it by determining STT identity from the
actual STT service instance instead of the error's processor string — use
self._stt_service (or a provider/name attribute on that instance) to classify
STT errors, falling back to a tighter substring match only if the service
instance is unavailable (e.g., require tokens like "_stt" or "google_stt" and
include "openai" in stt_keywords); update the logic that sets is_stt_error (and
references to processor_str and stt_keywords) to use this service-based check so
only true STT providers trigger STT fallback recording, alerting, and
EndFrame().
- Around line 656-659: The current try/except around
task.queue_frames([EndFrame()]) swallows all exceptions; change it to catch
Exception as e and log the error (e.g., using logger.exception or an appropriate
module/class logger) including the exception details and context (mentioning
task and EndFrame) before optionally re-raising or handling; update the block
that calls task.queue_frames in the agent/__init__.py (the
task.queue_frames([EndFrame()]) site) to ensure exceptions are recorded for
post-mortems rather than silently passed.
- Around line 642-652: The code currently gates recording STT failures on
self.stt_provider == "soniox", which prevents fallback tracking for other
providers; remove that provider-specific check and always attempt to call
record_stt_failure once per call (i.e., keep the existing _stt_failure_recorded
boolean guard and the try/except), so replace the if block that checks
self.stt_provider with a simple "if not self._stt_failure_recorded" branch that
sets self._stt_failure_recorded = True and calls await
record_stt_failure(error_msg=str(error_msg)[:200], call_sid=self.call_sid or "",
context="mid-call") inside the existing try/except (record_stt_failure itself
respects ENABLE_BB_STT_FALLBACK), ensuring service-agnostic fallback recording.

In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py`:
- Around line 228-241: The proactive fallback build (when provider_name !=
fallback_provider) must be wrapped in try/except so a misconfigured fallback
doesn't abort create_stt_from_config; around the await
_build_stt_provider(fallback_config) call catch Exception as err, call
handle_stt_init_failure(provider=fallback_provider, err=err) (or equivalent
error/alerting path) and log/notify the failure (e.g., via
notify_fallback_active or Slack alert), then do NOT return a fallback
STTServiceResult so execution falls through to the primary provider
initialization; keep the successful path (notify_fallback_active + return
STTServiceResult(provider=fallback_provider, service=service)) unchanged but
guarded by the try block.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 34-37: Replace the hardcoded, misspelled _FALLBACK_TAG value and
make the fallback Slack tag operator-configurable: remove the literal
"@breeze-sentinals" and instead derive the fallback from configuration (e.g.,
use SLACK_TAG_USERS or a new config variable like DEFAULT_FALLBACK_TAG set to
the correct "@breeze-sentinels"), then update STT_FALLBACK_SLACK_TAG to combine
the configured fallback and SLACK_TAG_USERS (falling back to the configured
default when SLACK_TAG_USERS is empty) so alerts use the correct, tunable Slack
user-group handle; update references to _FALLBACK_TAG and STT_FALLBACK_SLACK_TAG
accordingly.
- Around line 159-174: The template filler _fill_template (and its inner _fmt)
only catches KeyError so malformed braces or positional format tokens in
upstream error messages can raise ValueError/IndexError and crash the alert
path; update _fill_template/_fmt to either (a) use a safe formatting approach
such as string.Template or str.format_map with a SafeDict to silently ignore
unknown placeholders, or (b) broaden the exception handling to catch ValueError
and IndexError (and/or Exception) and fall back to returning the original
string, and before formatting pre-escape or sanitize free-form fields like
error_msg to replace or double any stray braces; ensure send_templated_alert
continues to receive a safely formatted dict from _fill_template.

In `@app/services/fallback/__init__.py`:
- Around line 251-270: check_and_reset_stt_fallback rarely observes the
ephemeral Redis _key_active because _activate() sets the key with
EX=fallback_duration_secs and the scheduled task ticks on fixed intervals;
change the flow so the reset alert reliably fires by having
ServiceFallback._activate() also record a durable activation marker (e.g., set
an "stt_activated_at" timestamp or a "stt_pending_reset_alert" key with TTL
longer than fallback_duration_secs) and then update check_and_reset_stt_fallback
to look for that marker: if the active key no longer exists but the activation
marker indicates the fallback just expired (timestamp older than
fallback_duration_secs or presence of pending_reset_alert), call
fallback.reset_to_primary() and emit the alert, then remove the activation
marker; keep use of ServiceFallback.is_active(), ServiceFallback._activate(),
ServiceFallback.reset_to_primary(), and the check_and_reset_stt_fallback
function names to locate and modify the logic.
- Around line 110-113: When run_script returns None and the fallback branch uses
await redis.incr(self._key_failure_count), it re-introduces the
INCR-without-EXPIRE race by never assigning a TTL; change the fallback to set an
expiry as part of the fallback path (e.g., use INCR followed by an EXPIRE when
the increment returns 1 or use a single Redis command that sets TTL) or
alternatively raise/propagate an error instead of silently falling back; update
the code surrounding run_script, redis.incr and self._key_failure_count so the
fallback always ensures a bounded TTL on the failure counter (or fails hard) to
avoid persistent stale counters.

---

Nitpick comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Line 870: Remove the dead commented-out call to create_services (the line "#
stt, llm, tts = await create_services(self.configurations)") since it is stale
after the STTServiceResult refactor; delete that commented line from the agent
__init__ code (referencing create_services and STTServiceResult) to reduce noise
and keep the codebase clean.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 304-309: notify_fallback_active currently accesses
ServiceFallback's private attribute fb._key_notified, leaking internal Redis key
layout; add a public method on ServiceFallback (e.g. try_notify_once(ttl_secs)
-> bool) that performs the NX/EX Redis set and returns whether the notify
succeeded, then update notify_fallback_active to call
fb.try_notify_once(fb.config.fallback_duration_secs) instead of touching
fb._key_notified directly so the implementation detail stays encapsulated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b9fa66fe-5aef-4bd0-bfed-ba26ad60c4c1

📥 Commits

Reviewing files that changed from the base of the PR and between 0d6ef1c and 48c73ec.

📒 Files selected for processing (10)
  • app/ai/voice/agents/breeze_buddy/agent/__init__.py
  • app/ai/voice/agents/breeze_buddy/agent/pipeline.py
  • app/ai/voice/agents/breeze_buddy/stt/__init__.py
  • app/ai/voice/agents/breeze_buddy/stt/fallback.py
  • app/ai/voice/agents/breeze_buddy/utils/common.py
  • app/ai/voice/stt/soniox/config.py
  • app/core/config/dynamic.py
  • app/main.py
  • app/services/fallback/__init__.py
  • app/services/slack/alert.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/ai/voice/stt/soniox/config.py
  • app/ai/voice/agents/breeze_buddy/agent/pipeline.py

Comment on lines +625 to +635
# Detect STT errors by processor name keywords
processor_str = str(processor).lower()
stt_keywords = (
"stt",
"soniox",
"deepgram",
"transcri",
"google",
"sarvam",
)
is_stt_error = any(kw in processor_str for kw in stt_keywords)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Substring classification has false positives and a missing provider.

"google" matches any google_* processor (Google LLM/TTS/Vertex) — not just STT — so non-STT errors will be misclassified and trigger STT fallback recording + alert + EndFrame(). Conversely, "openai" is missing despite OpenAI being a supported STT provider per the learning above.

Prefer matching on self._stt_service's class identity (or processor name from the service instance) instead of substring sniffing the error's processor attribute. At minimum, anchor on more specific tokens (e.g. "_stt", "google_stt") and add "openai" if you keep this approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 625 - 635,
The substring check that builds processor_str and stt_keywords leads to false
positives (e.g., "google" matching non‑STT Google processors) and misses
"openai"; replace it by determining STT identity from the actual STT service
instance instead of the error's processor string — use self._stt_service (or a
provider/name attribute on that instance) to classify STT errors, falling back
to a tighter substring match only if the service instance is unavailable (e.g.,
require tokens like "_stt" or "google_stt" and include "openai" in
stt_keywords); update the logic that sets is_stt_error (and references to
processor_str and stt_keywords) to use this service-based check so only true STT
providers trigger STT fallback recording, alerting, and EndFrame().

Comment on lines +642 to +652
# Record failure in fallback system (once per call, Soniox only)
if self.stt_provider == "soniox" and not self._stt_failure_recorded:
self._stt_failure_recorded = True
try:
await record_stt_failure(
error_msg=str(error_msg)[:200],
call_sid=self.call_sid or "",
context="mid-call",
)
except Exception as fb_err:
logger.warning(f"STT fallback record_failure failed: {fb_err}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Hardcoded "soniox" disables fallback for every other supported primary.

BB_STT_SERVICE is dynamic and supports soniox, deepgram, sarvam, openai, google (see app/ai/voice/agents/breeze_buddy/stt/__init__.py:286-294). Gating record_stt_failure(...) on self.stt_provider == "soniox" means a merchant whose primary is anything else gets zero failure tracking, the breaker never trips, and fallback is silently inert — exactly the case swaroopvarma1 flagged.

Drop the provider gate. record_stt_failure(...) already checks ENABLE_BB_STT_FALLBACK() internally and the generic ServiceFallback is provider-agnostic.

Based on learnings: "Applies to app/ai/voice/agents/breeze_buddy/stt/**/*.py : STT providers must support native endpoint detection or SmartTurn; Soniox is default with optional Deepgram, Sarvam, OpenAI, Google".

🐛 Suggested fix
-            # Record failure in fallback system (once per call, Soniox only)
-            if self.stt_provider == "soniox" and not self._stt_failure_recorded:
+            # Record failure in fallback system (once per call, any primary)
+            if self.stt_provider and not self._stt_failure_recorded:
                 self._stt_failure_recorded = True
                 try:
                     await record_stt_failure(
                         error_msg=str(error_msg)[:200],
                         call_sid=self.call_sid or "",
                         context="mid-call",
                     )
                 except Exception as fb_err:
                     logger.warning(f"STT fallback record_failure failed: {fb_err}")
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 651-651: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 642 - 652,
The code currently gates recording STT failures on self.stt_provider ==
"soniox", which prevents fallback tracking for other providers; remove that
provider-specific check and always attempt to call record_stt_failure once per
call (i.e., keep the existing _stt_failure_recorded boolean guard and the
try/except), so replace the if block that checks self.stt_provider with a simple
"if not self._stt_failure_recorded" branch that sets self._stt_failure_recorded
= True and calls await record_stt_failure(error_msg=str(error_msg)[:200],
call_sid=self.call_sid or "", context="mid-call") inside the existing try/except
(record_stt_failure itself respects ENABLE_BB_STT_FALLBACK), ensuring
service-agnostic fallback recording.

Comment on lines +656 to +659
try:
await task.queue_frames([EndFrame()])
except Exception:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Swallowing EndFrame queue errors silently.

If task.queue_frames([EndFrame()]) raises, the call may hang in an undefined state and there will be no breadcrumb in logs (Ruff S110). Log the exception so post-mortems are possible.

🛡️ Suggested fix
             # Alert and end call — no mid-call swap in Phase 1
             fire_and_forget(self._send_mid_call_stt_alert())
             try:
                 await task.queue_frames([EndFrame()])
-            except Exception:
-                pass
+            except Exception as end_err:
+                logger.warning(f"Failed to queue EndFrame after STT error: {end_err}")
🧰 Tools
🪛 Ruff (0.15.12)

[error] 658-659: try-except-pass detected, consider logging the exception

(S110)


[warning] 658-658: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 656 - 659,
The current try/except around task.queue_frames([EndFrame()]) swallows all
exceptions; change it to catch Exception as e and log the error (e.g., using
logger.exception or an appropriate module/class logger) including the exception
details and context (mentioning task and EndFrame) before optionally re-raising
or handling; update the block that calls task.queue_frames in the
agent/__init__.py (the task.queue_frames([EndFrame()]) site) to ensure
exceptions are recorded for post-mortems rather than silently passed.

Comment on lines +228 to +241
if provider_name != fallback_provider:
fb = await get_stt_fallback()
if await fb.is_active():
logger.info(
f"STT fallback active — using {fallback_provider} "
f"instead of {provider_name}"
)
fallback_config = STTConfiguration(
provider=STTProvider(fallback_provider),
language=config.language,
)
service = await _build_stt_provider(fallback_config)
fire_and_forget(notify_fallback_active(fallback_provider))
return STTServiceResult(provider=fallback_provider, service=service)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Proactive fallback route has no error handling — misconfigured fallback provider kills the call.

When the fallback is already active and the configured BB_STT_FALLBACK_PROVIDER is missing its API key (e.g., set to deepgram while DEEPGRAM_API_KEY is empty), _build_stt_provider(fallback_config) will raise ValueError, propagating out of create_stt_from_config and aborting setup. Unlike the primary-build path below (which routes through handle_stt_init_failure), there is no catch here.

Consider wrapping this build in try/except and degrading to the primary provider (with a loud warning/Slack alert) if the fallback build fails — the primary may still partially work, and aborting the call is worse than ignoring an unusable fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py` around lines 228 - 241, The
proactive fallback build (when provider_name != fallback_provider) must be
wrapped in try/except so a misconfigured fallback doesn't abort
create_stt_from_config; around the await _build_stt_provider(fallback_config)
call catch Exception as err, call
handle_stt_init_failure(provider=fallback_provider, err=err) (or equivalent
error/alerting path) and log/notify the failure (e.g., via
notify_fallback_active or Slack alert), then do NOT return a fallback
STTServiceResult so execution falls through to the primary provider
initialization; keep the successful path (notify_fallback_active + return
STTServiceResult(provider=fallback_provider, service=service)) unchanged but
guarded by the try block.

Comment on lines +34 to +37
_FALLBACK_TAG = "@breeze-sentinals"
STT_FALLBACK_SLACK_TAG = (
f"{_FALLBACK_TAG},{SLACK_TAG_USERS}" if SLACK_TAG_USERS else _FALLBACK_TAG
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded, misspelled Slack group.

@breeze-sentinals should be @breeze-sentinels (or whatever the actual Slack user-group handle is). Today this prefix is hardcoded into every STT alert — if Slack rejects the unknown handle, on-call won't be paged. Move this to dynamic config (or SLACK_TAG_USERS) so it's both correct and operator-tunable without a redeploy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 34 - 37,
Replace the hardcoded, misspelled _FALLBACK_TAG value and make the fallback
Slack tag operator-configurable: remove the literal "@breeze-sentinals" and
instead derive the fallback from configuration (e.g., use SLACK_TAG_USERS or a
new config variable like DEFAULT_FALLBACK_TAG set to the correct
"@breeze-sentinels"), then update STT_FALLBACK_SLACK_TAG to combine the
configured fallback and SLACK_TAG_USERS (falling back to the configured default
when SLACK_TAG_USERS is empty) so alerts use the correct, tunable Slack
user-group handle; update references to _FALLBACK_TAG and STT_FALLBACK_SLACK_TAG
accordingly.

Comment on lines +159 to +174
def _fill_template(template: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]:
"""Recursively applies .format(**kwargs) to all string values."""

def _fmt(val: Any) -> Any:
if isinstance(val, str):
try:
return val.format(**kwargs)
except KeyError:
return val
if isinstance(val, dict):
return {k: _fmt(v) for k, v in val.items()}
if isinstance(val, list):
return [_fmt(item) for item in val]
return val

return _fmt(template)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

val.format() only catches KeyError — malformed braces in upstream errors will still crash the alert.

str.format raises ValueError for unmatched {/} and IndexError for positional refs like {0}. STT/SDK error messages routinely contain JSON-ish snippets (e.g. "... { 'code': 503 ..."), so a single stray brace from a provider exception will bubble out of _fmtsend_templated_alert → caller, breaking exactly the alert path that should be most reliable.

Broaden the catch and pre-escape error_msg (and any other free-form fields). Even better, switch to string.Template or str.format_map(SafeDict) for the entire template-fill path so unknown placeholders degrade silently and untrusted text cannot break formatting.

🛡️ Minimal hardening
     def _fmt(val: Any) -> Any:
         if isinstance(val, str):
             try:
                 return val.format(**kwargs)
-            except KeyError:
+            except (KeyError, IndexError, ValueError):
                 return val

Based on past PR feedback from swaroopvarma1 flagging format-string fragility for unsanitized error messages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 159 - 174, The
template filler _fill_template (and its inner _fmt) only catches KeyError so
malformed braces or positional format tokens in upstream error messages can
raise ValueError/IndexError and crash the alert path; update _fill_template/_fmt
to either (a) use a safe formatting approach such as string.Template or
str.format_map with a SafeDict to silently ignore unknown placeholders, or (b)
broaden the exception handling to catch ValueError and IndexError (and/or
Exception) and fall back to returning the original string, and before formatting
pre-escape or sanitize free-form fields like error_msg to replace or double any
stray braces; ensure send_templated_alert continues to receive a safely
formatted dict from _fill_template.

Comment on lines +110 to +113
if count is None:
# Lua script failed (logged inside run_script); fall back to
# non-atomic path so failures are never silently swallowed.
count = await redis.incr(self._key_failure_count)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Non-atomic Lua-failure fallback re-introduces the INCR-without-EXPIRE race.

When run_script returns None, the code falls back to a plain INCR with no companion EXPIRE. The very race the Lua script was added to eliminate (counter persisting without TTL) re-emerges in this branch — and because the key never gets a TTL set later, a single Lua failure on the first increment can leave the counter alive indefinitely, eventually tripping fallback on stale failure history.

Either set TTL explicitly here (still racy but bounded), or treat Lua failure as a hard error rather than silently using a degraded code path.

🛡️ Suggested fix
             if count is None:
-                # Lua script failed (logged inside run_script); fall back to
-                # non-atomic path so failures are never silently swallowed.
-                count = await redis.incr(self._key_failure_count)
+                # Lua script failed (logged inside run_script); fall back to
+                # non-atomic INCR+EXPIRE so failures are never silently swallowed.
+                count = await redis.incr(self._key_failure_count)
+                if count == 1:
+                    await redis.expire(
+                        self._key_failure_count, self.config.failure_window_secs
+                    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/fallback/__init__.py` around lines 110 - 113, When run_script
returns None and the fallback branch uses await
redis.incr(self._key_failure_count), it re-introduces the INCR-without-EXPIRE
race by never assigning a TTL; change the fallback to set an expiry as part of
the fallback path (e.g., use INCR followed by an EXPIRE when the increment
returns 1 or use a single Redis command that sets TTL) or alternatively
raise/propagate an error instead of silently falling back; update the code
surrounding run_script, redis.incr and self._key_failure_count so the fallback
always ensures a bounded TTL on the failure counter (or fails hard) to avoid
persistent stale counters.

Comment on lines +251 to +270
async def check_and_reset_stt_fallback() -> None:
"""Check if STT fallback is active and reset to primary if so."""
try:
fallback_provider = await BB_STT_FALLBACK_PROVIDER()
fallback = ServiceFallback(
ServiceFallbackConfig(
service_name="stt",
failure_threshold=await BB_STT_FALLBACK_THRESHOLD(),
failure_window_secs=await BB_STT_FALLBACK_WINDOW_SECS(),
fallback_duration_secs=await BB_STT_FALLBACK_DURATION_SECS(),
fallback_provider_name=fallback_provider,
)
)
if not await fallback.is_active():
return

logger.info("STT fallback active — resetting to primary provider")
await fallback.reset_to_primary()
except Exception as e:
logger.error(f"STT fallback reset task failed: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

on_reset_alert will rarely fire — _key_active self-expires before this task observes it.

_activate() sets _key_active with EX=fallback_duration_secs, so the key disappears via TTL. Meanwhile this reset task is scheduled with interval_seconds=fallback_duration_secs from server start, not from activation time. With a random activation T, the key expires at T+duration, but the next task tick is at startup + N*duration — only a narrow window where the task sees is_active() == True. In most cycles the TTL cleans up first and reset_to_primary() (and thus the "back to primary" Slack alert) is never invoked, leaving operators without a normal-operation-resumed signal.

Consider driving the reset alert off Redis keyspace expiration notifications, or store an activated_at timestamp and have the task fire the alert when it observes the active key has just expired (e.g., a separate "pending_reset_alert" key with longer TTL).

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 269-269: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/fallback/__init__.py` around lines 251 - 270,
check_and_reset_stt_fallback rarely observes the ephemeral Redis _key_active
because _activate() sets the key with EX=fallback_duration_secs and the
scheduled task ticks on fixed intervals; change the flow so the reset alert
reliably fires by having ServiceFallback._activate() also record a durable
activation marker (e.g., set an "stt_activated_at" timestamp or a
"stt_pending_reset_alert" key with TTL longer than fallback_duration_secs) and
then update check_and_reset_stt_fallback to look for that marker: if the active
key no longer exists but the activation marker indicates the fallback just
expired (timestamp older than fallback_duration_secs or presence of
pending_reset_alert), call fallback.reset_to_primary() and emit the alert, then
remove the activation marker; keep use of ServiceFallback.is_active(),
ServiceFallback._activate(), ServiceFallback.reset_to_primary(), and the
check_and_reset_stt_fallback function names to locate and modify the logic.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Redis-backed fallback/circuit-breaker mechanism and wires it into Breeze Buddy STT so the system can proactively route away from a failing primary provider (e.g., Soniox → Deepgram) and emit Slack alerts around failures and fallback state.

Changes:

  • Introduces a generic ServiceFallback (Redis failure counter + “active” flag) and registers a background task for STT fallback resets.
  • Implements Breeze Buddy STT fallback orchestration + templated Slack alerts, and updates STT creation to return the actual provider used.
  • Extends Slack alert sending to support per-alert tagging overrides; adds new dynamic config keys for STT fallback (and also ElevenLabs TTS).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
app/services/slack/alert.py Adds tag_users override to control per-alert tagging behavior.
app/services/fallback/init.py Adds Redis-backed fallback state machine + registers STT fallback reset task.
app/main.py Registers fallback background tasks during app lifespan startup.
app/core/config/dynamic.py Adds dynamic config for STT fallback (and ElevenLabs TTS settings).
app/ai/voice/stt/soniox/config.py Adds reconnect_on_error option when building Soniox STT.
app/ai/voice/agents/breeze_buddy/utils/common.py Adds fire_and_forget() helper to retain background task refs.
app/ai/voice/agents/breeze_buddy/stt/fallback.py STT-specific fallback orchestration + Slack alert templates/dedup.
app/ai/voice/agents/breeze_buddy/stt/init.py Implements proactive routing + init-time fallback; returns STTServiceResult.
app/ai/voice/agents/breeze_buddy/agent/pipeline.py Propagates STTServiceResult through service creation.
app/ai/voice/agents/breeze_buddy/agent/init.py Tracks actual STT provider; records Soniox failures on STT pipeline errors and ends call.

if count is None:
# Lua script failed (logged inside run_script); fall back to
# non-atomic path so failures are never silently swallowed.
count = await redis.incr(self._key_failure_count)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Lua script fails and you fall back to incr(), the failure counter key never gets a TTL. That can make failure_count accumulate indefinitely and trip fallback long after the intended window. Consider setting EXPIRE when count == 1 (or always) on the non-Lua path as well.

Suggested change
count = await redis.incr(self._key_failure_count)
count = await redis.incr(self._key_failure_count)
if count == 1:
await redis.expire(
self._key_failure_count, self.config.failure_window_secs
)

Copilot uses AI. Check for mistakes.
Comment thread app/services/fallback/__init__.py Outdated
Comment on lines +264 to +268
if not await fallback.is_active():
return

logger.info("STT fallback active — resetting to primary provider")
await fallback.reset_to_primary()
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_and_reset_stt_fallback() resets to primary whenever the fallback flag exists, regardless of how much TTL remains. Because the scheduler runs on fixed intervals unrelated to activation time, this can deactivate fallback early (even almost immediately after activation). The reset task should only reset after the active key’s TTL has expired (or be redesigned to avoid deleting the key at all).

Copilot uses AI. Check for mistakes.

tag_users: Optional[str] = filled.get("tag_users")
if tag_users and "{" in tag_users:
tag_users = None
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tag_users is left as an unfilled placeholder (e.g. "{tag_users}"), this code sets tag_users=None, which causes slack_alert.send() to fall back to global SLACK_TAG_USERS and may unexpectedly tag people. Consider overriding with an empty string (or include_tags=False) to avoid accidental tagging when template placeholders aren’t populated.

Suggested change
tag_users = None
tag_users = ""

Copilot uses AI. Check for mistakes.
Comment on lines +303 to +306
fb = await get_stt_fallback()
notified = await redis.set(
fb._key_notified,
"1",
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify_fallback_active() reaches into fb._key_notified (a private attribute). This couples the STT layer to ServiceFallback internals and makes refactors risky. Consider exposing a small public method/property on ServiceFallback for activation notification dedup (or a mark_notified() helper) instead.

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +237
fallback_config = STTConfiguration(
provider=STTProvider(fallback_provider),
language=config.language,
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STTProvider(fallback_provider) will raise ValueError if BB_STT_FALLBACK_PROVIDER is misconfigured (typo/unsupported provider), which can mask the original STT init error and prevent service creation. Consider validating the dynamic config value and falling back to a safe default (or disabling fallback) with a clear log/alert when the provider string is invalid.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +46
tag_users: Optional comma-separated users to tag. Overrides SLACK_TAG_USERS
when provided. Only used when include_tags is True.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag_users parameter is described as “comma-separated users to tag”, but send() formats entries as <@...> unless they’re already fully formatted mentions. That behavior only works reliably with Slack user IDs (or preformatted user-group mentions). Consider clarifying this in the docstring to prevent passing usernames like @name which won’t mention correctly.

Suggested change
tag_users: Optional comma-separated users to tag. Overrides SLACK_TAG_USERS
when provided. Only used when include_tags is True.
tag_users: Optional comma-separated Slack user IDs (for example,
"U12345678") or preformatted Slack mentions. Overrides
SLACK_TAG_USERS when provided. Usernames like "@name" are not
reliably converted into valid mentions. Only used when
include_tags is True.

Copilot uses AI. Check for mistakes.
Comment thread app/services/fallback/__init__.py Outdated
Comment on lines +256 to +260
ServiceFallbackConfig(
service_name="stt",
failure_threshold=await BB_STT_FALLBACK_THRESHOLD(),
failure_window_secs=await BB_STT_FALLBACK_WINDOW_SECS(),
fallback_duration_secs=await BB_STT_FALLBACK_DURATION_SECS(),
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ServiceFallbackConfig built for the STT reset task does not set on_reset_alert, so reset_to_primary() won’t emit the intended reset Slack alert even though the module docstring states the background task fires on_reset_alert. Consider wiring the reset callback (or reusing the STT-configured fallback instance) so resets generate alerts consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
_FALLBACK_TAG = "@breeze-sentinals"
STT_FALLBACK_SLACK_TAG = (
f"{_FALLBACK_TAG},{SLACK_TAG_USERS}" if SLACK_TAG_USERS else _FALLBACK_TAG
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_FALLBACK_TAG is set to "@breeze-sentinals" (typo?) and will be wrapped by slack_alert.send() into <@breeze-sentinals>, which is not a valid Slack user/group mention unless it’s a user ID. If this is meant to tag a Slack user group, use the Slack user-group mention format (<!subteam^...|@...>) or store the correct mention token in config.

Copilot uses AI. Check for mistakes.
Comment on lines +347 to +356
else:
fire_and_forget(
send_templated_alert(
ALERT_STT_INIT_FALLBACK,
service_name=primary_provider.capitalize(),
fallback_name=fallback_provider.capitalize(),
error_msg=str(primary_err)[:500],
tag_users=STT_FALLBACK_SLACK_TAG,
)
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_stt_init_failure() only sends the ALERT_STT_INIT_FALLBACK Slack message when ENABLE_BB_STT_FALLBACK is False, but this helper is only called from the fallback-enabled code path. As a result, the init-fallback alert template appears effectively unused. Consider always sending the init-fallback alert (or restructuring this conditional) so primary-init failures that successfully fall back are visible.

Suggested change
else:
fire_and_forget(
send_templated_alert(
ALERT_STT_INIT_FALLBACK,
service_name=primary_provider.capitalize(),
fallback_name=fallback_provider.capitalize(),
error_msg=str(primary_err)[:500],
tag_users=STT_FALLBACK_SLACK_TAG,
)
)
fire_and_forget(
send_templated_alert(
ALERT_STT_INIT_FALLBACK,
service_name=primary_provider.capitalize(),
fallback_name=fallback_provider.capitalize(),
error_msg=str(primary_err)[:500],
tag_users=STT_FALLBACK_SLACK_TAG,
)
)

Copilot uses AI. Check for mistakes.
Comment thread app/core/config/dynamic.py Outdated
Comment on lines +299 to +313
# --- Breeze Buddy ElevenLabs TTS Configuration ---
async def BB_ELEVENLABS_VOICE_ID() -> str:
"""Returns BB_ELEVENLABS_VOICE_ID from Redis"""
return await get_config("BB_ELEVENLABS_VOICE_ID", "fG9s0SXJb213f4UxVHyG", str)


async def BB_ELEVENLABS_MODEL_ID() -> str:
"""Returns BB_ELEVENLABS_MODEL_ID from Redis"""
return await get_config("BB_ELEVENLABS_MODEL_ID", "eleven_flash_v2_5", str)


async def BB_ELEVENLABS_VOICE_SPEED() -> float:
"""Returns BB_ELEVENLABS_VOICE_SPEED from Redis"""
return await get_config("BB_ELEVENLABS_VOICE_SPEED", 1.15, float)

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is described as adding STT fallback support, but this diff also introduces Breeze Buddy ElevenLabs TTS dynamic config keys (BB_ELEVENLABS_*). If these are required for the STT fallback feature, it would help to call that out in the PR description; otherwise consider splitting them into a separate PR to keep scope focused.

Copilot uses AI. Check for mistakes.
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch 2 times, most recently from d7a82ce to af918e4 Compare May 13, 2026 09:38
- Make fallback provider-agnostic (remove soniox hardcode)
- Log EndFrame errors instead of silently swallowing them
- Move FallbackSettings dataclass and _FALLBACK_DEFAULTS to services/fallback
- BB_FALLBACK_CONFIG returns typed FallbackSettings from services/fallback
- BB_FALLBACK_RAW_CONFIG in dynamic.py returns raw dict via json.loads pattern
- Remove no_delay from DeepgramConfig constructor (field not supported by pipecat)
- Deduplicate mid-call STT alert with _mid_call_alert_sent guard
- Fix reset alert timing: poll every 60s via notify_on_expiry() instead of
  deleting active key early; Redis TTL is sole authority on fallback expiry
@Devansh-1218 Devansh-1218 force-pushed the feat-stt-fallback-with-circuit-breaker branch from af918e4 to bee5647 Compare May 13, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants