feat(breeze-buddy): add STT fallback support#647
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR introduces a Redis-backed STT fallback orchestration system for Breeze Buddy that detects initialization and mid-call speech-to-text failures, routes to a fallback provider when failure thresholds are exceeded, dispatches templated Slack alerts, and terminates calls on non-recoverable errors. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent
participant Pipeline as Pipeline Handler
participant Fallback as ServiceFallback
participant Redis
participant SlackAlert as Slack Alert
participant CallControl as Call Control
Agent->>Pipeline: run() / on_pipeline_error()
Pipeline->>Pipeline: detect STT error
Pipeline->>Fallback: record_stt_failure()
Fallback->>Redis: EVAL (Lua script) INCR failure_count
Fallback->>Redis: Set TTL on counter
Fallback-->>Pipeline: failure recorded
Pipeline->>Fallback: is_active()?
Fallback->>Redis: EXISTS fallback:active
Fallback-->>Pipeline: false (not yet active)
Pipeline->>Redis: Check failure_count vs threshold
alt Threshold Exceeded
Pipeline->>Fallback: (internal) SET active flag
Pipeline->>SlackAlert: fire_and_forget(alert)
SlackAlert->>SlackAlert: send_templated_alert()
SlackAlert-->>SlackAlert: (async, non-blocking)
end
Pipeline->>CallControl: Queue EndFrame()
CallControl->>Agent: Terminate call
Agent-->>Pipeline: Pipeline ends
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a Redis-backed circuit breaker to Breeze Buddy’s STT stack to automatically fail over from Soniox to Deepgram across pods, including a HALF_OPEN “probe call” path and mid-call hot swap via Pipecat ServiceSwitcher.
Changes:
- Introduces Redis-backed STT circuit breaker state/locking and Slack alerting for trip/recovery/probe.
- Updates Breeze Buddy STT initialization to return an
STTServiceResultand route Soniox/Deepgram based on circuit state. - Integrates optional
ServiceSwitcherwrapping in the Breeze Buddy pipeline and adds mid-call STT swap handling.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| app/core/config/static.py | Adds feature flag + circuit breaker tuning env vars + Breeze Buddy-specific Deepgram env vars. |
| app/ai/voice/stt/soniox/config.py | Adds reconnect_on_error to Soniox config and passes it into the Soniox service builder. |
| app/ai/voice/agents/breeze_buddy/stt/fallback.py | New Redis-backed circuit breaker module (states, Redis keys, Slack alerts, probe lock). |
| app/ai/voice/agents/breeze_buddy/stt/init.py | Adds Deepgram fallback building and circuit-breaker-based STT routing, plus result wrapper. |
| app/ai/voice/agents/breeze_buddy/agent/pipeline.py | Adds optional ServiceSwitcher wrapping when a fallback STT is provided. |
| app/ai/voice/agents/breeze_buddy/agent/init.py | Records STT failures, attempts mid-call STT hot swap, and finalizes probe outcomes on disconnect. |
| if ENABLE_BREEZE_BUDDY_STT_FALLBACK: | ||
| from app.ai.voice.agents.breeze_buddy.stt.fallback import ( | ||
| CircuitState, | ||
| stt_circuit_breaker, | ||
| ) | ||
|
|
||
| circuit_state = await stt_circuit_breaker.get_state() | ||
| logger.info(f"STT circuit breaker state: {circuit_state.value}") | ||
|
|
| try: | ||
| stt_service = _build_soniox(language_hints, soniox_context) | ||
| fallback = _build_deepgram_fallback(vad_events=False) | ||
| logger.info( | ||
| "Probe call: Soniox primary + Deepgram fallback " | ||
| "(ServiceSwitcher, vad_events=False)" | ||
| ) | ||
| return STTServiceResult( | ||
| service=stt_service, | ||
| provider="soniox", | ||
| fallback_service=fallback, | ||
| is_probe_call=True, | ||
| ) | ||
| except Exception as probe_err: | ||
| logger.error( | ||
| f"Probe call Soniox init failed: {probe_err}. " | ||
| f"Recording failure and falling back to Deepgram." | ||
| ) | ||
| # Init-time failure during probe: record + release lock | ||
| _fire_and_forget(_send_soniox_failure_alert(probe_err, "deepgram")) | ||
| await stt_circuit_breaker.record_failure() | ||
| await stt_circuit_breaker.release_probe() |
| if self.fallback_stt and not self.stt_switched: | ||
| self.stt_switched = True | ||
| logger.info("Attempting STT hot-swap via ServiceSwitcher") | ||
| try: | ||
| from pipecat.pipeline.service_switcher import ( | ||
| ManuallySwitchServiceFrame, | ||
| ) |
| if not self._stt_failure_recorded: | ||
| await stt_circuit_breaker.record_success() | ||
| else: | ||
| await stt_circuit_breaker.release_probe() |
| async def record_failure(self) -> None: | ||
| """Increment failure count. Trip to OPEN if threshold reached.""" | ||
| try: | ||
| redis = await get_redis_service() | ||
| count = await redis.incr(self._KEY_FAILURE_COUNT) | ||
| # Set TTL on first failure (rolling window) | ||
| if count == 1: | ||
| await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window) | ||
| logger.info( | ||
| f"STT circuit breaker: failure {count}/{self.failure_threshold}" | ||
| ) | ||
| if count >= self.failure_threshold: | ||
| await self._trip(redis) |
| # Set TTL on first failure (rolling window) | ||
| if count == 1: | ||
| await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window) |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 698-708: The probe handling currently calls
stt_circuit_breaker.record_success() whenever is_probe_call is true and
_stt_failure_recorded is false, which can close the breaker on non‑STT
teardowns; change the logic in the is_probe_call block to only call
stt_circuit_breaker.record_success() when you have an actual successful Soniox
transcription/frame (e.g., check an existing success indicator like
self._stt_successful_transcription or the variable that flags a valid
transcription event), and otherwise call stt_circuit_breaker.release_probe();
keep the import and use of stt_circuit_breaker and the _stt_failure_recorded
check but gate record_success() on the real Soniox success flag instead of
merely not having recorded a failure.
In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py`:
- Around line 317-346: The current try block wraps both _build_soniox(...) and
_build_deepgram_fallback(...), so a failure building the Deepgram fallback
wrongly triggers Soniox failure handling; refactor so you first call
_build_soniox(language_hints, soniox_context) inside its own try/except and only
on exceptions run _send_soniox_failure_alert(probe_err), await
stt_circuit_breaker.record_failure(), await stt_circuit_breaker.release_probe(),
and fall back to Deepgram; after successfully creating the Soniox service,
separately construct fallback = _build_deepgram_fallback(vad_events=False) (with
its own error handling that does not mutate Soniox circuit state) and then
return the STTServiceResult with provider="soniox" and fallback_service
populated accordingly.
- Around line 171-173: The gate-breaker routing and failure-recording currently
check only ENABLE_BREEZE_BUDDY_STT_FALLBACK, causing routing to
_build_deepgram_fallback() even when DEEPGRAM_API_KEY is absent; make the
routing and failure-recording use the same combined predicate used in
_build_soniox(): (ENABLE_BREEZE_BUDDY_STT_FALLBACK and DEEPGRAM_API_KEY). Update
any gate checks and failure-recording branches that reference the flag alone
(the routing logic that selects between _build_soniox() and
_build_deepgram_fallback() and the code that records fallback failures) to
evaluate the combined predicate so the breaker stays on Soniox when Deepgram
credentials are missing.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 56-68: The constructor stores retrip_threshold but it’s never used
so HALF_OPEN probes that fail only bump failure_count toward failure_threshold
and never re-open the breaker; update the failure-handling logic in
record_failure (and any code that handles STATE_HALF_OPEN) to compare failures
during HALF_OPEN against self.retrip_threshold instead of self.failure_threshold
and call the same re-trip behavior (set state to OPEN, set open expiry using
self.open_duration, reset probe/failure counters) when the retrip threshold is
reached; ensure references to failure_count, STATE_HALF_OPEN, record_failure,
and retrip_threshold are adjusted so HALF_OPEN failures can immediately re-trip
the circuit using retrip_threshold.
- Around line 9-13: The circuit-breaker Redis keys (strings like
"stt:cb:failure_count", "stt:cb:open", "stt:cb:half_open", "stt:cb:probe_lock")
are currently global; change the code in fallback.py to scope them under a
Breeze Buddy namespace by either adding a BREEZE_BUDDY_NAMESPACE prefix to those
key constants or, preferably, pass a namespace parameter into all
redis_get/redis_set/redis_delete calls that touch these keys (e.g., where the
circuit logic reads/writes failure_count, open, half_open, probe_lock); update
the key constant definitions and every usage in the circuit-breaker functions
(including the calls around the block noted at lines 51–54) to use the
namespaced form or the redis helper's namespace argument so the keys no longer
collide with other services.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f892c0b1-4eaa-47de-b678-09cd51a80d70
📒 Files selected for processing (6)
app/ai/voice/agents/breeze_buddy/agent/__init__.pyapp/ai/voice/agents/breeze_buddy/agent/pipeline.pyapp/ai/voice/agents/breeze_buddy/stt/__init__.pyapp/ai/voice/agents/breeze_buddy/stt/fallback.pyapp/ai/voice/stt/soniox/config.pyapp/core/config/static.py
| # Record probe outcome in circuit breaker | ||
| if self.is_probe_call: | ||
| try: | ||
| from app.ai.voice.agents.breeze_buddy.stt.fallback import ( | ||
| stt_circuit_breaker, | ||
| ) | ||
|
|
||
| if not self._stt_failure_recorded: | ||
| await stt_circuit_breaker.record_success() | ||
| else: | ||
| await stt_circuit_breaker.release_probe() |
There was a problem hiding this comment.
Only close the breaker after a positive Soniox signal.
Right now any probe call that disconnects without _stt_failure_recorded calls record_success(). A caller hanging up before speaking, or any other non-STT teardown, will close the circuit globally even though Soniox was never validated. Gate record_success() on an actual successful Soniox transcription/frame and otherwise just release the probe lock.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 698 - 708,
The probe handling currently calls stt_circuit_breaker.record_success() whenever
is_probe_call is true and _stt_failure_recorded is false, which can close the
breaker on non‑STT teardowns; change the logic in the is_probe_call block to
only call stt_circuit_breaker.record_success() when you have an actual
successful Soniox transcription/frame (e.g., check an existing success indicator
like self._stt_successful_transcription or the variable that flags a valid
transcription event), and otherwise call stt_circuit_breaker.release_probe();
keep the import and use of stt_circuit_breaker and the _stt_failure_recorded
check but gate record_success() on the real Soniox success flag instead of
merely not having recorded a failure.
| Redis keys (no namespace prefix - each env has its own Redis): | ||
| stt:cb:failure_count - rolling failure counter with TTL window | ||
| stt:cb:open - present while circuit is OPEN (TTL = open_duration) | ||
| stt:cb:half_open - present while circuit is HALF-OPEN (auto-set when open expires) | ||
| stt:cb:probe_lock - mutex so only one call probes at a time |
There was a problem hiding this comment.
Namespace these circuit-breaker keys.
The docstring and constants make these global Redis keys. In a shared Redis, another service or another breaker instance can trip or clear Breeze Buddy’s circuit state. Route the access through the namespaced Redis helpers, or at minimum scope the keys under a Breeze Buddy namespace. As per coding guidelines, Always use namespace parameter in redis_get/redis_set calls to prevent key collisions across services.
Also applies to: 51-54
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 9 - 13, The
circuit-breaker Redis keys (strings like "stt:cb:failure_count", "stt:cb:open",
"stt:cb:half_open", "stt:cb:probe_lock") are currently global; change the code
in fallback.py to scope them under a Breeze Buddy namespace by either adding a
BREEZE_BUDDY_NAMESPACE prefix to those key constants or, preferably, pass a
namespace parameter into all redis_get/redis_set/redis_delete calls that touch
these keys (e.g., where the circuit logic reads/writes failure_count, open,
half_open, probe_lock); update the key constant definitions and every usage in
the circuit-breaker functions (including the calls around the block noted at
lines 51–54) to use the namespaced form or the redis helper's namespace argument
so the keys no longer collide with other services.
| def __init__( | ||
| self, | ||
| failure_threshold: int = STT_CIRCUIT_BREAKER_FAILURE_THRESHOLD, | ||
| retrip_threshold: int = STT_CIRCUIT_BREAKER_RETRIP_THRESHOLD, | ||
| open_duration: int = STT_CIRCUIT_BREAKER_OPEN_DURATION_SECS, | ||
| failure_window: int = STT_CIRCUIT_BREAKER_FAILURE_WINDOW_SECS, | ||
| probe_lock_ttl: int = STT_CIRCUIT_BREAKER_PROBE_LOCK_TTL_SECS, | ||
| ): | ||
| self.failure_threshold = failure_threshold | ||
| self.retrip_threshold = retrip_threshold | ||
| self.open_duration = open_duration | ||
| self.failure_window = failure_window | ||
| self.probe_lock_ttl = probe_lock_ttl |
There was a problem hiding this comment.
HALF_OPEN failures never re-open the breaker.
retrip_threshold is stored but never used. In HALF_OPEN, a failed probe only increments failure_count to 1 and leaves the circuit HALF_OPEN because record_failure() still compares against failure_threshold. That lets an unhealthy Soniox keep receiving periodic probe traffic instead of re-tripping immediately.
Suggested fix
async def record_failure(self) -> None:
"""Increment failure count. Trip to OPEN if threshold reached."""
try:
redis = await get_redis_service()
+ threshold = self.failure_threshold
+ if await redis.exists(self._KEY_HALF_OPEN) and not await redis.exists(
+ self._KEY_OPEN
+ ):
+ threshold = self.retrip_threshold
count = await redis.incr(self._KEY_FAILURE_COUNT)
# Set TTL on first failure (rolling window)
if count == 1:
await redis.expire(self._KEY_FAILURE_COUNT, self.failure_window)
logger.info(
- f"STT circuit breaker: failure {count}/{self.failure_threshold}"
+ f"STT circuit breaker: failure {count}/{threshold}"
)
- if count >= self.failure_threshold:
+ if count >= threshold:
await self._trip(redis)Also applies to: 87-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 56 - 68, The
constructor stores retrip_threshold but it’s never used so HALF_OPEN probes that
fail only bump failure_count toward failure_threshold and never re-open the
breaker; update the failure-handling logic in record_failure (and any code that
handles STATE_HALF_OPEN) to compare failures during HALF_OPEN against
self.retrip_threshold instead of self.failure_threshold and call the same
re-trip behavior (set state to OPEN, set open expiry using self.open_duration,
reset probe/failure counters) when the retrip threshold is reached; ensure
references to failure_count, STATE_HALF_OPEN, record_failure, and
retrip_threshold are adjusted so HALF_OPEN failures can immediately re-trip the
circuit using retrip_threshold.
259eb66 to
c7cafda
Compare
| _background_tasks: set[asyncio.Task] = set() | ||
|
|
||
|
|
||
| def _fire_and_forget(coro) -> None: |
| # When enabled, uses a circuit breaker to route calls: | ||
| # CLOSED -> Soniox only | OPEN -> Deepgram only | HALF-OPEN -> single probe call on Soniox | ||
| ENABLE_BREEZE_BUDDY_STT_FALLBACK = ( | ||
| os.environ.get("ENABLE_BREEZE_BUDDY_STT_FALLBACK", "false").lower() == "true" |
9a91582 to
5460862
Compare
567b2a8 to
c970db7
Compare
|
needs_changes This PR introduces critical reliability bugs: a syntax error that masks root causes during failures, missing TTLs creating permanent fallback states, and race conditions that poison failure counters. Combined with hardcoded provider checks that silently disable fallback for non-Soniox deployments, these issues pose significant operational risk.
|
|
|
||
|
|
||
| async def initialize_fallback_tasks(scheduler: BackgroundTaskScheduler) -> None: | ||
| """Register STT fallback reset task if fallback is enabled.""" |
There was a problem hiding this comment.
Bug: Reset task never registered if flag is off at startup — stuck fallback state
If ENABLE_BB_STT_FALLBACK is false at server start, this function returns early and stt_fallback_reset is never scheduled. If the flag is then flipped to true mid-run (Redis/DevCycle), record_failure will activate fallback by setting fallback:stt:active in Redis — but nothing ever calls reset_to_primary(). The key has no TTL; it survives restarts. System is permanently locked on fallback until someone manually deletes the Redis key.
Reproduced: flag=false at startup → flag flips true → STT fails twice → fallback:stt:active set → 30+ minutes pass → key still present, no reset.
Fix: either (a) always register the reset task and let it check the flag at runtime, or (b) give fallback:stt:active a TTL of fallback_duration_secs in _activate() so it self-expires without needing the reset task.
|
|
||
| duration_secs = await BB_STT_FALLBACK_DURATION_SECS() | ||
| scheduler.register_task( | ||
| name="stt_fallback_reset", |
There was a problem hiding this comment.
Bug: Fallback duration is not guaranteed — reset fires on startup cadence, not activation time
The reset task is registered with interval_seconds=duration_secs (e.g. 1800s). It fires at T+1800, T+3600, ... relative to server start. Fallback activation time is unrelated to this schedule.
Worst case: fallback activates 1 minute before the next reset tick → fallback lasts 1 minute, not 30. Operators configure 30-minute fallback, get 1–60 minute windows unpredictably. On a flaky STT provider, rapid fallback→reset→fail→fallback cycles trigger alert storms.
Root cause: check_and_reset_stt_fallback calls fallback.is_active() and resets unconditionally — it doesn't check when activation occurred.
Fix option 1: Store activation timestamp (fallback:stt:activated_at) and only reset if now - activated_at >= duration_secs.
Fix option 2 (simpler): Use a TTL on fallback:stt:active itself. In _activate(), change redis.set(key, '1', nx=True) to redis.set(key, '1', nx=True, ex=fallback_duration_secs). Key self-expires after exactly duration_secs. No reset task needed — just check is_active().
There was a problem hiding this comment.
_activate() was calling redis.set(key, "1", nx=True) with no TTL, making the key permanent and relying
entirely on the background reset task to clear it.
Added ex=fallback_duration_secs to the SET in _activate(). The key now self-expires after exactly the configured
cooldown, measured from activation time.
| await fb.record_failure(error_msg=str(primary_err)[:200], context="init") | ||
| else: | ||
| fire_and_forget( | ||
| send_templated_alert( |
There was a problem hiding this comment.
Nit: Dead code + misleading alert template
This else branch (fallback disabled) is unreachable. handle_stt_init_failure is only called from create_stt_from_config which already returns early when ENABLE_BB_STT_FALLBACK=False (line 120 of stt/init.py). So this branch can never execute.
If it ever did (e.g. future refactor adds a caller), it would fire ALERT_STT_INIT_FALLBACK — the template titled "🚨 {service_name} Init Failed — Using {fallback_name}" — even when fallback is disabled, misleading on-call engineers into thinking a fallback is active when the system is actually in a broken state.
Consider removing the dead branch or adding an assertion: assert fallback_enabled, 'handle_stt_init_failure called with fallback disabled'.
There was a problem hiding this comment.
_activate() was calling redis.set(key, "1", nx=True) with no TTL, making the key permanent and relying
entirely on the background reset task to clear it.
Added ex=fallback_duration_secs to the SET in _activate(). The key now self-expires after exactly the configured
cooldown, measured from activation time.
| redis = await get_redis_service() | ||
|
|
||
| count = await redis.incr(self._key_failure_count) | ||
| if count == 1: |
There was a problem hiding this comment.
Risk: Non-atomic INCR + EXPIRE race in multi-pod deployment
Two ops without a pipeline:
- Pod A:
INCR failure_count→ 1, then setsEXPIRE 240s - Pod B:
INCR failure_count→ 2 (threshold hit), calls_activate→ deletesfailure_count - Pod C:
INCR failure_count→ 1 (fresh counter, no TTL yet) - Pod A's
EXPIREcall now lands on Pod C's key → sets 240s TTL on freshly re-created counter
This is mostly benign (TTL on the new counter is correct behavior), but the counter can survive longer than intended if Pod A's expire lands after a reset cycle clears and recreates the key. More critically: if threshold=2 and 10 pods all fail simultaneously, all 10 call _activate — only 1 succeeds (NX guard) but 10 alert callbacks fire. At scale, alert flood.
Fix: use SET key 1 NX EX window_secs + INCR pattern, or Lua script to atomically incr-and-expire.
There was a problem hiding this comment.
Replaced the two-step non-atomic calls with a Lua script (_LUA_INCR_WITH_EXPIRE) that executes INCR
and EXPIRE as a single indivisible Redis operation. Added a non-atomic fallback if run_script returns None (Redis EVAL
failure) so failures are never silently dropped.
Added a fallback:{service}:alerted:{count} NX key before firing on_failure_alert. Only the first pod to
reach count=N wins the NX set and fires the Slack alert — all other pods skip it. TTL is set to failure_window_secs so
the dedup keys clean up with the counter. The trip alert (on_trip_alert) was already safe — it was protected by the NX
guard in _activate().
5632393 to
dac45f1
Compare
dac45f1 to
da74118
Compare
da74118 to
c309fe9
Compare
|
Changes requested The undefined variable in the exception handler will mask production failures with NameErrors, while hardcoded provider checks and breaking API changes undermine the fallback system's reliability. Multiple high-severity issues span correctness, security, and operational safety.
|
c309fe9 to
50f616a
Compare
|
50f616a to
48c73ec
Compare
|
|
||
|
|
||
| # --- Breeze Buddy ElevenLabs TTS Configuration --- | ||
| async def BB_ELEVENLABS_VOICE_ID() -> str: |
There was a problem hiding this comment.
Are these required in this PR?
| return await get_config("ENABLE_BB_STT_FALLBACK", False, bool) | ||
|
|
||
|
|
||
| async def BB_STT_FALLBACK_PROVIDER() -> str: |
There was a problem hiding this comment.
What about merchants who are having deepgram as STT and deepgram fails?
There was a problem hiding this comment.
for Phase-1 it was told to enable fallback only for Soniox!!
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (2)
app/ai/voice/agents/breeze_buddy/stt/fallback.py (1)
304-309: LeakingServiceFallback's private_key_notifiedacross module boundaries.
notify_fallback_activereaches intofb._key_notified, which couples this caller toServiceFallback's internal Redis layout. If the generic class ever renames or restructures its keys (it owns this namespace), this call site silently breaks. Add a small public method onServiceFallback(e.g.try_notify_once(ttl_secs) -> bool) and consume that instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 304 - 309, notify_fallback_active currently accesses ServiceFallback's private attribute fb._key_notified, leaking internal Redis key layout; add a public method on ServiceFallback (e.g. try_notify_once(ttl_secs) -> bool) that performs the NX/EX Redis set and returns whether the notify succeeded, then update notify_fallback_active to call fb.try_notify_once(fb.config.fallback_duration_secs) instead of touching fb._key_notified directly so the implementation detail stays encapsulated.app/ai/voice/agents/breeze_buddy/agent/__init__.py (1)
870-870: Remove dead commented-out call.
# stt, llm, tts = await create_services(self.configurations)is stale after theSTTServiceResultrefactor and adds noise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` at line 870, Remove the dead commented-out call to create_services (the line "# stt, llm, tts = await create_services(self.configurations)") since it is stale after the STTServiceResult refactor; delete that commented line from the agent __init__ code (referencing create_services and STTServiceResult) to reduce noise and keep the codebase clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 625-635: The substring check that builds processor_str and
stt_keywords leads to false positives (e.g., "google" matching non‑STT Google
processors) and misses "openai"; replace it by determining STT identity from the
actual STT service instance instead of the error's processor string — use
self._stt_service (or a provider/name attribute on that instance) to classify
STT errors, falling back to a tighter substring match only if the service
instance is unavailable (e.g., require tokens like "_stt" or "google_stt" and
include "openai" in stt_keywords); update the logic that sets is_stt_error (and
references to processor_str and stt_keywords) to use this service-based check so
only true STT providers trigger STT fallback recording, alerting, and
EndFrame().
- Around line 656-659: The current try/except around
task.queue_frames([EndFrame()]) swallows all exceptions; change it to catch
Exception as e and log the error (e.g., using logger.exception or an appropriate
module/class logger) including the exception details and context (mentioning
task and EndFrame) before optionally re-raising or handling; update the block
that calls task.queue_frames in the agent/__init__.py (the
task.queue_frames([EndFrame()]) site) to ensure exceptions are recorded for
post-mortems rather than silently passed.
- Around line 642-652: The code currently gates recording STT failures on
self.stt_provider == "soniox", which prevents fallback tracking for other
providers; remove that provider-specific check and always attempt to call
record_stt_failure once per call (i.e., keep the existing _stt_failure_recorded
boolean guard and the try/except), so replace the if block that checks
self.stt_provider with a simple "if not self._stt_failure_recorded" branch that
sets self._stt_failure_recorded = True and calls await
record_stt_failure(error_msg=str(error_msg)[:200], call_sid=self.call_sid or "",
context="mid-call") inside the existing try/except (record_stt_failure itself
respects ENABLE_BB_STT_FALLBACK), ensuring service-agnostic fallback recording.
In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py`:
- Around line 228-241: The proactive fallback build (when provider_name !=
fallback_provider) must be wrapped in try/except so a misconfigured fallback
doesn't abort create_stt_from_config; around the await
_build_stt_provider(fallback_config) call catch Exception as err, call
handle_stt_init_failure(provider=fallback_provider, err=err) (or equivalent
error/alerting path) and log/notify the failure (e.g., via
notify_fallback_active or Slack alert), then do NOT return a fallback
STTServiceResult so execution falls through to the primary provider
initialization; keep the successful path (notify_fallback_active + return
STTServiceResult(provider=fallback_provider, service=service)) unchanged but
guarded by the try block.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 34-37: Replace the hardcoded, misspelled _FALLBACK_TAG value and
make the fallback Slack tag operator-configurable: remove the literal
"@breeze-sentinals" and instead derive the fallback from configuration (e.g.,
use SLACK_TAG_USERS or a new config variable like DEFAULT_FALLBACK_TAG set to
the correct "@breeze-sentinels"), then update STT_FALLBACK_SLACK_TAG to combine
the configured fallback and SLACK_TAG_USERS (falling back to the configured
default when SLACK_TAG_USERS is empty) so alerts use the correct, tunable Slack
user-group handle; update references to _FALLBACK_TAG and STT_FALLBACK_SLACK_TAG
accordingly.
- Around line 159-174: The template filler _fill_template (and its inner _fmt)
only catches KeyError so malformed braces or positional format tokens in
upstream error messages can raise ValueError/IndexError and crash the alert
path; update _fill_template/_fmt to either (a) use a safe formatting approach
such as string.Template or str.format_map with a SafeDict to silently ignore
unknown placeholders, or (b) broaden the exception handling to catch ValueError
and IndexError (and/or Exception) and fall back to returning the original
string, and before formatting pre-escape or sanitize free-form fields like
error_msg to replace or double any stray braces; ensure send_templated_alert
continues to receive a safely formatted dict from _fill_template.
In `@app/services/fallback/__init__.py`:
- Around line 251-270: check_and_reset_stt_fallback rarely observes the
ephemeral Redis _key_active because _activate() sets the key with
EX=fallback_duration_secs and the scheduled task ticks on fixed intervals;
change the flow so the reset alert reliably fires by having
ServiceFallback._activate() also record a durable activation marker (e.g., set
an "stt_activated_at" timestamp or a "stt_pending_reset_alert" key with TTL
longer than fallback_duration_secs) and then update check_and_reset_stt_fallback
to look for that marker: if the active key no longer exists but the activation
marker indicates the fallback just expired (timestamp older than
fallback_duration_secs or presence of pending_reset_alert), call
fallback.reset_to_primary() and emit the alert, then remove the activation
marker; keep use of ServiceFallback.is_active(), ServiceFallback._activate(),
ServiceFallback.reset_to_primary(), and the check_and_reset_stt_fallback
function names to locate and modify the logic.
- Around line 110-113: When run_script returns None and the fallback branch uses
await redis.incr(self._key_failure_count), it re-introduces the
INCR-without-EXPIRE race by never assigning a TTL; change the fallback to set an
expiry as part of the fallback path (e.g., use INCR followed by an EXPIRE when
the increment returns 1 or use a single Redis command that sets TTL) or
alternatively raise/propagate an error instead of silently falling back; update
the code surrounding run_script, redis.incr and self._key_failure_count so the
fallback always ensures a bounded TTL on the failure counter (or fails hard) to
avoid persistent stale counters.
---
Nitpick comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Line 870: Remove the dead commented-out call to create_services (the line "#
stt, llm, tts = await create_services(self.configurations)") since it is stale
after the STTServiceResult refactor; delete that commented line from the agent
__init__ code (referencing create_services and STTServiceResult) to reduce noise
and keep the codebase clean.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py`:
- Around line 304-309: notify_fallback_active currently accesses
ServiceFallback's private attribute fb._key_notified, leaking internal Redis key
layout; add a public method on ServiceFallback (e.g. try_notify_once(ttl_secs)
-> bool) that performs the NX/EX Redis set and returns whether the notify
succeeded, then update notify_fallback_active to call
fb.try_notify_once(fb.config.fallback_duration_secs) instead of touching
fb._key_notified directly so the implementation detail stays encapsulated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b9fa66fe-5aef-4bd0-bfed-ba26ad60c4c1
📒 Files selected for processing (10)
app/ai/voice/agents/breeze_buddy/agent/__init__.pyapp/ai/voice/agents/breeze_buddy/agent/pipeline.pyapp/ai/voice/agents/breeze_buddy/stt/__init__.pyapp/ai/voice/agents/breeze_buddy/stt/fallback.pyapp/ai/voice/agents/breeze_buddy/utils/common.pyapp/ai/voice/stt/soniox/config.pyapp/core/config/dynamic.pyapp/main.pyapp/services/fallback/__init__.pyapp/services/slack/alert.py
🚧 Files skipped from review as they are similar to previous changes (2)
- app/ai/voice/stt/soniox/config.py
- app/ai/voice/agents/breeze_buddy/agent/pipeline.py
| # Detect STT errors by processor name keywords | ||
| processor_str = str(processor).lower() | ||
| stt_keywords = ( | ||
| "stt", | ||
| "soniox", | ||
| "deepgram", | ||
| "transcri", | ||
| "google", | ||
| "sarvam", | ||
| ) | ||
| is_stt_error = any(kw in processor_str for kw in stt_keywords) |
There was a problem hiding this comment.
Substring classification has false positives and a missing provider.
"google" matches any google_* processor (Google LLM/TTS/Vertex) — not just STT — so non-STT errors will be misclassified and trigger STT fallback recording + alert + EndFrame(). Conversely, "openai" is missing despite OpenAI being a supported STT provider per the learning above.
Prefer matching on self._stt_service's class identity (or processor name from the service instance) instead of substring sniffing the error's processor attribute. At minimum, anchor on more specific tokens (e.g. "_stt", "google_stt") and add "openai" if you keep this approach.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 625 - 635,
The substring check that builds processor_str and stt_keywords leads to false
positives (e.g., "google" matching non‑STT Google processors) and misses
"openai"; replace it by determining STT identity from the actual STT service
instance instead of the error's processor string — use self._stt_service (or a
provider/name attribute on that instance) to classify STT errors, falling back
to a tighter substring match only if the service instance is unavailable (e.g.,
require tokens like "_stt" or "google_stt" and include "openai" in
stt_keywords); update the logic that sets is_stt_error (and references to
processor_str and stt_keywords) to use this service-based check so only true STT
providers trigger STT fallback recording, alerting, and EndFrame().
| # Record failure in fallback system (once per call, Soniox only) | ||
| if self.stt_provider == "soniox" and not self._stt_failure_recorded: | ||
| self._stt_failure_recorded = True | ||
| try: | ||
| await record_stt_failure( | ||
| error_msg=str(error_msg)[:200], | ||
| call_sid=self.call_sid or "", | ||
| context="mid-call", | ||
| ) | ||
| except Exception as fb_err: | ||
| logger.warning(f"STT fallback record_failure failed: {fb_err}") |
There was a problem hiding this comment.
Hardcoded "soniox" disables fallback for every other supported primary.
BB_STT_SERVICE is dynamic and supports soniox, deepgram, sarvam, openai, google (see app/ai/voice/agents/breeze_buddy/stt/__init__.py:286-294). Gating record_stt_failure(...) on self.stt_provider == "soniox" means a merchant whose primary is anything else gets zero failure tracking, the breaker never trips, and fallback is silently inert — exactly the case swaroopvarma1 flagged.
Drop the provider gate. record_stt_failure(...) already checks ENABLE_BB_STT_FALLBACK() internally and the generic ServiceFallback is provider-agnostic.
Based on learnings: "Applies to app/ai/voice/agents/breeze_buddy/stt/**/*.py : STT providers must support native endpoint detection or SmartTurn; Soniox is default with optional Deepgram, Sarvam, OpenAI, Google".
🐛 Suggested fix
- # Record failure in fallback system (once per call, Soniox only)
- if self.stt_provider == "soniox" and not self._stt_failure_recorded:
+ # Record failure in fallback system (once per call, any primary)
+ if self.stt_provider and not self._stt_failure_recorded:
self._stt_failure_recorded = True
try:
await record_stt_failure(
error_msg=str(error_msg)[:200],
call_sid=self.call_sid or "",
context="mid-call",
)
except Exception as fb_err:
logger.warning(f"STT fallback record_failure failed: {fb_err}")🧰 Tools
🪛 Ruff (0.15.12)
[warning] 651-651: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 642 - 652,
The code currently gates recording STT failures on self.stt_provider ==
"soniox", which prevents fallback tracking for other providers; remove that
provider-specific check and always attempt to call record_stt_failure once per
call (i.e., keep the existing _stt_failure_recorded boolean guard and the
try/except), so replace the if block that checks self.stt_provider with a simple
"if not self._stt_failure_recorded" branch that sets self._stt_failure_recorded
= True and calls await record_stt_failure(error_msg=str(error_msg)[:200],
call_sid=self.call_sid or "", context="mid-call") inside the existing try/except
(record_stt_failure itself respects ENABLE_BB_STT_FALLBACK), ensuring
service-agnostic fallback recording.
| try: | ||
| await task.queue_frames([EndFrame()]) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Swallowing EndFrame queue errors silently.
If task.queue_frames([EndFrame()]) raises, the call may hang in an undefined state and there will be no breadcrumb in logs (Ruff S110). Log the exception so post-mortems are possible.
🛡️ Suggested fix
# Alert and end call — no mid-call swap in Phase 1
fire_and_forget(self._send_mid_call_stt_alert())
try:
await task.queue_frames([EndFrame()])
- except Exception:
- pass
+ except Exception as end_err:
+ logger.warning(f"Failed to queue EndFrame after STT error: {end_err}")🧰 Tools
🪛 Ruff (0.15.12)
[error] 658-659: try-except-pass detected, consider logging the exception
(S110)
[warning] 658-658: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 656 - 659,
The current try/except around task.queue_frames([EndFrame()]) swallows all
exceptions; change it to catch Exception as e and log the error (e.g., using
logger.exception or an appropriate module/class logger) including the exception
details and context (mentioning task and EndFrame) before optionally re-raising
or handling; update the block that calls task.queue_frames in the
agent/__init__.py (the task.queue_frames([EndFrame()]) site) to ensure
exceptions are recorded for post-mortems rather than silently passed.
| if provider_name != fallback_provider: | ||
| fb = await get_stt_fallback() | ||
| if await fb.is_active(): | ||
| logger.info( | ||
| f"STT fallback active — using {fallback_provider} " | ||
| f"instead of {provider_name}" | ||
| ) | ||
| fallback_config = STTConfiguration( | ||
| provider=STTProvider(fallback_provider), | ||
| language=config.language, | ||
| ) | ||
| service = await _build_stt_provider(fallback_config) | ||
| fire_and_forget(notify_fallback_active(fallback_provider)) | ||
| return STTServiceResult(provider=fallback_provider, service=service) |
There was a problem hiding this comment.
Proactive fallback route has no error handling — misconfigured fallback provider kills the call.
When the fallback is already active and the configured BB_STT_FALLBACK_PROVIDER is missing its API key (e.g., set to deepgram while DEEPGRAM_API_KEY is empty), _build_stt_provider(fallback_config) will raise ValueError, propagating out of create_stt_from_config and aborting setup. Unlike the primary-build path below (which routes through handle_stt_init_failure), there is no catch here.
Consider wrapping this build in try/except and degrading to the primary provider (with a loud warning/Slack alert) if the fallback build fails — the primary may still partially work, and aborting the call is worse than ignoring an unusable fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/stt/__init__.py` around lines 228 - 241, The
proactive fallback build (when provider_name != fallback_provider) must be
wrapped in try/except so a misconfigured fallback doesn't abort
create_stt_from_config; around the await _build_stt_provider(fallback_config)
call catch Exception as err, call
handle_stt_init_failure(provider=fallback_provider, err=err) (or equivalent
error/alerting path) and log/notify the failure (e.g., via
notify_fallback_active or Slack alert), then do NOT return a fallback
STTServiceResult so execution falls through to the primary provider
initialization; keep the successful path (notify_fallback_active + return
STTServiceResult(provider=fallback_provider, service=service)) unchanged but
guarded by the try block.
| _FALLBACK_TAG = "@breeze-sentinals" | ||
| STT_FALLBACK_SLACK_TAG = ( | ||
| f"{_FALLBACK_TAG},{SLACK_TAG_USERS}" if SLACK_TAG_USERS else _FALLBACK_TAG | ||
| ) |
There was a problem hiding this comment.
Hardcoded, misspelled Slack group.
@breeze-sentinals should be @breeze-sentinels (or whatever the actual Slack user-group handle is). Today this prefix is hardcoded into every STT alert — if Slack rejects the unknown handle, on-call won't be paged. Move this to dynamic config (or SLACK_TAG_USERS) so it's both correct and operator-tunable without a redeploy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 34 - 37,
Replace the hardcoded, misspelled _FALLBACK_TAG value and make the fallback
Slack tag operator-configurable: remove the literal "@breeze-sentinals" and
instead derive the fallback from configuration (e.g., use SLACK_TAG_USERS or a
new config variable like DEFAULT_FALLBACK_TAG set to the correct
"@breeze-sentinels"), then update STT_FALLBACK_SLACK_TAG to combine the
configured fallback and SLACK_TAG_USERS (falling back to the configured default
when SLACK_TAG_USERS is empty) so alerts use the correct, tunable Slack
user-group handle; update references to _FALLBACK_TAG and STT_FALLBACK_SLACK_TAG
accordingly.
| def _fill_template(template: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]: | ||
| """Recursively applies .format(**kwargs) to all string values.""" | ||
|
|
||
| def _fmt(val: Any) -> Any: | ||
| if isinstance(val, str): | ||
| try: | ||
| return val.format(**kwargs) | ||
| except KeyError: | ||
| return val | ||
| if isinstance(val, dict): | ||
| return {k: _fmt(v) for k, v in val.items()} | ||
| if isinstance(val, list): | ||
| return [_fmt(item) for item in val] | ||
| return val | ||
|
|
||
| return _fmt(template) |
There was a problem hiding this comment.
val.format() only catches KeyError — malformed braces in upstream errors will still crash the alert.
str.format raises ValueError for unmatched {/} and IndexError for positional refs like {0}. STT/SDK error messages routinely contain JSON-ish snippets (e.g. "... { 'code': 503 ..."), so a single stray brace from a provider exception will bubble out of _fmt → send_templated_alert → caller, breaking exactly the alert path that should be most reliable.
Broaden the catch and pre-escape error_msg (and any other free-form fields). Even better, switch to string.Template or str.format_map(SafeDict) for the entire template-fill path so unknown placeholders degrade silently and untrusted text cannot break formatting.
🛡️ Minimal hardening
def _fmt(val: Any) -> Any:
if isinstance(val, str):
try:
return val.format(**kwargs)
- except KeyError:
+ except (KeyError, IndexError, ValueError):
return valBased on past PR feedback from swaroopvarma1 flagging format-string fragility for unsanitized error messages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/ai/voice/agents/breeze_buddy/stt/fallback.py` around lines 159 - 174, The
template filler _fill_template (and its inner _fmt) only catches KeyError so
malformed braces or positional format tokens in upstream error messages can
raise ValueError/IndexError and crash the alert path; update _fill_template/_fmt
to either (a) use a safe formatting approach such as string.Template or
str.format_map with a SafeDict to silently ignore unknown placeholders, or (b)
broaden the exception handling to catch ValueError and IndexError (and/or
Exception) and fall back to returning the original string, and before formatting
pre-escape or sanitize free-form fields like error_msg to replace or double any
stray braces; ensure send_templated_alert continues to receive a safely
formatted dict from _fill_template.
| if count is None: | ||
| # Lua script failed (logged inside run_script); fall back to | ||
| # non-atomic path so failures are never silently swallowed. | ||
| count = await redis.incr(self._key_failure_count) |
There was a problem hiding this comment.
Non-atomic Lua-failure fallback re-introduces the INCR-without-EXPIRE race.
When run_script returns None, the code falls back to a plain INCR with no companion EXPIRE. The very race the Lua script was added to eliminate (counter persisting without TTL) re-emerges in this branch — and because the key never gets a TTL set later, a single Lua failure on the first increment can leave the counter alive indefinitely, eventually tripping fallback on stale failure history.
Either set TTL explicitly here (still racy but bounded), or treat Lua failure as a hard error rather than silently using a degraded code path.
🛡️ Suggested fix
if count is None:
- # Lua script failed (logged inside run_script); fall back to
- # non-atomic path so failures are never silently swallowed.
- count = await redis.incr(self._key_failure_count)
+ # Lua script failed (logged inside run_script); fall back to
+ # non-atomic INCR+EXPIRE so failures are never silently swallowed.
+ count = await redis.incr(self._key_failure_count)
+ if count == 1:
+ await redis.expire(
+ self._key_failure_count, self.config.failure_window_secs
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/services/fallback/__init__.py` around lines 110 - 113, When run_script
returns None and the fallback branch uses await
redis.incr(self._key_failure_count), it re-introduces the INCR-without-EXPIRE
race by never assigning a TTL; change the fallback to set an expiry as part of
the fallback path (e.g., use INCR followed by an EXPIRE when the increment
returns 1 or use a single Redis command that sets TTL) or alternatively
raise/propagate an error instead of silently falling back; update the code
surrounding run_script, redis.incr and self._key_failure_count so the fallback
always ensures a bounded TTL on the failure counter (or fails hard) to avoid
persistent stale counters.
| async def check_and_reset_stt_fallback() -> None: | ||
| """Check if STT fallback is active and reset to primary if so.""" | ||
| try: | ||
| fallback_provider = await BB_STT_FALLBACK_PROVIDER() | ||
| fallback = ServiceFallback( | ||
| ServiceFallbackConfig( | ||
| service_name="stt", | ||
| failure_threshold=await BB_STT_FALLBACK_THRESHOLD(), | ||
| failure_window_secs=await BB_STT_FALLBACK_WINDOW_SECS(), | ||
| fallback_duration_secs=await BB_STT_FALLBACK_DURATION_SECS(), | ||
| fallback_provider_name=fallback_provider, | ||
| ) | ||
| ) | ||
| if not await fallback.is_active(): | ||
| return | ||
|
|
||
| logger.info("STT fallback active — resetting to primary provider") | ||
| await fallback.reset_to_primary() | ||
| except Exception as e: | ||
| logger.error(f"STT fallback reset task failed: {e}") |
There was a problem hiding this comment.
on_reset_alert will rarely fire — _key_active self-expires before this task observes it.
_activate() sets _key_active with EX=fallback_duration_secs, so the key disappears via TTL. Meanwhile this reset task is scheduled with interval_seconds=fallback_duration_secs from server start, not from activation time. With a random activation T, the key expires at T+duration, but the next task tick is at startup + N*duration — only a narrow window where the task sees is_active() == True. In most cycles the TTL cleans up first and reset_to_primary() (and thus the "back to primary" Slack alert) is never invoked, leaving operators without a normal-operation-resumed signal.
Consider driving the reset alert off Redis keyspace expiration notifications, or store an activated_at timestamp and have the task fire the alert when it observes the active key has just expired (e.g., a separate "pending_reset_alert" key with longer TTL).
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 269-269: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/services/fallback/__init__.py` around lines 251 - 270,
check_and_reset_stt_fallback rarely observes the ephemeral Redis _key_active
because _activate() sets the key with EX=fallback_duration_secs and the
scheduled task ticks on fixed intervals; change the flow so the reset alert
reliably fires by having ServiceFallback._activate() also record a durable
activation marker (e.g., set an "stt_activated_at" timestamp or a
"stt_pending_reset_alert" key with TTL longer than fallback_duration_secs) and
then update check_and_reset_stt_fallback to look for that marker: if the active
key no longer exists but the activation marker indicates the fallback just
expired (timestamp older than fallback_duration_secs or presence of
pending_reset_alert), call fallback.reset_to_primary() and emit the alert, then
remove the activation marker; keep use of ServiceFallback.is_active(),
ServiceFallback._activate(), ServiceFallback.reset_to_primary(), and the
check_and_reset_stt_fallback function names to locate and modify the logic.
There was a problem hiding this comment.
Pull request overview
Adds a Redis-backed fallback/circuit-breaker mechanism and wires it into Breeze Buddy STT so the system can proactively route away from a failing primary provider (e.g., Soniox → Deepgram) and emit Slack alerts around failures and fallback state.
Changes:
- Introduces a generic
ServiceFallback(Redis failure counter + “active” flag) and registers a background task for STT fallback resets. - Implements Breeze Buddy STT fallback orchestration + templated Slack alerts, and updates STT creation to return the actual provider used.
- Extends Slack alert sending to support per-alert tagging overrides; adds new dynamic config keys for STT fallback (and also ElevenLabs TTS).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| app/services/slack/alert.py | Adds tag_users override to control per-alert tagging behavior. |
| app/services/fallback/init.py | Adds Redis-backed fallback state machine + registers STT fallback reset task. |
| app/main.py | Registers fallback background tasks during app lifespan startup. |
| app/core/config/dynamic.py | Adds dynamic config for STT fallback (and ElevenLabs TTS settings). |
| app/ai/voice/stt/soniox/config.py | Adds reconnect_on_error option when building Soniox STT. |
| app/ai/voice/agents/breeze_buddy/utils/common.py | Adds fire_and_forget() helper to retain background task refs. |
| app/ai/voice/agents/breeze_buddy/stt/fallback.py | STT-specific fallback orchestration + Slack alert templates/dedup. |
| app/ai/voice/agents/breeze_buddy/stt/init.py | Implements proactive routing + init-time fallback; returns STTServiceResult. |
| app/ai/voice/agents/breeze_buddy/agent/pipeline.py | Propagates STTServiceResult through service creation. |
| app/ai/voice/agents/breeze_buddy/agent/init.py | Tracks actual STT provider; records Soniox failures on STT pipeline errors and ends call. |
| if count is None: | ||
| # Lua script failed (logged inside run_script); fall back to | ||
| # non-atomic path so failures are never silently swallowed. | ||
| count = await redis.incr(self._key_failure_count) |
There was a problem hiding this comment.
If the Lua script fails and you fall back to incr(), the failure counter key never gets a TTL. That can make failure_count accumulate indefinitely and trip fallback long after the intended window. Consider setting EXPIRE when count == 1 (or always) on the non-Lua path as well.
| count = await redis.incr(self._key_failure_count) | |
| count = await redis.incr(self._key_failure_count) | |
| if count == 1: | |
| await redis.expire( | |
| self._key_failure_count, self.config.failure_window_secs | |
| ) |
| if not await fallback.is_active(): | ||
| return | ||
|
|
||
| logger.info("STT fallback active — resetting to primary provider") | ||
| await fallback.reset_to_primary() |
There was a problem hiding this comment.
check_and_reset_stt_fallback() resets to primary whenever the fallback flag exists, regardless of how much TTL remains. Because the scheduler runs on fixed intervals unrelated to activation time, this can deactivate fallback early (even almost immediately after activation). The reset task should only reset after the active key’s TTL has expired (or be redesigned to avoid deleting the key at all).
|
|
||
| tag_users: Optional[str] = filled.get("tag_users") | ||
| if tag_users and "{" in tag_users: | ||
| tag_users = None |
There was a problem hiding this comment.
When tag_users is left as an unfilled placeholder (e.g. "{tag_users}"), this code sets tag_users=None, which causes slack_alert.send() to fall back to global SLACK_TAG_USERS and may unexpectedly tag people. Consider overriding with an empty string (or include_tags=False) to avoid accidental tagging when template placeholders aren’t populated.
| tag_users = None | |
| tag_users = "" |
| fb = await get_stt_fallback() | ||
| notified = await redis.set( | ||
| fb._key_notified, | ||
| "1", |
There was a problem hiding this comment.
notify_fallback_active() reaches into fb._key_notified (a private attribute). This couples the STT layer to ServiceFallback internals and makes refactors risky. Consider exposing a small public method/property on ServiceFallback for activation notification dedup (or a mark_notified() helper) instead.
| fallback_config = STTConfiguration( | ||
| provider=STTProvider(fallback_provider), | ||
| language=config.language, |
There was a problem hiding this comment.
STTProvider(fallback_provider) will raise ValueError if BB_STT_FALLBACK_PROVIDER is misconfigured (typo/unsupported provider), which can mask the original STT init error and prevent service creation. Consider validating the dynamic config value and falling back to a safe default (or disabling fallback) with a clear log/alert when the provider string is invalid.
| tag_users: Optional comma-separated users to tag. Overrides SLACK_TAG_USERS | ||
| when provided. Only used when include_tags is True. |
There was a problem hiding this comment.
The tag_users parameter is described as “comma-separated users to tag”, but send() formats entries as <@...> unless they’re already fully formatted mentions. That behavior only works reliably with Slack user IDs (or preformatted user-group mentions). Consider clarifying this in the docstring to prevent passing usernames like @name which won’t mention correctly.
| tag_users: Optional comma-separated users to tag. Overrides SLACK_TAG_USERS | |
| when provided. Only used when include_tags is True. | |
| tag_users: Optional comma-separated Slack user IDs (for example, | |
| "U12345678") or preformatted Slack mentions. Overrides | |
| SLACK_TAG_USERS when provided. Usernames like "@name" are not | |
| reliably converted into valid mentions. Only used when | |
| include_tags is True. |
| ServiceFallbackConfig( | ||
| service_name="stt", | ||
| failure_threshold=await BB_STT_FALLBACK_THRESHOLD(), | ||
| failure_window_secs=await BB_STT_FALLBACK_WINDOW_SECS(), | ||
| fallback_duration_secs=await BB_STT_FALLBACK_DURATION_SECS(), |
There was a problem hiding this comment.
The ServiceFallbackConfig built for the STT reset task does not set on_reset_alert, so reset_to_primary() won’t emit the intended reset Slack alert even though the module docstring states the background task fires on_reset_alert. Consider wiring the reset callback (or reusing the STT-configured fallback instance) so resets generate alerts consistently.
| _FALLBACK_TAG = "@breeze-sentinals" | ||
| STT_FALLBACK_SLACK_TAG = ( | ||
| f"{_FALLBACK_TAG},{SLACK_TAG_USERS}" if SLACK_TAG_USERS else _FALLBACK_TAG | ||
| ) |
There was a problem hiding this comment.
_FALLBACK_TAG is set to "@breeze-sentinals" (typo?) and will be wrapped by slack_alert.send() into <@breeze-sentinals>, which is not a valid Slack user/group mention unless it’s a user ID. If this is meant to tag a Slack user group, use the Slack user-group mention format (<!subteam^...|@...>) or store the correct mention token in config.
| else: | ||
| fire_and_forget( | ||
| send_templated_alert( | ||
| ALERT_STT_INIT_FALLBACK, | ||
| service_name=primary_provider.capitalize(), | ||
| fallback_name=fallback_provider.capitalize(), | ||
| error_msg=str(primary_err)[:500], | ||
| tag_users=STT_FALLBACK_SLACK_TAG, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
handle_stt_init_failure() only sends the ALERT_STT_INIT_FALLBACK Slack message when ENABLE_BB_STT_FALLBACK is False, but this helper is only called from the fallback-enabled code path. As a result, the init-fallback alert template appears effectively unused. Consider always sending the init-fallback alert (or restructuring this conditional) so primary-init failures that successfully fall back are visible.
| else: | |
| fire_and_forget( | |
| send_templated_alert( | |
| ALERT_STT_INIT_FALLBACK, | |
| service_name=primary_provider.capitalize(), | |
| fallback_name=fallback_provider.capitalize(), | |
| error_msg=str(primary_err)[:500], | |
| tag_users=STT_FALLBACK_SLACK_TAG, | |
| ) | |
| ) | |
| fire_and_forget( | |
| send_templated_alert( | |
| ALERT_STT_INIT_FALLBACK, | |
| service_name=primary_provider.capitalize(), | |
| fallback_name=fallback_provider.capitalize(), | |
| error_msg=str(primary_err)[:500], | |
| tag_users=STT_FALLBACK_SLACK_TAG, | |
| ) | |
| ) |
| # --- Breeze Buddy ElevenLabs TTS Configuration --- | ||
| async def BB_ELEVENLABS_VOICE_ID() -> str: | ||
| """Returns BB_ELEVENLABS_VOICE_ID from Redis""" | ||
| return await get_config("BB_ELEVENLABS_VOICE_ID", "fG9s0SXJb213f4UxVHyG", str) | ||
|
|
||
|
|
||
| async def BB_ELEVENLABS_MODEL_ID() -> str: | ||
| """Returns BB_ELEVENLABS_MODEL_ID from Redis""" | ||
| return await get_config("BB_ELEVENLABS_MODEL_ID", "eleven_flash_v2_5", str) | ||
|
|
||
|
|
||
| async def BB_ELEVENLABS_VOICE_SPEED() -> float: | ||
| """Returns BB_ELEVENLABS_VOICE_SPEED from Redis""" | ||
| return await get_config("BB_ELEVENLABS_VOICE_SPEED", 1.15, float) | ||
|
|
There was a problem hiding this comment.
This PR is described as adding STT fallback support, but this diff also introduces Breeze Buddy ElevenLabs TTS dynamic config keys (BB_ELEVENLABS_*). If these are required for the STT fallback feature, it would help to call that out in the PR description; otherwise consider splitting them into a separate PR to keep scope focused.
d7a82ce to
af918e4
Compare
- Make fallback provider-agnostic (remove soniox hardcode) - Log EndFrame errors instead of silently swallowing them - Move FallbackSettings dataclass and _FALLBACK_DEFAULTS to services/fallback - BB_FALLBACK_CONFIG returns typed FallbackSettings from services/fallback - BB_FALLBACK_RAW_CONFIG in dynamic.py returns raw dict via json.loads pattern - Remove no_delay from DeepgramConfig constructor (field not supported by pipecat) - Deduplicate mid-call STT alert with _mid_call_alert_sent guard - Fix reset alert timing: poll every 60s via notify_on_expiry() instead of deleting active key early; Redis TTL is sole authority on fallback expiry
af918e4 to
bee5647
Compare
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Improvements