Skip to content

feat: auto pause and system check#731

Open
Swetha-160303 wants to merge 1 commit into
juspay:releasefrom
Swetha-160303:auto-pause-and-system-health-check
Open

feat: auto pause and system check#731
Swetha-160303 wants to merge 1 commit into
juspay:releasefrom
Swetha-160303:auto-pause-and-system-health-check

Conversation

@Swetha-160303
Copy link
Copy Markdown

@Swetha-160303 Swetha-160303 commented Apr 30, 2026

image

Dependency PR :
#647

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic service health monitoring with intelligent circuit-breaker protection to pause and resume calls based on system status.
    • Implemented enhanced speech-to-text failure detection and handling with automatic fallback recovery.
  • Improvements

    • Improved system resilience with dynamic configuration controls for service health monitoring and auto-recovery timing.
    • Added automated failure tracking and alerting for better system visibility.

Copilot AI review requested due to automatic review settings April 30, 2026 08:28
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Walkthrough

Introduces a comprehensive service health monitoring and circuit breaker system. Detects STT pipeline failures, records failures via Redis-backed counters, activates fallback routing when thresholds are exceeded, auto-resets circuits periodically, and gates call processing based on global pause state. Integrates monitoring across agent pipelines, call managers, dynamic configuration, and startup tasks.

Changes

Cohort / File(s) Summary
STT Error Classification
app/ai/voice/agents/breeze_buddy/agent/__init__.py
Classifies pipeline errors as STT failures by checking error processor name, logs warnings, records failures to fallback system, sends templated alerts asynchronously, and immediately ends calls via EndFrame queueing without mid-call swap logic.
Call Processing Control
app/ai/voice/agents/breeze_buddy/managers/calls.py
Adds global circuit-breaker check in process_backlog_leads that skips leads, releases locks, and logs skip reason when service_health_monitor.is_globally_paused() returns true.
Dynamic Configuration
app/core/config/dynamic.py
Adds two async configuration accessors: ENABLE_SERVICE_HEALTH_MONITORING (bool, default True) and SERVICE_HEALTH_AUTO_RESUME_MINUTES (int, default 15), both backed by Redis with type-constrained parsing.
Startup Task Registration
app/main.py
Registers STT fallback initialization tasks during lifespan setup by invoking initialize_fallback_tasks(...) after langfuse tasks, integrating fallback scheduling before scheduler start.
Service Fallback Layer
app/services/fallback/__init__.py
Implements Redis-backed circuit breaker with atomic failure counting (Lua INCR+EXPIRE), deduplicated alert emission (NX+EX), threshold-triggered activation with TTL, and background reset tasks. Exports ServiceFallback class, ServiceFallbackConfig dataclass, and STT-specific reset logic.
Service Health Monitoring
app/services/service_health/__init__.py, app/services/service_health/monitor.py
Creates monitoring module that wraps ServiceFallback for circuit-breaker patterns, loads rules from JSON, maps processor names to rules, gates operations behind feature flag, records pipeline errors with contextual data, emits Slack alerts on circuit state changes, and provides APIs for manual pause/resume and auto-reset scheduling.
Health Check Rules Configuration
app/services/service_health/rules.json
Defines per-provider alert thresholds with threshold_count and window_minutes; most rules set to 10 failures per 5 minutes, call_initiation at 20 per 5 minutes, and provider_fallback at 5 per 5 minutes.

Sequence Diagram

sequenceDiagram
    participant Pipeline as Pipeline<br/>(Agent)
    participant HealthMon as ServiceHealthMonitor
    participant Fallback as ServiceFallback
    participant Redis as Redis
    participant Alert as Alert System<br/>(Slack)
    participant CallMgr as Call Manager

    Note over Pipeline,CallMgr: Error Detection & Recording
    Pipeline->>HealthMon: record_pipeline_error(processor, error, call_sid)
    HealthMon->>Fallback: record_failure(error, context)
    Fallback->>Redis: INCR failure_counter<br/>EXPIRE (TTL=window)
    Fallback->>Redis: SET failure_alert (NX+EX)<br/>deduplicate alerts
    Fallback->>Alert: emit failure_alert (async)

    Note over Fallback,Redis: Threshold Check & Activation
    alt Failure Count >= Threshold
        Fallback->>Redis: SET active_flag (TTL=fallback_duration)
        Fallback->>Redis: DEL failure_counter
        Fallback->>Alert: emit circuit_trip_alert
        Fallback-->>HealthMon: return true (activated)
    else Failure Count < Threshold
        Fallback-->>HealthMon: return false
    end

    Note over CallMgr,Redis: Global Pause Check
    CallMgr->>HealthMon: is_globally_paused()
    HealthMon->>Redis: EXISTS active_flag
    alt Circuit Active
        HealthMon-->>CallMgr: true
        CallMgr->>CallMgr: skip_lead + release_lock
    else Circuit Inactive
        HealthMon-->>CallMgr: false
        CallMgr->>CallMgr: proceed_normally
    end

    Note over HealthMon,Redis: Auto-Reset Background Task
    HealthMon->>Redis: check_all_active_circuits
    alt Auto-Reset Triggered
        HealthMon->>Redis: DEL active_flag
        HealthMon->>Alert: emit circuit_reset_alert
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested Reviewers

  • badri-singhal
  • murdore

Poem

🐰 Hops of joy with Redis so bright,
Circuit breakers guard the night,
STT failures now have a home,
Fallbacks dance, no more alone,
Auto-reset makes systems sing,
Health monitoring springs! 🌱

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: auto pause and system check' clearly summarizes the main changes: implementing automatic pause functionality and system health monitoring/checking.
Docstring Coverage ✅ Passed Docstring coverage is 93.10% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Redis-backed “service health” circuit-breaker layer intended to auto-pause outbound calls when upstream providers (TTS/LLM/telephony) exhibit repeated failures, building on the existing STT fallback work referenced in dependency PR #647.

Changes:

  • Introduces a generic ServiceFallback (Redis + TTL) utility and registers an STT fallback reset background task.
  • Adds service_health package (rules + monitor) and a global “pause outbound calls” check in Breeze Buddy backlog processing.
  • Extends Breeze Buddy on_pipeline_error handling to detect STT failures and end calls (Phase 1 behavior).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
app/services/service_health/rules.json Defines per-service thresholds/windows for health rules.
app/services/service_health/monitor.py Implements service health monitor (circuit breaker) and task registration.
app/services/service_health/init.py Exposes monitor/task initializer as package API.
app/services/fallback/init.py Adds generic Redis-backed fallback/circuit utility + STT reset task registration.
app/main.py Wires STT fallback task initialization into app lifespan (service health import added).
app/core/config/dynamic.py Adds Redis-backed flags for enabling service health + auto-resume window.
app/ai/voice/agents/breeze_buddy/managers/calls.py Skips backlog lead processing when a global pause is active.
app/ai/voice/agents/breeze_buddy/agent/init.py Adds STT error detection and failure recording/end-call behavior.

Comment on lines +68 to +74
from app.ai.voice.agents.breeze_buddy.stt.fallback import (
ALERT_STT_TERMINAL_FAILURE,
STT_FALLBACK_SLACK_TAG,
record_stt_failure,
send_templated_alert,
)
from app.services.service_health import service_health_monitor
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several newly added imports appear unused in this module (ALERT_STT_TERMINAL_FAILURE, STT_FALLBACK_SLACK_TAG, send_templated_alert, and service_health_monitor unless used elsewhere). If they aren’t used later in the file, they should be removed to avoid confusion and keep imports accurate.

Suggested change
from app.ai.voice.agents.breeze_buddy.stt.fallback import (
ALERT_STT_TERMINAL_FAILURE,
STT_FALLBACK_SLACK_TAG,
record_stt_failure,
send_templated_alert,
)
from app.services.service_health import service_health_monitor
from app.ai.voice.agents.breeze_buddy.stt.fallback import record_stt_failure

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +272
async def check_and_reset_stt_fallback() -> None:
"""Check if STT fallback is active and reset to primary if so."""
try:
fallback_provider = await BB_STT_FALLBACK_PROVIDER()
fallback = ServiceFallback(
ServiceFallbackConfig(
service_name="stt",
failure_threshold=await BB_STT_FALLBACK_THRESHOLD(),
failure_window_secs=await BB_STT_FALLBACK_WINDOW_SECS(),
fallback_duration_secs=await BB_STT_FALLBACK_DURATION_SECS(),
fallback_provider_name=fallback_provider,
)
)
if not await fallback.is_active():
return

logger.info("STT fallback active — resetting to primary provider")
await fallback.reset_to_primary()
except Exception as e:
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_and_reset_stt_fallback() calls reset_to_primary() whenever the active key exists. Because _activate() already sets a TTL on the active key, this scheduled task can clear the fallback early depending on scheduler alignment, reducing the intended cooldown and potentially flapping providers. Consider either relying on the TTL alone (and adjusting how reset alerts are emitted), or only resetting when the cooldown has actually elapsed (e.g., by checking the Redis TTL / stored activation timestamp).

Copilot uses AI. Check for mistakes.
Comment on lines +460 to +466
# Check global service health pause (circuit breaker pattern)
if await service_health_monitor.is_globally_paused():
logger.info(
f"Skipping lead {locked_lead.id} - calls are globally paused due to service health"
)
await release_lock_on_lead_by_id(locked_lead.id)
continue
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global pause check happens only after acquiring a per-lead DB lock. When calls are paused, this will still lock and unlock every backlog lead on each run, adding avoidable DB load. Consider checking is_globally_paused() once near the start of process_backlog_leads() (before querying/locking leads) and returning early (or sleeping) when paused.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +37
# LLM providers
"azure": "llm",
"openaillm": "llm",
"googlellm": "llm",
# Telephony transports (Twilio, Exotel, Plivo)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROCESSOR_RULE_MAP uses exact string matches like "azure" / "openaillm". In pipeline errors, processor often appears as a class name (e.g. AzureLLMService, OpenAILLMService), so this mapping may never match and failures won’t be recorded. Consider normalizing to processor.__class__.__name__.lower() and/or adding the expected class-name keys (e.g. azurellmservice, openaillmservice) or switching to substring matching with a controlled allowlist.

Copilot uses AI. Check for mistakes.
"""

import json
from datetime import datetime, timezone
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime and timezone are imported but not used in this module. Removing unused imports will keep the module clean and avoid misleading readers about time-based logic in this file.

Suggested change
from datetime import datetime, timezone

Copilot uses AI. Check for mistakes.
Comment thread app/main.py

# Initialize STT fallback reset tasks
await initialize_fallback_tasks(_background_scheduler)

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize_service_health_tasks is imported but never invoked in the lifespan startup, so the service health monitoring background task will never be registered (and the import is currently unused). If the monitor is intended to run, it should be initialized alongside the fallback/langfuse tasks when the background scheduler is created.

Suggested change
# Initialize service health monitoring tasks
await initialize_service_health_tasks(_background_scheduler)

Copilot uses AI. Check for mistakes.
)
is_stt_error = any(kw in processor_str for kw in stt_keywords)

if not is_stt_error:
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service_health_monitor is imported but never used in on_pipeline_error. As a result, non‑STT pipeline errors (TTS/LLM/telephony) aren’t recorded into the service health circuits, so auto‑pause can’t trigger from those failures. Consider calling service_health_monitor.record_pipeline_error(...) for non‑STT errors before returning.

Suggested change
if not is_stt_error:
if not is_stt_error:
try:
service_health_monitor.record_pipeline_error(
processor=str(processor),
error_message=str(error_msg),
)
except Exception as health_err:
logger.warning(
f"Failed to record non-STT pipeline error in service health monitor: {health_err}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +241
async def reset_to_primary(self) -> None:
"""Reset to primary: clear fallback flag."""
try:
redis = await get_redis_service()

# Clear fallback flag
await redis.delete(self._key_active)
# Clear failure counter
await redis.delete(self._key_failure_count)
# Clear notification dedup key
await redis.delete(self._key_notified)

logger.info(
f"Service fallback ({self.config.service_name}) " f"reset to primary"
)

# Reset alert
if self.config.on_reset_alert:
try:
await self.config.on_reset_alert(
service_name=self.config.service_name,
)
except Exception as alert_err:
logger.warning(
f"Service fallback ({self.config.service_name}) "
f"reset alert failed: {alert_err}"
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset_to_primary() always triggers the reset alert callback even if the fallback/circuit was not active. This can create noisy or misleading “reset/resumed” alerts when callers invoke reset defensively or when resume_calls() iterates over inactive circuits. Consider checking whether the active flag existed (or was deleted) before emitting a reset alert.

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +197
"""Evaluate all circuits and auto-reset if clean."""
if not await ENABLE_SERVICE_HEALTH_MONITORING():
return

for rule, circuit in _CIRCUITS.items():
if not await circuit.is_active():
continue
# Check if clean (no recent failures) - ServiceFallback TTL handles this
await circuit.reset_to_primary()


Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_auto_health_check() resets every active circuit unconditionally. Because ServiceFallback._activate() sets the active flag with a TTL, this will clear the pause almost immediately (on the next 60s check), effectively bypassing SERVICE_HEALTH_AUTO_RESUME_MINUTES. This should only reset when the cooldown has actually elapsed (e.g., based on Redis TTL/activation timestamp), or the task should be removed/changed to only emit a reset alert rather than deleting the active key.

Suggested change
"""Evaluate all circuits and auto-reset if clean."""
if not await ENABLE_SERVICE_HEALTH_MONITORING():
return
for rule, circuit in _CIRCUITS.items():
if not await circuit.is_active():
continue
# Check if clean (no recent failures) - ServiceFallback TTL handles this
await circuit.reset_to_primary()
"""Evaluate all circuits without clearing active pauses.
Active circuit state is managed by ServiceFallback via a Redis TTL. Resetting
the circuit from this periodic task would bypass the configured
SERVICE_HEALTH_AUTO_RESUME_MINUTES cooldown, so this health check is
intentionally observational only.
"""
if not await ENABLE_SERVICE_HEALTH_MONITORING():
return
auto_resume_minutes = await SERVICE_HEALTH_AUTO_RESUME_MINUTES()
for rule, circuit in _CIRCUITS.items():
if not await circuit.is_active():
continue
logger.info(
"[ServiceHealth] Circuit '%s' remains active; waiting for TTL-based "
"auto-resume after %s minute(s)",
rule,
auto_resume_minutes,
)

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +169
await circuit._activate(await get_redis_service())

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pause_calls() calls the private ServiceFallback._activate() method. This makes the service health layer depend on fallback internals and makes it harder to change ServiceFallback safely. Prefer adding a public method on ServiceFallback (e.g., activate() / open_circuit()) and calling that here.

Suggested change
await circuit._activate(await get_redis_service())
# Open the circuit through the public ServiceFallback API instead of
# depending on the private _activate() implementation detail.
threshold_count = int(config.get("threshold_count", 1))
for _ in range(max(threshold_count, 1)):
await circuit.record_failure(
error_msg=reason,
call_sid="",
context=f"manual-pause:{paused_by}",
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
app/services/service_health/monitor.py (2)

191-191: 💤 Low value

Rename unused loop variable rule to _rule.

Per the static analysis hint, the loop control variable rule is not used within the loop body.

♻️ Proposed fix
-        for rule, circuit in _CIRCUITS.items():
+        for _rule, circuit in _CIRCUITS.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/service_health/monitor.py` at line 191, The loop in function
iterating over _CIRCUITS.items() uses an unused variable named rule; rename that
loop control variable from rule to _rule to satisfy static analysis and clarify
intent (e.g., change "for rule, circuit in _CIRCUITS.items()" to "for _rule,
circuit in _CIRCUITS.items()" in monitor.py where the loop is defined).

79-82: 💤 Low value

Minor: Comment doesn't match actual key prefix.

The comment says "Use circuit: prefix instead of fallback:" but key_prefix is set to "health", resulting in keys like health:elevenlabs:active.

♻️ Fix the comment
         _CIRCUITS[rule] = ServiceFallback(
             ServiceFallbackConfig(
                 service_name=rule,
-                key_prefix="health",  # Use circuit: prefix instead of fallback:
+                key_prefix="health",  # Use health: prefix instead of fallback:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/service_health/monitor.py` around lines 79 - 82, The inline
comment next to the ServiceFallbackConfig key_prefix is incorrect: it currently
says "Use circuit: prefix instead of fallback:" while key_prefix is set to
"health", causing mismatch; update the comment in the _CIRCUITS initialization
(where ServiceFallback and ServiceFallbackConfig are constructed) to accurately
reflect the actual key_prefix value (e.g., "Use health: prefix for keys" or
similar) so the comment matches the key_prefix="health".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 613-647: The block handling STT errors uses undefined symbols and
uninitialized attributes: import or define fire_and_forget and EndFrame (from
pipecat.frames.frames) so calls to
fire_and_forget(self._send_mid_call_stt_alert()) and EndFrame() resolve;
initialize self.stt_provider and self._stt_failure_recorded in the class
__init__ (e.g., set default provider string and False) so checks in that block
won't raise AttributeError; and replace the silent "except: pass" around await
task.queue_frames([EndFrame()]) with a specific exception handler that logs the
error (use logger.warning or logger.exception) to avoid swallowing failures.
Ensure references to record_stt_failure and _send_mid_call_stt_alert remain
unchanged.

In `@app/main.py`:
- Around line 172-175: The import initialize_service_health_tasks is never
invoked, so the service health job (check_and_reset_circuits) isn’t registered;
add a call to await initialize_service_health_tasks(_background_scheduler)
(similar to the existing await initialize_fallback_tasks(_background_scheduler))
in the initialization sequence (e.g., immediately after
initialize_fallback_tasks) so the background scheduler registers the
check_and_reset_circuits task.

In `@app/services/fallback/__init__.py`:
- Around line 23-29: Add the missing async config accessors to
app/core/config/dynamic.py: implement async functions named
ENABLE_BB_STT_FALLBACK(), BB_STT_FALLBACK_PROVIDER(),
BB_STT_FALLBACK_THRESHOLD(), BB_STT_FALLBACK_WINDOW_SECS(), and
BB_STT_FALLBACK_DURATION_SECS() that return the expected typed config values
(bool, str, int/float as appropriate) by reading from your existing settings/env
or configuration store; ensure they are declared async, have correct return type
hints, provide sensible defaults, and are exported (update __all__ if the module
uses it) so importing these symbols from app.services.fallback.__init__ will not
raise ImportError.

In `@app/services/service_health/monitor.py`:
- Around line 186-196: run_auto_health_check currently calls reset_to_primary()
for every active circuit in _CIRCUITS unconditionally, which clears circuits and
fires reset alerts prematurely; instead, fetch and respect the existing Redis
TTL for each circuit's _key_active (or check existence) and only call
reset_to_primary() when the key has expired or its TTL is below a small
threshold (e.g., <= probe window) or when an actual health probe indicates
recovery; update run_auto_health_check to read the circuit's _key_active TTL (or
add a circuit.is_about_to_expire()/has_expired() helper) and only call
reset_to_primary() in those cases so fallback_duration_secs and ServiceFallback
TTL handling govern automatic resets.

---

Nitpick comments:
In `@app/services/service_health/monitor.py`:
- Line 191: The loop in function iterating over _CIRCUITS.items() uses an unused
variable named rule; rename that loop control variable from rule to _rule to
satisfy static analysis and clarify intent (e.g., change "for rule, circuit in
_CIRCUITS.items()" to "for _rule, circuit in _CIRCUITS.items()" in monitor.py
where the loop is defined).
- Around line 79-82: The inline comment next to the ServiceFallbackConfig
key_prefix is incorrect: it currently says "Use circuit: prefix instead of
fallback:" while key_prefix is set to "health", causing mismatch; update the
comment in the _CIRCUITS initialization (where ServiceFallback and
ServiceFallbackConfig are constructed) to accurately reflect the actual
key_prefix value (e.g., "Use health: prefix for keys" or similar) so the comment
matches the key_prefix="health".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58ed53ee-bfc3-4dd1-bd96-bc50f42a4905

📥 Commits

Reviewing files that changed from the base of the PR and between 71e9cb6 and 122e909.

📒 Files selected for processing (8)
  • app/ai/voice/agents/breeze_buddy/agent/__init__.py
  • app/ai/voice/agents/breeze_buddy/managers/calls.py
  • app/core/config/dynamic.py
  • app/main.py
  • app/services/fallback/__init__.py
  • app/services/service_health/__init__.py
  • app/services/service_health/monitor.py
  • app/services/service_health/rules.json

Comment on lines +613 to +647
# Detect STT errors by processor name keywords
processor_str = str(processor).lower()
stt_keywords = (
"stt",
"soniox",
"deepgram",
"transcri",
"google",
"sarvam",
)
is_stt_error = any(kw in processor_str for kw in stt_keywords)

if not is_stt_error:
return

logger.warning(f"STT error detected from processor: {processor}")

# Record failure in fallback system (once per call, Soniox only)
if self.stt_provider == "soniox" and not self._stt_failure_recorded:
self._stt_failure_recorded = True
try:
await record_stt_failure(
error_msg=str(error_msg)[:200],
call_sid=self.call_sid or "",
context="mid-call",
)
except Exception as fb_err:
logger.warning(f"STT fallback record_failure failed: {fb_err}")

# Alert and end call — no mid-call swap in Phase 1
fire_and_forget(self._send_mid_call_stt_alert())
try:
await task.queue_frames([EndFrame()])
except Exception:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if fire_and_forget is defined or imported anywhere in the codebase
echo "=== Searching for fire_and_forget definition/import ==="
rg -n "fire_and_forget" --type=py

echo ""
echo "=== Checking EndFrame import in agent/__init__.py ==="
rg -n "from pipecat.frames.frames import" app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Checking stt_provider attribute initialization ==="
rg -n "self\.stt_provider" app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Checking _stt_failure_recorded attribute initialization ==="
rg -n "self\._stt_failure_recorded" app/ai/voice/agents/breeze_buddy/agent/__init__.py

Repository: juspay/clairvoyance

Length of output: 2978


🏁 Script executed:

#!/bin/bash
# Get first 100 lines to see all imports
echo "=== First 100 lines (imports section) ==="
head -100 app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Check if fire_and_forget is defined in agent/__init__.py ==="
rg -n "def fire_and_forget" app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Search for __init__ method and surrounding context ==="
rg -n "def __init__" app/ai/voice/agents/breeze_buddy/agent/__init__.py

Repository: juspay/clairvoyance

Length of output: 3709


🏁 Script executed:

#!/bin/bash
# Read the __init__ method body (lines 116-200 to capture most of it)
echo "=== __init__ method body (lines 116-250) ==="
sed -n '116,250p' app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Search for fire_and_forget definition anywhere in file ==="
grep -n "def fire_and_forget\|fire_and_forget.*=" app/ai/voice/agents/breeze_buddy/agent/__init__.py

echo ""
echo "=== Search for any fire_and_forget function or import in entire file ==="
grep -n "fire_and_forget" app/ai/voice/agents/breeze_buddy/agent/__init__.py

Repository: juspay/clairvoyance

Length of output: 5457


🏁 Script executed:

#!/bin/bash
# Verify the exact line range (613-647) matches the code snippet shown in review
echo "=== Lines 613-647 to verify against review snippet ==="
sed -n '613,647p' app/ai/voice/agents/breeze_buddy/agent/__init__.py

Repository: juspay/clairvoyance

Length of output: 1466


Critical: Undefined names and attributes will cause runtime errors.

The code contains multiple undefined names and attributes that will raise NameError and AttributeError at runtime:

  1. fire_and_forget (line 643) — not imported or defined anywhere in the file
  2. EndFrame (line 645) — not imported from pipecat.frames.frames
  3. self.stt_provider (line 631) — not initialized in __init__
  4. self._stt_failure_recorded (line 632) — not initialized in __init__

The silent except: pass (lines 646-647) will also swallow errors without any indication of failure.

Proposed fixes

Add missing import:

-from pipecat.frames.frames import LLMMessagesAppendFrame, TTSSpeakFrame
+from pipecat.frames.frames import EndFrame, LLMMessagesAppendFrame, TTSSpeakFrame

Import or define fire_and_forget:

+import asyncio
+
+def fire_and_forget(coro):
+    """Schedule coroutine without awaiting."""
+    asyncio.create_task(coro)

Initialize attributes in __init__ after line 195 (error tracking section):

         # Error tracking
         self.errors: List[Dict[str, Any]] = []
+
+        # STT fallback tracking
+        self.stt_provider: Optional[str] = None
+        self._stt_failure_recorded: bool = False

Replace silent exception:

-            try:
-                await task.queue_frames([EndFrame()])
-            except Exception:
-                pass
+            try:
+                await task.queue_frames([EndFrame()])
+            except Exception as e:
+                logger.warning(f"Failed to queue EndFrame: {e}")
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 639-639: Do not catch blind exception: Exception

(BLE001)


[error] 643-643: Undefined name fire_and_forget

(F821)


[error] 645-645: Undefined name EndFrame

(F821)


[error] 646-647: try-except-pass detected, consider logging the exception

(S110)


[warning] 646-646: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py` around lines 613 - 647,
The block handling STT errors uses undefined symbols and uninitialized
attributes: import or define fire_and_forget and EndFrame (from
pipecat.frames.frames) so calls to
fire_and_forget(self._send_mid_call_stt_alert()) and EndFrame() resolve;
initialize self.stt_provider and self._stt_failure_recorded in the class
__init__ (e.g., set default provider string and False) so checks in that block
won't raise AttributeError; and replace the silent "except: pass" around await
task.queue_frames([EndFrame()]) with a specific exception handler that logs the
error (use logger.warning or logger.exception) to avoid swallowing failures.
Ensure references to record_stt_failure and _send_mid_call_stt_alert remain
unchanged.

Comment thread app/main.py
Comment on lines +172 to 175
# Initialize STT fallback reset tasks
await initialize_fallback_tasks(_background_scheduler)

### Register new tasks here
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing call to initialize_service_health_tasks.

initialize_service_health_tasks is imported on line 69 but never invoked. The service health background task (check_and_reset_circuits) will not be registered, meaning circuits will never auto-reset based on the configured schedule.

🐛 Proposed fix to register service health tasks
             # Initialize STT fallback reset tasks
             await initialize_fallback_tasks(_background_scheduler)

+            # Initialize service health check tasks
+            await initialize_service_health_tasks(_background_scheduler)
+
             ### Register new tasks here
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Initialize STT fallback reset tasks
await initialize_fallback_tasks(_background_scheduler)
### Register new tasks here
# Initialize STT fallback reset tasks
await initialize_fallback_tasks(_background_scheduler)
# Initialize service health check tasks
await initialize_service_health_tasks(_background_scheduler)
### Register new tasks here
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/main.py` around lines 172 - 175, The import
initialize_service_health_tasks is never invoked, so the service health job
(check_and_reset_circuits) isn’t registered; add a call to await
initialize_service_health_tasks(_background_scheduler) (similar to the existing
await initialize_fallback_tasks(_background_scheduler)) in the initialization
sequence (e.g., immediately after initialize_fallback_tasks) so the background
scheduler registers the check_and_reset_circuits task.

Comment on lines +23 to +29
from app.core.config.dynamic import (
BB_STT_FALLBACK_DURATION_SECS,
BB_STT_FALLBACK_PROVIDER,
BB_STT_FALLBACK_THRESHOLD,
BB_STT_FALLBACK_WINDOW_SECS,
ENABLE_BB_STT_FALLBACK,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if STT fallback config functions exist in dynamic.py
echo "=== Checking for STT fallback config functions in dynamic.py ==="
rg -n "BB_STT_FALLBACK|ENABLE_BB_STT_FALLBACK" app/core/config/dynamic.py

echo ""
echo "=== Full list of async def functions in dynamic.py ==="
rg -n "^async def " app/core/config/dynamic.py

Repository: juspay/clairvoyance

Length of output: 2803


🏁 Script executed:

# First, read the actual file to confirm the imports shown in the review
head -35 app/services/fallback/__init__.py

Repository: juspay/clairvoyance

Length of output: 1475


🏁 Script executed:

# Search for the missing config functions anywhere in the codebase
rg -r "BB_STT_FALLBACK_DURATION_SECS|BB_STT_FALLBACK_PROVIDER|BB_STT_FALLBACK_THRESHOLD|BB_STT_FALLBACK_WINDOW_SECS|ENABLE_BB_STT_FALLBACK" --type py

Repository: juspay/clairvoyance

Length of output: 460


🏁 Script executed:

# Check if app/services/fallback/__init__.py exists and its size
wc -l app/services/fallback/__init__.py

Repository: juspay/clairvoyance

Length of output: 101


🏁 Script executed:

# Search for where these config functions are defined (not imported)
rg "async def BB_STT_FALLBACK_DURATION_SECS|async def BB_STT_FALLBACK_PROVIDER|async def BB_STT_FALLBACK_THRESHOLD|async def BB_STT_FALLBACK_WINDOW_SECS|async def ENABLE_BB_STT_FALLBACK" --type py

Repository: juspay/clairvoyance

Length of output: 45


🏁 Script executed:

# More comprehensive search - look for these constants/functions anywhere
rg "BB_STT_FALLBACK_DURATION_SECS|ENABLE_BB_STT_FALLBACK" --type py

Repository: juspay/clairvoyance

Length of output: 486


Critical: Missing dynamic configuration functions will cause ImportError.

The imports on lines 23-29 reference configuration functions that do not exist anywhere in the codebase:

  • BB_STT_FALLBACK_DURATION_SECS
  • BB_STT_FALLBACK_PROVIDER
  • BB_STT_FALLBACK_THRESHOLD
  • BB_STT_FALLBACK_WINDOW_SECS
  • ENABLE_BB_STT_FALLBACK

These functions are not defined in app/core/config/dynamic.py (only ENABLE_SERVICE_HEALTH_MONITORING and SERVICE_HEALTH_AUTO_RESUME_MINUTES were added). Add the missing async config functions to app/core/config/dynamic.py before this module can be imported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/fallback/__init__.py` around lines 23 - 29, Add the missing
async config accessors to app/core/config/dynamic.py: implement async functions
named ENABLE_BB_STT_FALLBACK(), BB_STT_FALLBACK_PROVIDER(),
BB_STT_FALLBACK_THRESHOLD(), BB_STT_FALLBACK_WINDOW_SECS(), and
BB_STT_FALLBACK_DURATION_SECS() that return the expected typed config values
(bool, str, int/float as appropriate) by reading from your existing settings/env
or configuration store; ensure they are declared async, have correct return type
hints, provide sensible defaults, and are exported (update __all__ if the module
uses it) so importing these symbols from app.services.fallback.__init__ will not
raise ImportError.

Comment on lines +186 to +196
async def run_auto_health_check(self) -> None:
"""Evaluate all circuits and auto-reset if clean."""
if not await ENABLE_SERVICE_HEALTH_MONITORING():
return

for rule, circuit in _CIRCUITS.items():
if not await circuit.is_active():
continue
# Check if clean (no recent failures) - ServiceFallback TTL handles this
await circuit.reset_to_primary()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Auto health check unconditionally resets all active circuits.

The run_auto_health_check method calls reset_to_primary() on every active circuit without verifying whether failures have actually subsided. The comment mentions "ServiceFallback TTL handles this," but then immediately resets anyway.

Since the _key_active Redis key already has a TTL (fallback_duration_secs) that auto-expires, calling reset_to_primary() every 60 seconds will:

  1. Clear circuits before their configured duration expires
  2. Fire reset alerts prematurely

If the intent is to let TTL handle expiration, this loop should only fire the reset alert when the key is about to expire or has expired, not force-reset every active circuit.

🔧 Suggested approach

Either remove the unconditional reset:

     async def run_auto_health_check(self) -> None:
         """Evaluate all circuits and auto-reset if clean."""
         if not await ENABLE_SERVICE_HEALTH_MONITORING():
             return
 
-        for rule, circuit in _CIRCUITS.items():
-            if not await circuit.is_active():
-                continue
-            # Check if clean (no recent failures) - ServiceFallback TTL handles this
-            await circuit.reset_to_primary()
+        # TTL on the active key handles auto-expiration.
+        # This task now only serves to fire reset alerts when keys expire.
+        for _rule, circuit in _CIRCUITS.items():
+            # Check if circuit was active but has now expired (TTL elapsed)
+            # No action needed - key self-expired
+            pass

Or add actual health-check logic (e.g., probe the service):

     async def run_auto_health_check(self) -> None:
-        """Evaluate all circuits and auto-reset if clean."""
+        """Probe services and reset circuits if healthy."""
         if not await ENABLE_SERVICE_HEALTH_MONITORING():
             return
 
-        for rule, circuit in _CIRCUITS.items():
+        for _rule, circuit in _CIRCUITS.items():
             if not await circuit.is_active():
                 continue
-            # Check if clean (no recent failures) - ServiceFallback TTL handles this
-            await circuit.reset_to_primary()
+            # TODO: Add actual health probe logic here
+            # Only reset if probe succeeds
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 191-191: Loop control variable rule not used within loop body

Rename unused rule to _rule

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/service_health/monitor.py` around lines 186 - 196,
run_auto_health_check currently calls reset_to_primary() for every active
circuit in _CIRCUITS unconditionally, which clears circuits and fires reset
alerts prematurely; instead, fetch and respect the existing Redis TTL for each
circuit's _key_active (or check existence) and only call reset_to_primary() when
the key has expired or its TTL is below a small threshold (e.g., <= probe
window) or when an actual health probe indicates recovery; update
run_auto_health_check to read the circuit's _key_active TTL (or add a
circuit.is_about_to_expire()/has_expired() helper) and only call
reset_to_primary() in those cases so fallback_duration_secs and ServiceFallback
TTL handling govern automatic resets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants