Skip to content

Latest commit

 

History

History
83 lines (60 loc) · 2.69 KB

File metadata and controls

83 lines (60 loc) · 2.69 KB

Cascading Fallback

Problem

Depending on a single LLM provider means a single point of failure. API outages, rate limits, billing issues, or model deprecations take your entire agent system offline.

Even without failures, using the most expensive model for every task wastes money. A simple "restart this service" doesn't need the same model as "architect a new feature."

Solution

Configure a chain of models from premium to economical to local. Route tasks to the cheapest model that can handle them, and fall back down the chain on failures.

Premium API (Claude Opus, GPT-4)
  ↓ (on failure or for simpler tasks)
Mid-tier API (Claude Sonnet, GPT-4o-mini)
  ↓ (on failure)
Local Model (Qwen 8B, Llama 8B)

Implementation

1. Define the Chain

fallback_chain:
  - provider: anthropic
    model: claude-opus-4
    use_for: complex reasoning, architecture, code review
  - provider: anthropic
    model: claude-sonnet-4
    use_for: routine tasks, simple fixes
  - provider: local
    model: qwen3-8b
    use_for: last resort, offline operation

2. Automatic Failover

When the primary model returns an error (429, 500, timeout), automatically retry with the next model in the chain:

Request → Model A → 429 Rate Limited
                  → Model B → Success ✓

3. Task-Based Routing

Not every task needs the top model. Route by complexity:

Task Type Suggested Tier
Architecture decisions Premium
Code review Premium
Debugging from stack trace Premium
Simple service restart Mid-tier
Log grep and summary Mid-tier
Status checks Mid-tier or Local
Heartbeat polls Mid-tier or Local

4. Monitor Fallback Frequency

Track how often each fallback level is triggered. High fallback rates indicate:

  • Primary provider reliability issues
  • Budget constraints hitting rate limits
  • Tasks being over-routed to the premium tier

Trade-offs

  • Quality variance: Lower-tier models produce lower-quality output. Tasks routed to fallback models may need human review.
  • Complexity: Managing multiple providers means multiple API keys, billing, and configuration.
  • Latency: Failover adds latency (failed request timeout + retry).

When to Skip

  • You have a single provider with excellent uptime and budget isn't a concern.
  • All tasks genuinely require the premium model's capabilities.
  • You're in a regulated environment where model consistency matters more than availability.

Related Patterns