Depending on a single LLM provider means a single point of failure. API outages, rate limits, billing issues, or model deprecations take your entire agent system offline.
Even without failures, using the most expensive model for every task wastes money. A simple "restart this service" doesn't need the same model as "architect a new feature."
Configure a chain of models from premium to economical to local. Route tasks to the cheapest model that can handle them, and fall back down the chain on failures.
Premium API (Claude Opus, GPT-4)
↓ (on failure or for simpler tasks)
Mid-tier API (Claude Sonnet, GPT-4o-mini)
↓ (on failure)
Local Model (Qwen 8B, Llama 8B)
fallback_chain:
- provider: anthropic
model: claude-opus-4
use_for: complex reasoning, architecture, code review
- provider: anthropic
model: claude-sonnet-4
use_for: routine tasks, simple fixes
- provider: local
model: qwen3-8b
use_for: last resort, offline operationWhen the primary model returns an error (429, 500, timeout), automatically retry with the next model in the chain:
Request → Model A → 429 Rate Limited
→ Model B → Success ✓
Not every task needs the top model. Route by complexity:
| Task Type | Suggested Tier |
|---|---|
| Architecture decisions | Premium |
| Code review | Premium |
| Debugging from stack trace | Premium |
| Simple service restart | Mid-tier |
| Log grep and summary | Mid-tier |
| Status checks | Mid-tier or Local |
| Heartbeat polls | Mid-tier or Local |
Track how often each fallback level is triggered. High fallback rates indicate:
- Primary provider reliability issues
- Budget constraints hitting rate limits
- Tasks being over-routed to the premium tier
- Quality variance: Lower-tier models produce lower-quality output. Tasks routed to fallback models may need human review.
- Complexity: Managing multiple providers means multiple API keys, billing, and configuration.
- Latency: Failover adds latency (failed request timeout + retry).
- You have a single provider with excellent uptime and budget isn't a concern.
- All tasks genuinely require the premium model's capabilities.
- You're in a regulated environment where model consistency matters more than availability.
- Graceful Degradation — cascading fallback is one form of graceful degradation
- Heartbeat Monitoring — heartbeats can use cheaper models