Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5534b40
fix(team-routing): use deterministic team model group names
Sameerlite Mar 23, 2026
aeb932d
fix(team-routing): keep team model routing on public names
Sameerlite Mar 23, 2026
1835e9a
chore(team-routing): remove temporary candidate pool logs
Sameerlite Mar 23, 2026
7b5e7e0
fix(router): address Greptile review comments
Sameerlite Mar 23, 2026
248fb8b
fix(router): address remaining Greptile P0/P1 issues
Sameerlite Mar 23, 2026
ef9ea1f
fix(router): address Greptile P1/P2 performance issues
Sameerlite Mar 23, 2026
4f302f1
fix(router): prevent cross-team deployment leakage in fallback path
Sameerlite Mar 23, 2026
f5b7298
fix(management): query DB directly for sibling deployments on rename
Sameerlite Mar 23, 2026
298df75
fix(router): guard None model_info and deduplicate team index logic
Sameerlite Mar 23, 2026
8aa58bd
fix(routing): prevent stale model_aliases from interfering with team …
Sameerlite Mar 23, 2026
e8fb776
perf(routing): optimize team model checks and improve test coverage
Sameerlite Mar 23, 2026
8db867c
fix(routing): address state consistency and type safety issues
Sameerlite Mar 23, 2026
173695f
Fix greptile comments
Sameerlite Mar 23, 2026
303072d
Fix greptile comments
Sameerlite Mar 23, 2026
fc6865c
Fix greptile comments
Sameerlite Mar 23, 2026
d02a70a
Fix greptile comments
Sameerlite Mar 23, 2026
316a742
Fix greptile comments
Sameerlite Mar 23, 2026
9a0a216
Fix code qa issues
Sameerlite Mar 23, 2026
c6cc034
Fix greptile reviews and mock test
Sameerlite Mar 23, 2026
fb8d9c2
Fix greptile reviews and mock test
Sameerlite Mar 23, 2026
1a0b30a
Fix greptile reviews and mock test
Sameerlite Mar 23, 2026
592ac98
fix(router): address Greptile P1/P2 review comments
Sameerlite Mar 24, 2026
2321d77
fix(router): address remaining Greptile review comments
Sameerlite Mar 24, 2026
7436f88
fix(router): address final Greptile P1/P2 comments
Sameerlite Mar 24, 2026
1fac58a
fix(tests): reset module-level cache in stale alias bypass tests
Sameerlite Mar 24, 2026
15f5dc3
Fix tests
Sameerlite Mar 26, 2026
d3568ef
Merge pull request #24611 from Sameerlite/Sameerlite/order-fallback2
yuneng-berri Mar 27, 2026
7675488
feat(router): add health-check-driven routing behind opt-in flag
Sameerlite Mar 27, 2026
f784beb
fix: re-attach model_id after endpoint cleaning, bump log level
Sameerlite Mar 27, 2026
8210fd7
fix: revert accidental _litellm_uuid import back to _uuid
Sameerlite Mar 27, 2026
e4a1e52
Merge pull request #2 from Sameerlite/litellm_litellm_health-check-dr…
Sameerlite Mar 27, 2026
09675ef
Fix test
Sameerlite Mar 27, 2026
931c88f
Fix test
Sameerlite Mar 27, 2026
c4159a2
Fix codeql
Sameerlite Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/my-website/docs/proxy/config_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -804,6 +804,7 @@ router_settings:
| LITELLM_OTEL_INTEGRATION_ENABLE_EVENTS | Optionally enable semantic logs for OTEL
| LITELLM_OTEL_INTEGRATION_ENABLE_METRICS | Optionally enable emantic metrics for OTEL
| LITELLM_ENABLE_PYROSCOPE | If true, enables Pyroscope CPU profiling. Profiles are sent to PYROSCOPE_SERVER_ADDRESS. Off by default. See [Pyroscope profiling](/proxy/pyroscope_profiling).
| LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS | When `true`, if a team's legacy `model_aliases` entry maps a public model name to an internal `model_name_<team_id>_<uuid>` deployment, pre-call handling can skip that rewrite when team-scoped sibling deployments exist for the public name—so load balancing / `order` apply across siblings. Default is `false` for backwards compatibility. See [Team-scoped models and legacy aliases](./load_balancing#team-scoped-models-and-legacy-model_aliases). When stale aliases are detected and this flag is off, the proxy may log a one-time warning.
| PYROSCOPE_APP_NAME | Application name reported to Pyroscope. Required when LITELLM_ENABLE_PYROSCOPE is true. No default.
| PYROSCOPE_SERVER_ADDRESS | Pyroscope server URL to send profiles to. Required when LITELLM_ENABLE_PYROSCOPE is true. No default.
| PYROSCOPE_SAMPLE_RATE | Optional. Sample rate for Pyroscope profiling (integer). No default; when unset, the pyroscope-io library default is used.
Expand Down
83 changes: 83 additions & 0 deletions docs/my-website/docs/proxy/health.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,89 @@ general_settings:
health_check_details: False
```

## Health Check Driven Routing

By default, background health checks are observability-only — they populate the `/health` endpoint but don't affect routing. Unhealthy deployments still receive traffic until request failures trigger cooldown.

With `enable_health_check_routing: true`, the router **excludes deployments that failed their last background health check** before selecting a candidate. This gives you proactive failover instead of reactive cooldown.

### How it works

1. Background health checks run on their configured interval
2. After each cycle, every deployment is marked healthy or unhealthy
3. On each incoming request, the router filters out unhealthy deployments **before** cooldown filtering and load balancing
4. If all deployments are unhealthy, the filter is bypassed (safety net — never causes a total outage)
5. If health state is stale (older than `health_check_staleness_threshold`), it is ignored

### Quick start

```yaml
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY_SECONDARY

general_settings:
background_health_checks: true
health_check_interval: 60
enable_health_check_routing: true
```

### Configuration

| Setting | Where | Default | Description |
|---------|-------|---------|-------------|
| `enable_health_check_routing` | `general_settings` | `false` | Enable/disable health-check-driven routing |
| `health_check_staleness_threshold` | `general_settings` | `health_check_interval * 2` | Seconds before health state is considered stale and ignored |
| `background_health_checks` | `general_settings` | `false` | Must be `true` for health check routing to work |
| `health_check_interval` | `general_settings` | `300` | Seconds between health check cycles |

### Interaction with cooldown

Health check filtering and cooldown are **additive**. A deployment can be excluded by either mechanism:

- **Health check filter** — proactive, runs on the configured interval, excludes deployments that failed the last check
- **Cooldown** — reactive, triggered by request failures, excludes deployments for a short TTL

This means request failures still provide fast detection between health check intervals.

### Staleness

If a health check result is older than `health_check_staleness_threshold`, it is ignored and the deployment is treated as eligible. This prevents stale data from permanently excluding a deployment if the health check loop stops or slows down.

The default staleness threshold is `health_check_interval * 2`. For a 60s interval, health state expires after 120s.

### Example: custom staleness

```yaml
general_settings:
background_health_checks: true
health_check_interval: 30
enable_health_check_routing: true
health_check_staleness_threshold: 90 # ignore health state older than 90s
```

### Debugging

Run the proxy with `--detailed_debug` and look for:

```
health_check_routing_state_updated healthy=3 unhealthy=1
```

This is logged after each health check cycle when routing state is written.

If the safety net triggers (all deployments unhealthy), you'll see:

```
All deployments marked unhealthy by health checks, bypassing health filter
```

## Health Check Timeout

The health check timeout is set in `litellm/constants.py` and defaults to 60 seconds.
Expand Down
53 changes: 47 additions & 6 deletions docs/my-website/docs/proxy/load_balancing.md
Original file line number Diff line number Diff line change
Expand Up @@ -324,17 +324,58 @@ model_list:
litellm_params:
model: azure/gpt-4-fallback
api_key: os.environ/AZURE_API_KEY_2
order: 2 # 👈 Used when order=1 is unavailable
order: 2 # 👈 Used when order=1 fails
```

### How order-based fallback works

When a request to an `order=1` deployment fails (connection error, 404, 429, etc.), the router automatically tries `order=2` deployments, then `order=3`, and so on. Each order level gets its own set of retries before escalating to the next.

If all order levels are exhausted, the router falls through to any configured [model-level fallbacks](#fallbacks).

```yaml
model_list:
- model_name: gpt-4
litellm_params:
model: azure/gpt-4-primary
api_key: os.environ/AZURE_API_KEY
order: 1

- model_name: gpt-4
litellm_params:
model: azure/gpt-4-secondary
api_key: os.environ/AZURE_API_KEY_2
order: 2

- model_name: gpt-4-fallback
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY

router_settings:
enable_pre_call_checks: true # 👈 Required for 'order' to work
fallbacks:
- gpt-4:
- gpt-4-fallback # tried after all order levels fail
```

:::important
The `order` parameter requires `enable_pre_call_checks: true` in `router_settings`.
:::
The fallback chain for the above config: `order=1` → `order=2` → `gpt-4-fallback`.

For 429 (rate limit) errors specifically, the failed deployment is immediately placed on cooldown. If all `order=1` deployments are on cooldown, the router picks `order=2` deployments directly during retries without waiting for the fallback path.

### Team-scoped models and legacy `model_aliases` {#team-scoped-models-and-legacy-model_aliases}

Team-scoped deployments are identified by `model_info.team_id` and `model_info.team_public_model_name`. Requests should use the **public** model name; the router resolves all sibling deployments (same public name, different `api_base` / `order`, etc.) for routing, failover, and deployment `order`.

For router internals: when a `team_id` is in scope, optimized lookups key off `(team_id, team_public_model_name)`. If code passes an internal deployment id (e.g. `model_name_<team_id>_<uuid>`) instead of the public name, routing still works via the usual deployment-name paths, but the team-specific fast path applies only to the public name.

**Legacy teams:** Older proxy versions could persist `model_aliases` on the team row mapping a public name to a single internal deployment id (`model_name_<team_id>_<uuid>`). On each request, pre-call logic may still rewrite `model` to that internal name **before** routing, which collapses to one deployment and can make newer sibling deployments unreachable.

**Migration options:**

1. **Recommended for upgrades:** Set environment variable `LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS=true` so that when sibling team deployments exist for the public name, the stale alias rewrite is skipped and team-scoped routing (including `order` and failover) applies. See the [Environment variables](./config_settings) table in the proxy settings doc.
2. **Data cleanup:** Remove obsolete `model_aliases` entries for team public names from the team record in the database so only `team_public_model_name` + team model list drive access.

If `order=1` deployment is unavailable (e.g., rate-limited), the router falls back to `order=2` deployments.
If a stale alias is detected and the bypass is **not** enabled, the proxy may emit a **one-time** warning in logs explaining that sibling deployments may be unreachable until the flag is set or aliases are cleaned up.

### When You'll See Load Balancing in Action

Expand Down
15 changes: 5 additions & 10 deletions docs/my-website/docs/routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -842,6 +842,8 @@ Traffic mirroring allows you to "mimic" production traffic to a secondary (silen

Set `order` in `litellm_params` to prioritize deployments. Lower values = higher priority. When multiple deployments share the same `order`, the routing strategy picks among them.

When a request to an `order=1` deployment fails (connection error, 404, 429, etc.), the router automatically tries `order=2` deployments, then `order=3`, and so on. Each order level gets its own set of retries before escalating to the next. If all order levels are exhausted, the router falls through to any configured [fallbacks](#fallbacks).

<Tabs>
<TabItem value="sdk" label="SDK">

Expand All @@ -862,18 +864,14 @@ model_list = [
"litellm_params": {
"model": "azure/gpt-4-fallback",
"api_key": os.getenv("AZURE_API_KEY_2"),
"order": 2, # 👈 Used when order=1 is unavailable
"order": 2, # 👈 Tried when order=1 fails
},
},
]

router = Router(model_list=model_list, enable_pre_call_checks=True) # 👈 Required for 'order' to work
router = Router(model_list=model_list)
```

:::important
The `order` parameter requires `enable_pre_call_checks=True` to be set on the Router.
:::

</TabItem>
<TabItem value="proxy" label="PROXY">

Expand All @@ -889,10 +887,7 @@ model_list:
litellm_params:
model: azure/gpt-4-fallback
api_key: os.environ/AZURE_API_KEY_2
order: 2 # 👈 Used when order=1 is unavailable

router_settings:
enable_pre_call_checks: true # 👈 Required for 'order' to work
order: 2 # 👈 Tried when order=1 fails
```

</TabItem>
Expand Down
3 changes: 3 additions & 0 deletions litellm/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -1402,6 +1402,9 @@
DEFAULT_SHARED_HEALTH_CHECK_LOCK_TTL = int(
os.getenv("DEFAULT_SHARED_HEALTH_CHECK_LOCK_TTL", 60)
) # 1 minute - TTL for health check lock
DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER = (
2 # health state is stale after interval * this
)
PROMETHEUS_FALLBACK_STATS_SEND_TIME_HOURS = int(
os.getenv("PROMETHEUS_FALLBACK_STATS_SEND_TIME_HOURS", 9)
)
Expand Down
58 changes: 51 additions & 7 deletions litellm/proxy/health_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,21 +207,65 @@ async def _perform_health_check(

for is_healthy, model in zip(results, model_list):
litellm_params = model["litellm_params"]
_model_id = (model.get("model_info") or {}).get("id")

if isinstance(is_healthy, dict) and "error" not in is_healthy:
healthy_endpoints.append(
_clean_endpoint_data({**litellm_params, **is_healthy}, details)
)
cleaned = _clean_endpoint_data({**litellm_params, **is_healthy}, details)
if _model_id:
cleaned["model_id"] = _model_id
healthy_endpoints.append(cleaned)
elif isinstance(is_healthy, dict):
unhealthy_endpoints.append(
_clean_endpoint_data({**litellm_params, **is_healthy}, details)
)
cleaned = _clean_endpoint_data({**litellm_params, **is_healthy}, details)
if _model_id:
cleaned["model_id"] = _model_id
unhealthy_endpoints.append(cleaned)
else:
unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details))
cleaned = _clean_endpoint_data(litellm_params, details)
if _model_id:
cleaned["model_id"] = _model_id
unhealthy_endpoints.append(cleaned)

return healthy_endpoints, unhealthy_endpoints


def build_deployment_health_states(
healthy_endpoints: list,
unhealthy_endpoints: list,
) -> dict:
"""
Build a dict mapping deployment_id -> DeploymentHealthStateValue from
health check endpoint results.

Each endpoint dict includes a 'model_id' field (added by _perform_health_check)
that maps back to the deployment's model_info.id.

Used by the background health check loop to feed health state into
the router's DeploymentHealthCache for health-check-driven routing.
"""
now = time.time()
states: dict = {}

for ep in healthy_endpoints:
model_id = ep.get("model_id")
if model_id:
states[model_id] = {
"is_healthy": True,
"timestamp": now,
"reason": "",
}

for ep in unhealthy_endpoints:
model_id = ep.get("model_id")
if model_id:
states[model_id] = {
"is_healthy": False,
"timestamp": now,
"reason": "background_health_check_failed",
}

return states


def _update_litellm_params_for_health_check(
model_info: dict, litellm_params: dict
) -> dict:
Expand Down
60 changes: 59 additions & 1 deletion litellm/proxy/litellm_pre_call_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import asyncio
import copy
import time
from collections import OrderedDict
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

from fastapi import Request
Expand All @@ -26,6 +27,7 @@
v.value.lower() for v in SpecialHeaders._member_map_.values()
)
from litellm.router import Router
from litellm.secret_managers.main import get_secret_bool
from litellm.types.llms.anthropic import ANTHROPIC_API_HEADERS
from litellm.types.services import ServiceTypes
from litellm.types.utils import (
Expand All @@ -36,6 +38,11 @@
)

service_logger_obj = ServiceLogging() # used for tracking latency on OTEL
# Bounded dedup for stale-alias warnings (FIFO eviction when over cap).
_MAX_STALE_ALIAS_WARNING_KEYS = 10_000
_STALE_TEAM_ALIAS_WARNING_KEYS: OrderedDict[str, None] = OrderedDict()
# Cache the stale alias bypass flag at module load to avoid hot-path secret lookups
_ENABLE_TEAM_STALE_ALIAS_BYPASS: Optional[bool] = None


if TYPE_CHECKING:
Expand Down Expand Up @@ -1296,14 +1303,65 @@ def _update_model_if_team_alias_exists(
"gpt-4o": "gpt-4o-team-1"
}
- requested_model = "gpt-4o-team-1"

Note: model_aliases for team models are deprecated. This function only applies
to legacy non-team-scoped aliases. Team-scoped deployments use team_public_model_name
and are resolved via map_team_model in route_llm_request.
"""
_model = data.get("model")
if (
_model
and user_api_key_dict.team_model_aliases
and _model in user_api_key_dict.team_model_aliases
):
data["model"] = user_api_key_dict.team_model_aliases[_model]
from litellm.proxy.proxy_server import llm_router

# Skip alias rewrite if this model resolves to team-specific deployments
# (team models use team_public_model_name, not model_aliases)
aliased_target = user_api_key_dict.team_model_aliases[_model]

# Optional bypass for stale aliases from pre-PR deployments:
# only enabled via feature flag to preserve backwards compatibility.
# Cached at module level to avoid hot-path secret lookups on every request.
global _ENABLE_TEAM_STALE_ALIAS_BYPASS
if _ENABLE_TEAM_STALE_ALIAS_BYPASS is None:
_ENABLE_TEAM_STALE_ALIAS_BYPASS = get_secret_bool(
"LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS", False
)
enable_stale_alias_bypass = _ENABLE_TEAM_STALE_ALIAS_BYPASS
# Check if the alias points to a team-scoped UUID name
# (format: "model_name_{team_id}_{uuid}")
is_stale_team_alias = aliased_target.startswith(
f"model_name_{user_api_key_dict.team_id}_"
)
if is_stale_team_alias and llm_router:
# This is a stale alias from pre-PR deployments.
# Check if current team deployments exist for the public name.
key = (user_api_key_dict.team_id, _model)
if key in llm_router.team_model_to_deployment_indices:
if enable_stale_alias_bypass:
# Team deployments exist; skip stale alias
return
warning_key = f"{user_api_key_dict.team_id}:{_model}:{aliased_target}"
if warning_key not in _STALE_TEAM_ALIAS_WARNING_KEYS:
_STALE_TEAM_ALIAS_WARNING_KEYS[warning_key] = None
while (
len(_STALE_TEAM_ALIAS_WARNING_KEYS)
> _MAX_STALE_ALIAS_WARNING_KEYS
):
_STALE_TEAM_ALIAS_WARNING_KEYS.popitem(last=False)
verbose_proxy_logger.warning(
"Stale team model alias detected for model='%s', team_id='%s'. "
"New sibling deployments may be unreachable. "
"Set LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS=true to enable "
"team-scoped sibling routing.",
str(_model).replace("\n", "").replace("\r", ""),
str(user_api_key_dict.team_id)
.replace("\n", "")
.replace("\r", ""),
)

data["model"] = aliased_target
return


Expand Down
Loading
Loading