BerriAI · yuneng-berri · Mar 27, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/docs/my-website/docs/proxy/config_settings.md b/docs/my-website/docs/proxy/config_settings.md
@@ -804,6 +804,7 @@ router_settings:
 | LITELLM_OTEL_INTEGRATION_ENABLE_EVENTS | Optionally enable semantic logs for OTEL
 | LITELLM_OTEL_INTEGRATION_ENABLE_METRICS | Optionally enable emantic metrics for OTEL
 | LITELLM_ENABLE_PYROSCOPE | If true, enables Pyroscope CPU profiling. Profiles are sent to PYROSCOPE_SERVER_ADDRESS. Off by default. See [Pyroscope profiling](/proxy/pyroscope_profiling).
+| LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS | When `true`, if a team's legacy `model_aliases` entry maps a public model name to an internal `model_name_<team_id>_<uuid>` deployment, pre-call handling can skip that rewrite when team-scoped sibling deployments exist for the public name—so load balancing / `order` apply across siblings. Default is `false` for backwards compatibility. See [Team-scoped models and legacy aliases](./load_balancing#team-scoped-models-and-legacy-model_aliases). When stale aliases are detected and this flag is off, the proxy may log a one-time warning.
 | PYROSCOPE_APP_NAME | Application name reported to Pyroscope. Required when LITELLM_ENABLE_PYROSCOPE is true. No default.
 | PYROSCOPE_SERVER_ADDRESS | Pyroscope server URL to send profiles to. Required when LITELLM_ENABLE_PYROSCOPE is true. No default.
 | PYROSCOPE_SAMPLE_RATE | Optional. Sample rate for Pyroscope profiling (integer). No default; when unset, the pyroscope-io library default is used.

diff --git a/docs/my-website/docs/proxy/health.md b/docs/my-website/docs/proxy/health.md
@@ -314,6 +314,89 @@ general_settings:
   health_check_details: False
 ```
 
+## Health Check Driven Routing
+
+By default, background health checks are observability-only — they populate the `/health` endpoint but don't affect routing. Unhealthy deployments still receive traffic until request failures trigger cooldown.
+
+With `enable_health_check_routing: true`, the router **excludes deployments that failed their last background health check** before selecting a candidate. This gives you proactive failover instead of reactive cooldown.
+
+### How it works
+
+1. Background health checks run on their configured interval
+2. After each cycle, every deployment is marked healthy or unhealthy
+3. On each incoming request, the router filters out unhealthy deployments **before** cooldown filtering and load balancing
+4. If all deployments are unhealthy, the filter is bypassed (safety net — never causes a total outage)
+5. If health state is stale (older than `health_check_staleness_threshold`), it is ignored
+
+### Quick start
+
+```yaml
+model_list:
+  - model_name: gpt-4
+    litellm_params:
+      model: openai/gpt-4
+      api_key: os.environ/OPENAI_API_KEY
+  - model_name: gpt-4
+    litellm_params:
+      model: openai/gpt-4
+      api_key: os.environ/OPENAI_API_KEY_SECONDARY
+
+general_settings:
+  background_health_checks: true
+  health_check_interval: 60
+  enable_health_check_routing: true
+```
+
+### Configuration
+
+| Setting | Where | Default | Description |
+|---------|-------|---------|-------------|
+| `enable_health_check_routing` | `general_settings` | `false` | Enable/disable health-check-driven routing |
+| `health_check_staleness_threshold` | `general_settings` | `health_check_interval * 2` | Seconds before health state is considered stale and ignored |
+| `background_health_checks` | `general_settings` | `false` | Must be `true` for health check routing to work |
+| `health_check_interval` | `general_settings` | `300` | Seconds between health check cycles |
+
+### Interaction with cooldown
+
+Health check filtering and cooldown are **additive**. A deployment can be excluded by either mechanism:
+
+- **Health check filter** — proactive, runs on the configured interval, excludes deployments that failed the last check
+- **Cooldown** — reactive, triggered by request failures, excludes deployments for a short TTL
+
+This means request failures still provide fast detection between health check intervals.
+
+### Staleness
+
+If a health check result is older than `health_check_staleness_threshold`, it is ignored and the deployment is treated as eligible. This prevents stale data from permanently excluding a deployment if the health check loop stops or slows down.
+
+The default staleness threshold is `health_check_interval * 2`. For a 60s interval, health state expires after 120s.
+
+### Example: custom staleness
+
+```yaml
+general_settings:
+  background_health_checks: true
+  health_check_interval: 30
+  enable_health_check_routing: true
+  health_check_staleness_threshold: 90  # ignore health state older than 90s
+```
+
+### Debugging
+
+Run the proxy with `--detailed_debug` and look for:
+
+```
+health_check_routing_state_updated healthy=3 unhealthy=1
+```
+
+This is logged after each health check cycle when routing state is written.
+
+If the safety net triggers (all deployments unhealthy), you'll see:
+
+```
+All deployments marked unhealthy by health checks, bypassing health filter
+```
+
 ## Health Check Timeout
 
 The health check timeout is set in `litellm/constants.py` and defaults to 60 seconds.

diff --git a/docs/my-website/docs/proxy/load_balancing.md b/docs/my-website/docs/proxy/load_balancing.md
@@ -324,17 +324,58 @@ model_list:
     litellm_params:
       model: azure/gpt-4-fallback
       api_key: os.environ/AZURE_API_KEY_2
-      order: 2  # 👈 Used when order=1 is unavailable
+      order: 2  # 👈 Used when order=1 fails
+```
+
+### How order-based fallback works
+
+When a request to an `order=1` deployment fails (connection error, 404, 429, etc.), the router automatically tries `order=2` deployments, then `order=3`, and so on. Each order level gets its own set of retries before escalating to the next.
+
+If all order levels are exhausted, the router falls through to any configured [model-level fallbacks](#fallbacks).
+
+```yaml
+model_list:
+  - model_name: gpt-4
+    litellm_params:
+      model: azure/gpt-4-primary
+      api_key: os.environ/AZURE_API_KEY
+      order: 1
+
+  - model_name: gpt-4
+    litellm_params:
+      model: azure/gpt-4-secondary
+      api_key: os.environ/AZURE_API_KEY_2
+      order: 2
+
+  - model_name: gpt-4-fallback
+    litellm_params:
+      model: openai/gpt-4
+      api_key: os.environ/OPENAI_API_KEY
 
 router_settings:
-  enable_pre_call_checks: true  # 👈 Required for 'order' to work
+  fallbacks:
+    - gpt-4:
+        - gpt-4-fallback  # tried after all order levels fail
 ```
 
-:::important
-The `order` parameter requires `enable_pre_call_checks: true` in `router_settings`.
-:::
+The fallback chain for the above config: `order=1` → `order=2` → `gpt-4-fallback`.
+
+For 429 (rate limit) errors specifically, the failed deployment is immediately placed on cooldown. If all `order=1` deployments are on cooldown, the router picks `order=2` deployments directly during retries without waiting for the fallback path.
+
+### Team-scoped models and legacy `model_aliases` {#team-scoped-models-and-legacy-model_aliases}
+
+Team-scoped deployments are identified by `model_info.team_id` and `model_info.team_public_model_name`. Requests should use the **public** model name; the router resolves all sibling deployments (same public name, different `api_base` / `order`, etc.) for routing, failover, and deployment `order`.
+
+For router internals: when a `team_id` is in scope, optimized lookups key off `(team_id, team_public_model_name)`. If code passes an internal deployment id (e.g. `model_name_<team_id>_<uuid>`) instead of the public name, routing still works via the usual deployment-name paths, but the team-specific fast path applies only to the public name.
+
+**Legacy teams:** Older proxy versions could persist `model_aliases` on the team row mapping a public name to a single internal deployment id (`model_name_<team_id>_<uuid>`). On each request, pre-call logic may still rewrite `model` to that internal name **before** routing, which collapses to one deployment and can make newer sibling deployments unreachable.
+
+**Migration options:**
+
+1. **Recommended for upgrades:** Set environment variable `LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS=true` so that when sibling team deployments exist for the public name, the stale alias rewrite is skipped and team-scoped routing (including `order` and failover) applies. See the [Environment variables](./config_settings) table in the proxy settings doc.
+2. **Data cleanup:** Remove obsolete `model_aliases` entries for team public names from the team record in the database so only `team_public_model_name` + team model list drive access.
 
-If `order=1` deployment is unavailable (e.g., rate-limited), the router falls back to `order=2` deployments.
+If a stale alias is detected and the bypass is **not** enabled, the proxy may emit a **one-time** warning in logs explaining that sibling deployments may be unreachable until the flag is set or aliases are cleaned up.
 
 ### When You'll See Load Balancing in Action
 

diff --git a/docs/my-website/docs/routing.md b/docs/my-website/docs/routing.md
@@ -842,6 +842,8 @@ Traffic mirroring allows you to "mimic" production traffic to a secondary (silen
 
 Set `order` in `litellm_params` to prioritize deployments. Lower values = higher priority. When multiple deployments share the same `order`, the routing strategy picks among them.
 
+When a request to an `order=1` deployment fails (connection error, 404, 429, etc.), the router automatically tries `order=2` deployments, then `order=3`, and so on. Each order level gets its own set of retries before escalating to the next. If all order levels are exhausted, the router falls through to any configured [fallbacks](#fallbacks).
+
 <Tabs>
 <TabItem value="sdk" label="SDK">
 
@@ -862,18 +864,14 @@ model_list = [
         "litellm_params": {
             "model": "azure/gpt-4-fallback",
             "api_key": os.getenv("AZURE_API_KEY_2"),
-            "order": 2,  # 👈 Used when order=1 is unavailable
+            "order": 2,  # 👈 Tried when order=1 fails
         },
     },
 ]
 
-router = Router(model_list=model_list, enable_pre_call_checks=True)  # 👈 Required for 'order' to work
+router = Router(model_list=model_list)
 ```
 
-:::important
-The `order` parameter requires `enable_pre_call_checks=True` to be set on the Router.
-:::
-
 </TabItem>
 <TabItem value="proxy" label="PROXY">
 
@@ -889,10 +887,7 @@ model_list:
     litellm_params:
       model: azure/gpt-4-fallback
       api_key: os.environ/AZURE_API_KEY_2
-      order: 2  # 👈 Used when order=1 is unavailable
-
-router_settings:
-  enable_pre_call_checks: true  # 👈 Required for 'order' to work
+      order: 2  # 👈 Tried when order=1 fails
 ```
 
 </TabItem>

diff --git a/litellm/constants.py b/litellm/constants.py
@@ -1402,6 +1402,9 @@
 DEFAULT_SHARED_HEALTH_CHECK_LOCK_TTL = int(
     os.getenv("DEFAULT_SHARED_HEALTH_CHECK_LOCK_TTL", 60)
 )  # 1 minute - TTL for health check lock
+DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER = (
+    2  # health state is stale after interval * this
+)
 PROMETHEUS_FALLBACK_STATS_SEND_TIME_HOURS = int(
     os.getenv("PROMETHEUS_FALLBACK_STATS_SEND_TIME_HOURS", 9)
 )

diff --git a/litellm/proxy/health_check.py b/litellm/proxy/health_check.py
@@ -207,21 +207,65 @@ async def _perform_health_check(
 
     for is_healthy, model in zip(results, model_list):
         litellm_params = model["litellm_params"]
+        _model_id = (model.get("model_info") or {}).get("id")
 
         if isinstance(is_healthy, dict) and "error" not in is_healthy:
-            healthy_endpoints.append(
-                _clean_endpoint_data({**litellm_params, **is_healthy}, details)
-            )
+            cleaned = _clean_endpoint_data({**litellm_params, **is_healthy}, details)
+            if _model_id:
+                cleaned["model_id"] = _model_id
+            healthy_endpoints.append(cleaned)
         elif isinstance(is_healthy, dict):
-            unhealthy_endpoints.append(
-                _clean_endpoint_data({**litellm_params, **is_healthy}, details)
-            )
+            cleaned = _clean_endpoint_data({**litellm_params, **is_healthy}, details)
+            if _model_id:
+                cleaned["model_id"] = _model_id
+            unhealthy_endpoints.append(cleaned)
         else:
-            unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details))
+            cleaned = _clean_endpoint_data(litellm_params, details)
+            if _model_id:
+                cleaned["model_id"] = _model_id
+            unhealthy_endpoints.append(cleaned)
 
     return healthy_endpoints, unhealthy_endpoints
 
 
+def build_deployment_health_states(
+    healthy_endpoints: list,
+    unhealthy_endpoints: list,
+) -> dict:
+    """
+    Build a dict mapping deployment_id -> DeploymentHealthStateValue from
+    health check endpoint results.
+
+    Each endpoint dict includes a 'model_id' field (added by _perform_health_check)
+    that maps back to the deployment's model_info.id.
+
+    Used by the background health check loop to feed health state into
+    the router's DeploymentHealthCache for health-check-driven routing.
+    """
+    now = time.time()
+    states: dict = {}
+
+    for ep in healthy_endpoints:
+        model_id = ep.get("model_id")
+        if model_id:
+            states[model_id] = {
+                "is_healthy": True,
+                "timestamp": now,
+                "reason": "",
+            }
+
+    for ep in unhealthy_endpoints:
+        model_id = ep.get("model_id")
+        if model_id:
+            states[model_id] = {
+                "is_healthy": False,
+                "timestamp": now,
+                "reason": "background_health_check_failed",
+            }
+
+    return states
+
+
 def _update_litellm_params_for_health_check(
     model_info: dict, litellm_params: dict
 ) -> dict:

diff --git a/litellm/proxy/litellm_pre_call_utils.py b/litellm/proxy/litellm_pre_call_utils.py
@@ -1,6 +1,7 @@
 import asyncio
 import copy
 import time
+from collections import OrderedDict
 from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
 
 from fastapi import Request
@@ -26,6 +27,7 @@
     v.value.lower() for v in SpecialHeaders._member_map_.values()
 )
 from litellm.router import Router
+from litellm.secret_managers.main import get_secret_bool
 from litellm.types.llms.anthropic import ANTHROPIC_API_HEADERS
 from litellm.types.services import ServiceTypes
 from litellm.types.utils import (
@@ -36,6 +38,11 @@
 )
 
 service_logger_obj = ServiceLogging()  # used for tracking latency on OTEL
+# Bounded dedup for stale-alias warnings (FIFO eviction when over cap).
+_MAX_STALE_ALIAS_WARNING_KEYS = 10_000
+_STALE_TEAM_ALIAS_WARNING_KEYS: OrderedDict[str, None] = OrderedDict()
+# Cache the stale alias bypass flag at module load to avoid hot-path secret lookups
+_ENABLE_TEAM_STALE_ALIAS_BYPASS: Optional[bool] = None
 
 
 if TYPE_CHECKING:
@@ -1296,14 +1303,65 @@ def _update_model_if_team_alias_exists(
             "gpt-4o": "gpt-4o-team-1"
         }
         - requested_model = "gpt-4o-team-1"
+
+    Note: model_aliases for team models are deprecated. This function only applies
+    to legacy non-team-scoped aliases. Team-scoped deployments use team_public_model_name
+    and are resolved via map_team_model in route_llm_request.
     """
     _model = data.get("model")
     if (
         _model
         and user_api_key_dict.team_model_aliases
         and _model in user_api_key_dict.team_model_aliases
     ):
-        data["model"] = user_api_key_dict.team_model_aliases[_model]
+        from litellm.proxy.proxy_server import llm_router
+
+        # Skip alias rewrite if this model resolves to team-specific deployments
+        # (team models use team_public_model_name, not model_aliases)
+        aliased_target = user_api_key_dict.team_model_aliases[_model]
+
+        # Optional bypass for stale aliases from pre-PR deployments:
+        # only enabled via feature flag to preserve backwards compatibility.
+        # Cached at module level to avoid hot-path secret lookups on every request.
+        global _ENABLE_TEAM_STALE_ALIAS_BYPASS
+        if _ENABLE_TEAM_STALE_ALIAS_BYPASS is None:
+            _ENABLE_TEAM_STALE_ALIAS_BYPASS = get_secret_bool(
+                "LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS", False
+            )
+        enable_stale_alias_bypass = _ENABLE_TEAM_STALE_ALIAS_BYPASS
+        # Check if the alias points to a team-scoped UUID name
+        # (format: "model_name_{team_id}_{uuid}")
+        is_stale_team_alias = aliased_target.startswith(
+            f"model_name_{user_api_key_dict.team_id}_"
+        )
+        if is_stale_team_alias and llm_router:
+            # This is a stale alias from pre-PR deployments.
+            # Check if current team deployments exist for the public name.
+            key = (user_api_key_dict.team_id, _model)
+            if key in llm_router.team_model_to_deployment_indices:
+                if enable_stale_alias_bypass:
+                    # Team deployments exist; skip stale alias
+                    return
+                warning_key = f"{user_api_key_dict.team_id}:{_model}:{aliased_target}"
+                if warning_key not in _STALE_TEAM_ALIAS_WARNING_KEYS:
+                    _STALE_TEAM_ALIAS_WARNING_KEYS[warning_key] = None
+                    while (
+                        len(_STALE_TEAM_ALIAS_WARNING_KEYS)
+                        > _MAX_STALE_ALIAS_WARNING_KEYS
+                    ):
+                        _STALE_TEAM_ALIAS_WARNING_KEYS.popitem(last=False)
+                    verbose_proxy_logger.warning(
+                        "Stale team model alias detected for model='%s', team_id='%s'. "
+                        "New sibling deployments may be unreachable. "
+                        "Set LITELLM_ENABLE_TEAM_STALE_ALIAS_BYPASS=true to enable "
+                        "team-scoped sibling routing.",
+                        str(_model).replace("\n", "").replace("\r", ""),
+                        str(user_api_key_dict.team_id)
+                        .replace("\n", "")
+                        .replace("\r", ""),
+                    )
+
+        data["model"] = aliased_target
     return