fix(prisma): add consecutive-failures gate before watchdog reconnect#29502
fix(prisma): add consecutive-failures gate before watchdog reconnect#29502gvisco wants to merge 2 commits into
Conversation
A single failed SELECT 1 probe triggered an immediate engine kill and respawn. A transient probe failure from brief event loop contention or a momentary external condition now causes a full reconnect cycle, turning a self-clearing hiccup into a hard outage. This adds a consecutive probe failure counter to the watchdog loop. Reconnect fires only after K consecutive probe failures, where K is read from PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT (default 1 to preserve current behavior). Any successful probe resets the counter. The counter is also reset when a reconnect is triggered so the gate re-arms cleanly for the next failure window.
Greptile SummaryThis PR adds a consecutive-failures gate to the Prisma DB health watchdog so that transient
Confidence Score: 4/5Safe to merge; the change is backward-compatible and the new threshold logic is well-exercised by the added tests. The implementation is minimal and correct. The only gap is that the new failures_before_reconnect knob is not surfaced in the watchdog startup log, making it harder to confirm the env var was picked up at runtime. No files require special attention beyond the minor startup-log omission in litellm/proxy/utils.py.
|
| Filename | Overview |
|---|---|
| litellm/proxy/utils.py | Adds _consecutive_probe_failures counter and _watchdog_failures_before_reconnect threshold; reconnect is deferred until K consecutive probe failures occur. Default=1 preserves existing behavior. Startup log omits the new knob. |
| tests/test_litellm/proxy/db/test_prisma_self_heal.py | Three new async unit tests covering: deferred reconnect below threshold, triggered reconnect at threshold, and counter reset on a successful probe. All tests are mock-only and cover the key behavioral scenarios cleanly. |
Comments Outside Diff (1)
-
litellm/proxy/utils.py, line 4752-4758 (link)The startup info log lists every other watchdog tunable but omits the new
failures_before_reconnectvalue. When a user setsPRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECTand looks at their logs, there is no confirmation that the value was picked up, making it harder to debug misconfigured grace windows.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "fix(prisma): add consecutive-failures ga..." | Re-trigger Greptile
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
To address the issue with the documentation I have created this PR BerriAI/litellm-docs#285 |
Fixed in 5be7981; the resolved value now appears in the startup log alongside the other watchdog tunables. |
9274b9e to
5be7981
Compare
Relevant issues
Linear ticket
Pre-Submission checklist
make test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewType
🐛 Bug Fix
Changes
The Prisma DB health watchdog triggered a full engine reconnect on every single
probe failure, with no grace window. A transient
SELECT 1timeout (frommomentary event loop contention, a resolver delay, or any other short-lived
condition) immediately killed and respawned the query engine, turning a
self-clearing hiccup into a hard outage.
A new counter tracks consecutive probe failures. A reconnect is triggered only
after K consecutive failures, where K is controlled by the env var
PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT(default:1, preserving currentbehavior). Any successful probe resets the counter to zero. The counter is also
reset to zero when a reconnect is triggered, so the gate re-arms cleanly for
the next failure window.
The watchdog startup log also includes the resolved threshold value (failures_before_reconnect=N) alongside the other tunables, so the setting can be confirmed from logs without inspecting env vars directly.
At the default of
1the code path is identical to before. Setting it to3or
4(with the default 30-second probe interval) gives a ~90–120 secondgrace window before any engine kill is attempted, which is enough to absorb
transient stalls while still detecting genuine engine failures promptly.
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Run a LiteLLM proxy locally with
PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT=3set and observe that the first two watchdog probe timeouts log
"Prisma DB watchdog probe failure 1/3; deferring reconnect."and"Prisma DB watchdog probe failure 2/3; deferring reconnect."rather thanimmediately initiating a reconnect. The reconnect fires only on the third
consecutive failure.