Skip to content

fix(prisma): add consecutive-failures gate before watchdog reconnect#29502

Open
gvisco wants to merge 2 commits into
BerriAI:litellm_internal_stagingfrom
gvisco:litellm_prisma_watchdog_failures_gate
Open

fix(prisma): add consecutive-failures gate before watchdog reconnect#29502
gvisco wants to merge 2 commits into
BerriAI:litellm_internal_stagingfrom
gvisco:litellm_prisma_watchdog_failures_gate

Conversation

@gvisco

@gvisco gvisco commented Jun 2, 2026

Copy link
Copy Markdown

Relevant issues

Linear ticket

Pre-Submission checklist

  • I have added meaningful tests
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

The Prisma DB health watchdog triggered a full engine reconnect on every single
probe failure, with no grace window. A transient SELECT 1 timeout (from
momentary event loop contention, a resolver delay, or any other short-lived
condition) immediately killed and respawned the query engine, turning a
self-clearing hiccup into a hard outage.

A new counter tracks consecutive probe failures. A reconnect is triggered only
after K consecutive failures, where K is controlled by the env var
PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT (default: 1, preserving current
behavior). Any successful probe resets the counter to zero. The counter is also
reset to zero when a reconnect is triggered, so the gate re-arms cleanly for
the next failure window.
The watchdog startup log also includes the resolved threshold value (failures_before_reconnect=N) alongside the other tunables, so the setting can be confirmed from logs without inspecting env vars directly.

At the default of 1 the code path is identical to before. Setting it to 3
or 4 (with the default 30-second probe interval) gives a ~90–120 second
grace window before any engine kill is attempted, which is enough to absorb
transient stalls while still detecting genuine engine failures promptly.

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Run a LiteLLM proxy locally with PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT=3
set and observe that the first two watchdog probe timeouts log
"Prisma DB watchdog probe failure 1/3; deferring reconnect." and
"Prisma DB watchdog probe failure 2/3; deferring reconnect." rather than
immediately initiating a reconnect. The reconnect fires only on the third
consecutive failure.

PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT=3 \
  python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml \
  --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

A single failed SELECT 1 probe triggered an immediate engine kill and
respawn. A transient probe failure from brief event loop contention or
a momentary external condition now causes a full reconnect cycle,
turning a self-clearing hiccup into a hard outage.

This adds a consecutive probe failure counter to the watchdog loop.
Reconnect fires only after K consecutive probe failures, where K is
read from PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT (default 1 to
preserve current behavior). Any successful probe resets the counter.
The counter is also reset when a reconnect is triggered so the gate
re-arms cleanly for the next failure window.
@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a consecutive-failures gate to the Prisma DB health watchdog so that transient SELECT 1 timeouts no longer immediately trigger a full engine reconnect. A new counter tracks back-to-back probe failures and reconnect fires only after reaching a configurable threshold (PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT, default 1 to preserve existing behavior).

  • litellm/proxy/utils.py: Adds _consecutive_probe_failures and _watchdog_failures_before_reconnect; counter increments on each qualifying failure, resets on a successful probe or when reconnect is triggered. At the default of 1 the code path is unchanged.
  • tests/test_litellm/proxy/db/test_prisma_self_heal.py: Three new mock-only tests verify deferred reconnect below threshold, reconnect at threshold, and counter reset on success.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible and the new threshold logic is well-exercised by the added tests.

The implementation is minimal and correct. The only gap is that the new failures_before_reconnect knob is not surfaced in the watchdog startup log, making it harder to confirm the env var was picked up at runtime.

No files require special attention beyond the minor startup-log omission in litellm/proxy/utils.py.

Important Files Changed

Filename Overview
litellm/proxy/utils.py Adds _consecutive_probe_failures counter and _watchdog_failures_before_reconnect threshold; reconnect is deferred until K consecutive probe failures occur. Default=1 preserves existing behavior. Startup log omits the new knob.
tests/test_litellm/proxy/db/test_prisma_self_heal.py Three new async unit tests covering: deferred reconnect below threshold, triggered reconnect at threshold, and counter reset on a successful probe. All tests are mock-only and cover the key behavioral scenarios cleanly.

Comments Outside Diff (1)

  1. litellm/proxy/utils.py, line 4752-4758 (link)

    P2 The startup info log lists every other watchdog tunable but omits the new failures_before_reconnect value. When a user sets PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT and looks at their logs, there is no confirmation that the value was picked up, making it harder to debug misconfigured grace windows.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "fix(prisma): add consecutive-failures ga..." | Re-trigger Greptile

@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gvisco

gvisco commented Jun 2, 2026

Copy link
Copy Markdown
Author

To address the issue with the documentation I have created this PR BerriAI/litellm-docs#285

@gvisco

gvisco commented Jun 2, 2026

Copy link
Copy Markdown
Author

Greptile Summary

[...]

Comments Outside Diff (1)

  1. litellm/proxy/utils.py, line 4752-4758 (link)
    P2 The startup info log lists every other watchdog tunable but omits the new failures_before_reconnect value. When a user sets PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT and looks at their logs, there is no confirmation that the value was picked up, making it harder to debug misconfigured grace windows.
    [...]

Fixed in 5be7981; the resolved value now appears in the startup log alongside the other watchdog tunables.

@gvisco gvisco force-pushed the litellm_prisma_watchdog_failures_gate branch from 9274b9e to 5be7981 Compare June 2, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant