fix(prisma): add consecutive-failures gate before watchdog reconnect by gvisco · Pull Request #29502 · BerriAI/litellm

gvisco · 2026-06-02T15:16:30Z

Relevant issues

Linear ticket

Pre-Submission checklist

I have added meaningful tests
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible; it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

The Prisma DB health watchdog triggered a full engine reconnect on every single
probe failure, with no grace window. A transient SELECT 1 timeout (from
momentary event loop contention, a resolver delay, or any other short-lived
condition) immediately killed and respawned the query engine, turning a
self-clearing hiccup into a hard outage.

A new counter tracks consecutive probe failures. A reconnect is triggered only
after K consecutive failures, where K is controlled by the env var
PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT (default: 1, preserving current
behavior). Any successful probe resets the counter to zero. The counter is also
reset to zero when a reconnect is triggered, so the gate re-arms cleanly for
the next failure window.
The watchdog startup log also includes the resolved threshold value (failures_before_reconnect=N) alongside the other tunables, so the setting can be confirmed from logs without inspecting env vars directly.

At the default of 1 the code path is identical to before. Setting it to 3
or 4 (with the default 30-second probe interval) gives a ~90–120 second
grace window before any engine kill is attempted, which is enough to absorb
transient stalls while still detecting genuine engine failures promptly.

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Run a LiteLLM proxy locally with PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT=3
set and observe that the first two watchdog probe timeouts log
"Prisma DB watchdog probe failure 1/3; deferring reconnect." and
"Prisma DB watchdog probe failure 2/3; deferring reconnect." rather than
immediately initiating a reconnect. The reconnect fires only on the third
consecutive failure.

PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT=3 \
  python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml \
  --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

A single failed SELECT 1 probe triggered an immediate engine kill and respawn. A transient probe failure from brief event loop contention or a momentary external condition now causes a full reconnect cycle, turning a self-clearing hiccup into a hard outage. This adds a consecutive probe failure counter to the watchdog loop. Reconnect fires only after K consecutive probe failures, where K is read from PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT (default 1 to preserve current behavior). Any successful probe resets the counter. The counter is also reset when a reconnect is triggered so the gate re-arms cleanly for the next failure window.

greptile-apps · 2026-06-02T15:19:13Z

Greptile Summary

This PR adds a consecutive-failures gate to the Prisma DB health watchdog so that transient SELECT 1 timeouts no longer immediately trigger a full engine reconnect. A new counter tracks back-to-back probe failures and reconnect fires only after reaching a configurable threshold (PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT, default 1 to preserve existing behavior).

litellm/proxy/utils.py: Adds _consecutive_probe_failures and _watchdog_failures_before_reconnect; counter increments on each qualifying failure, resets on a successful probe or when reconnect is triggered. At the default of 1 the code path is unchanged.
tests/test_litellm/proxy/db/test_prisma_self_heal.py: Three new mock-only tests verify deferred reconnect below threshold, reconnect at threshold, and counter reset on success.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible and the new threshold logic is well-exercised by the added tests.

The implementation is minimal and correct. The only gap is that the new failures_before_reconnect knob is not surfaced in the watchdog startup log, making it harder to confirm the env var was picked up at runtime.

No files require special attention beyond the minor startup-log omission in litellm/proxy/utils.py.

Important Files Changed

Filename	Overview
litellm/proxy/utils.py	Adds `_consecutive_probe_failures` counter and `_watchdog_failures_before_reconnect` threshold; reconnect is deferred until K consecutive probe failures occur. Default=1 preserves existing behavior. Startup log omits the new knob.
tests/test_litellm/proxy/db/test_prisma_self_heal.py	Three new async unit tests covering: deferred reconnect below threshold, triggered reconnect at threshold, and counter reset on a successful probe. All tests are mock-only and cover the key behavioral scenarios cleanly.

Comments Outside Diff (1)

litellm/proxy/utils.py, line 4752-4758 (link)

The startup info log lists every other watchdog tunable but omits the new failures_before_reconnect value. When a user sets PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT and looks at their logs, there is no confirmation that the value was picked up, making it harder to debug misconfigured grace windows.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "fix(prisma): add consecutive-failures ga..." | Re-trigger Greptile}

codecov · 2026-06-02T15:26:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

gvisco · 2026-06-02T16:03:09Z

To address the issue with the documentation I have created this PR BerriAI/litellm-docs#285

gvisco · 2026-06-02T16:12:19Z

Greptile Summary

[...]

Comments Outside Diff (1)

litellm/proxy/utils.py, line 4752-4758 (link)
The startup info log lists every other watchdog tunable but omits the new failures_before_reconnect value. When a user sets PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT and looks at their logs, there is no confirmation that the value was picked up, making it harder to debug misconfigured grace windows.
[...]

Fixed in 5be7981; the resolved value now appears in the startup log alongside the other watchdog tunables.

gvisco mentioned this pull request Jun 2, 2026

docs: document PRISMA_WATCHDOG_FAILURES_BEFORE_RECONNECT env var BerriAI/litellm-docs#285

Open

fix(prisma): include failures_before_reconnect in watchdog startup log

5be7981

gvisco force-pushed the litellm_prisma_watchdog_failures_gate branch from 9274b9e to 5be7981 Compare June 2, 2026 16:35

gvisco mentioned this pull request Jun 2, 2026

[Bug]: Prisma reconnection failed #26886

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(prisma): add consecutive-failures gate before watchdog reconnect#29502

fix(prisma): add consecutive-failures gate before watchdog reconnect#29502
gvisco wants to merge 2 commits into
BerriAI:litellm_internal_stagingfrom
gvisco:litellm_prisma_watchdog_failures_gate

gvisco commented Jun 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

gvisco commented Jun 2, 2026

Uh oh!

gvisco commented Jun 2, 2026 •

edited

Loading

Greptile Summary

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gvisco commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Linear ticket

Pre-Submission checklist

Type

Changes

CI (LiteLLM team)

Screenshots / Proof of Fix

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gvisco commented Jun 2, 2026

Uh oh!

gvisco commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gvisco commented Jun 2, 2026 •

edited

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading

gvisco commented Jun 2, 2026 •

edited

Loading