[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes by foxyManTou · Pull Request #34 · thanhle74/kickama

foxyManTou · 2026-06-22T23:36:06Z

Summary

Add configurable retry/backoff and circuit breaker patterns to the HTTP health check probes in tools/health_check.py, plus result aggregation and proper logging.

Changes

RetryPolicy class — configurable max retries, exponential backoff (delay = base_delay * (backoff_factor ^ attempt)), optional jitter (±25%), retry on 5xx/timeout only
CircuitBreakerState/CircuitBreakerRegistry — CLOSED → OPEN after N consecutive failures, HALF_OPEN after cooldown, success/failure transitions, per-endpoint tracking
HealthCheckAggregator — result summary stats (total/passed/warnings/critical/pass_rate), degraded service listing, WARNING-level logging for degraded services
New CLI flags: --max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown
Updated output — aggregation summary and retry/circuit-breaker state in report
33 unit tests covering retry policy, exponential backoff, jitter, circuit breaker state machine, registry, aggregator, and integration tests (all pass)

Testing

$ python3 -m unittest tools.test_health_check -v
Ran 33 tests — OK

$ python3 build.py
Build completed — frontend PASS, diagnostic artifacts generated

Checklist

Relevant modules affected by these changes build locally
Tests pass locally
Diagnostic build log is committed in this PR
Documentation has been updated, if applicable
Configuration or schema changes are documented, if applicable
No generated build artifacts are committed, except the required diagnostic build log
Changes are scoped to the PR purpose and avoid unrelated cleanup
Security, privacy, and error-handling implications have been considered

Diagnostic artifacts:

diagnostic/build-94e0fb08.logd (encrypted)
diagnostic/build-94e0fb08.json

Summary by CodeRabbit

Release Notes

New Features
- Added retry mechanism with exponential backoff and jitter for HTTP health checks
- Implemented circuit breaker pattern for endpoint resilience
- Enhanced health check reporting with aggregated metrics and degraded service details
- Added CLI options for configuring retry and circuit breaker behavior
Tests
- Added comprehensive test coverage for retry, backoff, and circuit breaker functionality

- Add RetryPolicy class with exponential backoff: delay = base_delay * (backoff_factor ^ attempt) - Add CircuitBreakerState/CircuitBreakerRegistry for circuit breaker pattern - Add HealthCheckAggregator for result summary stats - Add proper logging with WARNING level for degraded services - Add --max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown flags - Add comprehensive unit tests (33 tests covering retry, backoff, circuit breaker) - Updated run_health_checks and main() to wire all new features

coderabbitai · 2026-06-22T23:36:19Z

📝 Walkthrough

Walkthrough

Adds exponential-retry with jitter and a per-endpoint circuit breaker to tools/health_check.py via new RetryPolicy, CircuitBreakerState, CircuitBreakerRegistry, and HealthCheckAggregator classes. check_http_service and run_health_checks are updated to use them, the CLI gains five new flags, and tools/test_health_check.py covers all new behavior. A build diagnostic JSON file is also added.

Changes

Health Check Resilience Enhancement

Layer / File(s)	Summary
RetryPolicy, CircuitBreaker, and Aggregator contracts `tools/health_check.py`	Adds module defaults, logger, `CircuitBreakerState`/`CircuitBreakerRegistry` with CLOSED/OPEN/HALF_OPEN transitions, `RetryPolicy` with exponential backoff and ±25% jitter, and `HealthCheckAggregator` with summary computation and degraded-service logging.
`check_http_service` with retry and circuit breaker `tools/health_check.py`	Rewrites the HTTP probe to skip when the breaker is OPEN, map status codes to OK/WARNING/CRITICAL, record failures into the breaker, retry on 5xx/timeout with backoff, and return retries-used as a fourth return value.
`run_health_checks` aggregation and wiring `tools/health_check.py`	Extends `run_health_checks` with `retry_policy` and `circuit_registry` parameters, wires per-endpoint circuit breakers into `check_http_service`, records all check outcomes (services, infra latency, disk/memory/load, TLS cert days) into the aggregator, and writes aggregation and overall status into the result dict.
Report output, CLI flags, and `main()` wiring `tools/health_check.py`	Adds aggregated totals and retry/circuit-breaker configuration to `print_health_report`, appends retry counts to per-check detail lines, introduces five new CLI flags (`--max-retries`, `--backoff-factor`, `--base-delay`, `--circuit-threshold`, `--circuit-cooldown`), and passes constructed policy/registry into `run_health_checks`.
Unit and integration tests `tools/test_health_check.py`	Covers `RetryPolicy` defaults, backoff math, jitter bounds, and eligibility rules; `CircuitBreakerState` transitions; `CircuitBreakerRegistry` isolation and reset; `HealthCheckAggregator` counting and status derivation; and `check_http_service` retry/OPEN-skip/exhaustion behavior via mocked `HTTPConnection`.

Diagnostic Build Metadata

Layer / File(s)	Summary
Build diagnostic metadata file `diagnostic/build-94e0fb08.json`	Creates the JSON file with generation timestamp, commit reference, encrypted artifact path and decryption credentials, per-module status/elapsed-time/output records (1 passed, 9 failed), and a PR note.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI / main()
  participant RHC as run_health_checks
  participant CBR as CircuitBreakerRegistry
  participant CHS as check_http_service
  participant CBS as CircuitBreakerState
  participant AGG as HealthCheckAggregator

  CLI->>RHC: retry_policy, circuit_registry
  RHC->>CBR: get(endpoint)
  CBR-->>RHC: CircuitBreakerState
  RHC->>CHS: host, port, path, timeout, retry_policy, circuit_breaker
  CHS->>CBS: is_open()?
  alt OPEN
    CBS-->>CHS: skip
    CHS-->>RHC: CRITICAL, retries=0
  else CLOSED / HALF_OPEN
    loop up to max_retries
      CHS->>CBS: record_success() or record_failure()
    end
    CHS-->>RHC: status, detail, http_code, retries_used
  end
  RHC->>AGG: add_result(name, status, metric, retries)
  AGG-->>RHC: summary (totals, pass_rate, overall_status)
  RHC-->>CLI: results dict with aggregation

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

[$25 BOUNTY] [Python] feat: Add retry/backoff support to health_check.py #2 — The PR directly implements the retry/backoff with exponential jitter and circuit-breaker features requested in this issue, including the CLI flags and unit tests matching the acceptance criteria.
[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes #15 — The PR satisfies all requirements in this bounty: exponential-retry with jitter, circuit-breaker pattern, health-check result aggregation, WARNING-level logging, comprehensive unit tests, and a diagnostic artifact.

Poem

🐇 Hoppity-hop through the network maze,
With retries and jitter to lighten the haze.
If the circuit breaks open, I wait for the cool,
Then probe once again — no rabbit's a fool!
Aggregated results, all tidy and bright,
The health check now passes the test with delight! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding retry/backoff and circuit breaker patterns to health_check HTTP probes.
Description check	✅ Passed	The PR description includes all required sections (Summary, Changes, Testing, Checklist) with comprehensive details about the implementation, testing approach, and verification of checklist items.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

tools/test_health_check.py (3)

141-141: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Resolve Ruff warnings for unused loop/unpacked variables

Ruff hints here are valid; rename unused loop vars to _ and unused unpacked values to _detail/_code/_retries to keep lint clean.

Also applies to: 161-161, 174-174, 187-187, 200-200, 350-350, 368-368, 383-383

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` at line 141, The loop variable `i` in the for
loop (and similarly unused variables at lines 161, 174, 187, 200, 350, 368, and
383) are triggering Ruff warnings because they are not used within their
respective loops or unpacking assignments. Replace unused loop variables with a
single underscore `_`, and replace unused unpacked values with descriptive
underscore-prefixed names such as `_detail`, `_code`, or `_retries` depending on
what value is being unpacked. This will satisfy Ruff's linting requirements
while keeping the code intent clear.

Source: Linters/SAST tools

158-169: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Cooldown transition tests are timing-flaky due to real sleeps

Using time.sleep(0.15) in unit tests makes CI timing-sensitive and slower. Prefer mocking tools.health_check.time.time to deterministically drive OPEN → HALF_OPEN transitions.

Deterministic pattern (example)

     def test_cooldown_transitions_to_half_open(self):
         """Test that after cooldown, open circuit moves to HALF_OPEN."""
         cb = CircuitBreakerState(threshold=3, cooldown=0.1)
-        for i in range(3):
+        for _ in range(3):
             cb.record_failure()
         self.assertEqual(cb.state, cb.OPEN)
-        self.assertFalse(cb.can_probe())
-
-        # Sleep past cooldown
-        time.sleep(0.15)
-        self.assertTrue(cb.can_probe())
-        self.assertEqual(cb.state, cb.HALF_OPEN)
+        with mock.patch("tools.health_check.time.time") as mocked_time:
+            mocked_time.return_value = cb.last_failure_time + 0.05
+            self.assertFalse(cb.can_probe())
+            mocked_time.return_value = cb.last_failure_time + 0.11
+            self.assertTrue(cb.can_probe())
+            self.assertEqual(cb.state, cb.HALF_OPEN)

Also applies to: 171-193

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` around lines 158 - 169, Replace the real
`time.sleep(0.15)` call in the test_cooldown_transitions_to_half_open method
with a mock of `tools.health_check.time.time` to deterministically advance time.
Mock `time.time` to return a timestamp that is greater than the cooldown
duration (0.1 seconds) from when the failures were recorded, allowing the
circuit breaker to transition from OPEN to HALF_OPEN without introducing real
delays. Apply the same mocking pattern to the other timing-dependent tests
mentioned in lines 171-193 that also use `time.sleep()`.

63-73: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

test_jitter_range does not actually verify the ±25% contract

On Line 71, the lower bound is relaxed to min_delay * 0.5, which allows values well outside the documented jitter window. This can let regressions pass.

Proposed tightening

     def test_jitter_range(self):
         """Test that jitter is within ±25%."""
         rp = RetryPolicy(backoff_factor=2.0, base_delay=1.0, jitter=True)
         for attempt in range(10):
             delay = rp.get_delay(attempt)
             expected_base = 1.0 * (2.0 ** attempt)
             min_delay = expected_base * 0.75
             max_delay = expected_base * 1.25
-            self.assertGreaterEqual(delay, min_delay * 0.5)  # with floor
-            self.assertLessEqual(delay, max_delay + 0.1)
+            self.assertGreaterEqual(delay, min_delay)
+            self.assertLessEqual(delay, max_delay)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` around lines 63 - 73, The test_jitter_range
method is not properly enforcing the documented ±25% jitter contract. The lower
bound assertion is checking against min_delay * 0.5, which is too permissive and
allows delays well outside the documented range, and the upper bound is checking
against max_delay + 0.1, which further relaxes the constraint. Fix this by
removing the * 0.5 multiplier from the lower bound check and removing the + 0.1
addition from the upper bound check in the assertions within the
test_jitter_range method, so the test properly verifies the delay falls within
the exact ±25% jitter window.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health_check.py`:
- Around line 536-537: The circuit_registry.to_dict() call at line 537 captures
the circuit breaker state before the probes execute and mutate the breaker
states via get() and record_failure() calls, resulting in a stale snapshot. Move
the assignment of circuit_registry.to_dict() to results["circuit_breakers"] to
after all health check probes complete execution, so the snapshot reflects the
final mutated state of the circuit breakers. Apply the same fix to the similar
circuit_registry serialization at the location noted in "Also applies to:
640-641".
- Around line 396-405: When a circuit breaker is in HALF_OPEN state, it should
not retry failed probes because the breaker is already attempting to recover and
retries would delay the decision to reopen. Before entering the retry logic
block (where retry_policy.should_retry is called), add a check to determine if
circuit_breaker.state equals HALF_OPEN. If the circuit breaker is in HALF_OPEN
state, record the failure and return immediately without entering the retry
block that calls should_retry, get_delay, and time.sleep. This ensures that any
failure during HALF_OPEN state is recorded promptly and the circuit breaker can
transition back to OPEN if needed, rather than being held in retry loops.
- Around line 364-409: The HTTP connection object conn is not being closed when
exceptions occur during request, getresponse, or read operations, causing socket
leaks on retry attempts. Move the conn.close() call into a finally block that
wraps the entire HTTP communication section (from conn.request through
resp.read) to ensure the connection is always closed regardless of whether an
exception is raised. This way, even if an exception is caught in the except
block, the connection will be properly closed before the retry logic executes.
- Around line 373-390: The code currently returns immediately when encountering
5xx responses (status_code >= 500) without checking if the RetryPolicy allows
retrying the request. To fix this, before recording the circuit breaker failure
and returning on 5xx status codes, first check if the retry policy indicates a
retry should be attempted using the should_retry method. Only record the failure
and return the CRITICAL result if the retry policy determines the request should
not be retried or if retry attempts have been exhausted. Otherwise, allow the
retry logic to proceed. This ensures the advertised retry contract for 5xx
probes is actually honored.

In `@tools/test_health_check.py`:
- Around line 359-374: The test_circuit_breaker_skips_probe test verifies the
response when the circuit breaker is open, but does not assert that the actual
probe was skipped and no HTTP connection was attempted. Add
mock_conn.assert_not_called() after the existing assertions to enforce that the
mocked HTTP connection was never called, thus verifying the core contract that
an open circuit breaker skips the probe entirely. Optionally, also add
assertions to verify that retries and code have their expected default values.

---

Nitpick comments:
In `@tools/test_health_check.py`:
- Line 141: The loop variable `i` in the for loop (and similarly unused
variables at lines 161, 174, 187, 200, 350, 368, and 383) are triggering Ruff
warnings because they are not used within their respective loops or unpacking
assignments. Replace unused loop variables with a single underscore `_`, and
replace unused unpacked values with descriptive underscore-prefixed names such
as `_detail`, `_code`, or `_retries` depending on what value is being unpacked.
This will satisfy Ruff's linting requirements while keeping the code intent
clear.
- Around line 158-169: Replace the real `time.sleep(0.15)` call in the
test_cooldown_transitions_to_half_open method with a mock of
`tools.health_check.time.time` to deterministically advance time. Mock
`time.time` to return a timestamp that is greater than the cooldown duration
(0.1 seconds) from when the failures were recorded, allowing the circuit breaker
to transition from OPEN to HALF_OPEN without introducing real delays. Apply the
same mocking pattern to the other timing-dependent tests mentioned in lines
171-193 that also use `time.sleep()`.
- Around line 63-73: The test_jitter_range method is not properly enforcing the
documented ±25% jitter contract. The lower bound assertion is checking against
min_delay * 0.5, which is too permissive and allows delays well outside the
documented range, and the upper bound is checking against max_delay + 0.1, which
further relaxes the constraint. Fix this by removing the * 0.5 multiplier from
the lower bound check and removing the + 0.1 addition from the upper bound check
in the assertions within the test_jitter_range method, so the test properly
verifies the delay falls within the exact ±25% jitter window.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90825d78-b194-4c1d-8366-b4145461c177

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 02db8b7.

📒 Files selected for processing (4)

diagnostic/build-94e0fb08.json
diagnostic/build-94e0fb08.logd
tools/health_check.py
tools/test_health_check.py

coderabbitai · 2026-06-22T23:42:36Z

+    for attempt in range(max_attempts):
+        try:
+            conn = http.client.HTTPConnection(host, port, timeout=timeout)
+            conn.request("GET", path)
+            resp = conn.getresponse()
+            status_code = resp.status
+            body = resp.read().decode("utf-8", errors="replace")[:200]
+            conn.close()
+
+            if status_code == 200:
+                result = "OK"
+                detail = f"HTTP {status_code}"
+            elif status_code < 500:
+                result = "WARNING"
+                detail = f"HTTP {status_code}: {body[:100]}"
+            else:
+                result = "CRITICAL"
+                detail = f"HTTP {status_code}: {body[:100]}"
+
+            # Record in circuit breaker
+            if circuit_breaker is not None:
+                if result == "OK":
+                    circuit_breaker.record_success()
+                else:
+                    circuit_breaker.record_failure()
+
+            return result, detail, status_code, attempt
+
+        except Exception as e:
+            status_code = 0
+            detail = str(e)
+
+            # Check if we should retry
+            if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
+                delay = retry_policy.get_delay(attempt)
+                logger.info(
+                    "Attempt %d/%d failed for %s. Retrying in %.2fs...",
+                    attempt + 1, retry_policy.max_retries + 1,
+                    endpoint_key, delay,
+                )
+                time.sleep(delay)
+                continue
+            else:
+                if circuit_breaker is not None:
+                    circuit_breaker.record_failure()
+                return "CRITICAL", detail, status_code, attempt


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Close the HTTP connection on failed attempts.

If request(), getresponse(), or read() raises, Line 371 is skipped and retry/watch mode can leak sockets across repeated failures. Close the connection in a finally block for every attempt.

🛡️ Proposed fix

for attempt in range(max_attempts): + conn = None try: conn = http.client.HTTPConnection(host, port, timeout=timeout) conn.request("GET", path) resp = conn.getresponse() status_code = resp.status body = resp.read().decode("utf-8", errors="replace")[:200] conn.close() + conn = None if status_code == 200: result = "OK" @@ if circuit_breaker is not None: circuit_breaker.record_failure() return "CRITICAL", detail, status_code, attempt + finally: + if conn is not None: + conn.close()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for attempt in range(max_attempts):

try:

conn = http.client.HTTPConnection(host, port, timeout=timeout)

conn.request("GET", path)

resp = conn.getresponse()

status_code = resp.status

body = resp.read().decode("utf-8", errors="replace")[:200]

conn.close()

if status_code == 200:

result = "OK"

detail = f"HTTP {status_code}"

elif status_code < 500:

result = "WARNING"

detail = f"HTTP {status_code}: {body[:100]}"

else:

result = "CRITICAL"

detail = f"HTTP {status_code}: {body[:100]}"

# Record in circuit breaker

if circuit_breaker is not None:

if result == "OK":

circuit_breaker.record_success()

else:

circuit_breaker.record_failure()

return result, detail, status_code, attempt

except Exception as e:

status_code = 0

detail = str(e)

# Check if we should retry

if retry_policy is not None and retry_policy.should_retry(attempt, status_code):

delay = retry_policy.get_delay(attempt)

logger.info(

"Attempt %d/%d failed for %s. Retrying in %.2fs...",

attempt + 1, retry_policy.max_retries + 1,

endpoint_key, delay,

)

time.sleep(delay)

continue

else:

if circuit_breaker is not None:

circuit_breaker.record_failure()

return "CRITICAL", detail, status_code, attempt

for attempt in range(max_attempts):

conn = None

try:

conn = http.client.HTTPConnection(host, port, timeout=timeout)

conn.request("GET", path)

resp = conn.getresponse()

status_code = resp.status

body = resp.read().decode("utf-8", errors="replace")[:200]

conn.close()

conn = None

if status_code == 200:

result = "OK"

detail = f"HTTP {status_code}"

elif status_code < 500:

result = "WARNING"

detail = f"HTTP {status_code}: {body[:100]}"

else:

result = "CRITICAL"

detail = f"HTTP {status_code}: {body[:100]}"

# Record in circuit breaker

if circuit_breaker is not None:

if result == "OK":

circuit_breaker.record_success()

else:

circuit_breaker.record_failure()

return result, detail, status_code, attempt

except Exception as e:

status_code = 0

detail = str(e)

# Check if we should retry

if retry_policy is not None and retry_policy.should_retry(attempt, status_code):

delay = retry_policy.get_delay(attempt)

logger.info(

"Attempt %d/%d failed for %s. Retrying in %.2fs...",

attempt + 1, retry_policy.max_retries + 1,

endpoint_key, delay,

)

time.sleep(delay)

continue

else:

if circuit_breaker is not None:

circuit_breaker.record_failure()

return "CRITICAL", detail, status_code, attempt

finally:

if conn is not None:

conn.close()

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 392-392: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 364 - 409, The HTTP connection object conn is not being closed when exceptions occur during request, getresponse, or read operations, causing socket leaks on retry attempts. Move the conn.close() call into a finally block that wraps the entire HTTP communication section (from conn.request through resp.read) to ensure the connection is always closed regardless of whether an exception is raised. This way, even if an exception is caught in the except block, the connection will be properly closed before the retry logic executes.

coderabbitai · 2026-06-22T23:42:36Z

+            if status_code == 200:
+                result = "OK"
+                detail = f"HTTP {status_code}"
+            elif status_code < 500:
+                result = "WARNING"
+                detail = f"HTTP {status_code}: {body[:100]}"
+            else:
+                result = "CRITICAL"
+                detail = f"HTTP {status_code}: {body[:100]}"
+
+            # Record in circuit breaker
+            if circuit_breaker is not None:
+                if result == "OK":
+                    circuit_breaker.record_success()
+                else:
+                    circuit_breaker.record_failure()
+
+            return result, detail, status_code, attempt


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Retry eligible HTTP 5xx responses before returning.

Line 380 marks 5xx responses CRITICAL, but Lines 383-390 immediately record the failure and return, so RetryPolicy.should_retry(..., 500+) is never used for actual HTTP responses. This breaks the advertised retry contract for 5xx probes.

🐛 Proposed fix

else: result = "CRITICAL" detail = f"HTTP {status_code}: {body[:100]}" + if retry_policy is not None and retry_policy.should_retry(attempt, status_code): + delay = retry_policy.get_delay(attempt) + logger.info( + "Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...", + attempt + 1, retry_policy.max_retries + 1, + status_code, endpoint_key, delay, + ) + time.sleep(delay) + continue + # Record in circuit breaker if circuit_breaker is not None: if result == "OK": circuit_breaker.record_success() else:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if status_code == 200:

result = "OK"

detail = f"HTTP {status_code}"

elif status_code < 500:

result = "WARNING"

detail = f"HTTP {status_code}: {body[:100]}"

else:

result = "CRITICAL"

detail = f"HTTP {status_code}: {body[:100]}"

# Record in circuit breaker

if circuit_breaker is not None:

if result == "OK":

circuit_breaker.record_success()

else:

circuit_breaker.record_failure()

return result, detail, status_code, attempt

if status_code == 200:

result = "OK"

detail = f"HTTP {status_code}"

elif status_code < 500:

result = "WARNING"

detail = f"HTTP {status_code}: {body[:100]}"

else:

result = "CRITICAL"

detail = f"HTTP {status_code}: {body[:100]}"

if retry_policy is not None and retry_policy.should_retry(attempt, status_code):

delay = retry_policy.get_delay(attempt)

logger.info(

"Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...",

attempt + 1, retry_policy.max_retries + 1,

status_code, endpoint_key, delay,

)

time.sleep(delay)

continue

# Record in circuit breaker

if circuit_breaker is not None:

if result == "OK":

circuit_breaker.record_success()

else:

circuit_breaker.record_failure()

return result, detail, status_code, attempt

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 373 - 390, The code currently returns immediately when encountering 5xx responses (status_code >= 500) without checking if the RetryPolicy allows retrying the request. To fix this, before recording the circuit breaker failure and returning on 5xx status codes, first check if the retry policy indicates a retry should be attempted using the should_retry method. Only record the failure and return the CRITICAL result if the retry policy determines the request should not be retried or if retry attempts have been exhausted. Otherwise, allow the retry logic to proceed. This ensures the advertised retry contract for 5xx probes is actually honored.

coderabbitai · 2026-06-22T23:42:36Z

+            # Check if we should retry
+            if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
+                delay = retry_policy.get_delay(attempt)
+                logger.info(
+                    "Attempt %d/%d failed for %s. Retrying in %.2fs...",
+                    attempt + 1, retry_policy.max_retries + 1,
+                    endpoint_key, delay,
+                )
+                time.sleep(delay)
+                continue


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not retry a failed HALF_OPEN probe.

When can_probe() moves the breaker to HALF_OPEN, a timeout/connection failure reaches Lines 397-405 and can retry multiple times before the breaker is reopened. Record the failure and return immediately when circuit_breaker.state == HALF_OPEN.

🔒 Proposed fix

detail = str(e) + if circuit_breaker is not None and circuit_breaker.state == circuit_breaker.HALF_OPEN: + circuit_breaker.record_failure() + return "CRITICAL", detail, status_code, attempt + # Check if we should retry if retry_policy is not None and retry_policy.should_retry(attempt, status_code): delay = retry_policy.get_delay(attempt)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 396 - 405, When a circuit breaker is in HALF_OPEN state, it should not retry failed probes because the breaker is already attempting to recover and retries would delay the decision to reopen. Before entering the retry logic block (where retry_policy.should_retry is called), add a check to determine if circuit_breaker.state equals HALF_OPEN. If the circuit breaker is in HALF_OPEN state, record the failure and return immediately without entering the retry block that calls should_retry, get_delay, and time.sleep. This ensures that any failure during HALF_OPEN state is recorded promptly and the circuit breaker can transition back to OPEN if needed, rather than being held in retry loops.

coderabbitai · 2026-06-22T23:42:36Z

+        "retry_policy": retry_policy.to_dict() if retry_policy else None,
+        "circuit_breakers": circuit_registry.to_dict() if circuit_registry else {},


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Snapshot circuit breakers after probes update them.

Line 537 serializes the registry before get()/record_failure() mutate breaker state, so results["circuit_breakers"] is stale and print_health_report() can miss newly OPEN breakers. Refresh the snapshot after all checks complete.

🐛 Proposed fix

"overall_status": "OK", "aggregation": {}, "retry_policy": retry_policy.to_dict() if retry_policy else None, - "circuit_breakers": circuit_registry.to_dict() if circuit_registry else {}, + "circuit_breakers": {}, @@ results["overall_status"] = "OK" if all_ok else "DEGRADED" results["aggregation"] = aggregator.summary() + if circuit_registry is not None: + results["circuit_breakers"] = circuit_registry.to_dict() # Log degraded services

Also applies to: 640-641

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 536 - 537, The circuit_registry.to_dict() call at line 537 captures the circuit breaker state before the probes execute and mutate the breaker states via get() and record_failure() calls, resulting in a stale snapshot. Move the assignment of circuit_registry.to_dict() to results["circuit_breakers"] to after all health check probes complete execution, so the snapshot reflects the final mutated state of the circuit breakers. Apply the same fix to the similar circuit_registry serialization at the location noted in "Also applies to: 640-641".

coderabbitai · 2026-06-22T23:42:36Z

+    def test_circuit_breaker_skips_probe(self, mock_conn):
+        """Test that open circuit breaker skips probe."""
+        from tools.health_check import CircuitBreakerState
+
+        cb = CircuitBreakerState(threshold=2, cooldown=30)
+        cb.record_failure()
+        cb.record_failure()  # circuit opens
+        self.assertFalse(cb.can_probe())
+
+        status, detail, code, retries = check_http_service(
+            "localhost", 8080, "/health", 5,
+            circuit_breaker=cb,
+        )
+        self.assertEqual(status, "CRITICAL")
+        self.assertIn("Circuit breaker OPEN", detail)
+


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

OPEN-circuit short-circuit behavior is not fully asserted

This test should also verify no outbound probe occurs. Add mock_conn.assert_not_called() (and optionally assert retries/code defaults) so the core skip contract is enforced.

Suggested assertion

status, detail, code, retries = check_http_service( "localhost", 8080, "/health", 5, circuit_breaker=cb, ) self.assertEqual(status, "CRITICAL") self.assertIn("Circuit breaker OPEN", detail) + mock_conn.assert_not_called() + self.assertEqual(retries, 0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_circuit_breaker_skips_probe(self, mock_conn):

"""Test that open circuit breaker skips probe."""

from tools.health_check import CircuitBreakerState

cb = CircuitBreakerState(threshold=2, cooldown=30)

cb.record_failure()

cb.record_failure() # circuit opens

self.assertFalse(cb.can_probe())

status, detail, code, retries = check_http_service(

"localhost", 8080, "/health", 5,

circuit_breaker=cb,

)

self.assertEqual(status, "CRITICAL")

self.assertIn("Circuit breaker OPEN", detail)

def test_circuit_breaker_skips_probe(self, mock_conn):

"""Test that open circuit breaker skips probe."""

from tools.health_check import CircuitBreakerState

cb = CircuitBreakerState(threshold=2, cooldown=30)

cb.record_failure()

cb.record_failure() # circuit opens

self.assertFalse(cb.can_probe())

status, detail, code, retries = check_http_service(

"localhost", 8080, "/health", 5,

circuit_breaker=cb,

)

self.assertEqual(status, "CRITICAL")

self.assertIn("Circuit breaker OPEN", detail)

mock_conn.assert_not_called()

self.assertEqual(retries, 0)

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 368-368: Unpacked variable code is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 368-368: Unpacked variable retries is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/test_health_check.py` around lines 359 - 374, The test_circuit_breaker_skips_probe test verifies the response when the circuit breaker is open, but does not assert that the actual probe was skipped and no HTTP connection was attempted. Add mock_conn.assert_not_called() after the existing assertions to enforce that the mocked HTTP connection was never called, thus verifying the core contract that an open circuit breaker skips the probe entirely. Optionally, also add assertions to verify that retries and code have their expected default values.

foxyManTou added 2 commits June 23, 2026 07:34

Add build diagnostics for 94e0fb0

7d6a441

foxyManTou mentioned this pull request Jun 22, 2026

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes #15

Open

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes#34

[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes#34
foxyManTou wants to merge 2 commits into
thanhle74:mainfrom
foxyManTou:feature/health-check-retry-circuit-breaker

foxyManTou commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"retry_policy": retry_policy.to_dict() if retry_policy else None,
		"circuit_breakers": circuit_registry.to_dict() if circuit_registry else {},

Conversation

foxyManTou commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

foxyManTou commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading