Skip to content

[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes#34

Open
foxyManTou wants to merge 2 commits into
thanhle74:mainfrom
foxyManTou:feature/health-check-retry-circuit-breaker
Open

[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes#34
foxyManTou wants to merge 2 commits into
thanhle74:mainfrom
foxyManTou:feature/health-check-retry-circuit-breaker

Conversation

@foxyManTou

@foxyManTou foxyManTou commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Add configurable retry/backoff and circuit breaker patterns to the HTTP health check probes in tools/health_check.py, plus result aggregation and proper logging.

Changes

  • RetryPolicy class — configurable max retries, exponential backoff (delay = base_delay * (backoff_factor ^ attempt)), optional jitter (±25%), retry on 5xx/timeout only
  • CircuitBreakerState/CircuitBreakerRegistry — CLOSED → OPEN after N consecutive failures, HALF_OPEN after cooldown, success/failure transitions, per-endpoint tracking
  • HealthCheckAggregator — result summary stats (total/passed/warnings/critical/pass_rate), degraded service listing, WARNING-level logging for degraded services
  • New CLI flags: --max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown
  • Updated output — aggregation summary and retry/circuit-breaker state in report
  • 33 unit tests covering retry policy, exponential backoff, jitter, circuit breaker state machine, registry, aggregator, and integration tests (all pass)

Testing

$ python3 -m unittest tools.test_health_check -v
Ran 33 tests — OK

$ python3 build.py
Build completed — frontend PASS, diagnostic artifacts generated

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated, if applicable
  • Configuration or schema changes are documented, if applicable
  • No generated build artifacts are committed, except the required diagnostic build log
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

Diagnostic artifacts:

  • diagnostic/build-94e0fb08.logd (encrypted)
  • diagnostic/build-94e0fb08.json

Summary by CodeRabbit

Release Notes

  • New Features

    • Added retry mechanism with exponential backoff and jitter for HTTP health checks
    • Implemented circuit breaker pattern for endpoint resilience
    • Enhanced health check reporting with aggregated metrics and degraded service details
    • Added CLI options for configuring retry and circuit breaker behavior
  • Tests

    • Added comprehensive test coverage for retry, backoff, and circuit breaker functionality

foxyManTou added 2 commits June 23, 2026 07:34
- Add RetryPolicy class with exponential backoff: delay = base_delay * (backoff_factor ^ attempt)
- Add CircuitBreakerState/CircuitBreakerRegistry for circuit breaker pattern
- Add HealthCheckAggregator for result summary stats
- Add proper logging with WARNING level for degraded services
- Add --max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown flags
- Add comprehensive unit tests (33 tests covering retry, backoff, circuit breaker)
- Updated run_health_checks and main() to wire all new features
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds exponential-retry with jitter and a per-endpoint circuit breaker to tools/health_check.py via new RetryPolicy, CircuitBreakerState, CircuitBreakerRegistry, and HealthCheckAggregator classes. check_http_service and run_health_checks are updated to use them, the CLI gains five new flags, and tools/test_health_check.py covers all new behavior. A build diagnostic JSON file is also added.

Changes

Health Check Resilience Enhancement

Layer / File(s) Summary
RetryPolicy, CircuitBreaker, and Aggregator contracts
tools/health_check.py
Adds module defaults, logger, CircuitBreakerState/CircuitBreakerRegistry with CLOSED/OPEN/HALF_OPEN transitions, RetryPolicy with exponential backoff and ±25% jitter, and HealthCheckAggregator with summary computation and degraded-service logging.
check_http_service with retry and circuit breaker
tools/health_check.py
Rewrites the HTTP probe to skip when the breaker is OPEN, map status codes to OK/WARNING/CRITICAL, record failures into the breaker, retry on 5xx/timeout with backoff, and return retries-used as a fourth return value.
run_health_checks aggregation and wiring
tools/health_check.py
Extends run_health_checks with retry_policy and circuit_registry parameters, wires per-endpoint circuit breakers into check_http_service, records all check outcomes (services, infra latency, disk/memory/load, TLS cert days) into the aggregator, and writes aggregation and overall status into the result dict.
Report output, CLI flags, and main() wiring
tools/health_check.py
Adds aggregated totals and retry/circuit-breaker configuration to print_health_report, appends retry counts to per-check detail lines, introduces five new CLI flags (--max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown), and passes constructed policy/registry into run_health_checks.
Unit and integration tests
tools/test_health_check.py
Covers RetryPolicy defaults, backoff math, jitter bounds, and eligibility rules; CircuitBreakerState transitions; CircuitBreakerRegistry isolation and reset; HealthCheckAggregator counting and status derivation; and check_http_service retry/OPEN-skip/exhaustion behavior via mocked HTTPConnection.

Diagnostic Build Metadata

Layer / File(s) Summary
Build diagnostic metadata file
diagnostic/build-94e0fb08.json
Creates the JSON file with generation timestamp, commit reference, encrypted artifact path and decryption credentials, per-module status/elapsed-time/output records (1 passed, 9 failed), and a PR note.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI / main()
  participant RHC as run_health_checks
  participant CBR as CircuitBreakerRegistry
  participant CHS as check_http_service
  participant CBS as CircuitBreakerState
  participant AGG as HealthCheckAggregator

  CLI->>RHC: retry_policy, circuit_registry
  RHC->>CBR: get(endpoint)
  CBR-->>RHC: CircuitBreakerState
  RHC->>CHS: host, port, path, timeout, retry_policy, circuit_breaker
  CHS->>CBS: is_open()?
  alt OPEN
    CBS-->>CHS: skip
    CHS-->>RHC: CRITICAL, retries=0
  else CLOSED / HALF_OPEN
    loop up to max_retries
      CHS->>CBS: record_success() or record_failure()
    end
    CHS-->>RHC: status, detail, http_code, retries_used
  end
  RHC->>AGG: add_result(name, status, metric, retries)
  AGG-->>RHC: summary (totals, pass_rate, overall_status)
  RHC-->>CLI: results dict with aggregation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

🐇 Hoppity-hop through the network maze,
With retries and jitter to lighten the haze.
If the circuit breaks open, I wait for the cool,
Then probe once again — no rabbit's a fool!
Aggregated results, all tidy and bright,
The health check now passes the test with delight! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding retry/backoff and circuit breaker patterns to health_check HTTP probes.
Description check ✅ Passed The PR description includes all required sections (Summary, Changes, Testing, Checklist) with comprehensive details about the implementation, testing approach, and verification of checklist items.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
tools/test_health_check.py (3)

141-141: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Resolve Ruff warnings for unused loop/unpacked variables

Ruff hints here are valid; rename unused loop vars to _ and unused unpacked values to _detail/_code/_retries to keep lint clean.

Also applies to: 161-161, 174-174, 187-187, 200-200, 350-350, 368-368, 383-383

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` at line 141, The loop variable `i` in the for
loop (and similarly unused variables at lines 161, 174, 187, 200, 350, 368, and
383) are triggering Ruff warnings because they are not used within their
respective loops or unpacking assignments. Replace unused loop variables with a
single underscore `_`, and replace unused unpacked values with descriptive
underscore-prefixed names such as `_detail`, `_code`, or `_retries` depending on
what value is being unpacked. This will satisfy Ruff's linting requirements
while keeping the code intent clear.

Source: Linters/SAST tools


158-169: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Cooldown transition tests are timing-flaky due to real sleeps

Using time.sleep(0.15) in unit tests makes CI timing-sensitive and slower. Prefer mocking tools.health_check.time.time to deterministically drive OPEN → HALF_OPEN transitions.

Deterministic pattern (example)
     def test_cooldown_transitions_to_half_open(self):
         """Test that after cooldown, open circuit moves to HALF_OPEN."""
         cb = CircuitBreakerState(threshold=3, cooldown=0.1)
-        for i in range(3):
+        for _ in range(3):
             cb.record_failure()
         self.assertEqual(cb.state, cb.OPEN)
-        self.assertFalse(cb.can_probe())
-
-        # Sleep past cooldown
-        time.sleep(0.15)
-        self.assertTrue(cb.can_probe())
-        self.assertEqual(cb.state, cb.HALF_OPEN)
+        with mock.patch("tools.health_check.time.time") as mocked_time:
+            mocked_time.return_value = cb.last_failure_time + 0.05
+            self.assertFalse(cb.can_probe())
+            mocked_time.return_value = cb.last_failure_time + 0.11
+            self.assertTrue(cb.can_probe())
+            self.assertEqual(cb.state, cb.HALF_OPEN)

Also applies to: 171-193

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` around lines 158 - 169, Replace the real
`time.sleep(0.15)` call in the test_cooldown_transitions_to_half_open method
with a mock of `tools.health_check.time.time` to deterministically advance time.
Mock `time.time` to return a timestamp that is greater than the cooldown
duration (0.1 seconds) from when the failures were recorded, allowing the
circuit breaker to transition from OPEN to HALF_OPEN without introducing real
delays. Apply the same mocking pattern to the other timing-dependent tests
mentioned in lines 171-193 that also use `time.sleep()`.

63-73: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

test_jitter_range does not actually verify the ±25% contract

On Line 71, the lower bound is relaxed to min_delay * 0.5, which allows values well outside the documented jitter window. This can let regressions pass.

Proposed tightening
     def test_jitter_range(self):
         """Test that jitter is within ±25%."""
         rp = RetryPolicy(backoff_factor=2.0, base_delay=1.0, jitter=True)
         for attempt in range(10):
             delay = rp.get_delay(attempt)
             expected_base = 1.0 * (2.0 ** attempt)
             min_delay = expected_base * 0.75
             max_delay = expected_base * 1.25
-            self.assertGreaterEqual(delay, min_delay * 0.5)  # with floor
-            self.assertLessEqual(delay, max_delay + 0.1)
+            self.assertGreaterEqual(delay, min_delay)
+            self.assertLessEqual(delay, max_delay)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` around lines 63 - 73, The test_jitter_range
method is not properly enforcing the documented ±25% jitter contract. The lower
bound assertion is checking against min_delay * 0.5, which is too permissive and
allows delays well outside the documented range, and the upper bound is checking
against max_delay + 0.1, which further relaxes the constraint. Fix this by
removing the * 0.5 multiplier from the lower bound check and removing the + 0.1
addition from the upper bound check in the assertions within the
test_jitter_range method, so the test properly verifies the delay falls within
the exact ±25% jitter window.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health_check.py`:
- Around line 536-537: The circuit_registry.to_dict() call at line 537 captures
the circuit breaker state before the probes execute and mutate the breaker
states via get() and record_failure() calls, resulting in a stale snapshot. Move
the assignment of circuit_registry.to_dict() to results["circuit_breakers"] to
after all health check probes complete execution, so the snapshot reflects the
final mutated state of the circuit breakers. Apply the same fix to the similar
circuit_registry serialization at the location noted in "Also applies to:
640-641".
- Around line 396-405: When a circuit breaker is in HALF_OPEN state, it should
not retry failed probes because the breaker is already attempting to recover and
retries would delay the decision to reopen. Before entering the retry logic
block (where retry_policy.should_retry is called), add a check to determine if
circuit_breaker.state equals HALF_OPEN. If the circuit breaker is in HALF_OPEN
state, record the failure and return immediately without entering the retry
block that calls should_retry, get_delay, and time.sleep. This ensures that any
failure during HALF_OPEN state is recorded promptly and the circuit breaker can
transition back to OPEN if needed, rather than being held in retry loops.
- Around line 364-409: The HTTP connection object conn is not being closed when
exceptions occur during request, getresponse, or read operations, causing socket
leaks on retry attempts. Move the conn.close() call into a finally block that
wraps the entire HTTP communication section (from conn.request through
resp.read) to ensure the connection is always closed regardless of whether an
exception is raised. This way, even if an exception is caught in the except
block, the connection will be properly closed before the retry logic executes.
- Around line 373-390: The code currently returns immediately when encountering
5xx responses (status_code >= 500) without checking if the RetryPolicy allows
retrying the request. To fix this, before recording the circuit breaker failure
and returning on 5xx status codes, first check if the retry policy indicates a
retry should be attempted using the should_retry method. Only record the failure
and return the CRITICAL result if the retry policy determines the request should
not be retried or if retry attempts have been exhausted. Otherwise, allow the
retry logic to proceed. This ensures the advertised retry contract for 5xx
probes is actually honored.

In `@tools/test_health_check.py`:
- Around line 359-374: The test_circuit_breaker_skips_probe test verifies the
response when the circuit breaker is open, but does not assert that the actual
probe was skipped and no HTTP connection was attempted. Add
mock_conn.assert_not_called() after the existing assertions to enforce that the
mocked HTTP connection was never called, thus verifying the core contract that
an open circuit breaker skips the probe entirely. Optionally, also add
assertions to verify that retries and code have their expected default values.

---

Nitpick comments:
In `@tools/test_health_check.py`:
- Line 141: The loop variable `i` in the for loop (and similarly unused
variables at lines 161, 174, 187, 200, 350, 368, and 383) are triggering Ruff
warnings because they are not used within their respective loops or unpacking
assignments. Replace unused loop variables with a single underscore `_`, and
replace unused unpacked values with descriptive underscore-prefixed names such
as `_detail`, `_code`, or `_retries` depending on what value is being unpacked.
This will satisfy Ruff's linting requirements while keeping the code intent
clear.
- Around line 158-169: Replace the real `time.sleep(0.15)` call in the
test_cooldown_transitions_to_half_open method with a mock of
`tools.health_check.time.time` to deterministically advance time. Mock
`time.time` to return a timestamp that is greater than the cooldown duration
(0.1 seconds) from when the failures were recorded, allowing the circuit breaker
to transition from OPEN to HALF_OPEN without introducing real delays. Apply the
same mocking pattern to the other timing-dependent tests mentioned in lines
171-193 that also use `time.sleep()`.
- Around line 63-73: The test_jitter_range method is not properly enforcing the
documented ±25% jitter contract. The lower bound assertion is checking against
min_delay * 0.5, which is too permissive and allows delays well outside the
documented range, and the upper bound is checking against max_delay + 0.1, which
further relaxes the constraint. Fix this by removing the * 0.5 multiplier from
the lower bound check and removing the + 0.1 addition from the upper bound check
in the assertions within the test_jitter_range method, so the test properly
verifies the delay falls within the exact ±25% jitter window.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90825d78-b194-4c1d-8366-b4145461c177

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 02db8b7.

📒 Files selected for processing (4)
  • diagnostic/build-94e0fb08.json
  • diagnostic/build-94e0fb08.logd
  • tools/health_check.py
  • tools/test_health_check.py

Comment thread tools/health_check.py
Comment on lines +364 to +409
for attempt in range(max_attempts):
try:
conn = http.client.HTTPConnection(host, port, timeout=timeout)
conn.request("GET", path)
resp = conn.getresponse()
status_code = resp.status
body = resp.read().decode("utf-8", errors="replace")[:200]
conn.close()

if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"

# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()

return result, detail, status_code, attempt

except Exception as e:
status_code = 0
detail = str(e)

# Check if we should retry
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)
logger.info(
"Attempt %d/%d failed for %s. Retrying in %.2fs...",
attempt + 1, retry_policy.max_retries + 1,
endpoint_key, delay,
)
time.sleep(delay)
continue
else:
if circuit_breaker is not None:
circuit_breaker.record_failure()
return "CRITICAL", detail, status_code, attempt

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Close the HTTP connection on failed attempts.

If request(), getresponse(), or read() raises, Line 371 is skipped and retry/watch mode can leak sockets across repeated failures. Close the connection in a finally block for every attempt.

🛡️ Proposed fix
     for attempt in range(max_attempts):
+        conn = None
         try:
             conn = http.client.HTTPConnection(host, port, timeout=timeout)
             conn.request("GET", path)
             resp = conn.getresponse()
             status_code = resp.status
             body = resp.read().decode("utf-8", errors="replace")[:200]
             conn.close()
+            conn = None
 
             if status_code == 200:
                 result = "OK"
@@
                 if circuit_breaker is not None:
                     circuit_breaker.record_failure()
                 return "CRITICAL", detail, status_code, attempt
+        finally:
+            if conn is not None:
+                conn.close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for attempt in range(max_attempts):
try:
conn = http.client.HTTPConnection(host, port, timeout=timeout)
conn.request("GET", path)
resp = conn.getresponse()
status_code = resp.status
body = resp.read().decode("utf-8", errors="replace")[:200]
conn.close()
if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"
# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()
return result, detail, status_code, attempt
except Exception as e:
status_code = 0
detail = str(e)
# Check if we should retry
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)
logger.info(
"Attempt %d/%d failed for %s. Retrying in %.2fs...",
attempt + 1, retry_policy.max_retries + 1,
endpoint_key, delay,
)
time.sleep(delay)
continue
else:
if circuit_breaker is not None:
circuit_breaker.record_failure()
return "CRITICAL", detail, status_code, attempt
for attempt in range(max_attempts):
conn = None
try:
conn = http.client.HTTPConnection(host, port, timeout=timeout)
conn.request("GET", path)
resp = conn.getresponse()
status_code = resp.status
body = resp.read().decode("utf-8", errors="replace")[:200]
conn.close()
conn = None
if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"
# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()
return result, detail, status_code, attempt
except Exception as e:
status_code = 0
detail = str(e)
# Check if we should retry
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)
logger.info(
"Attempt %d/%d failed for %s. Retrying in %.2fs...",
attempt + 1, retry_policy.max_retries + 1,
endpoint_key, delay,
)
time.sleep(delay)
continue
else:
if circuit_breaker is not None:
circuit_breaker.record_failure()
return "CRITICAL", detail, status_code, attempt
finally:
if conn is not None:
conn.close()
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 392-392: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 364 - 409, The HTTP connection object
conn is not being closed when exceptions occur during request, getresponse, or
read operations, causing socket leaks on retry attempts. Move the conn.close()
call into a finally block that wraps the entire HTTP communication section (from
conn.request through resp.read) to ensure the connection is always closed
regardless of whether an exception is raised. This way, even if an exception is
caught in the except block, the connection will be properly closed before the
retry logic executes.

Comment thread tools/health_check.py
Comment on lines +373 to +390
if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"

# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()

return result, detail, status_code, attempt

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Retry eligible HTTP 5xx responses before returning.

Line 380 marks 5xx responses CRITICAL, but Lines 383-390 immediately record the failure and return, so RetryPolicy.should_retry(..., 500+) is never used for actual HTTP responses. This breaks the advertised retry contract for 5xx probes.

🐛 Proposed fix
             else:
                 result = "CRITICAL"
                 detail = f"HTTP {status_code}: {body[:100]}"
 
+            if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
+                delay = retry_policy.get_delay(attempt)
+                logger.info(
+                    "Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...",
+                    attempt + 1, retry_policy.max_retries + 1,
+                    status_code, endpoint_key, delay,
+                )
+                time.sleep(delay)
+                continue
+
             # Record in circuit breaker
             if circuit_breaker is not None:
                 if result == "OK":
                     circuit_breaker.record_success()
                 else:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"
# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()
return result, detail, status_code, attempt
if status_code == 200:
result = "OK"
detail = f"HTTP {status_code}"
elif status_code < 500:
result = "WARNING"
detail = f"HTTP {status_code}: {body[:100]}"
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)
logger.info(
"Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...",
attempt + 1, retry_policy.max_retries + 1,
status_code, endpoint_key, delay,
)
time.sleep(delay)
continue
# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:
circuit_breaker.record_failure()
return result, detail, status_code, attempt
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 373 - 390, The code currently returns
immediately when encountering 5xx responses (status_code >= 500) without
checking if the RetryPolicy allows retrying the request. To fix this, before
recording the circuit breaker failure and returning on 5xx status codes, first
check if the retry policy indicates a retry should be attempted using the
should_retry method. Only record the failure and return the CRITICAL result if
the retry policy determines the request should not be retried or if retry
attempts have been exhausted. Otherwise, allow the retry logic to proceed. This
ensures the advertised retry contract for 5xx probes is actually honored.

Comment thread tools/health_check.py
Comment on lines +396 to +405
# Check if we should retry
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)
logger.info(
"Attempt %d/%d failed for %s. Retrying in %.2fs...",
attempt + 1, retry_policy.max_retries + 1,
endpoint_key, delay,
)
time.sleep(delay)
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not retry a failed HALF_OPEN probe.

When can_probe() moves the breaker to HALF_OPEN, a timeout/connection failure reaches Lines 397-405 and can retry multiple times before the breaker is reopened. Record the failure and return immediately when circuit_breaker.state == HALF_OPEN.

🔒 Proposed fix
             detail = str(e)
 
+            if circuit_breaker is not None and circuit_breaker.state == circuit_breaker.HALF_OPEN:
+                circuit_breaker.record_failure()
+                return "CRITICAL", detail, status_code, attempt
+
             # Check if we should retry
             if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
                 delay = retry_policy.get_delay(attempt)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 396 - 405, When a circuit breaker is in
HALF_OPEN state, it should not retry failed probes because the breaker is
already attempting to recover and retries would delay the decision to reopen.
Before entering the retry logic block (where retry_policy.should_retry is
called), add a check to determine if circuit_breaker.state equals HALF_OPEN. If
the circuit breaker is in HALF_OPEN state, record the failure and return
immediately without entering the retry block that calls should_retry, get_delay,
and time.sleep. This ensures that any failure during HALF_OPEN state is recorded
promptly and the circuit breaker can transition back to OPEN if needed, rather
than being held in retry loops.

Comment thread tools/health_check.py
Comment on lines +536 to +537
"retry_policy": retry_policy.to_dict() if retry_policy else None,
"circuit_breakers": circuit_registry.to_dict() if circuit_registry else {},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Snapshot circuit breakers after probes update them.

Line 537 serializes the registry before get()/record_failure() mutate breaker state, so results["circuit_breakers"] is stale and print_health_report() can miss newly OPEN breakers. Refresh the snapshot after all checks complete.

🐛 Proposed fix
         "overall_status": "OK",
         "aggregation": {},
         "retry_policy": retry_policy.to_dict() if retry_policy else None,
-        "circuit_breakers": circuit_registry.to_dict() if circuit_registry else {},
+        "circuit_breakers": {},
@@
     results["overall_status"] = "OK" if all_ok else "DEGRADED"
     results["aggregation"] = aggregator.summary()
+    if circuit_registry is not None:
+        results["circuit_breakers"] = circuit_registry.to_dict()
 
     # Log degraded services

Also applies to: 640-641

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 536 - 537, The circuit_registry.to_dict()
call at line 537 captures the circuit breaker state before the probes execute
and mutate the breaker states via get() and record_failure() calls, resulting in
a stale snapshot. Move the assignment of circuit_registry.to_dict() to
results["circuit_breakers"] to after all health check probes complete execution,
so the snapshot reflects the final mutated state of the circuit breakers. Apply
the same fix to the similar circuit_registry serialization at the location noted
in "Also applies to: 640-641".

Comment on lines +359 to +374
def test_circuit_breaker_skips_probe(self, mock_conn):
"""Test that open circuit breaker skips probe."""
from tools.health_check import CircuitBreakerState

cb = CircuitBreakerState(threshold=2, cooldown=30)
cb.record_failure()
cb.record_failure() # circuit opens
self.assertFalse(cb.can_probe())

status, detail, code, retries = check_http_service(
"localhost", 8080, "/health", 5,
circuit_breaker=cb,
)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker OPEN", detail)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

OPEN-circuit short-circuit behavior is not fully asserted

This test should also verify no outbound probe occurs. Add mock_conn.assert_not_called() (and optionally assert retries/code defaults) so the core skip contract is enforced.

Suggested assertion
         status, detail, code, retries = check_http_service(
             "localhost", 8080, "/health", 5,
             circuit_breaker=cb,
         )
         self.assertEqual(status, "CRITICAL")
         self.assertIn("Circuit breaker OPEN", detail)
+        mock_conn.assert_not_called()
+        self.assertEqual(retries, 0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_circuit_breaker_skips_probe(self, mock_conn):
"""Test that open circuit breaker skips probe."""
from tools.health_check import CircuitBreakerState
cb = CircuitBreakerState(threshold=2, cooldown=30)
cb.record_failure()
cb.record_failure() # circuit opens
self.assertFalse(cb.can_probe())
status, detail, code, retries = check_http_service(
"localhost", 8080, "/health", 5,
circuit_breaker=cb,
)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker OPEN", detail)
def test_circuit_breaker_skips_probe(self, mock_conn):
"""Test that open circuit breaker skips probe."""
from tools.health_check import CircuitBreakerState
cb = CircuitBreakerState(threshold=2, cooldown=30)
cb.record_failure()
cb.record_failure() # circuit opens
self.assertFalse(cb.can_probe())
status, detail, code, retries = check_http_service(
"localhost", 8080, "/health", 5,
circuit_breaker=cb,
)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker OPEN", detail)
mock_conn.assert_not_called()
self.assertEqual(retries, 0)
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 368-368: Unpacked variable code is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 368-368: Unpacked variable retries is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check.py` around lines 359 - 374, The
test_circuit_breaker_skips_probe test verifies the response when the circuit
breaker is open, but does not assert that the actual probe was skipped and no
HTTP connection was attempted. Add mock_conn.assert_not_called() after the
existing assertions to enforce that the mocked HTTP connection was never called,
thus verifying the core contract that an open circuit breaker skips the probe
entirely. Optionally, also add assertions to verify that retries and code have
their expected default values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant