[$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes#34
Conversation
- Add RetryPolicy class with exponential backoff: delay = base_delay * (backoff_factor ^ attempt) - Add CircuitBreakerState/CircuitBreakerRegistry for circuit breaker pattern - Add HealthCheckAggregator for result summary stats - Add proper logging with WARNING level for degraded services - Add --max-retries, --backoff-factor, --base-delay, --circuit-threshold, --circuit-cooldown flags - Add comprehensive unit tests (33 tests covering retry, backoff, circuit breaker) - Updated run_health_checks and main() to wire all new features
📝 WalkthroughWalkthroughAdds exponential-retry with jitter and a per-endpoint circuit breaker to ChangesHealth Check Resilience Enhancement
Diagnostic Build Metadata
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI / main()
participant RHC as run_health_checks
participant CBR as CircuitBreakerRegistry
participant CHS as check_http_service
participant CBS as CircuitBreakerState
participant AGG as HealthCheckAggregator
CLI->>RHC: retry_policy, circuit_registry
RHC->>CBR: get(endpoint)
CBR-->>RHC: CircuitBreakerState
RHC->>CHS: host, port, path, timeout, retry_policy, circuit_breaker
CHS->>CBS: is_open()?
alt OPEN
CBS-->>CHS: skip
CHS-->>RHC: CRITICAL, retries=0
else CLOSED / HALF_OPEN
loop up to max_retries
CHS->>CBS: record_success() or record_failure()
end
CHS-->>RHC: status, detail, http_code, retries_used
end
RHC->>AGG: add_result(name, status, metric, retries)
AGG-->>RHC: summary (totals, pass_rate, overall_status)
RHC-->>CLI: results dict with aggregation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
tools/test_health_check.py (3)
141-141: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winResolve Ruff warnings for unused loop/unpacked variables
Ruff hints here are valid; rename unused loop vars to
_and unused unpacked values to_detail/_code/_retriesto keep lint clean.Also applies to: 161-161, 174-174, 187-187, 200-200, 350-350, 368-368, 383-383
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/test_health_check.py` at line 141, The loop variable `i` in the for loop (and similarly unused variables at lines 161, 174, 187, 200, 350, 368, and 383) are triggering Ruff warnings because they are not used within their respective loops or unpacking assignments. Replace unused loop variables with a single underscore `_`, and replace unused unpacked values with descriptive underscore-prefixed names such as `_detail`, `_code`, or `_retries` depending on what value is being unpacked. This will satisfy Ruff's linting requirements while keeping the code intent clear.Source: Linters/SAST tools
158-169: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winCooldown transition tests are timing-flaky due to real sleeps
Using
time.sleep(0.15)in unit tests makes CI timing-sensitive and slower. Prefer mockingtools.health_check.time.timeto deterministically drive OPEN → HALF_OPEN transitions.Deterministic pattern (example)
def test_cooldown_transitions_to_half_open(self): """Test that after cooldown, open circuit moves to HALF_OPEN.""" cb = CircuitBreakerState(threshold=3, cooldown=0.1) - for i in range(3): + for _ in range(3): cb.record_failure() self.assertEqual(cb.state, cb.OPEN) - self.assertFalse(cb.can_probe()) - - # Sleep past cooldown - time.sleep(0.15) - self.assertTrue(cb.can_probe()) - self.assertEqual(cb.state, cb.HALF_OPEN) + with mock.patch("tools.health_check.time.time") as mocked_time: + mocked_time.return_value = cb.last_failure_time + 0.05 + self.assertFalse(cb.can_probe()) + mocked_time.return_value = cb.last_failure_time + 0.11 + self.assertTrue(cb.can_probe()) + self.assertEqual(cb.state, cb.HALF_OPEN)Also applies to: 171-193
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/test_health_check.py` around lines 158 - 169, Replace the real `time.sleep(0.15)` call in the test_cooldown_transitions_to_half_open method with a mock of `tools.health_check.time.time` to deterministically advance time. Mock `time.time` to return a timestamp that is greater than the cooldown duration (0.1 seconds) from when the failures were recorded, allowing the circuit breaker to transition from OPEN to HALF_OPEN without introducing real delays. Apply the same mocking pattern to the other timing-dependent tests mentioned in lines 171-193 that also use `time.sleep()`.
63-73: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win
test_jitter_rangedoes not actually verify the ±25% contractOn Line 71, the lower bound is relaxed to
min_delay * 0.5, which allows values well outside the documented jitter window. This can let regressions pass.Proposed tightening
def test_jitter_range(self): """Test that jitter is within ±25%.""" rp = RetryPolicy(backoff_factor=2.0, base_delay=1.0, jitter=True) for attempt in range(10): delay = rp.get_delay(attempt) expected_base = 1.0 * (2.0 ** attempt) min_delay = expected_base * 0.75 max_delay = expected_base * 1.25 - self.assertGreaterEqual(delay, min_delay * 0.5) # with floor - self.assertLessEqual(delay, max_delay + 0.1) + self.assertGreaterEqual(delay, min_delay) + self.assertLessEqual(delay, max_delay)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/test_health_check.py` around lines 63 - 73, The test_jitter_range method is not properly enforcing the documented ±25% jitter contract. The lower bound assertion is checking against min_delay * 0.5, which is too permissive and allows delays well outside the documented range, and the upper bound is checking against max_delay + 0.1, which further relaxes the constraint. Fix this by removing the * 0.5 multiplier from the lower bound check and removing the + 0.1 addition from the upper bound check in the assertions within the test_jitter_range method, so the test properly verifies the delay falls within the exact ±25% jitter window.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/health_check.py`:
- Around line 536-537: The circuit_registry.to_dict() call at line 537 captures
the circuit breaker state before the probes execute and mutate the breaker
states via get() and record_failure() calls, resulting in a stale snapshot. Move
the assignment of circuit_registry.to_dict() to results["circuit_breakers"] to
after all health check probes complete execution, so the snapshot reflects the
final mutated state of the circuit breakers. Apply the same fix to the similar
circuit_registry serialization at the location noted in "Also applies to:
640-641".
- Around line 396-405: When a circuit breaker is in HALF_OPEN state, it should
not retry failed probes because the breaker is already attempting to recover and
retries would delay the decision to reopen. Before entering the retry logic
block (where retry_policy.should_retry is called), add a check to determine if
circuit_breaker.state equals HALF_OPEN. If the circuit breaker is in HALF_OPEN
state, record the failure and return immediately without entering the retry
block that calls should_retry, get_delay, and time.sleep. This ensures that any
failure during HALF_OPEN state is recorded promptly and the circuit breaker can
transition back to OPEN if needed, rather than being held in retry loops.
- Around line 364-409: The HTTP connection object conn is not being closed when
exceptions occur during request, getresponse, or read operations, causing socket
leaks on retry attempts. Move the conn.close() call into a finally block that
wraps the entire HTTP communication section (from conn.request through
resp.read) to ensure the connection is always closed regardless of whether an
exception is raised. This way, even if an exception is caught in the except
block, the connection will be properly closed before the retry logic executes.
- Around line 373-390: The code currently returns immediately when encountering
5xx responses (status_code >= 500) without checking if the RetryPolicy allows
retrying the request. To fix this, before recording the circuit breaker failure
and returning on 5xx status codes, first check if the retry policy indicates a
retry should be attempted using the should_retry method. Only record the failure
and return the CRITICAL result if the retry policy determines the request should
not be retried or if retry attempts have been exhausted. Otherwise, allow the
retry logic to proceed. This ensures the advertised retry contract for 5xx
probes is actually honored.
In `@tools/test_health_check.py`:
- Around line 359-374: The test_circuit_breaker_skips_probe test verifies the
response when the circuit breaker is open, but does not assert that the actual
probe was skipped and no HTTP connection was attempted. Add
mock_conn.assert_not_called() after the existing assertions to enforce that the
mocked HTTP connection was never called, thus verifying the core contract that
an open circuit breaker skips the probe entirely. Optionally, also add
assertions to verify that retries and code have their expected default values.
---
Nitpick comments:
In `@tools/test_health_check.py`:
- Line 141: The loop variable `i` in the for loop (and similarly unused
variables at lines 161, 174, 187, 200, 350, 368, and 383) are triggering Ruff
warnings because they are not used within their respective loops or unpacking
assignments. Replace unused loop variables with a single underscore `_`, and
replace unused unpacked values with descriptive underscore-prefixed names such
as `_detail`, `_code`, or `_retries` depending on what value is being unpacked.
This will satisfy Ruff's linting requirements while keeping the code intent
clear.
- Around line 158-169: Replace the real `time.sleep(0.15)` call in the
test_cooldown_transitions_to_half_open method with a mock of
`tools.health_check.time.time` to deterministically advance time. Mock
`time.time` to return a timestamp that is greater than the cooldown duration
(0.1 seconds) from when the failures were recorded, allowing the circuit breaker
to transition from OPEN to HALF_OPEN without introducing real delays. Apply the
same mocking pattern to the other timing-dependent tests mentioned in lines
171-193 that also use `time.sleep()`.
- Around line 63-73: The test_jitter_range method is not properly enforcing the
documented ±25% jitter contract. The lower bound assertion is checking against
min_delay * 0.5, which is too permissive and allows delays well outside the
documented range, and the upper bound is checking against max_delay + 0.1, which
further relaxes the constraint. Fix this by removing the * 0.5 multiplier from
the lower bound check and removing the + 0.1 addition from the upper bound check
in the assertions within the test_jitter_range method, so the test properly
verifies the delay falls within the exact ±25% jitter window.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90825d78-b194-4c1d-8366-b4145461c177
📒 Files selected for processing (4)
diagnostic/build-94e0fb08.jsondiagnostic/build-94e0fb08.logdtools/health_check.pytools/test_health_check.py
| for attempt in range(max_attempts): | ||
| try: | ||
| conn = http.client.HTTPConnection(host, port, timeout=timeout) | ||
| conn.request("GET", path) | ||
| resp = conn.getresponse() | ||
| status_code = resp.status | ||
| body = resp.read().decode("utf-8", errors="replace")[:200] | ||
| conn.close() | ||
|
|
||
| if status_code == 200: | ||
| result = "OK" | ||
| detail = f"HTTP {status_code}" | ||
| elif status_code < 500: | ||
| result = "WARNING" | ||
| detail = f"HTTP {status_code}: {body[:100]}" | ||
| else: | ||
| result = "CRITICAL" | ||
| detail = f"HTTP {status_code}: {body[:100]}" | ||
|
|
||
| # Record in circuit breaker | ||
| if circuit_breaker is not None: | ||
| if result == "OK": | ||
| circuit_breaker.record_success() | ||
| else: | ||
| circuit_breaker.record_failure() | ||
|
|
||
| return result, detail, status_code, attempt | ||
|
|
||
| except Exception as e: | ||
| status_code = 0 | ||
| detail = str(e) | ||
|
|
||
| # Check if we should retry | ||
| if retry_policy is not None and retry_policy.should_retry(attempt, status_code): | ||
| delay = retry_policy.get_delay(attempt) | ||
| logger.info( | ||
| "Attempt %d/%d failed for %s. Retrying in %.2fs...", | ||
| attempt + 1, retry_policy.max_retries + 1, | ||
| endpoint_key, delay, | ||
| ) | ||
| time.sleep(delay) | ||
| continue | ||
| else: | ||
| if circuit_breaker is not None: | ||
| circuit_breaker.record_failure() | ||
| return "CRITICAL", detail, status_code, attempt |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Close the HTTP connection on failed attempts.
If request(), getresponse(), or read() raises, Line 371 is skipped and retry/watch mode can leak sockets across repeated failures. Close the connection in a finally block for every attempt.
🛡️ Proposed fix
for attempt in range(max_attempts):
+ conn = None
try:
conn = http.client.HTTPConnection(host, port, timeout=timeout)
conn.request("GET", path)
resp = conn.getresponse()
status_code = resp.status
body = resp.read().decode("utf-8", errors="replace")[:200]
conn.close()
+ conn = None
if status_code == 200:
result = "OK"
@@
if circuit_breaker is not None:
circuit_breaker.record_failure()
return "CRITICAL", detail, status_code, attempt
+ finally:
+ if conn is not None:
+ conn.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for attempt in range(max_attempts): | |
| try: | |
| conn = http.client.HTTPConnection(host, port, timeout=timeout) | |
| conn.request("GET", path) | |
| resp = conn.getresponse() | |
| status_code = resp.status | |
| body = resp.read().decode("utf-8", errors="replace")[:200] | |
| conn.close() | |
| if status_code == 200: | |
| result = "OK" | |
| detail = f"HTTP {status_code}" | |
| elif status_code < 500: | |
| result = "WARNING" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| else: | |
| result = "CRITICAL" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| # Record in circuit breaker | |
| if circuit_breaker is not None: | |
| if result == "OK": | |
| circuit_breaker.record_success() | |
| else: | |
| circuit_breaker.record_failure() | |
| return result, detail, status_code, attempt | |
| except Exception as e: | |
| status_code = 0 | |
| detail = str(e) | |
| # Check if we should retry | |
| if retry_policy is not None and retry_policy.should_retry(attempt, status_code): | |
| delay = retry_policy.get_delay(attempt) | |
| logger.info( | |
| "Attempt %d/%d failed for %s. Retrying in %.2fs...", | |
| attempt + 1, retry_policy.max_retries + 1, | |
| endpoint_key, delay, | |
| ) | |
| time.sleep(delay) | |
| continue | |
| else: | |
| if circuit_breaker is not None: | |
| circuit_breaker.record_failure() | |
| return "CRITICAL", detail, status_code, attempt | |
| for attempt in range(max_attempts): | |
| conn = None | |
| try: | |
| conn = http.client.HTTPConnection(host, port, timeout=timeout) | |
| conn.request("GET", path) | |
| resp = conn.getresponse() | |
| status_code = resp.status | |
| body = resp.read().decode("utf-8", errors="replace")[:200] | |
| conn.close() | |
| conn = None | |
| if status_code == 200: | |
| result = "OK" | |
| detail = f"HTTP {status_code}" | |
| elif status_code < 500: | |
| result = "WARNING" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| else: | |
| result = "CRITICAL" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| # Record in circuit breaker | |
| if circuit_breaker is not None: | |
| if result == "OK": | |
| circuit_breaker.record_success() | |
| else: | |
| circuit_breaker.record_failure() | |
| return result, detail, status_code, attempt | |
| except Exception as e: | |
| status_code = 0 | |
| detail = str(e) | |
| # Check if we should retry | |
| if retry_policy is not None and retry_policy.should_retry(attempt, status_code): | |
| delay = retry_policy.get_delay(attempt) | |
| logger.info( | |
| "Attempt %d/%d failed for %s. Retrying in %.2fs...", | |
| attempt + 1, retry_policy.max_retries + 1, | |
| endpoint_key, delay, | |
| ) | |
| time.sleep(delay) | |
| continue | |
| else: | |
| if circuit_breaker is not None: | |
| circuit_breaker.record_failure() | |
| return "CRITICAL", detail, status_code, attempt | |
| finally: | |
| if conn is not None: | |
| conn.close() |
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 392-392: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 364 - 409, The HTTP connection object
conn is not being closed when exceptions occur during request, getresponse, or
read operations, causing socket leaks on retry attempts. Move the conn.close()
call into a finally block that wraps the entire HTTP communication section (from
conn.request through resp.read) to ensure the connection is always closed
regardless of whether an exception is raised. This way, even if an exception is
caught in the except block, the connection will be properly closed before the
retry logic executes.
| if status_code == 200: | ||
| result = "OK" | ||
| detail = f"HTTP {status_code}" | ||
| elif status_code < 500: | ||
| result = "WARNING" | ||
| detail = f"HTTP {status_code}: {body[:100]}" | ||
| else: | ||
| result = "CRITICAL" | ||
| detail = f"HTTP {status_code}: {body[:100]}" | ||
|
|
||
| # Record in circuit breaker | ||
| if circuit_breaker is not None: | ||
| if result == "OK": | ||
| circuit_breaker.record_success() | ||
| else: | ||
| circuit_breaker.record_failure() | ||
|
|
||
| return result, detail, status_code, attempt |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Retry eligible HTTP 5xx responses before returning.
Line 380 marks 5xx responses CRITICAL, but Lines 383-390 immediately record the failure and return, so RetryPolicy.should_retry(..., 500+) is never used for actual HTTP responses. This breaks the advertised retry contract for 5xx probes.
🐛 Proposed fix
else:
result = "CRITICAL"
detail = f"HTTP {status_code}: {body[:100]}"
+ if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
+ delay = retry_policy.get_delay(attempt)
+ logger.info(
+ "Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...",
+ attempt + 1, retry_policy.max_retries + 1,
+ status_code, endpoint_key, delay,
+ )
+ time.sleep(delay)
+ continue
+
# Record in circuit breaker
if circuit_breaker is not None:
if result == "OK":
circuit_breaker.record_success()
else:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if status_code == 200: | |
| result = "OK" | |
| detail = f"HTTP {status_code}" | |
| elif status_code < 500: | |
| result = "WARNING" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| else: | |
| result = "CRITICAL" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| # Record in circuit breaker | |
| if circuit_breaker is not None: | |
| if result == "OK": | |
| circuit_breaker.record_success() | |
| else: | |
| circuit_breaker.record_failure() | |
| return result, detail, status_code, attempt | |
| if status_code == 200: | |
| result = "OK" | |
| detail = f"HTTP {status_code}" | |
| elif status_code < 500: | |
| result = "WARNING" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| else: | |
| result = "CRITICAL" | |
| detail = f"HTTP {status_code}: {body[:100]}" | |
| if retry_policy is not None and retry_policy.should_retry(attempt, status_code): | |
| delay = retry_policy.get_delay(attempt) | |
| logger.info( | |
| "Attempt %d/%d returned HTTP %d for %s. Retrying in %.2fs...", | |
| attempt + 1, retry_policy.max_retries + 1, | |
| status_code, endpoint_key, delay, | |
| ) | |
| time.sleep(delay) | |
| continue | |
| # Record in circuit breaker | |
| if circuit_breaker is not None: | |
| if result == "OK": | |
| circuit_breaker.record_success() | |
| else: | |
| circuit_breaker.record_failure() | |
| return result, detail, status_code, attempt |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 373 - 390, The code currently returns
immediately when encountering 5xx responses (status_code >= 500) without
checking if the RetryPolicy allows retrying the request. To fix this, before
recording the circuit breaker failure and returning on 5xx status codes, first
check if the retry policy indicates a retry should be attempted using the
should_retry method. Only record the failure and return the CRITICAL result if
the retry policy determines the request should not be retried or if retry
attempts have been exhausted. Otherwise, allow the retry logic to proceed. This
ensures the advertised retry contract for 5xx probes is actually honored.
| # Check if we should retry | ||
| if retry_policy is not None and retry_policy.should_retry(attempt, status_code): | ||
| delay = retry_policy.get_delay(attempt) | ||
| logger.info( | ||
| "Attempt %d/%d failed for %s. Retrying in %.2fs...", | ||
| attempt + 1, retry_policy.max_retries + 1, | ||
| endpoint_key, delay, | ||
| ) | ||
| time.sleep(delay) | ||
| continue |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Do not retry a failed HALF_OPEN probe.
When can_probe() moves the breaker to HALF_OPEN, a timeout/connection failure reaches Lines 397-405 and can retry multiple times before the breaker is reopened. Record the failure and return immediately when circuit_breaker.state == HALF_OPEN.
🔒 Proposed fix
detail = str(e)
+ if circuit_breaker is not None and circuit_breaker.state == circuit_breaker.HALF_OPEN:
+ circuit_breaker.record_failure()
+ return "CRITICAL", detail, status_code, attempt
+
# Check if we should retry
if retry_policy is not None and retry_policy.should_retry(attempt, status_code):
delay = retry_policy.get_delay(attempt)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 396 - 405, When a circuit breaker is in
HALF_OPEN state, it should not retry failed probes because the breaker is
already attempting to recover and retries would delay the decision to reopen.
Before entering the retry logic block (where retry_policy.should_retry is
called), add a check to determine if circuit_breaker.state equals HALF_OPEN. If
the circuit breaker is in HALF_OPEN state, record the failure and return
immediately without entering the retry block that calls should_retry, get_delay,
and time.sleep. This ensures that any failure during HALF_OPEN state is recorded
promptly and the circuit breaker can transition back to OPEN if needed, rather
than being held in retry loops.
| "retry_policy": retry_policy.to_dict() if retry_policy else None, | ||
| "circuit_breakers": circuit_registry.to_dict() if circuit_registry else {}, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Snapshot circuit breakers after probes update them.
Line 537 serializes the registry before get()/record_failure() mutate breaker state, so results["circuit_breakers"] is stale and print_health_report() can miss newly OPEN breakers. Refresh the snapshot after all checks complete.
🐛 Proposed fix
"overall_status": "OK",
"aggregation": {},
"retry_policy": retry_policy.to_dict() if retry_policy else None,
- "circuit_breakers": circuit_registry.to_dict() if circuit_registry else {},
+ "circuit_breakers": {},
@@
results["overall_status"] = "OK" if all_ok else "DEGRADED"
results["aggregation"] = aggregator.summary()
+ if circuit_registry is not None:
+ results["circuit_breakers"] = circuit_registry.to_dict()
# Log degraded servicesAlso applies to: 640-641
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 536 - 537, The circuit_registry.to_dict()
call at line 537 captures the circuit breaker state before the probes execute
and mutate the breaker states via get() and record_failure() calls, resulting in
a stale snapshot. Move the assignment of circuit_registry.to_dict() to
results["circuit_breakers"] to after all health check probes complete execution,
so the snapshot reflects the final mutated state of the circuit breakers. Apply
the same fix to the similar circuit_registry serialization at the location noted
in "Also applies to: 640-641".
| def test_circuit_breaker_skips_probe(self, mock_conn): | ||
| """Test that open circuit breaker skips probe.""" | ||
| from tools.health_check import CircuitBreakerState | ||
|
|
||
| cb = CircuitBreakerState(threshold=2, cooldown=30) | ||
| cb.record_failure() | ||
| cb.record_failure() # circuit opens | ||
| self.assertFalse(cb.can_probe()) | ||
|
|
||
| status, detail, code, retries = check_http_service( | ||
| "localhost", 8080, "/health", 5, | ||
| circuit_breaker=cb, | ||
| ) | ||
| self.assertEqual(status, "CRITICAL") | ||
| self.assertIn("Circuit breaker OPEN", detail) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
OPEN-circuit short-circuit behavior is not fully asserted
This test should also verify no outbound probe occurs. Add mock_conn.assert_not_called() (and optionally assert retries/code defaults) so the core skip contract is enforced.
Suggested assertion
status, detail, code, retries = check_http_service(
"localhost", 8080, "/health", 5,
circuit_breaker=cb,
)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker OPEN", detail)
+ mock_conn.assert_not_called()
+ self.assertEqual(retries, 0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_circuit_breaker_skips_probe(self, mock_conn): | |
| """Test that open circuit breaker skips probe.""" | |
| from tools.health_check import CircuitBreakerState | |
| cb = CircuitBreakerState(threshold=2, cooldown=30) | |
| cb.record_failure() | |
| cb.record_failure() # circuit opens | |
| self.assertFalse(cb.can_probe()) | |
| status, detail, code, retries = check_http_service( | |
| "localhost", 8080, "/health", 5, | |
| circuit_breaker=cb, | |
| ) | |
| self.assertEqual(status, "CRITICAL") | |
| self.assertIn("Circuit breaker OPEN", detail) | |
| def test_circuit_breaker_skips_probe(self, mock_conn): | |
| """Test that open circuit breaker skips probe.""" | |
| from tools.health_check import CircuitBreakerState | |
| cb = CircuitBreakerState(threshold=2, cooldown=30) | |
| cb.record_failure() | |
| cb.record_failure() # circuit opens | |
| self.assertFalse(cb.can_probe()) | |
| status, detail, code, retries = check_http_service( | |
| "localhost", 8080, "/health", 5, | |
| circuit_breaker=cb, | |
| ) | |
| self.assertEqual(status, "CRITICAL") | |
| self.assertIn("Circuit breaker OPEN", detail) | |
| mock_conn.assert_not_called() | |
| self.assertEqual(retries, 0) |
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 368-368: Unpacked variable code is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 368-368: Unpacked variable retries is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/test_health_check.py` around lines 359 - 374, The
test_circuit_breaker_skips_probe test verifies the response when the circuit
breaker is open, but does not assert that the actual probe was skipped and no
HTTP connection was attempted. Add mock_conn.assert_not_called() after the
existing assertions to enforce that the mocked HTTP connection was never called,
thus verifying the core contract that an open circuit breaker skips the probe
entirely. Optionally, also add assertions to verify that retries and code have
their expected default values.
Summary
Add configurable retry/backoff and circuit breaker patterns to the HTTP health check probes in
tools/health_check.py, plus result aggregation and proper logging.Changes
--max-retries,--backoff-factor,--base-delay,--circuit-threshold,--circuit-cooldownTesting
Checklist
Diagnostic artifacts:
diagnostic/build-94e0fb08.logd(encrypted)diagnostic/build-94e0fb08.jsonSummary by CodeRabbit
Release Notes
New Features
Tests