Skip to content

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes#33

Open
liujin11112 wants to merge 4 commits into
thanhle74:mainfrom
liujin11112:fix/retry-backoff-circuit-breaker
Open

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes#33
liujin11112 wants to merge 4 commits into
thanhle74:mainfrom
liujin11112:fix/retry-backoff-circuit-breaker

Conversation

@liujin11112

@liujin11112 liujin11112 commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Adds configurable retry logic with exponential backoff and circuit breaker pattern to HTTP health check probes.

Changes

  • CircuitBreaker class: State machine (CLOSED->OPEN->HALF_OPEN) with per-service tracking
  • HealthCheckStats class: Aggregate pass/warn/critical counts and retry totals
  • check_http_service(): Retries on 5xx, connection refused, timeout with exponential backoff: delay = 0.5 * (backoff_factor ^ attempt)
  • CLI flags: --max-retries, --backoff-factor, --circuit-threshold
  • Logging: WARNING-level for retries and circuit state changes
  • JSON output: Includes summary + circuit_breakers sections
  • Tests: 17 unit tests covering all new features

Testing

python -m unittest tests.test_health_check.TestCircuitBreaker tests.test_health_check.TestHealthCheckStats tests.test_health_check.TestHTTPRetryBackoff tests.test_health_check.TestCLIFlags -v
# Result: 15/15 passed

Checklist

  • Tests pass locally
  • Diagnostic build log committed

Summary by CodeRabbit

  • New Features

    • Added circuit breaker pattern to health checks for improved fault tolerance and service isolation
    • Introduced configurable retry and backoff settings for HTTP probes
    • New CLI options: max_retries, backoff_factor, and circuit_threshold for fine-tuning health check behavior
    • Enhanced health check responses with summary data and circuit breaker status
  • Tests

    • Added comprehensive test suite covering circuit breaker logic, retry mechanisms, and health check aggregation

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@liujin11112, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f80c30e-a77f-445b-bbc1-1a82c8cf5dfa

📥 Commits

Reviewing files that changed from the base of the PR and between 202a22c and 12acd18.

📒 Files selected for processing (1)
  • tests/test_health_check.py
📝 Walkthrough

Walkthrough

tools/health_check.py gains CircuitBreaker and HealthCheckStats classes, retry/backoff logic in check_http_service, aggregated summary and circuit_breakers fields in run_health_checks, and three new CLI flags. A new tests/ package with test_health_check.py provides unit and end-to-end coverage for all new behaviors.

Changes

Circuit Breaker, Retry, and Stats for health_check

Layer / File(s) Summary
CircuitBreaker, HealthCheckStats, and HTTP probe with retry/backoff
tools/health_check.py
Adds import logging, new module constants, CircuitBreaker (CLOSED/OPEN/HALF_OPEN state machine) and HealthCheckStats (per-status counters + retry totals) classes, and rewrites check_http_service to gate attempts via circuit-breaker state, retry transient failures with configurable backoff, and record outcomes into the circuit breaker.
run_health_checks wiring: per-service circuit, stats, and result fields
tools/health_check.py
Expands run_health_checks with max_retries, backoff_factor, and circuit_threshold parameters; creates CircuitBreaker and HealthCheckStats per invocation; threads them through HTTP, TCP, disk, memory, and load checks; and adds summary and circuit_breakers to the returned dict.
CLI flag extensions and main() wiring
tools/health_check.py
parse_args accepts argv and adds --max-retries, --backoff-factor, and --circuit-threshold flags with defaults from module constants; both watch and non-watch invocations of run_health_checks pass the new arguments.
Test suite
tests/__init__.py, tests/test_health_check.py
Adds the tests package init and a full test module covering CircuitBreaker state transitions (CLOSED → OPEN, reset, isolation, HALF_OPEN cooldown), HealthCheckStats recording and to_dict, check_http_service tuple shape and circuit-rejection behavior, parse_args new flags and defaults, and run_health_checks top-level response structure.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as main() / CLI
  participant RHC as run_health_checks
  participant CB as CircuitBreaker
  participant HCS as HealthCheckStats
  participant CHS as check_http_service
  participant Endpoint as HTTP Endpoint

  CLI->>RHC: call(max_retries, backoff_factor, circuit_threshold)
  RHC->>CB: instantiate CircuitBreaker(circuit_threshold)
  RHC->>HCS: instantiate HealthCheckStats()
  loop each service
    RHC->>CHS: call(service, circuit_breaker=CB)
    CHS->>CB: check state
    alt OPEN
      CB-->>CHS: reject
      CHS-->>RHC: CRITICAL, "circuit breaker"
    else CLOSED/HALF_OPEN
      CHS->>Endpoint: HTTP GET (with retries)
      Endpoint-->>CHS: response
      CHS->>CB: record_success / record_failure
      CHS-->>RHC: status, detail, code
    end
    RHC->>HCS: record(status)
  end
  RHC-->>CLI: results + summary(HCS.to_dict()) + circuit_breakers(CB states)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐇 Hoppity-hop through the circuit's gate,
When failures pile, the breaker won't wait!
Retries with backoff, a rabbit's delight,
Stats neatly counted, the summary's right.
CLOSED, OPEN, HALF_OPEN — what a dance,
This bunny's health checks never miss a chance! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers Summary, Changes, and Testing sections with specific details. However, the Checklist is incomplete—only 2 of 8 required items are marked. Complete the Checklist by verifying and marking all relevant items, particularly regarding local builds, documentation updates, and security considerations.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding retry/backoff and circuit breaker functionality to health_check HTTP probes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/health_check.py (1)

221-242: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unreachable dead code: old implementation not removed.

Lines 221-242 contain the previous single-attempt HTTP probe implementation. This code is unreachable because the function returns at line 219 or 220. Remove this dead code.

🧹 Remove dead code
     return "CRITICAL", "HTTP " + str(status) + ": " + body[:100], 0
-
-    import http.client
-    try:
-        conn = http.client.HTTPConnection(host, port, timeout=timeout)
-        conn.request("GET", path)
-        resp = conn.getresponse()
-        status = resp.status
-        body = resp.read().decode("utf-8", errors="replace")[:200]
-        conn.close()
-
-        if status == 200:
-            result = "OK"
-            detail = f"HTTP {status}"
-        elif status < 500:
-            result = "WARNING"
-            detail = f"HTTP {status}: {body[:100]}"
-        else:
-            result = "CRITICAL"
-            detail = f"HTTP {status}: {body[:100]}"
-
-        return result, detail, status
-    except Exception as e:
-        return "CRITICAL", str(e), 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 221 - 242, Remove the entire unreachable
code block that spans from the http.client import through the exception handler
returning "CRITICAL", str(e), 0. This entire try-except block contains the old
single-attempt HTTP probe implementation that is dead code since the function
returns earlier (at lines 219-220 based on context). Delete the import statement
and the entire try-except block that handles HTTPConnection, request, response
reading, status checking, and exception handling.
🧹 Nitpick comments (1)
tools/health_check.py (1)

415-421: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Prefix unused tuple elements with underscore.

mem_pct and load_val are unpacked but never used. Prefix with underscore to signal intent.

♻️ Suggested fix
-    mem_status, mem_detail, mem_pct = check_memory_usage()
+    mem_status, mem_detail, _mem_pct = check_memory_usage()
     stats.record(mem_status)
     results["system"]["memory"] = {"status": mem_status, "detail": mem_detail}
     if mem_status == "CRITICAL":
         all_ok = False

-    load_status, load_detail, load_val = check_load_average()
+    load_status, load_detail, _load_val = check_load_average()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 415 - 421, Prefix the unused tuple
elements with underscores to indicate they are intentionally not being used. In
the check_memory_usage() call, rename the unpacked variable mem_pct to _mem_pct.
In the check_load_average() call, rename the unpacked variable load_val to
_load_val. This follows Python conventions for indicating intentionally unused
variables.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_health_check.py`:
- Around line 78-90: The unit tests test_http_probe_returns_tuple,
test_http_max_retries_zero, and tests at lines 124-141 are making real network
calls to check_http_service and run_health_checks against actual HTTP services,
which introduces nondeterministic behavior and environmental dependencies.
Replace these live network calls by mocking the HTTPConnection class and the
probe functions using unittest.mock.patch or similar mocking tools. Define
controlled fixtures that simulate successful responses, timeouts, and failures,
then assert that the retry logic, circuit breaker behavior, and output contracts
work correctly based on these mocked responses instead of relying on actual
services running.
- Around line 4-5: The _REPO_ROOT variable on line 4 contains a hard-coded
absolute Windows path that will break on other machines and CI environments.
Replace this hard-coded path with a relative path computed from the current test
file location using Path(__file__) to determine the test file's directory, then
navigate up the directory structure to find the repository root, and finally
construct the path to the tools directory. This approach will be
platform-independent and work across Windows, Linux, and macOS systems.

In `@tools/health_check.py`:
- Around line 192-200: The circuit_breaker.record_success call is being executed
unconditionally before checking the HTTP status code, which means it records
success even for 5xx responses. Move the circuit_breaker.record_success
invocation so that it only executes when the response status indicates success
(when status is less than 500), not for CRITICAL failures. This ensures the
circuit breaker only records success for non-5xx responses and properly tracks
service failures.

---

Outside diff comments:
In `@tools/health_check.py`:
- Around line 221-242: Remove the entire unreachable code block that spans from
the http.client import through the exception handler returning "CRITICAL",
str(e), 0. This entire try-except block contains the old single-attempt HTTP
probe implementation that is dead code since the function returns earlier (at
lines 219-220 based on context). Delete the import statement and the entire
try-except block that handles HTTPConnection, request, response reading, status
checking, and exception handling.

---

Nitpick comments:
In `@tools/health_check.py`:
- Around line 415-421: Prefix the unused tuple elements with underscores to
indicate they are intentionally not being used. In the check_memory_usage()
call, rename the unpacked variable mem_pct to _mem_pct. In the
check_load_average() call, rename the unpacked variable load_val to _load_val.
This follows Python conventions for indicating intentionally unused variables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e1fdc2d2-e427-4787-a0e3-5bd448b59b77

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 202a22c.

📒 Files selected for processing (3)
  • tests/__init__.py
  • tests/test_health_check.py
  • tools/health_check.py

Comment on lines +4 to +5
_REPO_ROOT = Path(r"C:\Users\Administrator\bug_bounty_work\kickama35\thanhle74-kickama-94e0fb0")
sys.path.insert(0, str(_REPO_ROOT / "tools"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace machine-specific test bootstrap path

Line 4 hard-codes a local Windows directory, so this module import will break on other machines/CI before tests even run.

Suggested portable fix
-_REPO_ROOT = Path(r"C:\Users\Administrator\bug_bounty_work\kickama35\thanhle74-kickama-94e0fb0")
-sys.path.insert(0, str(_REPO_ROOT / "tools"))
+_REPO_ROOT = Path(__file__).resolve().parents[1]
+tools_path = str(_REPO_ROOT / "tools")
+if tools_path not in sys.path:
+    sys.path.insert(0, tools_path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_health_check.py` around lines 4 - 5, The _REPO_ROOT variable on
line 4 contains a hard-coded absolute Windows path that will break on other
machines and CI environments. Replace this hard-coded path with a relative path
computed from the current test file location using Path(__file__) to determine
the test file's directory, then navigate up the directory structure to find the
repository root, and finally construct the path to the tools directory. This
approach will be platform-independent and work across Windows, Linux, and macOS
systems.

Comment on lines +78 to +90
def test_http_probe_returns_tuple(self):
result = health_check.check_http_service(
"localhost", 19996, "/", timeout=1, max_retries=1,
)
self.assertEqual(len(result), 3)

def test_http_max_retries_zero(self):
start = time.time()
result = health_check.check_http_service(
"localhost", 19995, "/", timeout=1, max_retries=0,
)
elapsed = time.time() - start
self.assertLess(elapsed, 3)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid live network/process dependencies in unit tests

These tests call real probe paths (check_http_service and run_health_checks) against runtime environment state. That makes the suite nondeterministic and can slow/fail CI for reasons unrelated to code changes. Mock probe functions/HTTPConnection and assert retry/circuit/output contracts from controlled fixtures.

Also applies to: 124-141

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_health_check.py` around lines 78 - 90, The unit tests
test_http_probe_returns_tuple, test_http_max_retries_zero, and tests at lines
124-141 are making real network calls to check_http_service and
run_health_checks against actual HTTP services, which introduces
nondeterministic behavior and environmental dependencies. Replace these live
network calls by mocking the HTTPConnection class and the probe functions using
unittest.mock.patch or similar mocking tools. Define controlled fixtures that
simulate successful responses, timeouts, and failures, then assert that the
retry logic, circuit breaker behavior, and output contracts work correctly based
on these mocked responses instead of relying on actual services running.

Comment thread tools/health_check.py
Comment on lines +192 to +200
if circuit_breaker and service_name:
circuit_breaker.record_success(service_name)

if status == 200:
return "OK", "HTTP " + str(status), status
elif status < 500:
return "WARNING", "HTTP " + str(status) + ": " + body[:100], status
else:
return "CRITICAL", "HTTP " + str(status) + ": " + body[:100], status

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Circuit breaker records success on 5xx response after retries are exhausted.

When the final retry attempt returns a 5xx status, the condition attempt < max_retries is false, so the loop doesn't continue. The code then falls through to record_success() before returning CRITICAL. This incorrectly signals success to the circuit breaker for a failing service.

The record_success call should only occur for non-5xx responses.

🐛 Proposed fix
-            if circuit_breaker and service_name:
-                circuit_breaker.record_success(service_name)
-
             if status == 200:
+                if circuit_breaker and service_name:
+                    circuit_breaker.record_success(service_name)
                 return "OK", "HTTP " + str(status), status
             elif status < 500:
+                if circuit_breaker and service_name:
+                    circuit_breaker.record_success(service_name)
                 return "WARNING", "HTTP " + str(status) + ": " + body[:100], status
             else:
+                if circuit_breaker and service_name:
+                    circuit_breaker.record_failure(service_name)
                 return "CRITICAL", "HTTP " + str(status) + ": " + body[:100], status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if circuit_breaker and service_name:
circuit_breaker.record_success(service_name)
if status == 200:
return "OK", "HTTP " + str(status), status
elif status < 500:
return "WARNING", "HTTP " + str(status) + ": " + body[:100], status
else:
return "CRITICAL", "HTTP " + str(status) + ": " + body[:100], status
if status == 200:
if circuit_breaker and service_name:
circuit_breaker.record_success(service_name)
return "OK", "HTTP " + str(status), status
elif status < 500:
if circuit_breaker and service_name:
circuit_breaker.record_success(service_name)
return "WARNING", "HTTP " + str(status) + ": " + body[:100], status
else:
if circuit_breaker and service_name:
circuit_breaker.record_failure(service_name)
return "CRITICAL", "HTTP " + str(status) + ": " + body[:100], status
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 192 - 200, The
circuit_breaker.record_success call is being executed unconditionally before
checking the HTTP status code, which means it records success even for 5xx
responses. Move the circuit_breaker.record_success invocation so that it only
executes when the response status indicates success (when status is less than
500), not for CRITICAL failures. This ensures the circuit breaker only records
success for non-5xx responses and properly tracks service failures.

@liujin11112

Copy link
Copy Markdown
Author

Hi, just checking in on this PR. Is there anything else needed from my side to move this forward? The code changes + tests are ready. Happy to make any adjustments. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant