Skip to content

[$35 BOUNTY] Add retry/backoff, circuit breaker, and result aggregation to health_check#37

Open
saimbcn1-lang wants to merge 2 commits into
thanhle74:mainfrom
saimbcn1-lang:bounty-35-cb
Open

[$35 BOUNTY] Add retry/backoff, circuit breaker, and result aggregation to health_check#37
saimbcn1-lang wants to merge 2 commits into
thanhle74:mainfrom
saimbcn1-lang:bounty-35-cb

Conversation

@saimbcn1-lang

@saimbcn1-lang saimbcn1-lang commented Jun 24, 2026

Copy link
Copy Markdown

Changes

🔁 Exponential Backoff Retry

  • with_retry() wrapper with configurable max_retries and backoff_base
  • Only retries on WARNING status (transient failures)
  • Exponential: base * 2^attempt seconds
  • --retries and --backoff CLI flags

⚡ Circuit Breaker Pattern

  • New CircuitBreaker class with CLOSED→OPEN→HALF_OPEN state machine
  • --cb-threshold: consecutive failures before opening (default 5)
  • --cb-recovery: seconds before attempting half-open (default 30)
  • --no-circuit-breaker: disable protection entirely
  • Circuit state reported per-service in output and JSON

📊 Result Aggregation

  • HealthCheckAggregator with configurable sliding window
  • Tracks OK/WARNING/CRITICAL counts per service over time
  • Uptime percentage calculation
  • --aggregation-window flag
  • Aggregation summary included in JSON output

📝 Proper Logging

  • logging module with WARNING level for degraded services
  • Structured log format with timestamps
  • Service name in log messages for easy grep/filtering

Closes #15

Summary by CodeRabbit

  • New Features

    • Probe retries with exponential backoff for increased resilience.
    • Circuit-breaker protection to prevent cascading failures.
    • Health metrics aggregation and trending across multiple runs.
    • Enhanced reports with circuit-breaker state visibility.
  • Documentation

    • Updated CLI help text with new retry, backoff, and circuit-breaker configuration options.

…ealth_check

- CircuitBreaker: CLOSED→OPEN→HALF_OPEN state machine
- with_retry(): exponential backoff on transient failures
- HealthCheckAggregator: sliding window aggregation
- WARNING/INFO logging for degraded services
- --retries, --backoff, --no-circuit-breaker CLI flags

Closes thanhle74#15
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@saimbcn1-lang, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 46 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca9e27df-90be-47e4-818f-a14168574666

📥 Commits

Reviewing files that changed from the base of the PR and between c6c4272 and d38587c.

📒 Files selected for processing (1)
  • tools/health_check.py
📝 Walkthrough

Walkthrough

tools/health_check.py gains a CircuitBreaker class (OPEN/HALF_OPEN/CLOSED with a global registry), a with_retry() function for exponential backoff, and a HealthCheckAggregator with windowed history. run_health_checks() integrates these, and the CLI adds flags for retry, backoff, circuit-breaker tuning, window size, and a --summary mode.

Changes

Retry, Circuit Breaker, and Aggregation for health_check.py

Layer / File(s) Summary
CircuitBreaker, with_retry, and HealthCheckAggregator
tools/health_check.py
Adds docstring/import updates, global circuit-breaker constants, the thread-safe CircuitBreaker class with OPEN/HALF_OPEN/CLOSED transitions, a global circuit registry via get_circuit(), with_retry() for exponential backoff re-probing on configurable statuses, and HealthCheckAggregator with windowed history and a global _aggregator instance.
run_health_checks() runner integration and aggregation
tools/health_check.py
Replaces the previous runner with an expanded run_health_checks() that gates HTTP/TCP probes through circuit-breaker state, wraps probes with with_retry(), populates results["circuits"] from the global registry, records each run into _aggregator, and attaches aggregation totals and window size under results["aggregation"].
Report output, CLI flags, and JSON saving
tools/health_check.py
Updates print_health_report() to render inline circuit state, aggregation totals, health percentage, and non-CLOSED circuit listing. Adds --max-retries, --backoff-base, --no-circuits, --circuit-threshold, --circuit-recovery, --window, and --summary CLI flags. Wires all settings through main() and adjusts JSON file output to use indent=2 with a save-path log message.

Possibly related issues

  • #15 [$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes: This PR directly implements the requested features — --max-retries, --backoff-factor/--backoff-base, --circuit-threshold flags, exponential backoff, a circuit breaker with configurable cooldown, and result aggregation summary stats in tools/health_check.py.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🐇 Hops through the checks with a retry in paw,
If a circuit cracks open, I skip with awe.
Backoff and cooldown, I track every run,
Aggregating the hops 'neath the warm morning sun.
No service goes dark without leaving a trace —
The rabbit reports from each endpoint in place! 🌟

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning Description is partially complete—it details the three main features but is missing required sections from the template (Summary, Testing, Checklist) and lacks build/test verification details. Add Summary and Testing sections with local verification results, and complete the Checklist confirming builds, tests pass, and diagnostic artifacts are included.
Linked Issues check ❓ Inconclusive Core features (retry/backoff, circuit breaker, aggregation, logging) align with #15 requirements, but PR description lacks evidence of unit tests and diagnostic build artifacts as specified. Confirm that at least 5 unit tests were added for retry/backoff/circuit-breaker logic and that diagnostic artifacts (build-XXX.logd, build-XXX.json) are included in the PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title is clear, specific, and fully describes the main changes (retry/backoff, circuit breaker, aggregation) introduced in the PR.
Out of Scope Changes check ✅ Passed All changes to tools/health_check.py (retries, circuit breaker, aggregation, logging, CLI updates) are directly scoped to #15 requirements with no apparent out-of-scope additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
tools/health_check.py (2)

537-537: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Unused unpacked variable latency.

Prefix with an underscore to silence the linter.

♻️ Proposed tweak
-        status, detail, latency = with_retry(
+        status, detail, _latency = with_retry(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` at line 537, The variable `latency` is being unpacked
from the `with_retry()` function call but is never used in the code, which
triggers a linter warning. To fix this, prefix the unused variable with an
underscore (change `latency` to `_latency`) in the unpacking statement at the
`with_retry()` call to indicate it is intentionally unused and silence the
linter.

Source: Linters/SAST tools


647-647: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

f-string without placeholders.

Drop the extraneous f prefix.

♻️ Proposed tweak
-        print(f"\n  Circuit Breakers:")
+        print("\n  Circuit Breakers:")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` at line 647, The print statement for "Circuit
Breakers:" contains an f-string prefix but has no placeholder expressions within
the string. Remove the f prefix from the string literal in the print statement
to clean up the code, as f-strings are only necessary when interpolating
variables with curly braces.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health_check.py`:
- Around line 661-675: The CLI argument flag names in the argument parser do not
match the specifications from issue `#15`. Update the following flag names in the
parser.add_argument calls: rename --retries to --max-retries, rename --backoff
to --backoff-factor, and rename --cb-threshold to --circuit-threshold to align
with the issue requirements. This ensures the CLI flag contract matches the
bounty acceptance criteria.
- Around line 688-694: The `--summary` flag path creates a brand-new empty
HealthCheckAggregator instance and immediately calls get_summary() on it without
running any health checks first, resulting in empty statistics. Either execute
the health checks before the get_summary() call when args.summary is true, or
implement persistence logic to load previously aggregated data from disk before
creating the HealthCheckAggregator or calling get_summary(). This ensures the
summary has actual data to report rather than empty totals.
- Around line 481-489: The circuit breaker OPEN state is reported as WARNING
status which does not affect overall_status (all_ok remains True), masking
service outages in the final health report. When circuits_enabled is True and
cb.allow_request() returns False (indicating an OPEN circuit), the code
currently sets status to WARNING but this does not trigger all_ok = False.
Modify the logic to ensure that when a circuit is OPEN, the overall_status is
degraded by explicitly setting all_ok = False in addition to the current
response, or change the status value to one that properly reflects the severity
(CRITICAL or similar) so that open circuits are surfaced correctly in the
overall_status output.
- Around line 162-195: The record_failure method uses time.monotonic() to set
self.last_failure_time, but get_status converts this value with
datetime.fromtimestamp which expects Unix epoch time, not monotonic time. This
results in meaningless timestamps. Replace time.monotonic() with time.time() in
the record_failure method when setting self.last_failure_time, since this value
is only consumed by get_status for the API response and does not affect the
separate monotonic-based state transition logic that uses last_state_change.
- Around line 213-238: In the with_retry function, change the default value of
the retry_on parameter from ["WARNING"] to ["CRITICAL"] since transient
connection failures return "CRITICAL" status and are the failures that actually
need to be retried, not warnings. Additionally, remove the unreachable dead code
at the end of the function (the final logger.warning and return last_result
statements) since the loop always returns during the final iteration due to the
else branch executing when attempt == max_retries.

---

Nitpick comments:
In `@tools/health_check.py`:
- Line 537: The variable `latency` is being unpacked from the `with_retry()`
function call but is never used in the code, which triggers a linter warning. To
fix this, prefix the unused variable with an underscore (change `latency` to
`_latency`) in the unpacking statement at the `with_retry()` call to indicate it
is intentionally unused and silence the linter.
- Line 647: The print statement for "Circuit Breakers:" contains an f-string
prefix but has no placeholder expressions within the string. Remove the f prefix
from the string literal in the print statement to clean up the code, as
f-strings are only necessary when interpolating variables with curly braces.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbf91f47-bfc2-4b87-b978-ea49684ae956

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and c6c4272.

📒 Files selected for processing (1)
  • tools/health_check.py

Comment thread tools/health_check.py
Comment thread tools/health_check.py
Comment on lines +213 to +238
def with_retry(func, *args, max_retries: int = 2, backoff_base: float = 1.0,
retry_on: Optional[List[str]] = None, **kwargs):
"""Execute a check function with exponential backoff on transient failures."""
if retry_on is None:
retry_on = ["WARNING"]

last_result = None
for attempt in range(max_retries + 1):
result = func(*args, **kwargs)
status = result[0]

if attempt < max_retries and status in retry_on:
wait = backoff_base * (2 ** attempt)
logger.debug(
"Retry %d/%d for %s (backoff: %.1fs)",
attempt + 1, max_retries, func.__name__, wait
)
time.sleep(wait)
last_result = result
else:
if attempt > 0:
logger.info("%s succeeded after %d retries", func.__name__, attempt)
return result

logger.warning("%s failed after %d retries", func.__name__, max_retries)
return last_result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect status semantics returned by the probes to confirm which is transient.
fd health_check.py tools --exec ast-grep run --pattern $'def check_http_service($$$) {\n  $$$\n}' --lang python {} 2>/dev/null
rg -nP 'return\s*\(?\s*"(OK|WARNING|CRITICAL)"' tools/health_check.py -C2

Repository: thanhle74/kickama

Length of output: 2851


Unreachable tail and inverted retry semantics—CRITICAL failures that should be retried are skipped.

Two issues confirmed:

  1. The final loop iteration (attempt == max_retries) always enters the else branch and returns, making lines 237-238 dead code that never executes.

  2. retry_on defaults to ["WARNING"], but connection failures (check_tcp_port, check_http_service) return "CRITICAL" on timeout, connection refused, and exceptions. These transient failures—the ones actually worth retrying—are never retried. Meanwhile, WARNING outcomes like certificate expiration (which are more deterministic) are retried unnecessarily. The retry layer cannot prevent transient CRITICAL failures from incrementing the circuit breaker failure count, defeating the backoff mechanism's purpose.

Remove dead code at lines 237-238 and invert the default retry classification to prioritize transient connection failures (CRITICAL) over non-transient warnings.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 213 - 238, In the with_retry function,
change the default value of the retry_on parameter from ["WARNING"] to
["CRITICAL"] since transient connection failures return "CRITICAL" status and
are the failures that actually need to be retried, not warnings. Additionally,
remove the unreachable dead code at the end of the function (the final
logger.warning and return last_result statements) since the loop always returns
during the final iteration due to the else branch executing when attempt ==
max_retries.

Comment thread tools/health_check.py
Comment thread tools/health_check.py
Comment thread tools/health_check.py
Comment on lines +688 to +694
global _aggregator
_aggregator = HealthCheckAggregator(window_size=args.aggregation_window)

if args.summary:
summary = _aggregator.get_summary()
print(json.dumps(summary, indent=2))
return 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

--summary always prints empty statistics.

main() rebuilds _aggregator as a brand-new in-memory HealthCheckAggregator (Line 689), then --summary immediately calls get_summary() on it (Lines 691-694) without any checks having run. Since aggregator history lives only in process memory and is never persisted, this path always emits empty totals/services. The --summary mode as written can never report trend data.

Either run the checks before summarizing, or persist/load aggregation history across invocations.

🧰 Tools
🪛 ast-grep (0.44.0)

[info] 692-692: use jsonify instead of json.dumps for JSON output
Context: json.dumps(summary, indent=2)
Note: [CWE-116] Improper Encoding or Escaping of Output.

(use-jsonify)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 688 - 694, The `--summary` flag path
creates a brand-new empty HealthCheckAggregator instance and immediately calls
get_summary() on it without running any health checks first, resulting in empty
statistics. Either execute the health checks before the get_summary() call when
args.summary is true, or implement persistence logic to load previously
aggregated data from disk before creating the HealthCheckAggregator or calling
get_summary(). This ensures the summary has actual data to report rather than
empty totals.

- Fix foxyManTou#1: Use time.time() instead of time.monotonic() for last_failure timestamp
- Fix foxyManTou#2: Retry on correct status semantics
- Fix foxyManTou#3: OPEN circuit now reports CRITICAL and sets overall DEGRADED
- Fix foxyManTou#4: Rename CLI flags to match issue spec (--max-retries, --backoff-factor, --circuit-threshold)
- Fix foxyManTou#5: --summary reads persistent aggregator instead of fresh empty one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes

1 participant