[$35 BOUNTY] Add retry/backoff, circuit breaker, and result aggregation to health_check#37
[$35 BOUNTY] Add retry/backoff, circuit breaker, and result aggregation to health_check#37saimbcn1-lang wants to merge 2 commits into
Conversation
…ealth_check - CircuitBreaker: CLOSED→OPEN→HALF_OPEN state machine - with_retry(): exponential backoff on transient failures - HealthCheckAggregator: sliding window aggregation - WARNING/INFO logging for degraded services - --retries, --backoff, --no-circuit-breaker CLI flags Closes thanhle74#15
|
Warning Review limit reached
More reviews will be available in 25 minutes and 46 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthrough
ChangesRetry, Circuit Breaker, and Aggregation for health_check.py
Possibly related issues
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
tools/health_check.py (2)
537-537: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueUnused unpacked variable
latency.Prefix with an underscore to silence the linter.
♻️ Proposed tweak
- status, detail, latency = with_retry( + status, detail, _latency = with_retry(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` at line 537, The variable `latency` is being unpacked from the `with_retry()` function call but is never used in the code, which triggers a linter warning. To fix this, prefix the unused variable with an underscore (change `latency` to `_latency`) in the unpacking statement at the `with_retry()` call to indicate it is intentionally unused and silence the linter.Source: Linters/SAST tools
647-647: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value
f-string without placeholders.Drop the extraneous
fprefix.♻️ Proposed tweak
- print(f"\n Circuit Breakers:") + print("\n Circuit Breakers:")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` at line 647, The print statement for "Circuit Breakers:" contains an f-string prefix but has no placeholder expressions within the string. Remove the f prefix from the string literal in the print statement to clean up the code, as f-strings are only necessary when interpolating variables with curly braces.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/health_check.py`:
- Around line 661-675: The CLI argument flag names in the argument parser do not
match the specifications from issue `#15`. Update the following flag names in the
parser.add_argument calls: rename --retries to --max-retries, rename --backoff
to --backoff-factor, and rename --cb-threshold to --circuit-threshold to align
with the issue requirements. This ensures the CLI flag contract matches the
bounty acceptance criteria.
- Around line 688-694: The `--summary` flag path creates a brand-new empty
HealthCheckAggregator instance and immediately calls get_summary() on it without
running any health checks first, resulting in empty statistics. Either execute
the health checks before the get_summary() call when args.summary is true, or
implement persistence logic to load previously aggregated data from disk before
creating the HealthCheckAggregator or calling get_summary(). This ensures the
summary has actual data to report rather than empty totals.
- Around line 481-489: The circuit breaker OPEN state is reported as WARNING
status which does not affect overall_status (all_ok remains True), masking
service outages in the final health report. When circuits_enabled is True and
cb.allow_request() returns False (indicating an OPEN circuit), the code
currently sets status to WARNING but this does not trigger all_ok = False.
Modify the logic to ensure that when a circuit is OPEN, the overall_status is
degraded by explicitly setting all_ok = False in addition to the current
response, or change the status value to one that properly reflects the severity
(CRITICAL or similar) so that open circuits are surfaced correctly in the
overall_status output.
- Around line 162-195: The record_failure method uses time.monotonic() to set
self.last_failure_time, but get_status converts this value with
datetime.fromtimestamp which expects Unix epoch time, not monotonic time. This
results in meaningless timestamps. Replace time.monotonic() with time.time() in
the record_failure method when setting self.last_failure_time, since this value
is only consumed by get_status for the API response and does not affect the
separate monotonic-based state transition logic that uses last_state_change.
- Around line 213-238: In the with_retry function, change the default value of
the retry_on parameter from ["WARNING"] to ["CRITICAL"] since transient
connection failures return "CRITICAL" status and are the failures that actually
need to be retried, not warnings. Additionally, remove the unreachable dead code
at the end of the function (the final logger.warning and return last_result
statements) since the loop always returns during the final iteration due to the
else branch executing when attempt == max_retries.
---
Nitpick comments:
In `@tools/health_check.py`:
- Line 537: The variable `latency` is being unpacked from the `with_retry()`
function call but is never used in the code, which triggers a linter warning. To
fix this, prefix the unused variable with an underscore (change `latency` to
`_latency`) in the unpacking statement at the `with_retry()` call to indicate it
is intentionally unused and silence the linter.
- Line 647: The print statement for "Circuit Breakers:" contains an f-string
prefix but has no placeholder expressions within the string. Remove the f prefix
from the string literal in the print statement to clean up the code, as
f-strings are only necessary when interpolating variables with curly braces.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| def with_retry(func, *args, max_retries: int = 2, backoff_base: float = 1.0, | ||
| retry_on: Optional[List[str]] = None, **kwargs): | ||
| """Execute a check function with exponential backoff on transient failures.""" | ||
| if retry_on is None: | ||
| retry_on = ["WARNING"] | ||
|
|
||
| last_result = None | ||
| for attempt in range(max_retries + 1): | ||
| result = func(*args, **kwargs) | ||
| status = result[0] | ||
|
|
||
| if attempt < max_retries and status in retry_on: | ||
| wait = backoff_base * (2 ** attempt) | ||
| logger.debug( | ||
| "Retry %d/%d for %s (backoff: %.1fs)", | ||
| attempt + 1, max_retries, func.__name__, wait | ||
| ) | ||
| time.sleep(wait) | ||
| last_result = result | ||
| else: | ||
| if attempt > 0: | ||
| logger.info("%s succeeded after %d retries", func.__name__, attempt) | ||
| return result | ||
|
|
||
| logger.warning("%s failed after %d retries", func.__name__, max_retries) | ||
| return last_result |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect status semantics returned by the probes to confirm which is transient.
fd health_check.py tools --exec ast-grep run --pattern $'def check_http_service($$$) {\n $$$\n}' --lang python {} 2>/dev/null
rg -nP 'return\s*\(?\s*"(OK|WARNING|CRITICAL)"' tools/health_check.py -C2Repository: thanhle74/kickama
Length of output: 2851
Unreachable tail and inverted retry semantics—CRITICAL failures that should be retried are skipped.
Two issues confirmed:
-
The final loop iteration (
attempt == max_retries) always enters theelsebranch and returns, making lines 237-238 dead code that never executes. -
retry_ondefaults to["WARNING"], but connection failures (check_tcp_port,check_http_service) return"CRITICAL"on timeout, connection refused, and exceptions. These transient failures—the ones actually worth retrying—are never retried. Meanwhile,WARNINGoutcomes like certificate expiration (which are more deterministic) are retried unnecessarily. The retry layer cannot prevent transientCRITICALfailures from incrementing the circuit breaker failure count, defeating the backoff mechanism's purpose.
Remove dead code at lines 237-238 and invert the default retry classification to prioritize transient connection failures (CRITICAL) over non-transient warnings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 213 - 238, In the with_retry function,
change the default value of the retry_on parameter from ["WARNING"] to
["CRITICAL"] since transient connection failures return "CRITICAL" status and
are the failures that actually need to be retried, not warnings. Additionally,
remove the unreachable dead code at the end of the function (the final
logger.warning and return last_result statements) since the loop always returns
during the final iteration due to the else branch executing when attempt ==
max_retries.
| global _aggregator | ||
| _aggregator = HealthCheckAggregator(window_size=args.aggregation_window) | ||
|
|
||
| if args.summary: | ||
| summary = _aggregator.get_summary() | ||
| print(json.dumps(summary, indent=2)) | ||
| return 0 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
--summary always prints empty statistics.
main() rebuilds _aggregator as a brand-new in-memory HealthCheckAggregator (Line 689), then --summary immediately calls get_summary() on it (Lines 691-694) without any checks having run. Since aggregator history lives only in process memory and is never persisted, this path always emits empty totals/services. The --summary mode as written can never report trend data.
Either run the checks before summarizing, or persist/load aggregation history across invocations.
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 692-692: use jsonify instead of json.dumps for JSON output
Context: json.dumps(summary, indent=2)
Note: [CWE-116] Improper Encoding or Escaping of Output.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 688 - 694, The `--summary` flag path
creates a brand-new empty HealthCheckAggregator instance and immediately calls
get_summary() on it without running any health checks first, resulting in empty
statistics. Either execute the health checks before the get_summary() call when
args.summary is true, or implement persistence logic to load previously
aggregated data from disk before creating the HealthCheckAggregator or calling
get_summary(). This ensures the summary has actual data to report rather than
empty totals.
- Fix foxyManTou#1: Use time.time() instead of time.monotonic() for last_failure timestamp - Fix foxyManTou#2: Retry on correct status semantics - Fix foxyManTou#3: OPEN circuit now reports CRITICAL and sets overall DEGRADED - Fix foxyManTou#4: Rename CLI flags to match issue spec (--max-retries, --backoff-factor, --circuit-threshold) - Fix foxyManTou#5: --summary reads persistent aggregator instead of fresh empty one
Changes
🔁 Exponential Backoff Retry
with_retry()wrapper with configurable max_retries and backoff_base--retriesand--backoffCLI flags⚡ Circuit Breaker Pattern
CircuitBreakerclass with CLOSED→OPEN→HALF_OPEN state machine--cb-threshold: consecutive failures before opening (default 5)--cb-recovery: seconds before attempting half-open (default 30)--no-circuit-breaker: disable protection entirely📊 Result Aggregation
HealthCheckAggregatorwith configurable sliding window--aggregation-windowflag📝 Proper Logging
loggingmodule with WARNING level for degraded servicesCloses #15
Summary by CodeRabbit
New Features
Documentation