feat(health_check): retry/backoff + circuit breaker for HTTP probes [#15]#32
feat(health_check): retry/backoff + circuit breaker for HTTP probes [#15]#32mayoka0 wants to merge 2 commits into
Conversation
The HTTP health probes made a single attempt with a fixed timeout — one transient blip flipped a healthy service to CRITICAL, and a genuinely down service was re-probed on every run with no protection. tools/health_check.py: - Retry with exponential backoff on CRITICAL probes (connection errors / HTTP 5xx). delay = base_delay * (backoff_factor ** attempt). A reachable response (OK or a 4xx WARNING) returns immediately — no pointless retry. - CircuitBreaker (per endpoint): opens after N consecutive failures, enters half-open after a cooldown to allow one trial, and resets on success. While open, the probe is skipped so a known-bad service is not hammered. Clock is injectable for testing. - summarize_results(): OK/WARNING/CRITICAL counts, healthy percentage, and a degraded list (walks nested checks like certificate status too). - WARNING-level logging on every retry, circuit transition, and degraded result. - New CLI flags: --max-retries, --base-delay, --backoff-factor, --circuit-threshold, --circuit-cooldown, --verbose. tools/test_health_check.py: - 13 unit tests (stdlib unittest, no network): retry-then-succeed, the exact backoff delay sequence, no-retry-on-4xx, 5xx retried, retry exhaustion, circuit open/half-open/reset, open-circuit skips the probe, and aggregation incl. nested certificate checks. Sleep and clock are injected, so the suite runs in milliseconds. Existing flags, output format, and JSON schema are preserved (the summary block is additive).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds HTTP retry with exponential backoff and a per-endpoint ChangesHealth Check Retry, Circuit Breaker, and Summarization
Diagnostic Build Artifact
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI / main()
participant run_health_checks
participant check_http_service
participant CircuitBreaker
participant HTTPEndpoint
CLI->>run_health_checks: run_health_checks(..., circuit_breaker=cb, max_retries=N, ...)
run_health_checks->>check_http_service: check_http_service(host, port, ..., circuit_breaker=cb)
check_http_service->>CircuitBreaker: allows_request(key)
alt Circuit OPEN
CircuitBreaker-->>check_http_service: False → CRITICAL "circuit open"
else Circuit CLOSED/HALF_OPEN
loop attempt 0..max_retries
check_http_service->>HTTPEndpoint: HTTP GET probe
HTTPEndpoint-->>check_http_service: status / error
alt Non-CRITICAL (OK or 4xx)
check_http_service->>CircuitBreaker: record_success(key)
check_http_service-->>run_health_checks: OK / WARNING
else CRITICAL (5xx / conn error)
check_http_service->>check_http_service: sleep(base_delay * backoff_factor^attempt)
end
end
check_http_service->>CircuitBreaker: record_failure(key)
check_http_service-->>run_health_checks: CRITICAL
end
run_health_checks->>run_health_checks: summarize_results(results)
run_health_checks-->>CLI: results with summary + degraded list
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Closes #15
Summary
The
tools/health_check.pyHTTP probes made a single attempt with a fixed timeout — one transient blip flipped a healthy service toCRITICAL, and a genuinely-down service was re-probed every run with no protection. This PR adds configurable retry with exponential backoff, a per-endpoint circuit breaker, result aggregation, and WARNING-level logging.Changes
tools/health_check.pyCRITICALprobes (connection errors / HTTP 5xx):delay = base_delay * (backoff_factor ** attempt). A reachable response (OK, or a 4xx WARNING) returns immediately — no pointless retry, since the service answered.CircuitBreaker(per endpoint): opens after N consecutive failures, enters half-open after a cooldown to allow a single trial, and resets on success. While open, the probe is skipped so a known-bad service isn't hammered. The clock is injectable for deterministic testing.summarize_results(): OK/WARNING/CRITICAL counts, healthy %, and adegradedlist (walks nested checks like a service's certificate too). Added to the result undersummary(additive — existing JSON keys unchanged).--max-retries,--base-delay,--backoff-factor,--circuit-threshold,--circuit-cooldown,--verbose.tools/test_health_check.py— 13 new unit tests (stdlibunittest, no network).Testing
13 tests (req: ≥5) cover: retry-then-succeed, the exact backoff delay sequence
[0.5, 1.0, 2.0], no-retry-on-4xx, 5xx-is-retried, retry exhaustion, circuit open/half-open/reset, open-circuit-skips-probe, and aggregation incl. nested certificate checks.sleepand the clock are injected, so the suite runs in milliseconds.Smoke test of the CLI (services not running locally → degraded, with summary + WARNING logs on stderr, clean JSON on stdout) also verified.
Build diagnostic: ran
python3 build.py— diagnostic artifacts committed:diagnostic/build-b09040e4.logd(encryptly archive,diagnostic_logd_error: null)diagnostic/build-b09040e4.jsonencryptly unpack diagnostic/build-b09040e4.logd <outdir> --password a30ac63ec1437388b877Checklist
— Contributed by Mayoka Labs
Summary by CodeRabbit
New Features
Tests