Skip to content

feat(health_check): retry/backoff + circuit breaker for HTTP probes [#15]#32

Open
mayoka0 wants to merge 2 commits into
thanhle74:mainfrom
mayoka0:feat/health-check-retry-circuit-breaker
Open

feat(health_check): retry/backoff + circuit breaker for HTTP probes [#15]#32
mayoka0 wants to merge 2 commits into
thanhle74:mainfrom
mayoka0:feat/health-check-retry-circuit-breaker

Conversation

@mayoka0

@mayoka0 mayoka0 commented Jun 22, 2026

Copy link
Copy Markdown

Closes #15

Summary

The tools/health_check.py HTTP probes made a single attempt with a fixed timeout — one transient blip flipped a healthy service to CRITICAL, and a genuinely-down service was re-probed every run with no protection. This PR adds configurable retry with exponential backoff, a per-endpoint circuit breaker, result aggregation, and WARNING-level logging.

Changes

tools/health_check.py

  • Retry + exponential backoff on CRITICAL probes (connection errors / HTTP 5xx): delay = base_delay * (backoff_factor ** attempt). A reachable response (OK, or a 4xx WARNING) returns immediately — no pointless retry, since the service answered.
  • CircuitBreaker (per endpoint): opens after N consecutive failures, enters half-open after a cooldown to allow a single trial, and resets on success. While open, the probe is skipped so a known-bad service isn't hammered. The clock is injectable for deterministic testing.
  • summarize_results(): OK/WARNING/CRITICAL counts, healthy %, and a degraded list (walks nested checks like a service's certificate too). Added to the result under summary (additive — existing JSON keys unchanged).
  • WARNING logging on every retry, circuit transition, and degraded result.
  • New CLI flags: --max-retries, --base-delay, --backoff-factor, --circuit-threshold, --circuit-cooldown, --verbose.

tools/test_health_check.py — 13 new unit tests (stdlib unittest, no network).

Testing

$ cd tools && python3 -m unittest test_health_check
.............
Ran 13 tests in 0.006s
OK

13 tests (req: ≥5) cover: retry-then-succeed, the exact backoff delay sequence [0.5, 1.0, 2.0], no-retry-on-4xx, 5xx-is-retried, retry exhaustion, circuit open/half-open/reset, open-circuit-skips-probe, and aggregation incl. nested certificate checks. sleep and the clock are injected, so the suite runs in milliseconds.

Smoke test of the CLI (services not running locally → degraded, with summary + WARNING logs on stderr, clean JSON on stdout) also verified.

Build diagnostic: ran python3 build.py — diagnostic artifacts committed:

  • diagnostic/build-b09040e4.logd (encryptly archive, diagnostic_logd_error: null)
  • diagnostic/build-b09040e4.json
  • Decrypt: encryptly unpack diagnostic/build-b09040e4.logd <outdir> --password a30ac63ec1437388b877

Note: 2/10 modules built (frontend, frailbox); the other 8 failed only because their language toolchains (cargo, go, cmake, javac, ruby, luac, ghc) aren't installed in my build environment — unrelated to this change, which is Python-only. The .logd/.json are valid (commit-tagged, not build-00000000).

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated, if applicable
  • Configuration or schema changes are documented, if applicable
  • No generated build artifacts are committed, except the required diagnostic build log
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

  • I would like to request that my diagnostic build log is removed before merging

— Contributed by Mayoka Labs

Summary by CodeRabbit

  • New Features

    • Health check tool now supports automatic retries with exponential backoff for failed probes.
    • Added circuit breaker functionality to prevent repeated probing of consistently failing services.
    • Enhanced health check reporting with summary statistics and degradation indicators.
    • New configuration options for tuning retry and circuit breaker behavior.
  • Tests

    • Comprehensive test suite added for health check functionality.

mayoka0 added 2 commits June 22, 2026 11:08
The HTTP health probes made a single attempt with a fixed timeout — one
transient blip flipped a healthy service to CRITICAL, and a genuinely
down service was re-probed on every run with no protection.

tools/health_check.py:
- Retry with exponential backoff on CRITICAL probes (connection errors /
  HTTP 5xx). delay = base_delay * (backoff_factor ** attempt). A reachable
  response (OK or a 4xx WARNING) returns immediately — no pointless retry.
- CircuitBreaker (per endpoint): opens after N consecutive failures,
  enters half-open after a cooldown to allow one trial, and resets on
  success. While open, the probe is skipped so a known-bad service is not
  hammered. Clock is injectable for testing.
- summarize_results(): OK/WARNING/CRITICAL counts, healthy percentage, and
  a degraded list (walks nested checks like certificate status too).
- WARNING-level logging on every retry, circuit transition, and degraded
  result.
- New CLI flags: --max-retries, --base-delay, --backoff-factor,
  --circuit-threshold, --circuit-cooldown, --verbose.

tools/test_health_check.py:
- 13 unit tests (stdlib unittest, no network): retry-then-succeed, the
  exact backoff delay sequence, no-retry-on-4xx, 5xx retried, retry
  exhaustion, circuit open/half-open/reset, open-circuit skips the probe,
  and aggregation incl. nested certificate checks. Sleep and clock are
  injected, so the suite runs in milliseconds.

Existing flags, output format, and JSON schema are preserved (the summary
block is additive).
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 426d9d12-7660-4d02-9320-d70d784d800c

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and cd1b7e3.

📒 Files selected for processing (4)
  • diagnostic/build-b09040e4.json
  • diagnostic/build-b09040e4.logd
  • tools/health_check.py
  • tools/test_health_check.py

📝 Walkthrough

Walkthrough

Adds HTTP retry with exponential backoff and a per-endpoint CircuitBreaker class to tools/health_check.py, along with a summarize_results aggregator, extended print_health_report output, new CLI flags for tuning, and a unittest suite. A generated diagnostic JSON artifact recording build results for commit b09040e4 is also included.

Changes

Health Check Retry, Circuit Breaker, and Summarization

Layer / File(s) Summary
Tuning constants and CircuitBreaker class
tools/health_check.py
Adds DEFAULT_MAX_RETRIES, DEFAULT_BASE_DELAY, DEFAULT_BACKOFF_FACTOR, DEFAULT_CIRCUIT_THRESHOLD, DEFAULT_CIRCUIT_COOLDOWN constants and the CircuitBreaker class with per-key CLOSED/OPEN/HALF_OPEN state tracking, allows_request, record_success, record_failure, and input validation.
check_http_service with retry/backoff/circuit-breaker
tools/health_check.py
Replaces the single-attempt HTTP probe with a retry loop that short-circuits when the breaker is OPEN, retries only CRITICAL outcomes up to max_retries times with exponential backoff (base_delay * backoff_factor**attempt), resets the breaker on OK/WARNING, and records a failure after retry exhaustion.
run_health_checks wiring, summarize_results, print_health_report, and CLI
tools/health_check.py
Updates run_health_checks to forward probe-tuning kwargs; adds summarize_results to aggregate OK/WARNING/CRITICAL counts and build a degraded list; extends print_health_report to print summary and degraded entries; adds --max-retries, --backoff-factor, --circuit-threshold, --circuit-cooldown, and --verbose CLI flags.
Unittest suite
tools/test_health_check.py
Adds _SleepRecorder, _FakeClock, and a probe factory helper, then tests retry delay sequences and exhaustion semantics, circuit-breaker state transitions (open/half-open/closed/reset), circuit-open short-circuiting, and summarize_results aggregation including nested certificate checks.

Diagnostic Build Artifact

Layer / File(s) Summary
Generated diagnostic JSON
diagnostic/build-b09040e4.json
Adds the generated metadata file for commit b09040e4 recording generation time, decrypt parameters, pass/fail counts (2 passing, 8 failing), per-module status and captured output, and a pr_note about the encrypted .logd artifact.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI / main()
  participant run_health_checks
  participant check_http_service
  participant CircuitBreaker
  participant HTTPEndpoint

  CLI->>run_health_checks: run_health_checks(..., circuit_breaker=cb, max_retries=N, ...)
  run_health_checks->>check_http_service: check_http_service(host, port, ..., circuit_breaker=cb)
  check_http_service->>CircuitBreaker: allows_request(key)
  alt Circuit OPEN
    CircuitBreaker-->>check_http_service: False → CRITICAL "circuit open"
  else Circuit CLOSED/HALF_OPEN
    loop attempt 0..max_retries
      check_http_service->>HTTPEndpoint: HTTP GET probe
      HTTPEndpoint-->>check_http_service: status / error
      alt Non-CRITICAL (OK or 4xx)
        check_http_service->>CircuitBreaker: record_success(key)
        check_http_service-->>run_health_checks: OK / WARNING
      else CRITICAL (5xx / conn error)
        check_http_service->>check_http_service: sleep(base_delay * backoff_factor^attempt)
      end
    end
    check_http_service->>CircuitBreaker: record_failure(key)
    check_http_service-->>run_health_checks: CRITICAL
  end
  run_health_checks->>run_health_checks: summarize_results(results)
  run_health_checks-->>CLI: results with summary + degraded list
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #15 [$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes — This PR directly implements all acceptance criteria: --max-retries, --backoff-factor, --circuit-threshold CLI flags; exponential backoff formula; circuit breaker with cooldown; summarize_results aggregation; python3 build.py diagnostic artifact; and more than 5 unit tests covering retry/backoff/circuit-breaker logic.

Poem

🐇 Hop, hop — the circuit's closed today,
With retries bouncing foes away,
Backoff blooms like morning dew,
Half-open skies let trials through,
When failures stack, the breaker fires —
But one small win and hope re-inspires!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature addition (retry/backoff + circuit breaker) for HTTP probes, and references the linked issue #15.
Description check ✅ Passed The PR description is comprehensive, covering summary, detailed changes, testing with actual test counts, CLI flags, and all checklist items from the template.
Linked Issues check ✅ Passed The PR fully satisfies all acceptance criteria: CLI flags (--max-retries, --backoff-factor, --circuit-threshold plus additional flags), exponential backoff formula, circuit breaker with cooldown, valid diagnostic artifacts, and 13 unit tests (exceeds minimum of 5).
Out of Scope Changes check ✅ Passed All changes directly support the PR objectives: retry/backoff and circuit breaker enhancements, result aggregation, logging, CLI flags, and corresponding unit tests. The diagnostic build files are required artifacts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes

1 participant