fix(service): retry audit database ping with exponential backoff during bootstrap#31
Conversation
…ng bootstrap AuditDatabaseManager.Connect did a single pool.Ping with no retry. Any transient reset during bootstrap — Postgres startup race in testcontainers, RDS failover, planned maintenance restart, network blip — killed the service outright. The fix from #30 hid this in CI by waiting until Postgres was fully ready before tests poll, but the production-side fragility was untouched. Adds pingWithRetry: up to 5 attempts with exponential backoff (200ms → 2s cap, ~6s total). Respects the parent ctx so the existing 30s Connect deadline still bounds everything. Parameterized for microsecond-fast unit tests. Connect now uses pingWithRetry instead of a single Ping. The error message on final failure carries the attempt count, so operators see "after 5 attempts: <last error>" instead of just the lone reset. Out of scope: AuditDatabaseManager.Ping (line 263) used by readyz probes stays single-shot — a runtime health check should report current state, not retry. Disconnect path also untouched.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughReplaces a single DB ping with a retrying ping (pingWithRetry) that uses exponential backoff, respects context cancellation, and is integrated into AuditDatabaseManager.Connect; adds tests covering success, transient recovery, exhaustion, cancellation, and maxAttempts coercion. ChangesDatabase Connection Ping Resilience
Comment |
📊 Unit Test Coverage Report:
|
| Metric | Value |
|---|---|
| Overall Coverage | 40.7% |
| Threshold | 85% |
Coverage by Package
| Package | Coverage |
|---|---|
github.com/LerianStudio/flowker/api |
100.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/audit |
0.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/catalog |
60.9% |
github.com/LerianStudio/flowker/internal/adapters/http/in/dashboard |
91.5% |
github.com/LerianStudio/flowker/internal/adapters/http/in/execution |
81.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/executor_configuration |
72.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/health |
100.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/middleware/testutil |
0.0% |
github.com/LerianStudio/flowker/internal/adapters/http/in/middleware |
44.4% |
github.com/LerianStudio/flowker/internal/adapters/http/in/provider_configuration |
21.3% |
github.com/LerianStudio/flowker/internal/adapters/http/in/readyz |
81.3% |
github.com/LerianStudio/flowker/internal/adapters/http/in/webhook |
87.5% |
github.com/LerianStudio/flowker/internal/adapters/http/in/workflow |
69.7% |
github.com/LerianStudio/flowker/internal/adapters/http/in |
33.3% |
github.com/LerianStudio/flowker/internal/adapters/mongodb/dashboard |
43.0% |
github.com/LerianStudio/flowker/internal/adapters/mongodb/execution |
64.8% |
github.com/LerianStudio/flowker/internal/adapters/mongodb/executor_configuration |
36.6% |
github.com/LerianStudio/flowker/internal/adapters/mongodb/provider_configuration |
45.8% |
github.com/LerianStudio/flowker/internal/adapters/mongodb/workflow |
69.2% |
github.com/LerianStudio/flowker/internal/adapters/postgresql/audit |
42.2% |
github.com/LerianStudio/flowker/internal/bootstrap |
40.5% |
github.com/LerianStudio/flowker/internal/services/command |
27.3% |
github.com/LerianStudio/flowker/internal/services/query |
11.2% |
github.com/LerianStudio/flowker/internal/services |
3.2% |
github.com/LerianStudio/flowker/internal/testutil |
63.6% |
github.com/LerianStudio/flowker/pkg/circuitbreaker |
97.6% |
github.com/LerianStudio/flowker/pkg/clock |
0.0% |
github.com/LerianStudio/flowker/pkg/condition |
87.9% |
github.com/LerianStudio/flowker/pkg/contextutil |
0.0% |
github.com/LerianStudio/flowker/pkg/executor/base |
20.0% |
github.com/LerianStudio/flowker/pkg/executor |
39.5% |
github.com/LerianStudio/flowker/pkg/executors/http/auth |
59.5% |
github.com/LerianStudio/flowker/pkg/executors/http |
0.0% |
github.com/LerianStudio/flowker/pkg/executors/midaz |
76.4% |
github.com/LerianStudio/flowker/pkg/executors/s3 |
0.0% |
github.com/LerianStudio/flowker/pkg/executors/tracer |
72.2% |
github.com/LerianStudio/flowker/pkg/executors |
57.1% |
github.com/LerianStudio/flowker/pkg/model |
57.7% |
github.com/LerianStudio/flowker/pkg/net/http |
91.3% |
github.com/LerianStudio/flowker/pkg/pagination |
0.0% |
github.com/LerianStudio/flowker/pkg/templates/tracer_midaz |
81.5% |
github.com/LerianStudio/flowker/pkg/templates |
0.0% |
github.com/LerianStudio/flowker/pkg/transformation |
67.2% |
github.com/LerianStudio/flowker/pkg/triggers/webhook |
0.0% |
github.com/LerianStudio/flowker/pkg/triggers |
0.0% |
github.com/LerianStudio/flowker/pkg/webhook |
100.0% |
github.com/LerianStudio/flowker/pkg |
90.3% |
Generated by Go PR Analysis workflow
🔒 Security Scan Results —
|
| Policy | Status |
|---|---|
| Default non-root user | ✅ Passed |
| No fixable critical/high CVEs | ✅ Passed |
| No high-profile vulnerabilities | ✅ Passed |
| No AGPL v3 licenses | ✅ Passed |
Pre-release Version Check
✅ No unstable version pins found.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/bootstrap/audit_database_test.go`:
- Around line 7-12: Tests in audit_database_test.go should be refactored to use
stretchr/testify assertions and a gomock-generated mock for the pinger interface
instead of the current stubPinger and t.Fatal/t.Fatalf calls; replace manual
asserts with require/expect (e.g., require.NoError, require.Equal) and replace
stubPinger usage with a gomock controller and the generated MockPinger (create
ctrl := gomock.NewController(t); defer ctrl.Finish(); mockPinger :=
NewMockPinger(ctrl)), set EXPECT() on mockPinger for PingContext (and any other
methods) to return the desired results, update imports to include
"github.com/stretchr/testify/require" and "github.com/golang/mock/gomock" (and
the generated mock package), and remove the stubPinger implementation from the
test file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a4e34a6-2bc0-415e-81e4-d297a8be1020
📒 Files selected for processing (2)
internal/bootstrap/audit_database.gointernal/bootstrap/audit_database_test.go
…ndant else branch Two small follow-ups from PR #31 review: - Tests now use stretchr/testify/require, matching project convention (CLAUDE.md: "Framework: stretchr/testify for assertions"). require.NoError, require.ErrorIs, require.Equal give better failure messages than the hand-rolled t.Fatalf calls and keep the file consistent with the rest of internal/bootstrap. - CodeRabbit's suggestion to swap stubPinger for a gomock-generated mock was intentionally not adopted: pinger is a single-method interface tested in five tight scenarios; a 6-line hand-rolled stub is clearer and adds zero codegen overhead. Added a comment explaining the trade-off so future reviewers see the reasoning without re-litigating it. - Cleaned up the redundant `else` branch after `return nil` in pingWithRetry. Cosmetic, but matches the Go idiom (revive's indent-error-flow).
Summary
AuditDatabaseManager.Connectdoes a singlepool.Pingwith no retry. Any transient reset during bootstrap — Postgres startup race, RDS failover, planned maintenance restart, brief network blip — killed the service outright. The CI testcontainer fix in #30 worked around this for tests by waiting until Postgres was fully ready, but the production-side fragility was untouched and would surface on the next RDS failover.This PR adds
pingWithRetrywith exponential backoff: 5 attempts, 200ms initial → 2s cap (~6s total wall clock worst case). Respects the parentctxso the existing 30-secondConnecttimeout still bounds everything.Why this is a separate concern from #30
#30 fixed the testcontainer race: the test was telling the app to connect before Postgres finished its init script. Correct fix — the test should wait for readiness before driving the app.
This PR fixes the production resilience gap: even with a well-behaved DB, any transient reset during bootstrap drops the service. Two separate problems; #30 hid #2 in CI; this PR addresses #2 directly.
Design
*pgxpool.Poolimplementspingerimplicitly.Connect."failed to ping audit database: after 5 attempts: <last error>"— operator sees the retry count.What stayed untouched
AuditDatabaseManager.Ping(line 263, used byreadyzprobes): single-shot. A runtime health check should report current state, not lie via retry.Disconnectpath.Test plan
TestPingWithRetry_SucceedsOnFirstAttempt— happy path, 1 attempt.TestPingWithRetry_SucceedsAfterTransientFailures— 3 fails + 1 success, 4 attempts total.TestPingWithRetry_GivesUpAfterMaxAttempts— always fail, asserts wrapped sentinel + 5 attempts.TestPingWithRetry_RespectsContextCancellation— pre-canceled ctx, assertscontext.Canceledwrapped + 1 attempt before ctx caught the cancellation between retries.TestPingWithRetry_MaxAttemptsBelowOneIsCoercedToOne— defensive guard.make lint: zero new findings.Reviewer checklist
Connectctx deadline.pingerinterface extraction is the minimal surface we want to expose.AuditDatabaseManager.Ping(runtime probe) is correctly left as single-shot — the intentional asymmetry is the right call, but is worth a second pair of eyes.