Skip to content

fix(service): retry audit database ping with exponential backoff during bootstrap#31

Merged
alexgarzao merged 2 commits into
developfrom
fix/audit-db-connect-retry
May 15, 2026
Merged

fix(service): retry audit database ping with exponential backoff during bootstrap#31
alexgarzao merged 2 commits into
developfrom
fix/audit-db-connect-retry

Conversation

@alexgarzao
Copy link
Copy Markdown
Collaborator

Summary

AuditDatabaseManager.Connect does a single pool.Ping with no retry. Any transient reset during bootstrap — Postgres startup race, RDS failover, planned maintenance restart, brief network blip — killed the service outright. The CI testcontainer fix in #30 worked around this for tests by waiting until Postgres was fully ready, but the production-side fragility was untouched and would surface on the next RDS failover.

This PR adds pingWithRetry with exponential backoff: 5 attempts, 200ms initial → 2s cap (~6s total wall clock worst case). Respects the parent ctx so the existing 30-second Connect timeout still bounds everything.

Why this is a separate concern from #30

#30 fixed the testcontainer race: the test was telling the app to connect before Postgres finished its init script. Correct fix — the test should wait for readiness before driving the app.

This PR fixes the production resilience gap: even with a well-behaved DB, any transient reset during bootstrap drops the service. Two separate problems; #30 hid #2 in CI; this PR addresses #2 directly.

Design

const (
    pingMaxAttempts    = 5
    pingInitialBackoff = 200 * time.Millisecond
    pingMaxBackoff     = 2 * time.Second
)

type pinger interface { Ping(ctx context.Context) error }

func pingWithRetry(ctx, p, initialBackoff, maxBackoff, maxAttempts) error
  • *pgxpool.Pool implements pinger implicitly.
  • Parameters exposed so unit tests run in microseconds; production callers use the package-level defaults via Connect.
  • On final failure: "failed to ping audit database: after 5 attempts: <last error>" — operator sees the retry count.

What stayed untouched

  • AuditDatabaseManager.Ping (line 263, used by readyz probes): single-shot. A runtime health check should report current state, not lie via retry.
  • Migrations: out of scope. Migration failures are usually real bugs, not transient.
  • Disconnect path.

Test plan

  • TestPingWithRetry_SucceedsOnFirstAttempt — happy path, 1 attempt.
  • TestPingWithRetry_SucceedsAfterTransientFailures — 3 fails + 1 success, 4 attempts total.
  • TestPingWithRetry_GivesUpAfterMaxAttempts — always fail, asserts wrapped sentinel + 5 attempts.
  • TestPingWithRetry_RespectsContextCancellation — pre-canceled ctx, asserts context.Canceled wrapped + 1 attempt before ctx caught the cancellation between retries.
  • TestPingWithRetry_MaxAttemptsBelowOneIsCoercedToOne — defensive guard.
  • Full bootstrap package: 173 tests pass (no regressions on touched function).
  • make lint: zero new findings.

Reviewer checklist

  1. Verify the retry budget (~6s worst case) fits comfortably under the 30s Connect ctx deadline.
  2. Confirm the pinger interface extraction is the minimal surface we want to expose.
  3. Sanity-check that AuditDatabaseManager.Ping (runtime probe) is correctly left as single-shot — the intentional asymmetry is the right call, but is worth a second pair of eyes.

…ng bootstrap

AuditDatabaseManager.Connect did a single pool.Ping with no retry. Any
transient reset during bootstrap — Postgres startup race in testcontainers,
RDS failover, planned maintenance restart, network blip — killed the
service outright. The fix from #30 hid this in CI by waiting until Postgres
was fully ready before tests poll, but the production-side fragility was
untouched.

Adds pingWithRetry: up to 5 attempts with exponential backoff (200ms → 2s
cap, ~6s total). Respects the parent ctx so the existing 30s Connect
deadline still bounds everything. Parameterized for microsecond-fast
unit tests.

Connect now uses pingWithRetry instead of a single Ping. The error message
on final failure carries the attempt count, so operators see "after 5
attempts: <last error>" instead of just the lone reset.

Out of scope: AuditDatabaseManager.Ping (line 263) used by readyz probes
stays single-shot — a runtime health check should report current state,
not retry. Disconnect path also untouched.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d2bcd83e-cd9b-455c-ac2f-ea966984f18d

📥 Commits

Reviewing files that changed from the base of the PR and between 1f93fa4 and 937ac52.

📒 Files selected for processing (2)
  • internal/bootstrap/audit_database.go
  • internal/bootstrap/audit_database_test.go

Walkthrough

Replaces a single DB ping with a retrying ping (pingWithRetry) that uses exponential backoff, respects context cancellation, and is integrated into AuditDatabaseManager.Connect; adds tests covering success, transient recovery, exhaustion, cancellation, and maxAttempts coercion.

Changes

Database Connection Ping Resilience

Layer / File(s) Summary
Retry infrastructure and integration
internal/bootstrap/audit_database.go
Adds package-level retry/backoff constants (pingMaxAttempts, pingInitialBackoff, pingMaxBackoff), a pinger interface, and pingWithRetry implementing exponential backoff with context-aware cancellation; AuditDatabaseManager.Connect now calls pingWithRetry instead of pool.Ping.
Ping retry test suite
internal/bootstrap/audit_database_test.go
Adds stubPinger test double and tests validating immediate success, recovery after transient failures, giving up after max attempts (error wrapping), respecting context cancellation, and coercion of maxAttempts < 1 to one.

Comment @coderabbitai help to get the list of available commands and usage tips.

@lerian-studio
Copy link
Copy Markdown
Contributor

lerian-studio commented May 15, 2026

📊 Unit Test Coverage Report: flowker

Metric Value
Overall Coverage 40.7% ⚠️ BELOW THRESHOLD
Threshold 85%

Coverage by Package

Package Coverage
github.com/LerianStudio/flowker/api 100.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/audit 0.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/catalog 60.9%
github.com/LerianStudio/flowker/internal/adapters/http/in/dashboard 91.5%
github.com/LerianStudio/flowker/internal/adapters/http/in/execution 81.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/executor_configuration 72.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/health 100.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/middleware/testutil 0.0%
github.com/LerianStudio/flowker/internal/adapters/http/in/middleware 44.4%
github.com/LerianStudio/flowker/internal/adapters/http/in/provider_configuration 21.3%
github.com/LerianStudio/flowker/internal/adapters/http/in/readyz 81.3%
github.com/LerianStudio/flowker/internal/adapters/http/in/webhook 87.5%
github.com/LerianStudio/flowker/internal/adapters/http/in/workflow 69.7%
github.com/LerianStudio/flowker/internal/adapters/http/in 33.3%
github.com/LerianStudio/flowker/internal/adapters/mongodb/dashboard 43.0%
github.com/LerianStudio/flowker/internal/adapters/mongodb/execution 64.8%
github.com/LerianStudio/flowker/internal/adapters/mongodb/executor_configuration 36.6%
github.com/LerianStudio/flowker/internal/adapters/mongodb/provider_configuration 45.8%
github.com/LerianStudio/flowker/internal/adapters/mongodb/workflow 69.2%
github.com/LerianStudio/flowker/internal/adapters/postgresql/audit 42.2%
github.com/LerianStudio/flowker/internal/bootstrap 40.5%
github.com/LerianStudio/flowker/internal/services/command 27.3%
github.com/LerianStudio/flowker/internal/services/query 11.2%
github.com/LerianStudio/flowker/internal/services 3.2%
github.com/LerianStudio/flowker/internal/testutil 63.6%
github.com/LerianStudio/flowker/pkg/circuitbreaker 97.6%
github.com/LerianStudio/flowker/pkg/clock 0.0%
github.com/LerianStudio/flowker/pkg/condition 87.9%
github.com/LerianStudio/flowker/pkg/contextutil 0.0%
github.com/LerianStudio/flowker/pkg/executor/base 20.0%
github.com/LerianStudio/flowker/pkg/executor 39.5%
github.com/LerianStudio/flowker/pkg/executors/http/auth 59.5%
github.com/LerianStudio/flowker/pkg/executors/http 0.0%
github.com/LerianStudio/flowker/pkg/executors/midaz 76.4%
github.com/LerianStudio/flowker/pkg/executors/s3 0.0%
github.com/LerianStudio/flowker/pkg/executors/tracer 72.2%
github.com/LerianStudio/flowker/pkg/executors 57.1%
github.com/LerianStudio/flowker/pkg/model 57.7%
github.com/LerianStudio/flowker/pkg/net/http 91.3%
github.com/LerianStudio/flowker/pkg/pagination 0.0%
github.com/LerianStudio/flowker/pkg/templates/tracer_midaz 81.5%
github.com/LerianStudio/flowker/pkg/templates 0.0%
github.com/LerianStudio/flowker/pkg/transformation 67.2%
github.com/LerianStudio/flowker/pkg/triggers/webhook 0.0%
github.com/LerianStudio/flowker/pkg/triggers 0.0%
github.com/LerianStudio/flowker/pkg/webhook 100.0%
github.com/LerianStudio/flowker/pkg 90.3%

Generated by Go PR Analysis workflow

@lerian-studio
Copy link
Copy Markdown
Contributor

lerian-studio commented May 15, 2026

🔒 Security Scan Results — flowker

Trivy

Filesystem Scan

✅ No vulnerabilities or secrets found.

Docker Image Scan

✅ No vulnerabilities found.


Docker Hub Health Score Compliance

✅ Policies — 4/4 met

Policy Status
Default non-root user ✅ Passed
No fixable critical/high CVEs ✅ Passed
No high-profile vulnerabilities ✅ Passed
No AGPL v3 licenses ✅ Passed

Pre-release Version Check

✅ No unstable version pins found.


🔍 View full scan logs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/bootstrap/audit_database_test.go`:
- Around line 7-12: Tests in audit_database_test.go should be refactored to use
stretchr/testify assertions and a gomock-generated mock for the pinger interface
instead of the current stubPinger and t.Fatal/t.Fatalf calls; replace manual
asserts with require/expect (e.g., require.NoError, require.Equal) and replace
stubPinger usage with a gomock controller and the generated MockPinger (create
ctrl := gomock.NewController(t); defer ctrl.Finish(); mockPinger :=
NewMockPinger(ctrl)), set EXPECT() on mockPinger for PingContext (and any other
methods) to return the desired results, update imports to include
"github.com/stretchr/testify/require" and "github.com/golang/mock/gomock" (and
the generated mock package), and remove the stubPinger implementation from the
test file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a4e34a6-2bc0-415e-81e4-d297a8be1020

📥 Commits

Reviewing files that changed from the base of the PR and between 1ab9715 and 1f93fa4.

📒 Files selected for processing (2)
  • internal/bootstrap/audit_database.go
  • internal/bootstrap/audit_database_test.go

Comment thread internal/bootstrap/audit_database_test.go
…ndant else branch

Two small follow-ups from PR #31 review:

- Tests now use stretchr/testify/require, matching project convention
  (CLAUDE.md: "Framework: stretchr/testify for assertions"). require.NoError,
  require.ErrorIs, require.Equal give better failure messages than the
  hand-rolled t.Fatalf calls and keep the file consistent with the rest of
  internal/bootstrap.

- CodeRabbit's suggestion to swap stubPinger for a gomock-generated mock was
  intentionally not adopted: pinger is a single-method interface tested in
  five tight scenarios; a 6-line hand-rolled stub is clearer and adds zero
  codegen overhead. Added a comment explaining the trade-off so future
  reviewers see the reasoning without re-litigating it.

- Cleaned up the redundant `else` branch after `return nil` in
  pingWithRetry. Cosmetic, but matches the Go idiom (revive's
  indent-error-flow).
@alexgarzao alexgarzao merged commit 6e20844 into develop May 15, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants