Skip to content

Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires#1426

Closed
jnlycklama wants to merge 1 commit into
microsoft:mainfrom
jnlycklama:users/jnlycklama/sqlserver-healthcheck-probe-timeout
Closed

Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires#1426
jnlycklama wants to merge 1 commit into
microsoft:mainfrom
jnlycklama:users/jnlycklama/sqlserver-healthcheck-probe-timeout

Conversation

@jnlycklama

Copy link
Copy Markdown
Member

Summary

Fixes a production scenario where /health/check returns 200 OK even though the SQL database has been deleted, because the publisher pipeline never gets a chance to publish the failure.

Root cause

  1. SqlServerHealthCheck issues a SQL query against the deleted DB.
  2. SqlClient's connect-timeout (30s) plus retry policy (ConnectRetryCount + ConnectRetryInterval) can take 70+ seconds for a single failed connection attempt.
  3. The framework's HealthCheckPublisherHostedService creates a linked CancellationTokenSource bounded by HealthCheckPublisherOptions.Timeout (default 30s).
  4. The SQL retry exhaustion consumes the entire budget; the outer token cancels mid-probe.
  5. The framework distinguishes "cancellation" from "exception": when cancellationToken.IsCancellationRequested == true, the OperationCanceledException is not converted into HealthStatus.Unhealthy. It propagates up to the hosted service, which treats the whole batch as cancelled — PublishAsync is never called.
  6. The previously cached HealthReport (Healthy) stays in ValueCache<HealthReport> indefinitely, so CachedHealthCheckMiddleware keeps returning 200.

Fix

  • New SqlServerDataStoreConfiguration.HealthCheckProbeTimeout (default 10s).
  • SqlServerHealthCheck.CheckStorageHealthAsync now wraps the SQL call in a linked CancellationTokenSource with CancelAfter(HealthCheckProbeTimeout). Failing fast inside the publisher's budget guarantees the result reaches PublishAsync.
  • Any SqlException that escapes the probe is converted to HealthCheckResult.Unhealthy(...) explicitly, with the SqlException attached so consumers see real diagnostics.
  • A probe-timeout OperationCanceledException (where the caller did not cancel) is converted to Unhealthy with a clear description.
  • Caller-initiated cancellation (the framework's outer token) is still allowed to propagate as OperationCanceledException, preserving the hosted service's "cancelled batch vs real failure" distinction.

Test plan

  • 3 new unit tests in SqlServerHealthCheckTests:
    • login-failure (SqlException 18456) → Unhealthy with the exception attached.
    • probe exceeds configured timeout → Unhealthy with timeout description.
    • caller cancels → OperationCanceledException propagates (not swallowed).
  • Existing tests still pass.
  • Build clean (0 warnings, 0 errors).
  • dotnet test src/Microsoft.Health.SqlServer.UnitTests → 100/100 (+2 skipped) on net8/9/10.

Related

Complements #1425 (ValueCache<HealthReport> expiry as a generic safety net for any publisher hang). This PR fixes the SQL-specific root cause; #1425 catches anything else that can wedge the publisher.

…ception to Unhealthy

When the SQL database is unreachable (for example, deleted in production), SqlClient's connect-timeout and retry budget can exceed 70 seconds for a single attempt. That is well beyond the framework's default HealthCheckPublisherOptions.Timeout of 30s, so the publisher's outer token cancels mid-probe, the OperationCanceledException is treated as cancellation (not a health failure), and PublishAsync is never called. The previously cached HealthReport stays cached and /health/check keeps returning 200.

This change fixes the root cause: SqlServerHealthCheck now creates a linked CancellationTokenSource bounded by a new SqlServerDataStoreConfiguration.HealthCheckProbeTimeout (default 10s). If the probe exceeds that window the underlying SQL call is cancelled and the check returns HealthStatus.Unhealthy with diagnostics. Any SqlException that escapes the probe is also converted to Unhealthy explicitly (with the SqlException attached) so the published HealthReport contains real per-check diagnostics instead of relying on framework-level exception conversion which is bypassed when the outer token is also cancelled. Caller cancellation (the framework's outer token) still propagates as OperationCanceledException so the hosted service can distinguish a cancelled batch from a real failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jnlycklama

Copy link
Copy Markdown
Member Author

Closing per design discussion. The per-probe timeout and explicit OCE conversion regress two important behaviors: (1) it hides the real SqlException from logs (the timer fires before SqlClient finishes retrying, so we never see 'Login failed for user'), and (2) it turns transient blips that today recover via SqlClient retries into published Unhealthy results. Going forward we'll rely on PR #1425 (ValueCache expiry as the safety net) plus a per-consumer bump of HealthCheckPublisherOptions.Timeout in dicom-server.

@jnlycklama jnlycklama closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant