Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires by jnlycklama · Pull Request #1426 · microsoft/healthcare-shared-components

jnlycklama · 2026-06-15T23:37:11Z

Summary

Fixes a production scenario where /health/check returns 200 OK even though the SQL database has been deleted, because the publisher pipeline never gets a chance to publish the failure.

Root cause

SqlServerHealthCheck issues a SQL query against the deleted DB.
SqlClient's connect-timeout (30s) plus retry policy (ConnectRetryCount + ConnectRetryInterval) can take 70+ seconds for a single failed connection attempt.
The framework's HealthCheckPublisherHostedService creates a linked CancellationTokenSource bounded by HealthCheckPublisherOptions.Timeout (default 30s).
The SQL retry exhaustion consumes the entire budget; the outer token cancels mid-probe.
The framework distinguishes "cancellation" from "exception": when cancellationToken.IsCancellationRequested == true, the OperationCanceledException is not converted into HealthStatus.Unhealthy. It propagates up to the hosted service, which treats the whole batch as cancelled — PublishAsync is never called.
The previously cached HealthReport (Healthy) stays in ValueCache<HealthReport> indefinitely, so CachedHealthCheckMiddleware keeps returning 200.

Fix

New SqlServerDataStoreConfiguration.HealthCheckProbeTimeout (default 10s).
SqlServerHealthCheck.CheckStorageHealthAsync now wraps the SQL call in a linked CancellationTokenSource with CancelAfter(HealthCheckProbeTimeout). Failing fast inside the publisher's budget guarantees the result reaches PublishAsync.
Any SqlException that escapes the probe is converted to HealthCheckResult.Unhealthy(...) explicitly, with the SqlException attached so consumers see real diagnostics.
A probe-timeout OperationCanceledException (where the caller did not cancel) is converted to Unhealthy with a clear description.
Caller-initiated cancellation (the framework's outer token) is still allowed to propagate as OperationCanceledException, preserving the hosted service's "cancelled batch vs real failure" distinction.

Test plan

3 new unit tests in SqlServerHealthCheckTests:
- login-failure (SqlException 18456) → Unhealthy with the exception attached.
- probe exceeds configured timeout → Unhealthy with timeout description.
- caller cancels → OperationCanceledException propagates (not swallowed).
Existing tests still pass.
Build clean (0 warnings, 0 errors).
dotnet test src/Microsoft.Health.SqlServer.UnitTests → 100/100 (+2 skipped) on net8/9/10.

…ception to Unhealthy When the SQL database is unreachable (for example, deleted in production), SqlClient's connect-timeout and retry budget can exceed 70 seconds for a single attempt. That is well beyond the framework's default HealthCheckPublisherOptions.Timeout of 30s, so the publisher's outer token cancels mid-probe, the OperationCanceledException is treated as cancellation (not a health failure), and PublishAsync is never called. The previously cached HealthReport stays cached and /health/check keeps returning 200. This change fixes the root cause: SqlServerHealthCheck now creates a linked CancellationTokenSource bounded by a new SqlServerDataStoreConfiguration.HealthCheckProbeTimeout (default 10s). If the probe exceeds that window the underlying SQL call is cancelled and the check returns HealthStatus.Unhealthy with diagnostics. Any SqlException that escapes the probe is also converted to Unhealthy explicitly (with the SqlException attached) so the published HealthReport contains real per-check diagnostics instead of relying on framework-level exception conversion which is bypassed when the outer token is also cancelled. Caller cancellation (the framework's outer token) still propagates as OperationCanceledException so the hosted service can distinguish a cancelled batch from a real failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jnlycklama · 2026-06-15T23:47:48Z

Closing per design discussion. The per-probe timeout and explicit OCE conversion regress two important behaviors: (1) it hides the real SqlException from logs (the timer fires before SqlClient finishes retrying, so we never see 'Login failed for user'), and (2) it turns transient blips that today recover via SqlClient retries into published Unhealthy results. Going forward we'll rely on PR #1425 (ValueCache expiry as the safety net) plus a per-consumer bump of HealthCheckPublisherOptions.Timeout in dicom-server.

jnlycklama closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires#1426

Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires#1426
jnlycklama wants to merge 1 commit into
microsoft:mainfrom
jnlycklama:users/jnlycklama/sqlserver-healthcheck-probe-timeout

jnlycklama commented Jun 15, 2026

Uh oh!

jnlycklama commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jnlycklama commented Jun 15, 2026

Summary

Root cause

Fix

Test plan

Related

Uh oh!

jnlycklama commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant