Bound SqlServerHealthCheck with a per-probe timeout so failures publish before the publisher's outer timeout fires#1426
Closed
jnlycklama wants to merge 1 commit into
Conversation
…ception to Unhealthy When the SQL database is unreachable (for example, deleted in production), SqlClient's connect-timeout and retry budget can exceed 70 seconds for a single attempt. That is well beyond the framework's default HealthCheckPublisherOptions.Timeout of 30s, so the publisher's outer token cancels mid-probe, the OperationCanceledException is treated as cancellation (not a health failure), and PublishAsync is never called. The previously cached HealthReport stays cached and /health/check keeps returning 200. This change fixes the root cause: SqlServerHealthCheck now creates a linked CancellationTokenSource bounded by a new SqlServerDataStoreConfiguration.HealthCheckProbeTimeout (default 10s). If the probe exceeds that window the underlying SQL call is cancelled and the check returns HealthStatus.Unhealthy with diagnostics. Any SqlException that escapes the probe is also converted to Unhealthy explicitly (with the SqlException attached) so the published HealthReport contains real per-check diagnostics instead of relying on framework-level exception conversion which is bypassed when the outer token is also cancelled. Caller cancellation (the framework's outer token) still propagates as OperationCanceledException so the hosted service can distinguish a cancelled batch from a real failure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
Closing per design discussion. The per-probe timeout and explicit OCE conversion regress two important behaviors: (1) it hides the real SqlException from logs (the timer fires before SqlClient finishes retrying, so we never see 'Login failed for user'), and (2) it turns transient blips that today recover via SqlClient retries into published Unhealthy results. Going forward we'll rely on PR #1425 (ValueCache expiry as the safety net) plus a per-consumer bump of HealthCheckPublisherOptions.Timeout in dicom-server. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a production scenario where
/health/checkreturns 200 OK even though the SQL database has been deleted, because the publisher pipeline never gets a chance to publish the failure.Root cause
SqlServerHealthCheckissues a SQL query against the deleted DB.ConnectRetryCount+ConnectRetryInterval) can take 70+ seconds for a single failed connection attempt.HealthCheckPublisherHostedServicecreates a linkedCancellationTokenSourcebounded byHealthCheckPublisherOptions.Timeout(default 30s).cancellationToken.IsCancellationRequested == true, theOperationCanceledExceptionis not converted intoHealthStatus.Unhealthy. It propagates up to the hosted service, which treats the whole batch as cancelled —PublishAsyncis never called.HealthReport(Healthy) stays inValueCache<HealthReport>indefinitely, soCachedHealthCheckMiddlewarekeeps returning 200.Fix
SqlServerDataStoreConfiguration.HealthCheckProbeTimeout(default 10s).SqlServerHealthCheck.CheckStorageHealthAsyncnow wraps the SQL call in a linkedCancellationTokenSourcewithCancelAfter(HealthCheckProbeTimeout). Failing fast inside the publisher's budget guarantees the result reachesPublishAsync.SqlExceptionthat escapes the probe is converted toHealthCheckResult.Unhealthy(...)explicitly, with theSqlExceptionattached so consumers see real diagnostics.OperationCanceledException(where the caller did not cancel) is converted toUnhealthywith a clear description.OperationCanceledException, preserving the hosted service's "cancelled batch vs real failure" distinction.Test plan
SqlServerHealthCheckTests:SqlException18456) →Unhealthywith the exception attached.Unhealthywith timeout description.OperationCanceledExceptionpropagates (not swallowed).dotnet test src/Microsoft.Health.SqlServer.UnitTests→ 100/100 (+2 skipped) on net8/9/10.Related
Complements #1425 (
ValueCache<HealthReport>expiry as a generic safety net for any publisher hang). This PR fixes the SQL-specific root cause; #1425 catches anything else that can wedge the publisher.