Skip to content

Commit 4c6f8c7

Browse files
radicalCopilot
andauthored
CI infrastructure improvements: heartbeat, dump checks, timeouts, logging (#15950)
* Change Heartbeat default interval from 5s to 60s The 5-second interval generates excessive log output during CI test runs. A 60-second default reduces noise while still providing periodic status updates for diagnosing runner hangs and disk space issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Pass 60s heartbeat interval explicitly in CI workflow Linux test steps were relying on the default interval. Now that the default changed to 60s this is technically a no-op, but passing it explicitly makes the intent clear and matches the Windows steps which already specified 60s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Move dump file check from final results to individual test jobs The dump file check was in specialized-test-runner.yml and checked artifacts/all-logs after downloading from sub-jobs. Move it to run-tests.yml where it checks testresults/ directly in the same job, matching the actual --crashdump/--hangdump output directory. This catches crashes/timeouts earlier and in the correct job context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Include timeout values in specialized test runsheets Add testSessionTimeout and testHangTimeout MSBuild properties to the JSON emitted by SpecializedTestRunsheetBuilderBase.targets, allowing per-project timeout overrides to flow through the runsheet to the CI workflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Clarify PR filtering log message for specialized test workflows The old message 'filtering to single test project' was ambiguous. Make it clear this is a sanity-check behavior for pull_request events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Clamp heartbeat interval to >= 1s to prevent tight loops Addresses review feedback: reject zero or negative interval values that would cause a tight loop or crash in Task.Delay. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Narrow dump file check to hang dumps only Crash dumps (e.g. *_crash.dmp) can be produced during process cleanup even when all tests passed — this is benign. Only fail on hang dump files (*hangdump*) which indicate the test runner timed out and tests may not have completed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent a182eb9 commit 4c6f8c7

5 files changed

Lines changed: 37 additions & 27 deletions

File tree

.github/workflows/run-tests.yml

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -333,8 +333,8 @@ jobs:
333333
GITHUB_PR_HEAD_SHA: ${{ inputs.requiresCliArchive && github.event.pull_request.head.sha || '' }}
334334
GH_TOKEN: ${{ inputs.requiresCliArchive && github.token || '' }}
335335
run: |
336-
# Start heartbeat monitor in background
337-
${{ github.workspace }}/${{ env.DOTNET_SCRIPT }} ${{ github.workspace }}/tools/scripts/Heartbeat.cs &
336+
# Start heartbeat monitor in background (60s interval to reduce log noise)
337+
${{ github.workspace }}/${{ env.DOTNET_SCRIPT }} ${{ github.workspace }}/tools/scripts/Heartbeat.cs 60 &
338338
HEARTBEAT_PID=$!
339339
340340
# Run tests
@@ -425,8 +425,8 @@ jobs:
425425
GITHUB_PR_HEAD_SHA: ${{ inputs.requiresCliArchive && github.event.pull_request.head.sha || '' }}
426426
GH_TOKEN: ${{ inputs.requiresCliArchive && github.token || '' }}
427427
run: |
428-
# Start heartbeat monitor in background
429-
${{ env.DOTNET_SCRIPT }} ${{ github.workspace }}/tools/scripts/Heartbeat.cs &
428+
# Start heartbeat monitor in background (60s interval to reduce log noise)
429+
${{ env.DOTNET_SCRIPT }} ${{ github.workspace }}/tools/scripts/Heartbeat.cs 60 &
430430
HEARTBEAT_PID=$!
431431
432432
# Run tests
@@ -561,6 +561,28 @@ jobs:
561561
562562
Write-Host "✓ Test execution completed successfully (found $validFileCount valid .trx file(s))"
563563
564+
- name: Check for hang dump files
565+
if: always()
566+
shell: pwsh
567+
run: |
568+
$testResultsDir = "${{ github.workspace }}/testresults"
569+
if (-not (Test-Path $testResultsDir)) {
570+
Write-Host "No test results directory found, skipping dump file check"
571+
return
572+
}
573+
574+
# Only check for hang dump files — these indicate a test timed out.
575+
# Crash dumps (e.g. *_crash.dmp) can be produced during process cleanup
576+
# even when all tests passed, so we don't fail on those.
577+
$hangDumpFiles = Get-ChildItem -Path $testResultsDir -Filter *hangdump* -Recurse -ErrorAction SilentlyContinue
578+
if ($hangDumpFiles.Count -gt 0) {
579+
Write-Host "::error::Hang dump file(s) detected — the test runner timed out:"
580+
foreach ($df in $hangDumpFiles) { Write-Host " - $($df.FullName)" }
581+
exit 1
582+
}
583+
584+
Write-Host "✓ No hang dump files detected"
585+
564586
- name: Dump docker info
565587
if: ${{ always() && runner.os == 'Linux' }}
566588
run: |

.github/workflows/specialized-test-runner.yml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -201,23 +201,6 @@ jobs:
201201
${{ github.workspace }}/testresults
202202
--combined
203203
204-
- name: Fail if test runner crashed or timed out
205-
if: steps.check_tests.outputs.tests_run == 'true'
206-
shell: pwsh
207-
run: |
208-
$logDirectory = "${{ github.workspace }}/artifacts/all-logs"
209-
210-
# Check for dump files (.dmp) in any of the collected test artifacts.
211-
# A .dmp file indicates a crash (--crashdump) or timeout (--hangdump).
212-
$dumpFiles = Get-ChildItem -Path $logDirectory -Filter *.dmp -Recurse -ErrorAction SilentlyContinue
213-
if ($dumpFiles.Count -gt 0) {
214-
Write-Host "::error::Dump file(s) detected — the test runner crashed or timed out in one or more jobs:"
215-
foreach ($df in $dumpFiles) { Write-Host " - $($df.FullName)" }
216-
exit 1
217-
}
218-
219-
Write-Host "✓ No dump files detected"
220-
221204
- name: Fail if any dependency failed
222205
# 'skipped' can be when a transitive dependency fails and the dependent job gets 'skipped'.
223206
# For example, one of setup_* jobs failing and the Integration test jobs getting 'skipped'

eng/AfterSolutionBuild.targets

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
$isSpecializedWorkflow = ($testRunnerName -eq 'QuarantinedTestRunsheetBuilder' -or $testRunnerName -eq 'OuterloopTestRunsheetBuilder')
7575

7676
if ($isPullRequest -and $isSpecializedWorkflow -and $combined.Count -gt 0) {
77-
Write-Host "Running on pull_request event for specialized tests - filtering to single test project"
77+
Write-Host "Running on pull_request event for specialized tests - picking a single test project as a sanity check"
7878
# Get the first unique project name and keep all OS entries for that project
7979
$firstProjectName = $combined[0].project
8080
$combined = @($combined | Where-Object { $_.project -eq $firstProjectName })

eng/SpecializedTestRunsheetBuilderBase.targets

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -155,9 +155,9 @@
155155
<!-- Replace \ with /, and then escape " with \", so we have a compliant JSON -->
156156
<_TestCommand>$([System.String]::Copy($(_TestCommand)).Replace("\", "/").Replace('&quot;', '\&quot;'))</_TestCommand>
157157

158-
<_TestRunsheetWindows>{ "label": "w: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerWindows)", "command": "./eng/build.ps1 $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive) }</_TestRunsheetWindows>
159-
<_TestRunsheetLinux>{ "label": "l: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerLinux)", "command": "./eng/build.sh $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive) }</_TestRunsheetLinux>
160-
<_TestRunsheetMacOS>{ "label": "m: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerMacOS)", "command": "./eng/build.sh $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive) }</_TestRunsheetMacOS>
158+
<_TestRunsheetWindows>{ "label": "w: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerWindows)", "command": "./eng/build.ps1 $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive), "testSessionTimeout": "$(TestSessionTimeout)", "testHangTimeout": "$(TestHangTimeout)" }</_TestRunsheetWindows>
159+
<_TestRunsheetLinux>{ "label": "l: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerLinux)", "command": "./eng/build.sh $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive), "testSessionTimeout": "$(TestSessionTimeout)", "testHangTimeout": "$(TestHangTimeout)" }</_TestRunsheetLinux>
160+
<_TestRunsheetMacOS>{ "label": "m: $(_TestRunsheet)", "project": "$(_TestRunsheet)", "os": "$(_GithubActionsRunnerMacOS)", "command": "./eng/build.sh $(_TestCommand)", "requiresNugets": $(_RequiresNugets), "requiresTestSdk": $(_RequiresTestSdk), "requiresCliArchive": $(_RequiresCliArchive), "testSessionTimeout": "$(TestSessionTimeout)", "testHangTimeout": "$(TestHangTimeout)" }</_TestRunsheetMacOS>
161161
</PropertyGroup>
162162

163163
<WriteLinesToFile

tools/scripts/Heartbeat.cs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
// diagnose runner hangs and disk space issues during tests.
44
//
55
// Usage: dotnet tools/scripts/Heartbeat.cs [interval-seconds]
6-
// Default interval: 5 seconds
6+
// Default interval: 60 seconds
77
//
88
// Example: dotnet tools/scripts/Heartbeat.cs 10
99

@@ -16,7 +16,12 @@
1616
RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" :
1717
throw new NotSupportedException("Unsupported OS platform");
1818

19-
var intervalSeconds = args.Length > 0 && int.TryParse(args[0], out var parsed) ? parsed : 5;
19+
const int defaultIntervalSeconds = 60;
20+
var intervalSeconds = args.Length > 0 &&
21+
int.TryParse(args[0], out var parsed) &&
22+
parsed >= 1
23+
? parsed
24+
: defaultIntervalSeconds;
2025
var cts = new CancellationTokenSource();
2126

2227
Console.CancelKeyPress += (_, e) =>

0 commit comments

Comments
 (0)