Detect unexpected VM exit and terminate wslcsession.exe#40158
Detect unexpected VM exit and terminate wslcsession.exe#40158benhillis wants to merge 1 commit intofeature/wsl-for-appsfrom
Conversation
test/windows/WSLCTests.cpp
Outdated
| return ids; | ||
| } | ||
|
|
||
| // Helper: kill the VM backing a session using hcsdiag. |
There was a problem hiding this comment.
I think the VM owner is displayed in the hcsdiag list output, maybe we can use that instead? Maybe we should make the owner WSLC-<session_name> or something? I'm not sure what the limits on the vm owner name are....
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds VM-exit detection so wslcsession.exe can terminate cleanly when the backing HCS VM is killed externally, and introduces test coverage for the new behavior.
Changes:
- Introduces a cross-process VM termination callback (
RegisterTerminationCallback) so the service can notify the session when the VM exits. - Wires VM-exit signaling into
wslcsessionand hardensTerminate()to avoid shutdown hangs when the VM is already dead. - Adds TAEF tests that kill the VM via
hcsdiagand validate the session/process doesn’t hang.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/windows/WSLCTests.cpp | Adds new TAEF tests and helpers to kill the backing VM via hcsdiag. |
| src/windows/wslcsession/WSLCSession.h | Adds VM-exit handler and event member for session termination-on-VM-exit. |
| src/windows/wslcsession/WSLCSession.cpp | Registers termination callback, monitors VM-exit event, triggers async Terminate(), and skips shutdown steps when VM is dead. |
| src/windows/service/inc/wslc.idl | Extends IWSLCVirtualMachine with RegisterTerminationCallback. |
| src/windows/service/exe/HcsVirtualMachine.h | Adds storage/locking for a session-provided termination callback. |
| src/windows/service/exe/HcsVirtualMachine.cpp | Implements callback registration and invokes it on VM exit; clears callback during VM teardown. |
| THROW_IF_FAILED(Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get())); | ||
|
|
There was a problem hiding this comment.
THROW_IF_FAILED here makes session initialization fail hard if the running service/proxy is older and doesn’t support the new COM method (commonly returning RPC_S_PROCNUM_OUT_OF_RANGE/E_NOTIMPL). Consider treating “method not supported” as a non-fatal best-effort (log + continue without VM-exit detection) so version skew during upgrades doesn’t prevent sessions from starting.
| THROW_IF_FAILED(Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get())); | |
| const auto registerTerminationCallbackResult = | |
| Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get()); | |
| if ((registerTerminationCallbackResult == E_NOTIMPL) || | |
| (registerTerminationCallbackResult == HRESULT_FROM_WIN32(RPC_S_PROCNUM_OUT_OF_RANGE))) | |
| { | |
| WSL_LOG( | |
| "RegisterTerminationCallbackUnsupported", | |
| TraceLoggingValue(registerTerminationCallbackResult, "HResult"), | |
| TraceLoggingValue(m_id, "SessionId")); | |
| } | |
| else | |
| { | |
| THROW_IF_FAILED(registerTerminationCallbackResult); | |
| } |
test/windows/WSLCTests.cpp
Outdated
| auto hr = session->GetState(&state); | ||
| if (FAILED(hr)) | ||
| { | ||
| // RPC error means the session process has exited — expected. | ||
| terminated = true; | ||
| break; | ||
| } |
There was a problem hiding this comment.
This treats any failure from GetState as success, which can mask genuine test failures (e.g., access denied, invalid argument, unexpected COM failures). Tighten this to only accept expected RPC-disconnect-style HRESULTs (e.g., HRESULT_FROM_WIN32(RPC_S_SERVER_UNAVAILABLE), HRESULT_FROM_WIN32(RPC_S_CALL_FAILED), RPC_E_DISCONNECTED), and otherwise fail the test.
test/windows/WSLCTests.cpp
Outdated
| static void KillNewVms(const std::set<std::wstring>& preExistingVmIds) | ||
| { | ||
| auto currentIds = GetRunningVmIds(); | ||
|
|
||
| std::vector<std::wstring> newVmIds; | ||
| for (const auto& id : currentIds) | ||
| { | ||
| if (preExistingVmIds.find(id) == preExistingVmIds.end()) | ||
| { | ||
| newVmIds.push_back(id); | ||
| } | ||
| } | ||
|
|
||
| VERIFY_IS_FALSE(newVmIds.empty()); | ||
|
|
||
| for (const auto& vmId : newVmIds) | ||
| { | ||
| auto killCmd = std::format(L"hcsdiag.exe kill {}", vmId); | ||
| wsl::windows::common::SubProcess killProc(nullptr, killCmd.c_str()); | ||
| auto killExitCode = killProc.Run(10000); | ||
| VERIFY_ARE_EQUAL(killExitCode, 0u); | ||
| } | ||
| } |
There was a problem hiding this comment.
Diffing hcsdiag list to find “new VMs” can kill unrelated VMs if other tests (or background system activity) create VMs between snapshots, making the test destructive/flaky in parallel or shared environments. Prefer targeting the specific session’s VM (e.g., expose/query the VM ID from the session/service, or tag/name the VM with a unique session identifier and filter by that) rather than killing all “new” IDs.
test/windows/WSLCTests.cpp
Outdated
| std::wistringstream lineStream(line); | ||
| std::wstring id; | ||
| lineStream >> id; | ||
|
|
||
| if (id.size() >= 36) | ||
| { | ||
| ids.insert(id); | ||
| } |
There was a problem hiding this comment.
Parsing hcsdiag list by taking the first token and checking size() >= 36 is brittle (headers/format changes/extra columns can produce false positives). Prefer validating that the token is actually a GUID (e.g., by attempting GUID parsing) before inserting, to reduce test flakiness across environments.
When the VM is killed externally (e.g. hcsdiag /kill, kernel panic), the wslcsession.exe process now detects it and exits cleanly instead of hanging indefinitely as a zombie. Implementation: - Add GetExitEvent() to IWSLCVirtualMachine IDL. The SYSTEM service duplicates HcsVirtualMachine::m_vmExitEvent via COM system_handle marshaling so the session process can wait on it. - WSLCVirtualMachine calls GetExitEvent() during Initialize() and exposes VmExitedEvent() for consumers to monitor. - WSLCSession monitors the exit event via IORelay. On unexpected VM exit, OnVmExited() spawns a thread to call Terminate() (must be a separate thread to avoid deadlock with IORelay::Stop). - WSLCSessionManager registers a cleanup callback on HcsVirtualMachine to terminate sessions when VM exits. Callback is cleared in ~HcsVirtualMachine to avoid firing during normal shutdown. - Harden Terminate() to skip dockerd signal/wait and unmount when the VM is already dead, avoiding unnecessary 30s+ hangs. - Add tests: VmKillTerminatesSession, VmKillFailsInFlightOperations, CleanShutdownStillWorks.
d8b4914 to
aa851c7
Compare
When the VM is killed externally (e.g. \hcsdiag /kill, kernel panic), the wslcsession.exe process now detects it and exits cleanly instead of hanging indefinitely as a zombie.
Implementation
Testing
Three new TAEF tests:
All pass with zero service crashes.
Repro
\
hcsdiag /list # find WSLC VM GUID
hcsdiag /kill # kill the VM
\
Before: wslcsession.exe hangs forever. After: exits cleanly.