Skip to content

Detect unexpected VM exit and terminate wslcsession.exe#40158

Draft
benhillis wants to merge 1 commit intofeature/wsl-for-appsfrom
user/benhill/vm_exit
Draft

Detect unexpected VM exit and terminate wslcsession.exe#40158
benhillis wants to merge 1 commit intofeature/wsl-for-appsfrom
user/benhill/vm_exit

Conversation

@benhillis
Copy link
Copy Markdown
Member

When the VM is killed externally (e.g. \hcsdiag /kill, kernel panic), the wslcsession.exe process now detects it and exits cleanly instead of hanging indefinitely as a zombie.

Implementation

  • Add \RegisterTerminationCallback()\ to \IWSLCVirtualMachine\ IDL. The session process registers an \ITerminationCallback\ during initialization so the SYSTEM service can notify it cross-process when HCS reports VM exit.
  • \VmTerminationCallback\ (in wslcsession) signals a local \m_vmExitedEvent\ when the callback fires. IORelay monitors this event and calls \OnVmExited().
  • \OnVmExited()\ spawns a thread (to avoid IORelay deadlock) that calls \Terminate(), cleaning up the session and exiting the process.
  • \Terminate()\ is hardened to skip dockerd signal/wait and unmount when the VM is already dead, avoiding unnecessary 30s+ hangs.
  • Destructor of \HcsVirtualMachine\ clears the callback (under a dedicated lock) before VM teardown, preventing redundant COM calls during normal shutdown.

Testing

Three new TAEF tests:

  • VmKillTerminatesSession — creates session, kills VM via \hcsdiag, verifies session terminates
  • VmKillFailsInFlightOperations — starts long-running process, kills VM, verifies no hang
  • CleanShutdownStillWorks — regression test for normal \Terminate()\ flow

All pass with zero service crashes.

Repro

\
hcsdiag /list # find WSLC VM GUID
hcsdiag /kill # kill the VM
\
Before: wslcsession.exe hangs forever. After: exits cleanly.

Copilot AI review requested due to automatic review settings April 10, 2026 19:01
return ids;
}

// Helper: kill the VM backing a session using hcsdiag.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the VM owner is displayed in the hcsdiag list output, maybe we can use that instead? Maybe we should make the owner WSLC-<session_name> or something? I'm not sure what the limits on the vm owner name are....

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds VM-exit detection so wslcsession.exe can terminate cleanly when the backing HCS VM is killed externally, and introduces test coverage for the new behavior.

Changes:

  • Introduces a cross-process VM termination callback (RegisterTerminationCallback) so the service can notify the session when the VM exits.
  • Wires VM-exit signaling into wslcsession and hardens Terminate() to avoid shutdown hangs when the VM is already dead.
  • Adds TAEF tests that kill the VM via hcsdiag and validate the session/process doesn’t hang.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/windows/WSLCTests.cpp Adds new TAEF tests and helpers to kill the backing VM via hcsdiag.
src/windows/wslcsession/WSLCSession.h Adds VM-exit handler and event member for session termination-on-VM-exit.
src/windows/wslcsession/WSLCSession.cpp Registers termination callback, monitors VM-exit event, triggers async Terminate(), and skips shutdown steps when VM is dead.
src/windows/service/inc/wslc.idl Extends IWSLCVirtualMachine with RegisterTerminationCallback.
src/windows/service/exe/HcsVirtualMachine.h Adds storage/locking for a session-provided termination callback.
src/windows/service/exe/HcsVirtualMachine.cpp Implements callback registration and invokes it on VM exit; clears callback during VM teardown.

Comment on lines +312 to +313
THROW_IF_FAILED(Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get()));

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THROW_IF_FAILED here makes session initialization fail hard if the running service/proxy is older and doesn’t support the new COM method (commonly returning RPC_S_PROCNUM_OUT_OF_RANGE/E_NOTIMPL). Consider treating “method not supported” as a non-fatal best-effort (log + continue without VM-exit detection) so version skew during upgrades doesn’t prevent sessions from starting.

Suggested change
THROW_IF_FAILED(Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get()));
const auto registerTerminationCallbackResult =
Vm->RegisterTerminationCallback(vmTermCallback.as<ITerminationCallback>().get());
if ((registerTerminationCallbackResult == E_NOTIMPL) ||
(registerTerminationCallbackResult == HRESULT_FROM_WIN32(RPC_S_PROCNUM_OUT_OF_RANGE)))
{
WSL_LOG(
"RegisterTerminationCallbackUnsupported",
TraceLoggingValue(registerTerminationCallbackResult, "HResult"),
TraceLoggingValue(m_id, "SessionId"));
}
else
{
THROW_IF_FAILED(registerTerminationCallbackResult);
}

Copilot uses AI. Check for mistakes.
Comment on lines +6828 to +6834
auto hr = session->GetState(&state);
if (FAILED(hr))
{
// RPC error means the session process has exited — expected.
terminated = true;
break;
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treats any failure from GetState as success, which can mask genuine test failures (e.g., access denied, invalid argument, unexpected COM failures). Tighten this to only accept expected RPC-disconnect-style HRESULTs (e.g., HRESULT_FROM_WIN32(RPC_S_SERVER_UNAVAILABLE), HRESULT_FROM_WIN32(RPC_S_CALL_FAILED), RPC_E_DISCONNECTED), and otherwise fail the test.

Copilot uses AI. Check for mistakes.
Comment on lines +6776 to +6798
static void KillNewVms(const std::set<std::wstring>& preExistingVmIds)
{
auto currentIds = GetRunningVmIds();

std::vector<std::wstring> newVmIds;
for (const auto& id : currentIds)
{
if (preExistingVmIds.find(id) == preExistingVmIds.end())
{
newVmIds.push_back(id);
}
}

VERIFY_IS_FALSE(newVmIds.empty());

for (const auto& vmId : newVmIds)
{
auto killCmd = std::format(L"hcsdiag.exe kill {}", vmId);
wsl::windows::common::SubProcess killProc(nullptr, killCmd.c_str());
auto killExitCode = killProc.Run(10000);
VERIFY_ARE_EQUAL(killExitCode, 0u);
}
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diffing hcsdiag list to find “new VMs” can kill unrelated VMs if other tests (or background system activity) create VMs between snapshots, making the test destructive/flaky in parallel or shared environments. Prefer targeting the specific session’s VM (e.g., expose/query the VM ID from the session/service, or tag/name the VM with a unique session identifier and filter by that) rather than killing all “new” IDs.

Copilot uses AI. Check for mistakes.
Comment on lines +6760 to +6767
std::wistringstream lineStream(line);
std::wstring id;
lineStream >> id;

if (id.size() >= 36)
{
ids.insert(id);
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing hcsdiag list by taking the first token and checking size() >= 36 is brittle (headers/format changes/extra columns can produce false positives). Prefer validating that the token is actually a GUID (e.g., by attempting GUID parsing) before inserting, to reduce test flakiness across environments.

Copilot uses AI. Check for mistakes.
When the VM is killed externally (e.g. hcsdiag /kill, kernel panic), the
wslcsession.exe process now detects it and exits cleanly instead of
hanging indefinitely as a zombie.

Implementation:
- Add GetExitEvent() to IWSLCVirtualMachine IDL. The SYSTEM service
  duplicates HcsVirtualMachine::m_vmExitEvent via COM system_handle
  marshaling so the session process can wait on it.
- WSLCVirtualMachine calls GetExitEvent() during Initialize() and
  exposes VmExitedEvent() for consumers to monitor.
- WSLCSession monitors the exit event via IORelay. On unexpected VM
  exit, OnVmExited() spawns a thread to call Terminate() (must be a
  separate thread to avoid deadlock with IORelay::Stop).
- WSLCSessionManager registers a cleanup callback on HcsVirtualMachine
  to terminate sessions when VM exits. Callback is cleared in
  ~HcsVirtualMachine to avoid firing during normal shutdown.
- Harden Terminate() to skip dockerd signal/wait and unmount when the
  VM is already dead, avoiding unnecessary 30s+ hangs.
- Add tests: VmKillTerminatesSession, VmKillFailsInFlightOperations,
  CleanShutdownStillWorks.
@benhillis benhillis force-pushed the user/benhill/vm_exit branch from d8b4914 to aa851c7 Compare April 10, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants