Isolate plugins in out-of-process COM host#40120
Conversation
There was a problem hiding this comment.
Pull request overview
This PR moves WSL plugins from in-process LoadLibrary inside wslservice.exe to isolated out-of-process wslpluginhost.exe COM local servers, aiming to keep the service alive even if a plugin crashes.
Changes:
- Introduces
wslpluginhost.exe(COM local server) that loads one plugin DLL and forwards lifecycle events, while proxying plugin API callbacks back to the service. - Adds new COM contracts (
IWslPluginHost,IWslPluginHostCallback) and consolidates proxy/stub generation intowslserviceproxystub.dll. - Updates service-side plugin management and adds a new
shared_mutexpath intended to avoid re-entrancy/deadlock on COM RPC callback threads.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/windows/wslpluginhost/exe/resource.h | Adds resource IDs for new host executable. |
| src/windows/wslpluginhost/exe/PluginHost.h | Declares COM class implementing IWslPluginHost plus API callback stubs. |
| src/windows/wslpluginhost/exe/PluginHost.cpp | Implements plugin DLL loading, lifecycle dispatch, and callback forwarding to service. |
| src/windows/wslpluginhost/exe/main.rc | Adds version/icon resources for wslpluginhost.exe. |
| src/windows/wslpluginhost/exe/main.cpp | Implements COM local server entrypoint and class factory registration. |
| src/windows/wslpluginhost/exe/CMakeLists.txt | Adds build target for wslpluginhost.exe. |
| src/windows/wslpluginhost/CMakeLists.txt | Wires new subdirectory into build. |
| src/windows/service/stub/CMakeLists.txt | Adds MIDL proxy/stub sources for WslPluginHost.idl into wslserviceproxystub. |
| src/windows/service/inc/WslPluginHost.idl | Defines out-of-proc COM interfaces for plugin hosting + callbacks. |
| src/windows/service/inc/CMakeLists.txt | Adds new wslpluginhostidl MIDL generation target. |
| src/windows/service/exe/PluginManager.h | Refactors plugin manager to track out-of-proc hosts and adds callback implementation type. |
| src/windows/service/exe/PluginManager.cpp | Implements COM activation of hosts, job object assignment, and service-side callback handlers. |
| src/windows/service/exe/LxssUserSession.h | Adds shared_mutex and makes plugin-callback methods private/friend-only. |
| src/windows/service/exe/LxssUserSession.cpp | Switches plugin callback locking from m_instanceLock to m_callbackLock and gates VM teardown. |
| src/windows/service/exe/CMakeLists.txt | Adds dependency on wslpluginhostidl. |
| src/windows/common/precomp.h | Adds <shared_mutex> include for new locking. |
| msipackage/package.wix.in | Installs wslpluginhost.exe and registers COM AppID/CLSID/interfaces for activation and proxy/stub. |
| msipackage/CMakeLists.txt | Adds wslpluginhost.exe to packaged binaries and build dependencies. |
| CMakeLists.txt | Adds subdirectory for wslpluginhost and adjusts global include directories. |
Comments suppressed due to low confidence (2)
src/windows/service/exe/LxssUserSession.cpp:3644
- CreateLinuxProcess now only takes m_callbackLock, but it calls _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and reads m_runningInstances. This is both a locking-contract violation and can race with writers that still use m_instanceLock only. Refactor so callback code can safely read the running-instance map (e.g., provide a callback-safe lookup guarded by m_callbackLock, and ensure all writes to m_runningInstances/m_utilityVm also take m_callbackLock exclusively after m_instanceLock per the stated lock ordering).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
src/windows/service/exe/LxssUserSession.cpp:2614
- m_runningInstances is updated here without taking m_callbackLock, but plugin callbacks now read m_runningInstances under m_callbackLock (and intentionally do not take m_instanceLock). Because these are different locks, this doesn’t provide synchronization and can lead to data races/UB when a callback runs concurrently with instance creation/termination. Writers that mutate m_runningInstances (and m_utilityVm if accessed by callbacks) need to also take m_callbackLock (exclusive) in the documented order m_instanceLock → m_callbackLock.
// This needs to be done before plugins are notified because they might try to run a command inside the distribution.
m_runningInstances[registration.Id()] = instance;
if (version == LXSS_WSL_VERSION_2)
{
auto cleanupOnFailure =
wil::scope_exit_log(WI_DIAGNOSTICS_INFO, [&]() { m_runningInstances.erase(registration.Id()); });
m_pluginManager.OnDistributionStarted(&m_session, instance->DistributionInformation());
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3643
- CreateLinuxProcess now only takes m_callbackLock, but it reads m_runningInstances via _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and the map itself is Guarded_by(m_instanceLock). This is a real race/contract violation (and can also break static analysis). You’ll need a consistent locking strategy for callback threads (e.g., make all accesses/mutations of m_runningInstances + m_utilityVm also take m_callbackLock in the documented order, or refactor callbacks to avoid touching m_runningInstances without m_instanceLock).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
3418f7e to
3f03d4f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3653
CreateLinuxProcessnow only takesm_callbackLockbut calls_RunningInstance(Distro)(which is annotated_Requires_lock_held_(m_instanceLock)and touchesm_lockedDistributions). This violates the locking contract and can race with instance state changes or deadlock if someone later adds the required lock. Consider adding a callback-safe lookup that is guarded bym_callbackLockonly (and does not call_EnsureNotLocked), or refactoring_RunningInstance/ the guarded annotations so callback code never needsm_instanceLock.
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
THROW_HR_IF(WSL_E_WSL2_NEEDED, !wsl2Distro);
|
Hey @benhillis 👋 — Following up on this PR. It currently has merge conflicts that need to be resolved, and the CI build didn't run (shows action_required). There are also 20 unresolved review threads, including some security-relevant findings (TOCTOU on DLL signature validation, |
f32e629 to
5641289
Compare
|
CI is all green now and conflicts are resolved. Would love to get some eyes on this when someone has a chance — the main design change is moving plugin loading out of wslservice into separate COM hosts so a bad plugin can't take down the service. |
3492d6b to
57501d9
Compare
|
|
||
| wil::unique_socket sock; | ||
| auto result = session->CreateLinuxProcess(DistributionId, Path, args.data(), &sock); | ||
|
|
There was a problem hiding this comment.
PluginHostCallbackImpl::ExecuteBinaryInDistribution has the same issue as ExecuteBinary: &sock is a wil::unique_socket* but CreateLinuxProcess expects a SOCKET* out parameter. Adjust the call so the out parameter type matches and the socket is correctly owned/closed on failure.
| std::vector<BYTE> PluginManager::SerializeSid(PSID Sid) | ||
| { | ||
| const DWORD sidLength = GetLengthSid(Sid); | ||
| std::vector<BYTE> buffer(sidLength); | ||
| CopySid(sidLength, buffer.data(), Sid); |
There was a problem hiding this comment.
SerializeSid ignores failures/invalid SIDs: GetLengthSid can return 0 and CopySid returns BOOL, but the return value is unchecked. Please validate the SID (or at least THROW_IF_WIN32_BOOL_FALSE(CopySid(...))) so we don’t serialize an empty/invalid SID and later reinterpret it as a PSID in the host process.
| /// Reads guarded by m_instanceLock OR m_callbackLock (shared). | ||
| /// Mutations require BOTH m_instanceLock AND m_callbackLock (exclusive). | ||
| /// </summary> | ||
| _Guarded_by_(m_instanceLock) std::map<GUID, std::shared_ptr<LxssRunningInstance>, wsl::windows::common::helpers::GuidLess> m_runningInstances; |
There was a problem hiding this comment.
The comment says reads of m_runningInstances are guarded by m_instanceLock OR m_callbackLock, but the field is still annotated Guarded_by(m_instanceLock). With callbacks now intentionally avoiding m_instanceLock, this is inconsistent and will cause static analysis/maintenance confusion. Consider updating annotations (or adding separate helper APIs that explicitly require/hold m_callbackLock) to match the new locking model.
| /// Reads guarded by m_instanceLock OR m_callbackLock (shared). | |
| /// Mutations require BOTH m_instanceLock AND m_callbackLock (exclusive). | |
| /// </summary> | |
| _Guarded_by_(m_instanceLock) std::map<GUID, std::shared_ptr<LxssRunningInstance>, wsl::windows::common::helpers::GuidLess> m_runningInstances; | |
| /// Reads require either m_instanceLock or m_callbackLock (shared). | |
| /// Mutations require both m_instanceLock and m_callbackLock (exclusive). | |
| /// This uses a dual-lock access model and therefore is not annotated with a | |
| /// single _Guarded_by_ lock. | |
| /// </summary> | |
| std::map<GUID, std::shared_ptr<LxssRunningInstance>, wsl::windows::common::helpers::GuidLess> m_runningInstances; |
| PluginHost* wsl::windows::pluginhost::g_pluginHost = nullptr; | ||
|
|
||
| // Thread ID of the thread currently dispatching a plugin hook. | ||
| // Only that thread may call PluginError. Using thread ID instead of | ||
| // thread_local to avoid TLS initialization issues across DLL/EXE boundaries. | ||
| static std::atomic<DWORD> g_hookThreadId{0}; |
There was a problem hiding this comment.
g_pluginHost is a process-wide raw pointer that’s read by API stubs from arbitrary plugin threads and written/reset during Initialize/destruction. Without atomic synchronization this is a data race and can lead to use-after-free when the host is shutting down. Consider making g_pluginHost an atomic pointer (or otherwise synchronizing access) and ensuring stub methods can’t observe a partially torn-down instance.
test/windows/PluginTests.cpp
Outdated
| @@ -354,11 +354,10 @@ class PluginTests | |||
| { | |||
| WSL1_TEST_ONLY(); | |||
|
|
|||
| constexpr auto ExpectedOutput = LR"(Plugin loaded. TestMode=1)"; | |||
|
|
|||
| // Plugins are not loaded for WSL1-only sessions (no VM, no plugin hooks). | |||
| // Verify that WSL1 works without plugins. | |||
| ConfigurePlugin(PluginTestType::Success); | |||
| StartWsl(0); | |||
| ValidateLogFile(ExpectedOutput); | |||
| } | |||
There was a problem hiding this comment.
SuccessWSL1 no longer asserts anything about plugin behavior. Since ConfigurePlugin sets up logging, it would be good to assert that the plugin log file is absent/empty (and does not contain "Plugin loaded") to actually verify that plugins are not loaded for WSL1-only sessions.
| // Plugins are not loaded for WSL1-only sessions, so a plugin that | ||
| // would fail to load on WSL2 has no effect on WSL1. | ||
| ConfigurePlugin(PluginTestType::FailToLoad); | ||
| StartWsl(0); |
There was a problem hiding this comment.
LoadFailureNonFatalWSL1 similarly no longer validates that the plugin wasn’t loaded (or that no plugin failure was recorded). Add an assertion that the log file is absent/empty to ensure WSL1 remains unaffected by a plugin that would fail on WSL2.
| StartWsl(0); | |
| StartWsl(0); | |
| VERIFY_IS_TRUE( | |
| !std::filesystem::exists(logFile) || (std::filesystem::file_size(logFile) == 0), | |
| std::format("Expected plugin log file '{}' to be absent or empty for WSL1", logFile).c_str()); |
| HRESULT LxssUserSessionImpl::CreateLinuxProcess(_In_opt_ const GUID* Distro, _In_ LPCSTR Path, _In_ LPCSTR* Arguments, _Out_ SOCKET* Socket) | ||
| { | ||
| std::lock_guard lock(m_instanceLock); | ||
| // Shared lock prevents _VmTerminate from destroying the VM or instances | ||
| // while we use them. See MountRootNamespaceFolder for rationale. | ||
| std::shared_lock lock(m_callbackLock); | ||
| RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm); | ||
|
|
||
| if (Distro == nullptr) |
There was a problem hiding this comment.
CreateLinuxProcess now intentionally avoids m_instanceLock and only takes m_callbackLock (shared). However, the implementation later calls _RunningInstance (which is annotated Requires_lock_held(m_instanceLock) and reads m_runningInstances). Please align the helper/annotations with the new locking model (e.g., require/hold m_callbackLock for _RunningInstance) to avoid lock-contract violations and potential races.
Plugin DLLs are now loaded in isolated wslpluginhost.exe processes instead of directly in wslservice.exe via LoadLibrary. This prevents a buggy or malicious plugin from crashing the WSL service. Architecture: - New IWslPluginHost/IWslPluginHostCallback COM interfaces (WslPluginHost.idl) for cross-process plugin lifecycle management - New wslpluginhost.exe: COM local server (REGCLS_SINGLEUSE), one per plugin, loads the plugin DLL and dispatches notifications - Refactored PluginManager: CoCreateInstance replaces LoadLibrary, PluginError returned via [out] parameter, crash recovery via IsHostCrash() detecting RPC_E_DISCONNECTED/SERVER_DIED Callback safety: - Plugin callbacks (MountFolder, ExecuteBinary) arrive on a different COM RPC thread and use std::shared_lock(m_callbackLock) instead of m_instanceLock to avoid re-entrancy deadlocks - _VmTerminate takes exclusive m_callbackLock before destroying the VM, blocking until in-flight callbacks complete - Lock ordering: m_instanceLock -> m_callbackLock (never reverse) - All writes to m_runningInstances take m_callbackLock exclusive to prevent data races with concurrent callback reads Security: - COM AppID with SYSTEM-only launch/access permissions - Plugin signature validation (ValidateFileSignature) keeps the file handle open until after LoadLibrary to prevent TOCTOU attacks - Plugin host processes use minimal access rights for handles Process lifecycle: - Plugin hosts added to a job object with KILL_ON_JOB_CLOSE for automatic cleanup if wslservice exits - g_pluginHost is process-wide (REGCLS_SINGLEUSE guarantees one plugin per process), nulled on destruction to prevent UAF - std::call_once for thread-safe initialization and job creation Packaging: - WslPluginHost.idl compiled into existing wslserviceproxystub.dll - MSI: COM class/interface registration, AppID security, proxy/stub - wslpluginhost.exe added to build/signing pipeline, WER crash dump list, LSP registration, and test validation Plugins are not loaded for WSL1-only sessions since all plugin hooks require a WSL2 VM. WslPluginApi.h is unchanged - existing plugin DLLs work unmodified.
57501d9 to
ed98200
Compare
Moves plugin DLLs from LoadLibrary in wslservice.exe to isolated wslpluginhost.exe processes via COM. A crashing plugin kills only its host process — the service logs it and continues. Plugin API is unchanged; Defender and existing plugins work unmodified.
Callbacks (MountFolder, ExecuteBinary) arrive on a different COM thread, so they can't re-enter the recursive mutex. Instead of restructuring every call site, callbacks use a separate \shared_mutex\ — shared for reads, exclusive in _VmTerminate\ before destroying the VM.
Plugin hosts are in a job object for automatic cleanup on service exit. COM activation is SYSTEM-only via AppID. Proxy/stub is consolidated into wslserviceproxystub.dll. One new exe, no new DLLs.