Skip to content

fix(shim): optimize monitor to avoid intermittent hangs#38

Open
novahe wants to merge 1 commit intokuasar-io:mainfrom
novahe:optimize-shim-monitor
Open

fix(shim): optimize monitor to avoid intermittent hangs#38
novahe wants to merge 1 commit intokuasar-io:mainfrom
novahe:optimize-shim-monitor

Conversation

@novahe
Copy link
Collaborator

@novahe novahe commented Mar 14, 2026

Description

This PR addresses intermittent hangs and potential deadlocks in the containerd-shim monitor module by optimizing the locking mechanism and event distribution logic.

The Problem

The previous implementation suffered from two main issues:

  1. Async Lock in Destructors: The global MONITOR used tokio::sync::Mutex. When a Subscription was dropped, it attempted to acquire this lock to unsubscribe. If the drop occurred outside an active Tokio
    runtime (e.g., during shim shutdown or from a blocking thread), it caused panics or hung the process.
  2. Reaper Thread Blocking: Using bounded channels for exit events meant that the "Reaper" thread (responsible for SIGCHLD and waitpid) could block if a subscriber was slow to consume events. This prevented
    other process exits from being collected, leading to zombie processes and a total shim hang under heavy workloads.

The Solution

  1. Synchronous Locking: Replaced tokio::sync::Mutex with std::sync::Mutex for the global MONITOR singleton. This ensures that Subscription::drop can safely and synchronously unregister itself in any
    execution context without relying on the async runtime.
  2. Unbounded Channels: Switched to unbounded channels (tokio::sync::mpsc::unbounded_channel for async and std::sync::mpsc::channel for sync). Since PID exit events are critical and must not be lost,
    unbounded channels ensure that the producer (the Reaper thread) is never blocked by subscriber backpressure.
  3. Strict FIFO Ordering: By using unbounded channels, the send operation is now a non-blocking memory operation. This allowed moving the notification logic back inside the lock, ensuring strict FIFO
    ordering of events and providing a guarantee that no events are delivered after a successful unsubscription.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reliability of the containerd-shim monitor module by addressing critical concurrency issues. The changes prevent intermittent hangs and potential deadlocks by refining the locking strategy and optimizing event distribution, leading to a more stable and responsive shim operation under various workloads.

Highlights

  • Concurrency Mechanism Update: Replaced tokio::sync::Mutex with std::sync::Mutex for the global MONITOR singleton to ensure safe and synchronous unregistration in any execution context, resolving issues with async locks in destructors.
  • Event Channel Optimization: Switched from bounded channels to unbounded channels (tokio::sync::mpsc::unbounded_channel for async and std::sync::mpsc::channel for sync) for exit event distribution, preventing the 'Reaper' thread from blocking and avoiding zombie processes.
  • Strict FIFO Ordering: Refactored notification logic to occur within the lock, ensuring strict First-In, First-Out (FIFO) ordering of events and guaranteeing that no events are delivered after a successful unsubscription.
  • Subscription Drop Handling: Implemented the Drop trait for Subscription to automatically and safely unsubscribe when a subscription handle is dropped, improving resource management and preventing leaks.
Changelog
  • crates/shim/src/asynchronous/monitor.rs
    • Updated imports to use std::sync::Mutex and tokio::sync::mpsc::unbounded_channel.
    • Modified monitor_subscribe, monitor_unsubscribe, monitor_notify_by_pid, monitor_notify_by_pid_blocking, and monitor_notify_by_exec to use std::sync::Mutex and simplified notification calls.
    • Implemented the Drop trait for Subscription to handle automatic unsubscription.
    • Changed channel creation from bounded channel to unbounded_channel.
    • Refactored the notify_topic method to be synchronous, take a mutable reference, and handle dead subscribers directly.
    • Removed redundant notify_by_pid, notify_by_pid_blocking, and notify_by_exec methods from the Monitor struct.
    • Replaced the single test_monitor with new, more comprehensive tests: test_monitor_table, test_monitor_backpressure, and test_monitor_reliability.
  • crates/shim/src/synchronous/monitor.rs
    • Added std::time::Duration import.
    • Modified monitor_notify_by_pid and monitor_notify_by_exec to directly call monitor.notify_topic.
    • Added a new public function monitor_unsubscribe.
    • Updated the notify_topic method to take a mutable reference and improved error handling for sending events.
    • Modified wait_pid to use recv_timeout for more robust waiting and error handling.
    • Added new test cases: test_monitor_table_sync, test_monitor_backpressure_sync, and test_monitor_reliability_sync.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@novahe novahe force-pushed the optimize-shim-monitor branch from ee587e6 to 5bc9053 Compare March 14, 2026 01:39
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses potential deadlocks and hangs in the monitor module by switching to std::sync::Mutex and unbounded channels. This is a solid improvement for the shim's stability. The addition of a Drop implementation for Subscription to handle automatic unsubscription is a great feature for resource management. The test suite has also been significantly enhanced with table-driven tests, backpressure tests, and reliability tests, which increases confidence in these critical changes. I have a couple of suggestions: one to fix a resource leak in the synchronous monitor path by adding cleanup for disconnected subscribers, and another to improve the efficiency of this cleanup logic in the asynchronous path.

@novahe novahe force-pushed the optimize-shim-monitor branch 2 times, most recently from 8a3cc2b to f58174c Compare March 14, 2026 05:14
This commit addresses several deadlock and hang scenarios in the monitor module:
- Replaced tokio::sync::Mutex with std::sync::Mutex to allow safe Drop implementation.
- Switched to unbounded channels to prevent Reaper thread blocking on subscriber backpressure.
- Restored in-lock notification to ensure strict FIFO ordering and unsubscription consistency.

Signed-off-by: novahe <heqianfly@gmail.com>
@novahe novahe force-pushed the optimize-shim-monitor branch from f58174c to d0d537a Compare March 14, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant