Skip to content

Avoid search-thread deadlocks during surface teardown#55

Open
miracle2k wants to merge 1 commit into
manaflow-ai:mainfrom
miracle2k:fix/search-shutdown-hang
Open

Avoid search-thread deadlocks during surface teardown#55
miracle2k wants to merge 1 commit into
manaflow-ai:mainfrom
miracle2k:fix/search-shutdown-hang

Conversation

@miracle2k
Copy link
Copy Markdown

@miracle2k miracle2k commented May 12, 2026

Summary

This is a proposed fix for a freeze I have been hitting frequently in cmux when a terminal is closed while Ghostty search/find state is active or recently active.

The fix itself is entirely AI-suggested. I am opening this as a report plus a proposed patch, not as a claim that I personally proved every detail of the Ghostty-side threading model.

Observed failure

A sampled hung cmux process showed the macOS app main thread blocked in ghostty_surface_free, inside Ghostty Surface.deinit, waiting in pthread_join for the per-surface search thread to exit.

The suspected deadlock is:

  1. cmux closes/frees a terminal surface on the app/main thread.
  2. Surface.deinit asks the search thread to stop and synchronously joins it.
  3. While exiting, the search thread emits final search UI callbacks such as clearing match/highlight state.
  4. Those callbacks used .forever queue pushes into renderer/app mailboxes.
  5. If the relevant mailbox is full while the app thread is already waiting in join, both sides can wait forever.

This matches the user-visible symptom: the cmux window becomes unresponsive and only shows the spinner.

Proposed fix

This patch removes unbounded waits from search-thread callback delivery:

  • normal search callback delivery uses a short bounded queue wait, preserving ordering in ordinary cases;
  • once surface teardown has started, callback delivery becomes instant/best-effort;
  • arena-backed renderer messages deinit their arena if the message could not be enqueued.

Search UI updates are stale/recoverable state. During teardown the surface is closing, so dropping final reset updates is preferable to deadlocking the app.

Validation done locally

  • zig fmt --check ghostty/src/Surface.zig
  • zig build -Demit-xcframework=true -Demit-macos-app=false -Dxcframework-target=universal -Doptimize=ReleaseFast
  • cmux debug build using this GhosttyKit via ./scripts/reload.sh --tag fix-search-shutdown-hang

Caveat

Again: the diagnosis and fix are AI-assisted. The sample and observed freeze are real, but this should be reviewed carefully by someone familiar with Ghostty's search/thread/mailbox shutdown behavior.


Summary by cubic

Fixes a hang when closing a surface while search is active by removing unbounded waits in search-thread callbacks. Search updates now use short timeouts and switch to instant/best-effort during teardown.

  • Bug Fixes
    • Added search_tearing_down atomic flag set in Surface.deinit before joining the search thread.
    • Replaced .forever mailbox pushes with bounded waits (search_callback_push_timeout_ns = 50ms); use instant timeouts during teardown.
    • Added helper push functions that drop messages if queues are full and deinit arenas; wakeup notify now logs warnings instead of blocking.

Written for commit db98b99. Summary will update on new commits.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced search thread teardown handling to prevent potential deadlocks during application shutdown.
    • Optimized search callback message processing with adaptive timeout logic during shutdown.
    • Improved error handling for search-related notifications.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Search thread callback handling is hardened during Surface teardown by introducing an atomic flag and dynamic timeout values that prevent blocking operations when the app thread synchronously joins. Helpers centralize mailbox push logic, deinit sets the teardown flag before thread shutdown, and all search event handlers now use best-effort timeouts during teardown instead of bounded waits.

Changes

Search thread teardown safety

Layer / File(s) Summary
Search teardown state initialization
src/Surface.zig
A search_callback_push_timeout_ns constant defines the normal-operation timeout; a new search_tearing_down atomic field tracks teardown state; deinit sets the flag with release ordering before stopping and joining the search thread.
Callback refactoring with timeout helpers
src/Surface.zig
Helper functions pushRendererMessage_ and pushSurfaceMessage_ centralize mailbox push logic. searchCallback_ loads the teardown flag and computes renderer/surface timeout values: instant (best-effort) during teardown, bounded nanoseconds otherwise.
Search event handlers using teardown-aware timeouts
src/Surface.zig
All search event handlers (viewport_matches, selected_match, total_matches, quit/reset) now use helpers with computed timeouts; renderer wakeup notifications switch from try to catch + warn, preventing error propagation and unblocking the teardown path.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A search thread needs rest, without dread,
So teardown waves a flag bright red—
Helpers catch the messages in flight,
Best-effort pushes make goodbyes light.
🐰 No deadlock here, just clean shutdown tight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Avoid search-thread deadlocks during surface teardown' directly and precisely summarizes the main change: preventing deadlocks that occur when a surface is closed while search is active by improving teardown handling.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/Surface.zig (1)

1499-1513: ⚡ Quick win

Log dropped search messages when not tearing down.

These helpers silently drop timed-out pushes, which makes intermittent stale search UI hard to triage in normal operation. Please emit a debug/warn when push == 0 and teardown is not active.

Suggested patch
 fn pushSearchRendererMessage(
     self: *Surface,
     message: rendererpkg.Message,
     timeout: rendererpkg.Thread.Mailbox.Timeout,
 ) void {
-    _ = self.renderer_thread.mailbox.push(message, timeout);
+    const pushed = self.renderer_thread.mailbox.push(message, timeout);
+    if (pushed == 0 and !self.search_tearing_down.load(.acquire)) {
+        log.warn("dropping search renderer message due to mailbox backpressure", .{});
+    }
 }
@@
 fn pushSearchSurfaceMessage(
     self: *Surface,
     message: Message,
     timeout: App.Mailbox.Queue.Timeout,
 ) void {
-    _ = self.surfaceMailbox().push(message, timeout);
+    const pushed = self.surfaceMailbox().push(message, timeout);
+    if (pushed == 0 and !self.search_tearing_down.load(.acquire)) {
+        log.warn("dropping search surface message due to mailbox backpressure", .{});
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Surface.zig` around lines 1499 - 1513, The pushSearchRendererMessage and
pushSearchSurfaceMessage helpers currently drop timed-out pushes silently;
update both functions to check the return value of mailbox.push and, if it
equals 0 and the surface is not in teardown (e.g., check self.tearing_down /
self.is_tearing_down / self.tearingDown flag), emit a debug or warn log
including context (which helper, message type/id, and timeout) via the Surface
logger (e.g., self.logger.warn/self.logger.debug). Ensure the log is only
emitted when not tearing down so normal shutdowns remain quiet.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/Surface.zig`:
- Around line 1499-1513: The pushSearchRendererMessage and
pushSearchSurfaceMessage helpers currently drop timed-out pushes silently;
update both functions to check the return value of mailbox.push and, if it
equals 0 and the surface is not in teardown (e.g., check self.tearing_down /
self.is_tearing_down / self.tearingDown flag), emit a debug or warn log
including context (which helper, message type/id, and timeout) via the Surface
logger (e.g., self.logger.warn/self.logger.debug). Ensure the log is only
emitted when not tearing down so normal shutdowns remain quiet.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00678e55-3b36-41c2-95a2-df2cedb9c503

📥 Commits

Reviewing files that changed from the base of the PR and between 41ab6c5 and db98b99.

📒 Files selected for processing (1)
  • src/Surface.zig

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db98b99fc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Surface.zig
);
try self.renderer_thread.wakeup.notify();
self.pushSearchRendererMessage(.{ .search_selected_match = null }, renderer_timeout);
self.pushSearchRendererViewportMatches(.init(self.alloc), &.{}, renderer_timeout);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the final search reset outside teardown

When search is closed normally (not during surface teardown), is_tearing_down is false, so this quit reset is still sent with the 50 ms .ns timeout; if the renderer mailbox remains full longer than that, BlockingQueue.push returns 0 and pushSearchRendererViewportMatches drops the reset. Because .quit is the last search callback, there is no later update to clear renderer.search_matches/search_selected_match, so stale search highlights can remain visible after the user closes search under renderer backpressure. The teardown path can be best-effort, but the non-teardown quit path needs a reliable way to clear this state.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR fixes a deadlock that occurs when a terminal surface is torn down while the search thread is active: Surface.deinit used to join the search thread synchronously while that thread could be blocked on a full renderer/app mailbox with .forever pushes, causing both threads to wait on each other indefinitely.

  • Introduces a search_tearing_down atomic flag set in deinit before joining the search thread; search callbacks check it on every invocation and switch from a 50 ms bounded wait (normal operation) to an instant/non-blocking push (teardown), so the search thread always terminates quickly.
  • Extracts four push-helper functions for renderer/surface mailbox delivery; the two arena-backed helpers correctly free the arena on a failed push, while the generic helper is intentionally arena-free (though undocumented).
  • Converts try wakeup.notify() calls to catch |err| log.warn(...), narrowing the error surface and removing the pre-existing risk of an erroneously fired errdefer arena.deinit() on notify failure.

Confidence Score: 4/5

Safe to merge for the deadlock fix itself; the introduced wakeup-after-dropped-message and dangling-errdefer patterns are minor issues that won't cause runtime failures today.

The core threading fix is well-reasoned: the atomic store happens-before the join, the two specialised push helpers free their arenas on failure, and the bounded 50 ms wait eliminates the infinite-block path. The two concerns flagged — unconditional wakeup.notify() when a push was dropped, and the errdefer remaining live after value-based arena transfer — are both currently harmless but represent structural fragility that could bite on future changes to the callback body.

src/Surface.zig — specifically the searchCallback_ function and the four new push-helper wrappers; the arena ownership handoff and the post-push wakeup pattern deserve a second look.

Important Files Changed

Filename Overview
src/Surface.zig Adds a search_tearing_down atomic flag and four push-helper wrappers that replace all .forever mailbox pushes in search callbacks with bounded timeouts (50 ms in normal operation, instant/best-effort during teardown), directly targeting a deadlock where the search thread could block indefinitely on a full renderer/app mailbox while the app thread was already waiting in pthread_join. Arena cleanup on failed pushes is correctly handled in the two specialised helpers; the generic pushSearchRendererMessage is arena-free and the discard is safe but undocumented. A dangling errdefer after value-based arena transfers is a latent double-free risk, and wakeup.notify() is sent unconditionally even when messages are dropped.

Sequence Diagram

sequenceDiagram
    participant AT as App Thread (deinit)
    participant ST as Search Thread (callback)
    participant RM as Renderer Mailbox
    participant RT as Renderer Thread

    Note over AT,RT: Normal operation
    ST->>ST: searchCallback_() called
    ST->>ST: load search_tearing_down false
    ST->>ST: "timeout = .{ .ns = 50ms }"
    ST->>RM: push(message, 50ms timeout)
    RM-->>ST: "pushed >= 1 (success)"
    ST->>RT: wakeup.notify()

    Note over AT,RT: Teardown path (fix)
    AT->>AT: search_tearing_down.store(true, .release)
    AT->>ST: s.deinit() signals thread to stop
    Note over AT: join() waiting here

    ST->>ST: load search_tearing_down true
    ST->>ST: "timeout = .{ .instant = {} }"
    ST->>RM: push(message, instant)
    alt mailbox full
        RM-->>ST: "pushed = 0 (fail fast)"
        ST->>ST: arena.deinit() (no leak)
    else mailbox not full
        RM-->>ST: "pushed >= 1"
    end
    ST->>RT: wakeup.notify() (unconditional)
    ST->>ST: callback returns, thread exits
    AT->>AT: join() returns
    AT->>RT: renderer_thread.stop.notify()
    AT->>RT: renderer_thr.join()
Loading

Reviews (1): Last reviewed commit: "Avoid search callback deadlocks during s..." | Re-trigger Greptile

Comment thread src/Surface.zig
Comment on lines +1546 to +1547
self.renderer_thread.wakeup.notify() catch |err|
log.warn("error notifying renderer thread after search viewport update err={}", .{err});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unconditional wakeup after a potentially-dropped message

wakeup.notify() is called regardless of whether the preceding push succeeded. When the push times out or returns 0 (message dropped), the renderer thread is woken up with no new message in the mailbox. The renderer will drain its queue and go back to sleep harmlessly, but the wakeup signal is semantically misleading — its contract is "a new message is waiting." Under heavy load this path can fire repeatedly without effect. Consider only calling wakeup.notify() when the push returned a non-zero count.

Comment thread src/Surface.zig
Comment on lines 1537 to 1548
@@ -1463,14 +1542,9 @@ fn searchCallback_(
const matches = try alloc.dupe(terminal.highlight.Flattened, matches_unowned);
for (matches) |*m| m.* = try m.clone(alloc);

_ = self.renderer_thread.mailbox.push(
.{ .search_viewport_matches = .{
.arena = arena,
.matches = matches,
} },
.forever,
);
try self.renderer_thread.wakeup.notify();
self.pushSearchRendererViewportMatches(arena, matches, renderer_timeout);
self.renderer_thread.wakeup.notify() catch |err|
log.warn("error notifying renderer thread after search viewport update err={}", .{err});
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dangling errdefer after arena ownership transfer by value

arena is passed by value to pushSearchRendererViewportMatches. Inside that helper, when the push fails, the helper's copy calls arena.deinit(), freeing the underlying memory pool. The caller's arena variable (still on the stack with errdefer arena.deinit() attached) now holds dangling internal pointers. If a try-able call were ever added after the push helper — or if the helper's ownership model changes — the errdefer would trigger a double-free. Today it is safe because no errors propagate after the push, but it is a fragile ownership pattern. The same applies to the selected_match branch. Consider removing or cancelling the errdefer after transferring ownership, or returning the arena back from the helper so the caller retains exclusive ownership.

Comment thread src/Surface.zig
Comment on lines +1499 to +1505
fn pushSearchRendererMessage(
self: *Surface,
message: rendererpkg.Message,
timeout: rendererpkg.Thread.Mailbox.Timeout,
) void {
_ = self.renderer_thread.mailbox.push(message, timeout);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 pushSearchRendererMessage silently discards non-arena message drops

Unlike pushSearchRendererViewportMatches and pushSearchRendererSelectedMatch, this helper ignores the push return value without comment. Currently it is only called with { .search_selected_match = null } — a message with no owned arena — so there is nothing to free on a failed push. However, the asymmetry with the other two helpers (which explicitly handle the pushed == 0 case) makes it easy to accidentally pass an arena-bearing message here in the future. A brief inline comment explaining that this path is intentionally arena-free would prevent misuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant