Skip to content

fix: prevent deadlock in donNotifier.Subscribe() on concurrent NotifyDonSet#21963

Closed
Fletch153 wants to merge 3 commits intodevelopfrom
fix/CORE-2378-don-notifier-deadlock
Closed

fix: prevent deadlock in donNotifier.Subscribe() on concurrent NotifyDonSet#21963
Fletch153 wants to merge 3 commits intodevelopfrom
fix/CORE-2378-don-notifier-deadlock

Conversation

@Fletch153
Copy link
Copy Markdown
Collaborator

@Fletch153 Fletch153 commented Apr 10, 2026

Summary

Fixes a deadlock in Subscribe() that occurs when NotifyDonSet() fires concurrently.

The channel returned by Subscribe() has a buffer of 1. In the original code, Subscribe() registered the subscriber in the map before sending the cached DON value with a blocking send. If NotifyDonSet() ran between those two steps, it would fill the buffer via its non-blocking select send, and then the blocking s <- *n.don.Load() in Subscribe() would block forever — the caller has not received the channel yet, so nobody is reading from it.

Root Cause

Race window in the original Subscribe():

  1. n.subscribers.Store(s, struct{}{}) — subscriber visible to NotifyDonSet
  2. NotifyDonSet() runs concurrently: stores new DON, iterates subscribers, non-blocking send fills the 1-capacity buffer
  3. s <- *n.don.Load()blocking send on a now-full buffer, no reader exists yet — deadlock

Note: NotifyDonSet() itself uses a non-blocking select and cannot deadlock. The deadlock is in Subscribe()'s own blocking send, not in NotifyDonSet().

Fix

Change the cached-value send in Subscribe() to a non-blocking select, matching the pattern already used in NotifyDonSet(). Registration stays first to guarantee no missed notifications. If a concurrent NotifyDonSet already filled the buffer, the cached send is safely skipped — the subscriber already has a value queued.

Test plan

  • TestDonNotifier passes with -count=100 -race
  • CI passes

#bugfix

Fixes: CORE-2378

…DonSet

Subscribe() used a blocking channel send to deliver the cached DON
value. If NotifyDonSet() concurrently filled the buffer between
subscriber registration and the send, Subscribe() would deadlock.

Switch to a non-blocking select, matching NotifyDonSet()'s pattern.
The subscriber already has the value from NotifyDonSet if the buffer
is full.

Fixes: CORE-2378
@github-actions
Copy link
Copy Markdown
Contributor

👋 Fletch153, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

✅ No conflicts with other open PRs targeting develop

@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Apr 10, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
Test_CCIPZeroGasLimitTokenTransfer_EVM2Sui_BurnMintTokenPool Logs ↗︎

View Full Report ↗︎Docs

@Fletch153
Copy link
Copy Markdown
Collaborator Author

/rerun

…DonSet

Subscribe() registered the subscriber channel in the map BEFORE
sending the cached DON value. If NotifyDonSet fired between
registration and the send, it would fill the 1-buffer, and the
subsequent blocking send in Subscribe() would deadlock.

Reorder: send the cached value first (safe — channel is new, buffer
is empty, nobody else has a reference), then register. Eliminates the
race window entirely without changing the blocking send contract.

Fixes: CORE-2378
mchain0
mchain0 previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not as described. Does it actually fix the issue?

Register the subscriber first (preserving no-missed-notification
guarantee), but use a non-blocking select when sending the cached
value. If a concurrent NotifyDonSet already filled the buffer, the
cached send is safely skipped since the subscriber already has a
notification.

Also fixes a minor double-Load race in the original code by capturing
the pointer once.
@cl-sonarqube-production
Copy link
Copy Markdown

@Fletch153
Copy link
Copy Markdown
Collaborator Author

Closing issue - required further investigation

@Fletch153 Fletch153 closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants