Skip to content

refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699

Open
SantiagoPittella wants to merge 25 commits intonextfrom
santiagopittella-ntx-simplify-actor-messages
Open

refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699
SantiagoPittella wants to merge 25 commits intonextfrom
santiagopittella-ntx-simplify-actor-messages

Conversation

@SantiagoPittella
Copy link
Collaborator

First task of #1694

  • Replace per-actor mpsc<Arc<MempoolEvent>> channels with Arc<Notify>. The DB is already the source of truth (since chore(ntx): replace in memory with sqlite database #1662), so actors only need a "re-check your state" signal and not the event payload. This removes all SendError handling, failed-actor cleanup, and the send() helper from the coordinator.
  • In TransactionInflight mode, actors now query the DB (is_transaction_resolved) to check if their awaited transaction was committed/reverted, since Notify doesn't carry payload.
  • Unify actor -> coordinator communication into a single ActorRequest enum over one mpsc channel. NotesFailed carries a oneshot ack to prevent a race condition where the actor could re-select notes before failure counts are persisted (context). CacheNoteScript remains fire-and-forget.

@SantiagoPittella SantiagoPittella added the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Feb 23, 2026
Copy link
Collaborator

@sergerad sergerad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few questions on errors we don't handle

Copy link
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments. Would be good to load test this even if lightly to see that it performs well

Comment on lines +296 to +301
match request {
ActorRequest::NotesFailed { nullifiers, block_num, ack_tx } => {
if let Err(err) = self.db.notes_failed(nullifiers, block_num).await {
tracing::error!(err = %err, "failed to mark notes as failed");
}
let _ = ack_tx.send(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be acked even if the db.notes_failed() call failed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propgated the error instead before sending the ack

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now making the builder to halt if it is failing to write to the database, this behavior is more consistent with all the other processes in the sewrvice

Comment on lines +165 to +168
ActorShutdownReason::DbError(account_id) => {
tracing::error!(account_id = %account_id, "Account actor shut down due to DB error");
Ok(())
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea not to update the registry and remove the actor handle when this happens? Took a quick look at the other branches and could nto find it either

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update it. I replaced it.

Comment on lines 140 to 144
pub fn broadcast(&self) {
for handle in self.actor_registry.values() {
handle.notify.notify_one();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this scale well? Feels like broadcasting to all actors will cost a lot of unnecessary queries. For instance, I saw that select_candidate_from_db queries account state and notes through 2 different queries in the DB. The account state one is probably not trivial though the other one might be.
I know this is what the PR was partially trying to change but wonder if sending the data here to filter in memory is worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe there is an low-hanging optimization opportunity with the 2 queries in the DB. You can check for notes first and bail if none are found.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be wasteful at scale, i'm lookking for ways to always send targeted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent a new commit addressing this, let me know what do you think

@SantiagoPittella SantiagoPittella requested a review from igamigo March 3, 2026 17:05
Ok(v) => v,
Err(reason) => return reason,
};
if !exists {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is inverted maybe? We're waiting for the tx to arrive, so once it arrives (aka exists) we should exit this mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the transaction exits means that it is wan't committed/reverted yet, so !exists signals that we can take on a new task

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite -- we submitted a tx to the node and now we're waiting for it to be acknowledged via the mempool subscription and therefore exists? Once its committed or reverted, its already past that stage.

Though that does raise some further questions. This might be quite racey if a tx comes in and is reverted or committed before we handle it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm unsure how to handle this tbh. The idea was to wait for the mempool ack event to arrive before attempting another tx. But now with the database + notify its possible to miss the tx arriving.

Instead of checking for the tx, we could wait for the account commitment changing. But that only works if the tx is committed, and fails if its reverted.

We could add a last_updated, or update count field, but that seems extreme..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can punt on this in this PR - we must just ensure we address this before the next release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohohhh, shoot. you're right. Until the txadded event exists doesn't hold the value that I expected. So using exists == true fixes it. It was working because it was stalling the actor until the next event, but since i was testing with the network monitor there was always a new event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog This PR does not require an entry in the `CHANGELOG.md` file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants