refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699
refactor(ntx-builder): simplify coordinator-actor messaging with Notify#1699SantiagoPittella wants to merge 25 commits intonextfrom
Conversation
sergerad
left a comment
There was a problem hiding this comment.
LGTM just a few questions on errors we don't handle
igamigo
left a comment
There was a problem hiding this comment.
Leaving some comments. Would be good to load test this even if lightly to see that it performs well
| match request { | ||
| ActorRequest::NotesFailed { nullifiers, block_num, ack_tx } => { | ||
| if let Err(err) = self.db.notes_failed(nullifiers, block_num).await { | ||
| tracing::error!(err = %err, "failed to mark notes as failed"); | ||
| } | ||
| let _ = ack_tx.send(()); |
There was a problem hiding this comment.
Is this supposed to be acked even if the db.notes_failed() call failed?
There was a problem hiding this comment.
I propgated the error instead before sending the ack
There was a problem hiding this comment.
Now making the builder to halt if it is failing to write to the database, this behavior is more consistent with all the other processes in the sewrvice
| ActorShutdownReason::DbError(account_id) => { | ||
| tracing::error!(account_id = %account_id, "Account actor shut down due to DB error"); | ||
| Ok(()) | ||
| }, |
There was a problem hiding this comment.
Is the idea not to update the registry and remove the actor handle when this happens? Took a quick look at the other branches and could nto find it either
There was a problem hiding this comment.
We should update it. I replaced it.
| pub fn broadcast(&self) { | ||
| for handle in self.actor_registry.values() { | ||
| handle.notify.notify_one(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Will this scale well? Feels like broadcasting to all actors will cost a lot of unnecessary queries. For instance, I saw that select_candidate_from_db queries account state and notes through 2 different queries in the DB. The account state one is probably not trivial though the other one might be.
I know this is what the PR was partially trying to change but wonder if sending the data here to filter in memory is worth it.
There was a problem hiding this comment.
Also, I believe there is an low-hanging optimization opportunity with the 2 queries in the DB. You can check for notes first and bail if none are found.
There was a problem hiding this comment.
This might be wasteful at scale, i'm lookking for ways to always send targeted
There was a problem hiding this comment.
I sent a new commit addressing this, let me know what do you think
Co-authored-by: igamigo <ignacio.amigo@lambdaclass.com>
crates/ntx-builder/src/actor/mod.rs
Outdated
| Ok(v) => v, | ||
| Err(reason) => return reason, | ||
| }; | ||
| if !exists { |
There was a problem hiding this comment.
I think this is inverted maybe? We're waiting for the tx to arrive, so once it arrives (aka exists) we should exit this mode?
There was a problem hiding this comment.
If the transaction exits means that it is wan't committed/reverted yet, so !exists signals that we can take on a new task
There was a problem hiding this comment.
Not quite -- we submitted a tx to the node and now we're waiting for it to be acknowledged via the mempool subscription and therefore exists? Once its committed or reverted, its already past that stage.
Though that does raise some further questions. This might be quite racey if a tx comes in and is reverted or committed before we handle it.
There was a problem hiding this comment.
Hmm. I'm unsure how to handle this tbh. The idea was to wait for the mempool ack event to arrive before attempting another tx. But now with the database + notify its possible to miss the tx arriving.
Instead of checking for the tx, we could wait for the account commitment changing. But that only works if the tx is committed, and fails if its reverted.
We could add a last_updated, or update count field, but that seems extreme..
There was a problem hiding this comment.
But we can punt on this in this PR - we must just ensure we address this before the next release.
There was a problem hiding this comment.
ohohhh, shoot. you're right. Until the txadded event exists doesn't hold the value that I expected. So using exists == true fixes it. It was working because it was stalling the actor until the next event, but since i was testing with the network monitor there was always a new event
Co-authored-by: Mirko <48352201+Mirko-von-Leipzig@users.noreply.github.com>
First task of #1694
mpsc<Arc<MempoolEvent>>channels withArc<Notify>. The DB is already the source of truth (since chore(ntx): replace in memory with sqlite database #1662), so actors only need a "re-check your state" signal and not the event payload. This removes allSendErrorhandling, failed-actor cleanup, and thesend()helper from the coordinator.TransactionInflightmode, actors now query the DB (is_transaction_resolved) to check if their awaited transaction was committed/reverted, sinceNotifydoesn't carry payload.ActorRequestenum over onempscchannel.NotesFailedcarries a oneshot ack to prevent a race condition where the actor could re-select notes before failure counts are persisted (context).CacheNoteScriptremains fire-and-forget.