Callback for workflow update support#9614
Conversation
bergundy
left a comment
There was a problem hiding this comment.
I think we need just one more round here. For when updates are already completed, let's make sure to generate the new link type we discussed server-side.
| func (l *Library) Components() []*chasm.RegistrableComponent { | ||
| return []*chasm.RegistrableComponent{ | ||
| chasm.NewRegistrableComponent[*Workflow](chasm.WorkflowComponentName), | ||
| chasm.NewRegistrableComponent[*WorkflowUpdate](chasm.WorkflowUpdateComponentName), |
There was a problem hiding this comment.
Given that workflow update is tightly coupled to workflows, it makes total sense to put them in the same library.
| *workflowpb.UpdateState | ||
|
|
||
| // MSPointer is a special in-memory field for accessing the underlying mutable state. | ||
| chasm.MSPointer |
There was a problem hiding this comment.
This was only supposed to be embedded in the top level Workflow component but I can see why you'd want to access it here. No strong opinion because either way this would be a workaround. I wonder though if you need to embed this or if it'd be better to make it a named field.
There was a problem hiding this comment.
It was embed in the workflow component so I made it embed here
There was a problem hiding this comment.
if it's not embedded then it would also need to be an exported field otherwise CHASM tree deserialization will not work. Probably to keep similar convention embedding is ok here
| ) | ||
| MaxCallbacksPerUpdateID = NewNamespaceIntSetting( | ||
| "system.maxCallbacksPerUpdateID", | ||
| 32, |
There was a problem hiding this comment.
I think limiting all of the workflow callbacks, regardless of what component they're attached to makes more sense than a per component limit due to the fact that the entire tree needs to be loaded into memory when mutable state is accessed today.
There was a problem hiding this comment.
I also limited all workflow callbacks as well. I added this limit as well to keep one update from using up all the callbacks limit on a workflow.
stephanos
left a comment
There was a problem hiding this comment.
Only made it half-way through so far; but figured I can send my first review comments now.
| links []*commonpb.Link, | ||
| identity string, | ||
| priority *commonpb.Priority, | ||
| workflowUpdateOptions map[string]*historypb.WorkflowExecutionOptionsUpdatedEventAttributes_WorkflowUpdateOptionsUpdate, |
There was a problem hiding this comment.
I know it's not wrong, but ... WorkflowUpdateOptionsUpdate 😬
(non-blocking; just noticing)
There was a problem hiding this comment.
Yeah I agree
a453230 to
09ac27a
Compare
| // - The event will be written atomically with acceptance | ||
| // If the Update struct is lost (registry cleared), the abort mechanism fires | ||
| // registryClearedErr on the caller's future, prompting an immediate retry. | ||
| if u.state == stateAdmitted || u.state == stateSent { |
There was a problem hiding this comment.
added handling for stateAdmitted, should be same as stateSent but returns false, nil since IIUC caller still needs to create the speculative WFT at this stage
09ac27a to
9de5339
Compare
|
EDIT, leaving comment up for posterity: ignore this, latest state reverts these changes |
9de5339 to
4b0915d
Compare
8551a4f to
3ae1202
Compare
3ae1202 to
2ce7339
Compare
72b65be to
4b7757c
Compare
## What changed? Added a `createExternalNexusServer(...)` which sets up an external Nexus endpoint with user-provided handler and listens on a provided address. This is used in nexus_workflow_test.go and will be used more in #9614 Opportunistically did a couple more drive-by refactors/consistency fixes, specifically: * Force user to provide `ctx` into the endpoint creation functions instead of making a new `ctx` * Use `env.Context()` instead of `testcore.NewContext()` in all suites that I touched here ## Why? Pulling changes out of #9614 into targeted PRs to reduce load on reviewers. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
579442b to
8541095
Compare
8541095 to
d65dcaa
Compare
| } else { | ||
| outcome = cevent.GetWorkflowExecutionUpdateCompletedEventAttributes().GetOutcome() | ||
| closeTime = cevent.GetEventTime().AsTime() | ||
| } |
There was a problem hiding this comment.
Transient errors incorrectly produce permanent failure outcome for callbacks
High Severity
GetNexusUpdateCompletion treats all errors from getUpdateOutcomeEvent the same — including transient I/O errors from the events cache. When the workflow is complete, the fallback path returns AcceptedUpdateCompletedWorkflowFailure as the operation result instead of propagating the transient error. This delivers a permanently incorrect failure to the Nexus caller, even though the update may have succeeded. The fallback logic needs to distinguish "update not found/not completed" errors from transient errors before assuming the update outcome is missing.
Reviewed by Cursor Bugbot for commit d65dcaa. Configure here.
93da77d to
a187f71
Compare
Squashed these commits, left for posterity: - Add Nexus Workflow Update - Update from rebase - Fix sent state - Cleanup - Fix lint - Fix more CI - fix - Review clean up - Try suggestions from the review skill - Fix some tests - Add TODO for rejected event - Remove .omc from gitignore - Respond to PR comments - Add NS Capability for this feature - Respond to PR comments - Update API
a187f71 to
656086b
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
Reviewed by Cursor Bugbot for commit 656086b. Configure here.
| } | ||
|
|
||
| return callback.ScheduleStandbyCallbacks(ctx, wf.Callbacks) | ||
| return wf.ProcessCloseCallbacks(ctx) |
There was a problem hiding this comment.
cc @bergundy LMK if this is right, I think we need to fire update callbacks here as well. I think if updates fail to complete before workflow finishes, we should probably propagate back this error vs. waiting for the completion callbacks to timeout:
temporal/service/history/workflow/update/errors_failures.go
Lines 27 to 34 in 593fdba
I tightened up the assertion in test with assertAcceptedUpdateCompletedWorkflowError(...) to assert that we actually do propagate it back.
Without tightening up assertions, caller workflow would just timeout since the update completion callbacks never fired.
There was a problem hiding this comment.
We need to fire all of the standby update callbacks as soon as the run they are attached to completes. This is slight different than what we do with workflow close callbacks, that can be reattached to a following run if the workflow retries or continues as new. I didn't re-review the PR so I trust that that's covered by functional tests and we are good.
There was a problem hiding this comment.
replying for posterity from offline discussion: this change is good, we always wanna schedule the update-level callbacks when we schedule workflow-level callbacks
yycptt
left a comment
There was a problem hiding this comment.
Stamping the chasm NodeBackend change


What changed?
Added support for Nexus workflow update completion callbacks via CHASM. This allows a Nexus caller to be notified when a workflow update completes by attaching completion callbacks to the update request.
Why?
Nexus operations that target workflow updates need a way to receive completion notifications. Without this, a Nexus caller that sends an update has no async mechanism to learn when the update finishes. Completion callbacks enable the same async notification pattern that already exists for workflow-level Nexus operations.
How did you test it?
Potential risks
Touches speculative workflow updates, they are always hard to reason about. Tried to compensate with lots of test coverage.
Note: Needs this API PR https://github.com/temporalio/api/pull/742/changes
Note
High Risk
Touches workflow update state machine and mutable state/history event paths to persist, fire, and describe per-update callbacks, including rejection/continue-as-new/retry handling; mistakes could drop callbacks or affect update lifecycle behavior.
Overview
Adds CHASM-backed completion callbacks for workflow updates so Nexus callers can register callbacks on
UpdateWorkflowExecutionand receive async completion (or rejection) notifications.This introduces a new
WorkflowUpdateCHASM component with persistedUpdateState(including validator rejection failure), per-update callback storage, and Nexus completion lookup via a newGetNexusUpdateCompletionbackend path. Callback registration is extended to support update-scoped limits (MaxCallbacksPerUpdateID) and gated byEnableWorkflowUpdateCallbacks, with DescribeWorkflow now reporting both workflow- and update-triggered callbacks.Update handling is expanded to persist late-attached callbacks via
WorkflowExecutionOptionsUpdatedevents (including per-update options), buffer/flush callbacks while updates are in-flight, fire update callbacks on update completion, and ensure update callbacks are triggered on workflow close/continue-as-new/retry while leaving workflow-level callbacks inheritable.Reviewed by Cursor Bugbot for commit 656086b. Bugbot is set up for automated code reviews on this repo. Configure here.