Skip to content

Add support for standalone callbacks#10192

Open
chrsmith wants to merge 50 commits into
mainfrom
chrsmith/standalone-callbacks_ii
Open

Add support for standalone callbacks#10192
chrsmith wants to merge 50 commits into
mainfrom
chrsmith/standalone-callbacks_ii

Conversation

@chrsmith
Copy link
Copy Markdown
Contributor

@chrsmith chrsmith commented May 7, 2026

ℹ️ This PR is a rebase and cleanup from an earlier prototype, #9805.

⚠️ This is dependent on changes in the api and api-go repos. And will need some care before merging.

What changed?

Adds support for "standalone" callbacks. Today, the CHASM Callback component is used to deliver an arbitrary payload to a URL. (e.g. when a Workflow has completed.) As part of supplying the Nexus Connector Foundations, this feature adds CRUD operations on callbacks directly. So callers can invoke StartCallbackExecution(...) and get the durability guarantees to ensure that the callback actually gets invoked.

Why?

The primary (only?) use-case for this is for completing Nexus operations. With this capability, a Handler can implement a Nexus operation outside of Temporal. And when that operation completes, simply call StartCallbackExecution(...) with the right callback URL and Token. Then the CHASM Callback machinery will attempt to deliver the result for the Nexus operation. (Rather than, say, the Nexus operation to be implemented as a Workflow that is separately polling the async or out-of-band process.)

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

TBD. Will need to ask around.

@chrsmith chrsmith force-pushed the chrsmith/standalone-callbacks_ii branch 2 times, most recently from 0d3b36b to ad9f7cb Compare May 7, 2026 21:21
@chrsmith chrsmith changed the title [Draft] Standalone Callbacks - II [Draft] Add support for standalone callbacks May 7, 2026
@stephanos stephanos self-requested a review May 7, 2026 21:37
@chrsmith chrsmith requested a review from Quinn-With-Two-Ns May 7, 2026 21:40
@chrsmith chrsmith force-pushed the chrsmith/standalone-callbacks_ii branch from edb387a to 40f2f7e Compare May 7, 2026 22:01
Comment thread chasm/lib/callback/statemachine.go Outdated
Comment thread chasm/lib/callback/library.go Outdated
Comment thread service/frontend/configs/quotas.go
Comment thread chasm/lib/callback/statemachine.go Outdated
Comment thread chasm/lib/callback/library.go Outdated
Comment thread chasm/lib/callback/frontend_validation.go
Comment thread chasm/lib/callback/frontend_validation.go Outdated
Comment thread chasm/lib/callback/frontend_validation.go
}

// ScheduleToCloseTimeout
if req.GetScheduleToCloseTimeout() == nil || req.GetScheduleToCloseTimeout().AsDuration() <= 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want a MaxOperationScheduleToCloseTimeout? SANO has it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do, assuming that there is a system-wide hard cap on callbacks. (Which IIRC was like 10s.) If it is actually possible to be longer than that, then adding and maybe wiring something in would make sense.

Copy link
Copy Markdown
Contributor

@stephanos stephanos May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to distinguish the scope of the various timeouts. If we have/had a 10s timeout on the callback delivery attempt, that would be different from the "schedule to close" timeout which covers the entire lifecycle from creation to terminal state - including all retries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, MaxOperationScheduleToCloseTimeout is set to zero by default. But we do set a limit in Cloud to ensure that all nexus operations eventually terminate. I think we'd want the same thing here for the same reason.

Comment thread chasm/lib/callback/handler.go Outdated
@chrsmith
Copy link
Copy Markdown
Contributor Author

@stephanos thank you for the close look and detailed review, PTAL.

@chrsmith chrsmith requested a review from bergundy May 11, 2026 21:21
Copy link
Copy Markdown
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got all the way to frontend_validation.go but didn't complete reviewing that file yet.
It's a big PR, submitting my feedback so you can already take action and not be blocked until I complete reviewing the entire change.

Not blocking this PR but we should validate that temporal:// URLs have a token either via the old header format or the new structured Token field.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you remove the WorkflowClosed message that is unused here please?


// (standalone only) User-supplied business ID set when StartCallbackExecution() is
// called. Used to identify the callback for operations like Describe- or Terminate-.
string callback_id = 12;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is available via the chasm context, no need to duplicate this information.


// (standalone only) Schedule-to-close timeout from when StartCallbackExecution()
// is called to when the result gets delivered.
google.protobuf.Duration completion_schedule_to_close_timeout = 13;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @stephanos the callback lifetime is the completion lifetime, no need to qualify this field.

CALLBACK_STATUS_FAILED = 4;
// Callback has succeeded.
CALLBACK_STATUS_SUCCEEDED = 5;
// Callback was terminated by request.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Callback was terminated by request.
// Callback was terminated by request. Only relevant for standalone callbacks.


rpc DescribeCallbackExecution(DescribeCallbackExecutionRequest) returns (DescribeCallbackExecutionResponse) {
option (temporal.server.api.routing.v1.routing).business_id = "frontend_request.callback_id";
option (temporal.server.api.common.v1.api_category).category = API_CATEGORY_STANDARD;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a long poll API for the purpose of this annotation (liveness detection).

Copy link
Copy Markdown
Contributor

@stephanos stephanos May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm both SANO and SAA have this as API_CATEGORY_STANDARD - did that slip through 2 reviews before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roey confirmed that all these should be in the long-poll category. cc #10256 for fixing up the remaining ones.

Comment thread chasm/lib/callback/frontend.go Outdated
func (h *frontendHandler) checkFeatureEnabled(requestProto namespacer) error {
// Confirm CHASM is enabled.
targetNamespaceName := requestProto.GetNamespace()
if !h.config.CHASMEnabled(targetNamespaceName) || !h.config.CHASMCallbacksEnabled(targetNamespaceName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHASM callbacks is a feature flag for workflow callbacks, we should name it clearly and not check this here.

Comment thread chasm/lib/callback/frontend.go Outdated
type namespacer interface{ GetNamespace() string }

// Looks up the namespace ID from the user-supplied namespace name in the request proto.
func (h *frontendHandler) getTargetNamespace(requestProto namespacer) (namespace.ID, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use the get prefix for getters in Go: https://go.dev/doc/effective_go#Getters

return nil, err
}

resp, err := chasm.ListExecutions[*Callback, *callbackpb.CallbackExecutionListInfo](
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, there's no need for a structured memo to implement this API.

// requiredField is a tuple of a required field name and its value.
// Used instead of a map[string]string to provide deterministic
// errors if multiple fields aren't set.
type requiredField struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had different approaches to provide multi-error feedback from our APIs but don't have a standard in the codebase. I'm okay with introducing yet another form in this PR and have ideas for how to improve the validation process in general that I want us to tackle later on.

// requiredField is a tuple of a required field name and its value.
// Used instead of a map[string]string to provide deterministic
// errors if multiple fields aren't set.
type requiredField struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a bit more accurate for what this struct represents.

Suggested change
type requiredField struct {
type requiredStringField struct {

Comment thread tests/standalone_callbacks_test.go Outdated
)

// Test suite for the Nexus "Standalone Callbacks". Which are Nexus operations corresponding to
// aysynchronous actions that take place outside of Temporal. (e.g. waiting for a payment to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// aysynchronous actions that take place outside of Temporal. (e.g. waiting for a payment to
// asynchronous actions that take place outside of Temporal. (e.g. waiting for a payment to

Comment thread tests/standalone_callbacks_test.go Outdated
Comment on lines +1218 to +1219
ctx, cancel := context.WithTimeout(context.Background(), time.Second*30)
defer cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't really need those (anymore), technically. It's better to use testcore.NewContext() as it gives you a context with a default timeout. We're in the process of making that become s.Context() but that's not merged yet. The only benefit of doing this is using an even tighter timeout than the default. But that can backfire in CI when things take longer. So my recommendation is to remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I updated all the callsites to use ctx := env.Context(). So it's a big longer than I had with the custom timeout, but didn't wire in a lower value like the ones I had before. Since you have scared me worrying about how short durations would interact with under-powered and over-subscribed CI machines.

Comment on lines +1231 to +1235
s.NotNil(pollResp.GetOutcome(), "expected terminal outcome after timeout")
s.NotNil(pollResp.GetOutcome().GetFailure(), "expected failure outcome after timeout")
s.Contains(pollResp.GetOutcome().GetFailure().GetMessage(), "timed out")
s.NotNil(pollResp.GetOutcome().GetFailure().GetTimeoutFailureInfo())
s.Equal(enumspb.TIMEOUT_TYPE_SCHEDULE_TO_CLOSE, pollResp.GetOutcome().GetFailure().GetTimeoutFailureInfo().GetTimeoutType())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) (I didn't get around to codify this in testing.md yet) I find it often useful to - where possible - try to use protorequire.ProtoEqual to assert the whole object. There's an option to ignore fields that can't be checked like UUIDs. I find it makes it much easier to see the full picture and also get a better diff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep that in mind in the future. For this PR, I'm inclined to keep things as-is just to explicitly verify individual fields. But if I need to revisit this code in the future, I'll switch to that and just have a wantProto := ... to compare against.

Comment thread tests/standalone_callbacks_test.go Outdated
defer cancel()

// Short timeout so it fires quickly during the test.
callbackID := "timeout-test-" + uuid.NewString()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't actually need the uuid.NewString() part since the testEnv you get will create a unique namespace for you. And as long as that isolation is sufficient and you don't re-use the identifier in the same test, you're good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I'm inclined to keep this as-is, since it looks wrong to see a hard-coded ID like that. But yes, I see how it is unnecessary.

listenAddr := nexustest.AllocListenAddress()
nexustest.NewNexusServer(s.T(), listenAddr, h)

_, err := env.OperatorClient().CreateNexusEndpoint(ctx, &operatorservice.CreateNexusEndpointRequest{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider leveraging NexusTestEnv like a lot of the other Nexus tests; it has a createRandomExternalNexusServer and more already.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep that in mind for future tests. For now, I'm inclined to keep this more verbose and explicit. But if I ever need to modify this code, I'll look into simplifying it with existing or new test helpers. (But I'm still trying to build muscle memory re: Nexus registration mechanics, and understanding the parts of the RPC handshake, etc.)

Comment thread chasm/lib/callback/config.go Outdated

var LongPollTimeout = dynamicconfig.NewNamespaceDurationSetting(
"callback.longPollTimeout",
20*time.Second,
Copy link
Copy Markdown
Contributor

@stephanos stephanos May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make that a constant in common/constants.go, too, while we're at it.

Copy link
Copy Markdown
Contributor Author

@chrsmith chrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on the feedback (thanks!)

When I'm finished I'll split this all up into a series of smaller PRs applied to a feature branch. That'll make it easier to understand and review the changes.

Comment thread chasm/lib/callback/component.go Outdated
Comment on lines +145 to +147
"RequestID": c.RequestId,
// Only set for standalone callbacks.
"CallbackID": c.CallbackId,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I missed that.

Comment thread chasm/lib/callback/component.go Outdated
}

// Setup the completion's headers.
completion.Header = callback.Header
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for one of the cases (internal or outbound, I don't recall off the top of my head). Let's go over this in a separate PR after I've split this across.

This change came from applying a suggestion Stpehan suggested that might need some more thought to do safely.

Comment thread chasm/lib/callback/component.go Outdated
// This shouldn't happen outside of tests, since the Nexus machinery
// would prevent an invalid transition anyways. (e.g. terminating
// an already terminated Callback.)
if c.LifecycleState(ctx).IsClosed() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sorry about that.

Comment thread chasm/lib/callback/component.go Outdated

// completionSource returns the completionSource from the callback, which depends on whether it
// is embedded or is running in standalone mode.
func (c *Callback) completionSource(ctx chasm.Context) CompletionSource {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll just have Callback implement CompletionSource, and then just call that.

@chrsmith chrsmith requested review from a team as code owners May 14, 2026 20:39
@chrsmith
Copy link
Copy Markdown
Contributor Author

@bergundy , @stephanos I've addressed all the feedback so far and rebased.

The only meaningful change was I backed out moving some of the "outbound request setup logic" from Callback.loadInvocationArgs back to invocableOutbound.Invoke. (It's a good refactoring to take, but I don't feel confident enough that it was safe and accounted for all the possible edge cases.)

Will ping folks on Slack if we should just do another review here, or if I should push a series of smaller PRs into a feature branch.

@chrsmith chrsmith changed the title [Draft] Add support for standalone callbacks Add support for standalone callbacks May 14, 2026
stephanos added a commit that referenced this pull request May 15, 2026
## What changed?

Mark `DescribeNexusOperationRequest` as long poll API category.

## Why?

I response to
#10192 (comment)
@chrsmith chrsmith force-pushed the chrsmith/standalone-callbacks_ii branch from bf79b70 to 9a0ee41 Compare May 15, 2026 16:49
@chrsmith chrsmith requested a review from bergundy May 15, 2026 17:05
Copy link
Copy Markdown
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed everything but standalone_callbacks_test.go. This is getting really close to done.

// requestID (unique per API call) + idx (position within the request) ensures unique, idempotent callback IDs.
id := fmt.Sprintf("%s-%d", requestID, idx)
callbackObj := callback.NewCallback(requestID, registrationTime, &callbackspb.CallbackState{}, chasmCB)
callbackObj := callback.NewEmbeddedCallback(requestID, registrationTime, chasmCB)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also accept the constructor taking a chasm context and extracting the time but what you have is fine too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I'll keep that in mind if I need to modify this in the future, but for now I'll just keep it as-is.

return map[string]string{
"request-id": c.RequestId,
// Only set for standalone callbacks.
"callback-id": ctx.ExecutionKey().BusinessID,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ContextMetadata is only called on the root component, the comment makes it sound like that's not the case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that comment, and moved it to the function declaration.

// ContextMetadata is used for root CHASM components, so this is only applicable
// for the standalone callback case.
func (c *Callback) ContextMetadata(ctx chasm.Context) map[string]string {

I know it might be cleaner to just drop the comment all together, but IMHO it's important to call out any differences between the embedded vs. standalone case. (What do you think? Do you agree, or is this just noise? I'm still norming on what sort of things to call out.)

}

func callbackCompletionToNexusCompleteOperationOpts(
cb *Callback,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be a method on Callback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to leave this as-is, because I can't think of a good clean way to add it without adding more error handling. (Since the function would need to check if the Callback is in embedded mode or not.)

Comment on lines +275 to +291
case *callbackpb.CallbackExecutionCompletion_Failure:
f, err := commonnexus.TemporalFailureToNexusFailure(completion.GetFailure())
if err != nil {
wrappedErr := fmt.Errorf("failed to convert failure: %w", err)
return nexusrpc.CompleteOperationOptions{}, wrappedErr
}
opErr := &nexus.OperationError{
State: nexus.OperationStateFailed,
Message: "operation failed",
Cause: &nexus.FailureError{Failure: f},
}
if err := nexusrpc.MarkAsWrapperError(nexusrpc.DefaultFailureConverter(), opErr); err != nil {
wrappedErr := fmt.Errorf("failed to mark wrapper error: %w", err)
return nexusrpc.CompleteOperationOptions{}, wrappedErr
}
nexusCompletion.Error = opErr
return nexusCompletion, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm tracking this as tech debt. We don't want to have to do all of this conversion every time we extract a completion from a component.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and agreed. This can probably be standardized across all CHASM components.

Comment thread chasm/lib/callback/component.go Outdated
return nexusCompletion, nil

default:
return nexusrpc.CompleteOperationOptions{}, serviceerror.NewInvalidArgument("no completion result provided")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal error, it's our fault not the user's fault.

Comment thread chasm/lib/callback/statemachine_test.go Outdated
// Assert info object is updated.
require.Equal(t, callbackspb.CALLBACK_STATUS_SUCCEEDED, callback.StateMachineState())
require.Equal(t, int32(2), callback.Attempt)
require.Nil(t, callback.LastAttemptFailure)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not reset the field if it existed before. For SAA we use a message that stored the last attempt failure time to preserve as much of the last failure as possible. I would do that here too, can wait for a follow up PR given that it changes existing behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It's an easy change to make, done.

t.Run(test.Name, func(t *testing.T) {
currentTime := time.Now().UTC()
mctx := &chasm.MockMutableContext{}
mctx.HandleNow = func(chasm.Component) time.Time { return currentTime }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be initialized directly as a field on the mutable context struct literal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just keep it as is, none of these seem strictly better IMHO.

	mctx := &chasm.MockMutableContext{MockContext: chasm.MockContext{HandleNow: func(chasm.Component) time.Time { return currentTime }}}

	mctx := &chasm.MockMutableContext{MockContext: chasm.MockContext{
		HandleNow: func(chasm.Component) time.Time { return currentTime },
	}}

	mctx := &chasm.MockMutableContext{
		MockContext: chasm.MockContext{
			HandleNow: func(chasm.Component) time.Time { return currentTime },
		},
	}

Comment thread common/nexus/nexusrpc/completion.go Outdated
const (
// Copy of common/nexus.CallbackTokenHeader, to avoid import cycle.
// Header to identify the callback being resolved for callbacks to resolve Nexus operations.
commonnexusCallbackTokenHeader = "Temporal-Callback-Token"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot that this header was Temporal specific and not generic Nexus. I'd be fine if you moved the logic into the Temporal part of the codebase.

I would also love for us to use a Temporal agnostic header field for this but we should do that as a separate step.

Another thing we should do is make the token structured in the CompletionRequest struct and prevent the need to extract it from header in our handler implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll back out the changes to nexusrpc/ and just set the required header in callback/invocable_outbound.go like before. As far as how to improve this, let's chat about that in a separate PR.

I'm definitely onboard with refactoring this, since the "headers x { Callback, Operation } Token" x { Caller, Temporal, Handler }" situation is certainly hard to keep straight.

// ErrOperationNotStarted is returned when a completion arrives before the operation has
// started and no operation token is provided. This error is used by the callback invocation
// layer to detect this specific condition and retry without triggering the circuit breaker.
var ErrOperationNotStarted = serviceerror.NewFailedPrecondition("nexus operation not started")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that you create a new server internal service error with the FailedPrecondition status code and use a type assertion from the frontend handler to translate this error to a non retryable Nexus handler error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did that right, but please double check the commit that wired it in.

Specifically, the code is returning a &serviceerror.NexusOperationNotStarted{}. But this isn't a GRPC error wrapping that NexusOperationNotStarted. (So I don't think any sort of "failed precondition" would get surfaced upstream.)

// reject the request. This handles the race where a completion arrives before the start
// handler returns with the operation token. The caller will retry and by then the start
// handler will have returned and recorded the token.
if operationToken == "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a follow up to delay the completion request using chasm.PollComponent for a second to allow the start request to be recorded. I don't want these errors showing up for users.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I filed https://temporalio.atlassian.net/browse/NEXUS-369 to track that.

I tried to write a unit test to confirm the current behavior, but inside invocable_outbound.go we only retry these errors IFF the callback URL is temporal://system. So I'd need to spend more time figuring out how we can simulate that within functional or integration tests.

I did add unit tests for invocable_outbound.go to cover that case, but that isn't where we'd want to have that logic.

@chrsmith
Copy link
Copy Markdown
Contributor Author

I think I got to everything. However, I believe the way I wired in the serviceerror.NexusOperationNotStartedFailure isn't quite what you were asking for @bergundy .

However, I assume the way we are surfacing errors is fine for the time being, until we can address NEXUS-369: Improve "callback invoked before Nexus operation started" situation, which would entail adding some sort of delay or polling to start the callback until the target Nexus operation has actually started. WDYT?

@chrsmith chrsmith requested a review from bergundy May 18, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants