Shutdown refactor#2368
Open
yuandrew wants to merge 18 commits into
Open
Conversation
This reverts commit be5c0e4.
jmaeagle99
approved these changes
May 26, 2026
|
|
||
| aw.sendShutdownWorkerRPCForTaskQueue( | ||
| grpcCtx, | ||
| aw.sessionWorker.creationWorker.executionParameters.TaskQueue, |
Contributor
There was a problem hiding this comment.
nit: Consider having the session worker report its task queues via a func so aggregate worker doesn't have to know what the structure of the session worker is.
| aw.nexusWorker.worker.noRepoll.Store(true) | ||
| } | ||
| if !util.IsInterfaceNil(aw.sessionWorker) { | ||
| aw.sessionWorker.creationWorker.worker.noRepoll.Store(true) |
Contributor
There was a problem hiding this comment.
nit: Add a func to session worker to stop polling so that aggregate worker doesn't have to know about the details of the session worker.
| if aw.client.eagerDispatcher != nil { | ||
| aw.client.eagerDispatcher.deregisterWorker(aw.workflowWorker) | ||
| } | ||
| aw.workflowWorker.Stop() |
Contributor
There was a problem hiding this comment.
With more work happening in the stops for each worker, consider parallelizing these calls on each of the workers.
|
|
||
| service.EXPECT().ShutdownWorker(gomock.Any(), gomock.Any(), gomock.Any()). | ||
| Return(&workflowservice.ShutdownWorkerResponse{}, nil).Times(1) | ||
| Return(&workflowservice.ShutdownWorkerResponse{}, nil).Times(3) |
Contributor
There was a problem hiding this comment.
This and the below Times(3) feel like magic constants. Maybe move them to a constant and explain why this value was chosen or (if we can) calculate what it should be.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
non-fork version of #2261, so cloud tests can run
What was changed
Refactored worker shutdown to use a two-stage approach: pollers shut down first, then the task dispatcher drains any remaining tasks before exiting. This ensures tasks polled during shutdown are processed rather than silently dropped.
Key changes:
Why?
PR #2199 changed shutdown to let the server complete in-flight polls, instead of cancelling them. This exposed a pre-existing race when a poller receives a task during shutdown, Go would silently dropping the task. The dispatcher had the same issue — it could exit on stopCh before reading pending tasks from the channel.
This aligns the Go SDK's shutdown with how Core SDK handles it:
Checklist
Closes Drain polled tasks on shutdown #1197
How was this tested:
Any docs updates needed?
No
Note
Medium Risk
Changes core worker stop/poll/dispatch paths and ShutdownWorker behavior for session queues; mitigated by extensive unit/integration tests and preserved legacy shutdown when the capability is disabled.
Overview
Refactors graceful worker shutdown so tasks polled during stop are drained and processed instead of dropped when the server supports
worker_poll_complete_on_shutdown.baseWorkernow shuts down in two stages: pollers exit first (pollerWG), a goroutine closestaskQueueCh, and the task dispatcher ranges the channel in drain mode—still honoring dispatch rate limits via a separatetaskLimiterContext, with a path to release unprocessed tasks if the limiter is canceled. Pollers block-send tasks to the dispatcher during drain (nostopChdrop). Legacy shutdown (capability off/nil) keeps immediate poll cancel and early dispatcher exit.Pollers (
doPoll,ProcessTaskon workflow/activity/nexus) useshouldDrainOnShutdown()to finish in-flight polls without the old 5s wait hack and to keep processing tasks after stop when draining.AggregatedWorkersendsShutdownWorkerfor session creation/activity queues, setsnoRepollon session workers, and removes forcing session workers onto legacy-only shutdown.Tests/CI: broad unit coverage (drain vs legacy, rate limits, stop timeout); integration tests for active timer/activity shutdown and session poll-complete; cloud CI runs
TestShutdownDuringActiveTimerActivityWorkflows; local dev server enablesfrontend.enableCancelWorkerPollsOnShutdown.Reviewed by Cursor Bugbot for commit 428608a. Bugbot is set up for automated code reviews on this repo. Configure here.