Surface repeated activity retry failures in TemporalReportedProblems#10317
Surface repeated activity retry failures in TemporalReportedProblems#10317serelli wants to merge 6 commits into
Conversation
Add NumConsecutiveActivityRetryProblemsToTriggerSearchAttribute dynamic
config (namespace-scoped, default 0 = disabled). When enabled, the
TemporalReportedProblems search attribute gains a
"category=ActivityRetryFailed" entry whenever any pending activity's
attempt count reaches the configured threshold.
The entry is removed automatically when the activity reaches a terminal
state (completed, failed with no retry, timed out, or canceled).
Activity-retry entries are managed independently of the existing
workflow-task-failure entries: a successful WFT no longer wipes activity
entries, and vice versa.
Implementation details:
- UpdateReportedProblemsSearchAttribute now scans pendingActivityInfoIDs
and merges activity entries with WFT entries into the keyword list.
- RemoveReportedProblemsSearchAttribute clears the WFT state and
delegates to UpdateReportedProblemsSearchAttribute so activity
entries are preserved when only the WFT side is resolved.
- maybeUpdateActivityReportedProblems is called from RetryActivity and
from the Add{Completed,Failed,TimedOut,Canceled}Activity methods to
keep the SA in sync with activity state changes.
- Integration tests cover set+clear behavior and the disabled (0)
default.
Fixes: temporalio#10149
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@serelli new activity-only tests make sense to me for the basic set/clear and disabled-by-default cases. One extra case I was wondering about is the mixed WFT/activity state, since activity retry reporting now calls Maybe it would be useful to extend the existing ms, taskGenerator := newReportedProblemsReproMutableState(t, 0, 2)
ms.executionInfo.LastWorkflowTaskFailure = &persistencespb.WorkflowExecutionInfo_LastWorkflowTaskFailureCause{
LastWorkflowTaskFailureCause: enumspb.WORKFLOW_TASK_FAILED_CAUSE_WORKFLOW_WORKER_UNHANDLED_FAILURE,
}
ms.pendingActivityInfoIDs[1] = &persistencespb.ActivityInfo{
ScheduledEventId: 1,
ActivityId: "activity",
Attempt: 2,
}
taskGenerator.EXPECT().GenerateUpsertVisibilityTask().Return(nil)
require.NoError(t, ms.UpdateReportedProblemsSearchAttribute())
problems := reportedProblemsForRepro(t, ms)
require.Contains(t, problems, "category=ActivityRetryFailed")
require.NotContains(t, problems, "category=WorkflowTaskFailed")
|
|
@serelli one other edge case I was wondering about is activity reset/unpause-style paths. The test plan covers clearing after the activity succeeds, which makes sense for the normal completion path. But Is the intended behavior that those operator-driven paths also recompute/clear the activity retry entry, or should |
…vity reset - `UpdateReportedProblemsSearchAttribute`: gate WFT entries on NumConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute so that stale LastWorkflowTaskFailure state is not surfaced when WFT reporting is disabled (threshold=0). - `UpdateActivity`: track prevAttempt and call maybeUpdateActivityReportedProblems when the attempt count changes so that ResetActivity / UnpauseWithReset paths immediately clear the category=ActivityRetryFailed SA entry when retries drop below threshold. - Add TestReportedProblems_MixedWFTAndActivity: documents that with WFT threshold=0 and activity threshold=2, a recompute includes the activity entry but not the WFT entry even when LastWorkflowTaskFailure is set. - Add TestReportedProblems_ActivityClearedAfterAttemptReset: documents that the SA entry is cleared when the pending activity's attempt count drops below the threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@orange-dot The intended behavior is that WFT entries are gated by their own threshold, so category=ActivityRetryFailed appears but category=WorkflowTaskFailed does wftThreshold := ms.config.NumConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute(...) This ensures that stale LastWorkflowTaskFailure state (e.g., left over from a prior config value) doesn't bleed into the SA when WFT reporting is disabled. I also added the newReportedProblemsReproMutableState / reportedProblemsForRepro helpers and TestReportedProblems_MixedWFTAndActivity using your suggested |
@orange-dot The intended behavior is to clear immediately on reset — if an operator resets attempts to 1, the workflow is effectively getting a fresh start and the SA To wire this up, I added a prevAttempt check in UpdateActivity that calls maybeUpdateActivityReportedProblems() whenever the attempt count changes: prevAttempt := ai.Attempt This covers ResetActivity (sets Attempt = 1) and UnpauseActivityWithReset (same) without needing changes to the MutableState interface. And |
Fixes #10149
Summary
NumConsecutiveActivityRetryProblemsToTriggerSearchAttributedynamic config (namespace-scoped, default 0 = disabled). When set to N, any pending activity whose attempt count reaches N will surfacecategory=ActivityRetryFailedin theTemporalReportedProblemskeyword-list search attribute.How it works
UpdateReportedProblemsSearchAttributeis now the single recomputation point for the entire keyword list. It merges:category=ActivityRetryFailedentry, if any pending activity'sAttempt≥ thresholdRemoveReportedProblemsSearchAttribute(called on WFT success) now clears the WFT state and delegates toUpdateReportedProblemsSearchAttributerather than unconditionally wiping the whole SA. Activity entries survive a WFT success.maybeUpdateActivityReportedProblemsis called from:RetryActivity(both paused and normal retry paths) — to set the entry when threshold is crossedAddActivityTask{Completed,Failed,TimedOut,Canceled}Event— to clear the entry when activity finishesTest plan
TestActivityReportedProblems_SetAndClear— verifiescategory=ActivityRetryFailedappears after N retries and is cleared after the activity succeedsTestActivityReportedProblems_DisabledByDefault— verifies the SA is never set when threshold = 0TestWFTFailureReportedProblems_*tests continue to pass (no behavior change for the WFT path when activity threshold is 0)