Skip to content

Surface repeated activity retry failures in TemporalReportedProblems#10317

Open
serelli wants to merge 6 commits into
temporalio:mainfrom
serelli:feat/activity-retry-reported-problems
Open

Surface repeated activity retry failures in TemporalReportedProblems#10317
serelli wants to merge 6 commits into
temporalio:mainfrom
serelli:feat/activity-retry-reported-problems

Conversation

@serelli
Copy link
Copy Markdown

@serelli serelli commented May 18, 2026

Fixes #10149

Summary

  • Adds NumConsecutiveActivityRetryProblemsToTriggerSearchAttribute dynamic config (namespace-scoped, default 0 = disabled). When set to N, any pending activity whose attempt count reaches N will surface category=ActivityRetryFailed in the TemporalReportedProblems keyword-list search attribute.
  • Activity-retry entries are managed independently of the existing workflow-task-failure entries: a successful WFT no longer clears activity entries, and vice versa.
  • The entry is automatically removed when the activity reaches a terminal state (completed, failed with no more retries, timed out, or canceled).

How it works

UpdateReportedProblemsSearchAttribute is now the single recomputation point for the entire keyword list. It merges:

  1. Existing WFT failure entries (unchanged logic)
  2. A new category=ActivityRetryFailed entry, if any pending activity's Attempt ≥ threshold

RemoveReportedProblemsSearchAttribute (called on WFT success) now clears the WFT state and delegates to UpdateReportedProblemsSearchAttribute rather than unconditionally wiping the whole SA. Activity entries survive a WFT success.

maybeUpdateActivityReportedProblems is called from:

  • RetryActivity (both paused and normal retry paths) — to set the entry when threshold is crossed
  • AddActivityTask{Completed,Failed,TimedOut,Canceled}Event — to clear the entry when activity finishes

Test plan

  • TestActivityReportedProblems_SetAndClear — verifies category=ActivityRetryFailed appears after N retries and is cleared after the activity succeeds
  • TestActivityReportedProblems_DisabledByDefault — verifies the SA is never set when threshold = 0
  • Existing TestWFTFailureReportedProblems_* tests continue to pass (no behavior change for the WFT path when activity threshold is 0)

Add NumConsecutiveActivityRetryProblemsToTriggerSearchAttribute dynamic
config (namespace-scoped, default 0 = disabled). When enabled, the
TemporalReportedProblems search attribute gains a
"category=ActivityRetryFailed" entry whenever any pending activity's
attempt count reaches the configured threshold.

The entry is removed automatically when the activity reaches a terminal
state (completed, failed with no retry, timed out, or canceled).
Activity-retry entries are managed independently of the existing
workflow-task-failure entries: a successful WFT no longer wipes activity
entries, and vice versa.

Implementation details:
- UpdateReportedProblemsSearchAttribute now scans pendingActivityInfoIDs
  and merges activity entries with WFT entries into the keyword list.
- RemoveReportedProblemsSearchAttribute clears the WFT state and
  delegates to UpdateReportedProblemsSearchAttribute so activity
  entries are preserved when only the WFT side is resolved.
- maybeUpdateActivityReportedProblems is called from RetryActivity and
  from the Add{Completed,Failed,TimedOut,Canceled}Activity methods to
  keep the SA in sync with activity state changes.
- Integration tests cover set+clear behavior and the disabled (0)
  default.

Fixes: temporalio#10149

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@serelli serelli requested review from a team as code owners May 18, 2026 16:42
@orange-dot
Copy link
Copy Markdown
Contributor

@serelli new activity-only tests make sense to me for the basic set/clear and disabled-by-default cases.

One extra case I was wondering about is the mixed WFT/activity state, since activity retry reporting now calls
into the shared reported-problems recompute path. For example, if LastWorkflowTaskFailure exists, but WFT
reporting is disabled or still below its own threshold, and then an activity retry reaches its threshold, should
the recompute include the WFT tokens too, or keep those gated by the WFT threshold?

Maybe it would be useful to extend the existing TestWFTFailureReportedProblems_* coverage with one such mixed
case, just to make the intended behavior explicit. Something roughly like:

ms, taskGenerator := newReportedProblemsReproMutableState(t, 0, 2)
ms.executionInfo.LastWorkflowTaskFailure = &persistencespb.WorkflowExecutionInfo_LastWorkflowTaskFailureCause{
    LastWorkflowTaskFailureCause: enumspb.WORKFLOW_TASK_FAILED_CAUSE_WORKFLOW_WORKER_UNHANDLED_FAILURE,
}
ms.pendingActivityInfoIDs[1] = &persistencespb.ActivityInfo{
    ScheduledEventId: 1,
    ActivityId:       "activity",
    Attempt:          2,
}

taskGenerator.EXPECT().GenerateUpsertVisibilityTask().Return(nil)
require.NoError(t, ms.UpdateReportedProblemsSearchAttribute())

problems := reportedProblemsForRepro(t, ms)
require.Contains(t, problems, "category=ActivityRetryFailed")
require.NotContains(t, problems, "category=WorkflowTaskFailed")

@orange-dot
Copy link
Copy Markdown
Contributor

@serelli one other edge case I was wondering about is activity reset/unpause-style paths.

The test plan covers clearing after the activity succeeds, which makes sense for the normal completion path. But
if TemporalReportedProblems already contains category=ActivityRetryFailed and the activity is reset or
unpaused with attempts reset, the pending activity may no longer meet the retry threshold without going through an
activity completion event.

Is the intended behavior that those operator-driven paths also recompute/clear the activity retry entry, or should
the entry remain until the activity later completes/fails/times out/cancels? A small test around that path might
help document the intended behavior.

serelli and others added 3 commits May 26, 2026 10:59
…vity reset

- `UpdateReportedProblemsSearchAttribute`: gate WFT entries on
  NumConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute so that stale
  LastWorkflowTaskFailure state is not surfaced when WFT reporting is disabled (threshold=0).

- `UpdateActivity`: track prevAttempt and call maybeUpdateActivityReportedProblems when
  the attempt count changes so that ResetActivity / UnpauseWithReset paths immediately
  clear the category=ActivityRetryFailed SA entry when retries drop below threshold.

- Add TestReportedProblems_MixedWFTAndActivity: documents that with WFT threshold=0 and
  activity threshold=2, a recompute includes the activity entry but not the WFT entry even
  when LastWorkflowTaskFailure is set.

- Add TestReportedProblems_ActivityClearedAfterAttemptReset: documents that the SA entry
  is cleared when the pending activity's attempt count drops below the threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@serelli
Copy link
Copy Markdown
Author

serelli commented May 27, 2026

@serelli new activity-only tests make sense to me for the basic set/clear and disabled-by-default cases.

One extra case I was wondering about is the mixed WFT/activity state, since activity retry reporting now calls into the shared reported-problems recompute path. For example, if LastWorkflowTaskFailure exists, but WFT reporting is disabled or still below its own threshold, and then an activity retry reaches its threshold, should the recompute include the WFT tokens too, or keep those gated by the WFT threshold?

Maybe it would be useful to extend the existing TestWFTFailureReportedProblems_* coverage with one such mixed case, just to make the intended behavior explicit. Something roughly like:

ms, taskGenerator := newReportedProblemsReproMutableState(t, 0, 2)
ms.executionInfo.LastWorkflowTaskFailure = &persistencespb.WorkflowExecutionInfo_LastWorkflowTaskFailureCause{
    LastWorkflowTaskFailureCause: enumspb.WORKFLOW_TASK_FAILED_CAUSE_WORKFLOW_WORKER_UNHANDLED_FAILURE,
}
ms.pendingActivityInfoIDs[1] = &persistencespb.ActivityInfo{
    ScheduledEventId: 1,
    ActivityId:       "activity",
    Attempt:          2,
}

taskGenerator.EXPECT().GenerateUpsertVisibilityTask().Return(nil)
require.NoError(t, ms.UpdateReportedProblemsSearchAttribute())

problems := reportedProblemsForRepro(t, ms)
require.Contains(t, problems, "category=ActivityRetryFailed")
require.NotContains(t, problems, "category=WorkflowTaskFailed")

@orange-dot
Good point, thanks for the suggested test skeleton.

The intended behavior is that WFT entries are gated by their own threshold, so category=ActivityRetryFailed appears but category=WorkflowTaskFailed does
not when WFT reporting is disabled. I updated UpdateReportedProblemsSearchAttribute to enforce this:

wftThreshold := ms.config.NumConsecutiveWorkflowTaskProblemsToTriggerSearchAttribute(...)
if wftThreshold > 0 {
switch wftFailure := ms.executionInfo.LastWorkflowTaskFailure.(type) {
// ... append WFT entries
}
}

This ensures that stale LastWorkflowTaskFailure state (e.g., left over from a prior config value) doesn't bleed into the SA when WFT reporting is disabled.

I also added the newReportedProblemsReproMutableState / reportedProblemsForRepro helpers and TestReportedProblems_MixedWFTAndActivity using your suggested
structure — the assertions match exactly what you described.

@serelli
Copy link
Copy Markdown
Author

serelli commented May 27, 2026

@serelli one other edge case I was wondering about is activity reset/unpause-style paths.

The test plan covers clearing after the activity succeeds, which makes sense for the normal completion path. But if TemporalReportedProblems already contains category=ActivityRetryFailed and the activity is reset or unpaused with attempts reset, the pending activity may no longer meet the retry threshold without going through an activity completion event.

Is the intended behavior that those operator-driven paths also recompute/clear the activity retry entry, or should the entry remain until the activity later completes/fails/times out/cancels? A small test around that path might help document the intended behavior.

@orange-dot
Also addressed in the same commit.

The intended behavior is to clear immediately on reset — if an operator resets attempts to 1, the workflow is effectively getting a fresh start and the SA
entry should reflect that right away rather than lingering until the next terminal event.

To wire this up, I added a prevAttempt check in UpdateActivity that calls maybeUpdateActivityReportedProblems() whenever the attempt count changes:

prevAttempt := ai.Attempt
// ...
if err := updater(ai, ms); err != nil {
return err
}
if prevAttempt != ai.Attempt {
if err := ms.maybeUpdateActivityReportedProblems(); err != nil {
return err
}
}

This covers ResetActivity (sets Attempt = 1) and UnpauseActivityWithReset (same) without needing changes to the MutableState interface. And
TestReportedProblems_ActivityClearedAfterAttemptReset documents the behavior: SA entry is set when attempt ≥ threshold, then cleared when attempt drops
back below it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface repeated activity retry failures in TemporalReportedProblems

2 participants