Skip to content

fix: prevent completed step from being written to cancelled run#128

Merged
coji merged 4 commits into
mainfrom
fix/cancel-race-persistStep
Mar 16, 2026
Merged

fix: prevent completed step from being written to cancelled run#128
coji merged 4 commits into
mainfrom
fix/cancel-race-persistStep

Conversation

@coji
Copy link
Copy Markdown
Owner

@coji coji commented Mar 16, 2026

Summary

Fixes a race condition where persistStep() could write a completed step to an already-cancelled run. The bug existed because persistStep only checked lease_generation, while cancelRun changes status but not lease_generation.

Root Cause

completeRun, failRun, and renewLease all guard with both status = 'leased' AND lease_generation. But persistStep and updateProgress only checked lease_generation, creating a window where cancellation goes unnoticed.

Fix

Add status = 'leased' guard to persistStep (INSERT and UPDATE) and updateProgress, matching the existing pattern. Also properly emit step:cancel event when persistStep is correctly refused due to cancellation.

Test

New regression test: creates a step where fn() completes, cancel happens before persistStep, verifies no completed step is written.

Closes #121

🤖 Generated with Claude Code


Found by Codex (GPT-5.4) code review

Summary by CodeRabbit

  • バグ修正

    • ステップ保存処理でのキャンセル検出とハンドリングを強化し、キャンセル時にステップが完了として永続化されないようになりました。
    • データベース操作に対するリースガードを拡張し、リース状態と世代を確認して整合性を保つようになりました。
  • テスト

    • ステップ完了とキャンセルの競合を再現する新しいテストを追加し、キャンセル時に完了ステップが保存されないことを検証します。

Add status='leased' guard to persistStep and updateProgress, matching
the existing pattern in completeRun/failRun/renewLease. This closes the
race window where cancel() sets status='cancelled' but persistStep only
checked lease_generation.

Changes:
- persistStep INSERT...SELECT: add AND status = 'leased'
- persistStep step index UPDATE: add .where('status', '=', 'leased')
- updateProgress: add .where('status', '=', 'leased')
- context.ts: emit step:cancel event before throwing CancelledError
  when persistStep returns null due to cancellation
- context.ts: update comments to reflect new guard semantics
- Add regression test: cancel between fn() completion and persistStep

Closes #121

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
durably-demo Ready Ready Preview Mar 16, 2026 11:27am
durably-demo-vercel-turso Ready Ready Preview Mar 16, 2026 11:27am

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Warning

Rate limit exceeded

@coji has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72b5297c-3304-4e75-9247-b3e51e380a4a

📥 Commits

Reviewing files that changed from the base of the PR and between d24bd0d and 23e7b87.

📒 Files selected for processing (2)
  • packages/durably/src/context.ts
  • packages/durably/tests/shared/step.shared.ts
📝 Walkthrough

Walkthrough

ステップ永続化にリース(status='leased')チェックを追加し、実行中のキャンセル競合を検出して step:cancel を発行、キャンセル時は CancelledError を投げるように制御フローを調整しました。進捗・挿入・失敗パスでリースガードを強化しています。

Changes

Cohort / File(s) Summary
キャンセル処理の強化
packages/durably/src/context.ts
ステップ実行後の保存失敗や保存ミス時にキャンセル検出を追加。キャンセル時に step:cancel を emit し、CancelledError を投げる流れを明確化。finally 内の追加 abort チェックを削除。コメントをリースガード仕様に合わせて更新。
リース状態ガード追加
packages/durably/src/storage.ts
persistStep/ステップ挿入・進捗更新・ステップ進行の WHERE 句に status = 'leased' 条件を追加し、キャンセル後の誤った永続化を防止。
レース条件テスト追加
packages/durably/tests/shared/step.shared.ts
ステップ完了と外部キャンセルが競合するシナリオのテストを追加し、キャンセル時に完了ステップが永続化されないことを検証(同テストの重複コピーあり)。

Sequence Diagram(s)

sequenceDiagram
    participant Worker as Worker\n(ステップ実行)
    participant Controller as Controller\n(cancel / lease/status)
    participant Storage as Storage/DB\n(WHERE: id, lease_generation, status='leased')
    participant Events as EventEmitter

    rect rgba(200,220,255,0.5)
    Worker->>Storage: persistStep(insert/update) with lease_generation
    end

    Controller->>Storage: set run.status='cancelled' (no lease_generation bump)
    Controller->>Worker: abort signal
    Worker->>Storage: persistStep continuation (re-check WHERE includes status='leased')
    alt Storage match fails (run not 'leased' or lease lost)
        Storage-->>Worker: 0 rows affected
        Worker->>Events: emit step:cancel
        Worker-->>Controller: throw CancelledError
    else Storage match succeeds
        Storage-->>Worker: success
        Worker->>Events: emit step:fail / step:complete (with explicit type + error)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰✨
リース守れば、走る足も安心
キャンセル来たら、静かに手を振る
完了の影は残さないよ
みんなで跳ねる、データの野原 🍃

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: prevent completed step from being written to cancelled run' directly describes the main change - adding status='leased' guards to persistStep to prevent race condition where completed steps are written to cancelled runs.
Linked Issues check ✅ Passed All requirements from issue #121 are met: persistStep now checks status='leased' in both INSERT and UPDATE paths, updateProgress also includes the guard, proper error handling emits step:cancel before throwing CancelledError, and regression test validates the fix.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the cancel race condition: context.ts updates step persistence logic and cancellation handling, storage.ts adds lease guards to persistStep and updateProgress, and tests verify the fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/cancel-race-persistStep
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `biome` configuration to improve the quality of JS/TS/CSS/JSON code reviews.

Add a configuration file to your project to customize how CodeRabbit runs biome.

- persistStep now rejects cancelled runs (returns null), so the
  isCancelled ternary in the savedStep-truthy path is unreachable.
  Simplified to unconditional step:fail emit.
- Remove 'leased' as const casts — string literal is inferred correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/durably/src/context.ts (1)

126-139: ⚠️ Potential issue | 🟠 Major

persistStep() の失敗理由を AbortSignal だけで判定しているのが危険です。

別インスタンス/別プロセスから cancelRun() されたケースだと、persistStep()status='cancelled'null を返しても controller.signal.abortedfalse のままです。その場合ここは LeaseLostError 側に倒れて、step:cancel も emit されません。savedStep === null のときは getRun() で現在の status を再確認するか、persistStep() から cancellation / lease-loss を区別できる結果を返したいです。

Also applies to: 174-198

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/durably/src/context.ts` around lines 126 - 139, When
persistStep(...) returns null you must not assume controller.signal.aborted is
the only reason; modify the code paths around persistStep (the block using
persistStep, abortForLeaseLoss, throwIfAborted and the similar block at 174-198)
to disambiguate cancellation vs lease loss by calling getRun(run.id) and
checking run.status (or extend persistStep to return a discriminated result like
{type: 'cancelled'|'lease-lost'|'ok', step}): if getRun shows status ===
'cancelled' treat it as a cancellation (emit step:cancel and handle
accordingly), otherwise treat it as lease loss (invoke abortForLeaseLoss/throw
LeaseLostError). Ensure both occurrences use the same disambiguation logic so
cancelRun-triggered cancellations are handled correctly.
packages/durably/src/storage.ts (1)

931-955: ⚠️ Potential issue | 🔴 Critical

UPDATE の実行結果を検証して、ガード条件失敗時に rollback すること。

INSERT ... SELECT の後に UPDATE durably_runs ... status = 'leased' が 0 行更新となった場合、トランザクション全体では成功するが、current_step_index が進まずに step だけが残る可能性があります。ガード条件(status と lease_generation)が満たされなくなった場合の consistency を保つため、UPDATE の numUpdatedRows を検証し、0 だった場合はエラー throw で rollback させてください。

修正イメージ
        if (input.status === 'completed') {
-         await trx
+         const advanceResult = await trx
            .updateTable('durably_runs')
            .set({
              current_step_index: input.index + 1,
              updated_at: completedAt,
            })
            .where('id', '=', runId)
            .where('status', '=', 'leased' as const)
            .where('lease_generation', '=', leaseGeneration)
-           .execute()
+           .executeTakeFirst()
+
+         if (Number(advanceResult.numUpdatedRows) === 0) {
+           throw new Error('persist-step-guard-failed')
+         }
        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/durably/src/storage.ts` around lines 931 - 955, After performing the
INSERT ... SELECT into durably_steps (insertResult), inspect the result of the
subsequent trx.updateTable('durably_runs').set(...).where(...).execute() (assign
it to e.g. updateResult) and if Number(updateResult.numUpdatedRows) === 0 throw
an error to force the transaction to rollback; this ensures that if the guard
conditions (status = 'leased' and matching lease_generation) no longer hold and
the run row wasn't advanced, the step insert is not left inconsistent. Use the
same descriptive context in the thrown error so logs show it came from the
current_step_index advance step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/durably/tests/shared/step.shared.ts`:
- Around line 674-730: テストはローカルの中断経路でしか動いておらず、persistStep() が DB 側で null
を返す分岐(savedStep === null / step:cancel)を検証していません。修正案:既存のケースを残しつつ新しいテストを追加して、1)
別の Durably インスタンスから同じ run を cancel() して外部キャンセル経路を作る(参照シンボル: createDurably,
d.cancel, jobs.raceDef.trigger, getRun, storage.getSteps)か、2) persistStep をスパイして
null を返すようにして DB ガード経路を強制する(参照シンボル: persistStep, context.ts の throwIfAborted および
savedStep === null 分岐)。どちらかを採用して、run.status が 'cancelled' であることと completed
ステップが増えないことをアサートしてください。

---

Outside diff comments:
In `@packages/durably/src/context.ts`:
- Around line 126-139: When persistStep(...) returns null you must not assume
controller.signal.aborted is the only reason; modify the code paths around
persistStep (the block using persistStep, abortForLeaseLoss, throwIfAborted and
the similar block at 174-198) to disambiguate cancellation vs lease loss by
calling getRun(run.id) and checking run.status (or extend persistStep to return
a discriminated result like {type: 'cancelled'|'lease-lost'|'ok', step}): if
getRun shows status === 'cancelled' treat it as a cancellation (emit step:cancel
and handle accordingly), otherwise treat it as lease loss (invoke
abortForLeaseLoss/throw LeaseLostError). Ensure both occurrences use the same
disambiguation logic so cancelRun-triggered cancellations are handled correctly.

In `@packages/durably/src/storage.ts`:
- Around line 931-955: After performing the INSERT ... SELECT into durably_steps
(insertResult), inspect the result of the subsequent
trx.updateTable('durably_runs').set(...).where(...).execute() (assign it to e.g.
updateResult) and if Number(updateResult.numUpdatedRows) === 0 throw an error to
force the transaction to rollback; this ensures that if the guard conditions
(status = 'leased' and matching lease_generation) no longer hold and the run row
wasn't advanced, the step insert is not left inconsistent. Use the same
descriptive context in the thrown error so logs show it came from the
current_step_index advance step.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e1f5b43-76b4-46bd-b281-616088a3c2e7

📥 Commits

Reviewing files that changed from the base of the PR and between 960d7e4 and 70090ef.

📒 Files selected for processing (3)
  • packages/durably/src/context.ts
  • packages/durably/src/storage.ts
  • packages/durably/tests/shared/step.shared.ts

Comment thread packages/durably/tests/shared/step.shared.ts
Use direct DB update to simulate external cancel (bypassing in-process
abort signal), so the test verifies persistStep's status='leased' guard
actually rejects the write — not just the throwIfAborted() fast path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/durably/src/context.ts`:
- Around line 126-128: The code treats savedStep === null as lease loss based
only on controller.signal.aborted, which misclassifies DB-side cancellations;
change the logic around storage.persistStep and the LeaseLostError path so that
when savedStep === null you re-query the authoritative DB state (e.g. call
storage.getRun(run.id) or storage.getStep(run.id, stepId)) and if that record is
marked cancelled prefer treating it as a cancellation (emit the
run:cancel/step:cancel flow) instead of throwing LeaseLostError; only fall back
to LeaseLostError when the DB shows the lease was actually reclaimed and not a
cancellation, and keep references to persistStep, savedStep, LeaseLostError,
controller.signal.aborted to locate and update the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c82fd83-ed78-41f2-96db-05e13b794c44

📥 Commits

Reviewing files that changed from the base of the PR and between 70090ef and d24bd0d.

📒 Files selected for processing (2)
  • packages/durably/src/context.ts
  • packages/durably/src/storage.ts

Comment thread packages/durably/src/context.ts
When persistStep returns null, check the DB for the run's actual status
to distinguish cancel from lease loss. This handles external cancellations
(from another process) where the in-process abort signal isn't set.

Extracted throwForRefusedStep() helper to eliminate 3x duplicated logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: cancel race allows completed step on cancelled run

1 participant