Skip to content

Retry on ErrRefMoved across mutating commands; fix SetRef snapshot desync#130

Merged
jallum merged 1 commit into
mainfrom
fix-ref-moved-retry
May 21, 2026
Merged

Retry on ErrRefMoved across mutating commands; fix SetRef snapshot desync#130
jallum merged 1 commit into
mainfrom
fix-ref-moved-retry

Conversation

@jallum

@jallum jallum commented May 21, 2026

Copy link
Copy Markdown
Owner

Why we made the change

If you've ever seen this when running bw start or bw create:

error: commit failed: ref moved: ref refs/heads/beadwork (expected f2105ed3, got 3e642ee7)

…this PR fixes it. Here's what was going on, ELI5:

bw stores all your issue state on a hidden git branch (refs/heads/beadwork). When you run a command like bw start abc-123, here's what happens under the hood:

  1. bw takes a snapshot of where the branch is right now (let's call it commit f2105ed3).
  2. It builds up the changes in memory ("mark abc-123 as in_progress, set assignee to alice").
  3. It tries to commit — but only if the branch is still at f2105ed3. This is a safety check (compare-and-swap) so two commands can't silently clobber each other's work.

The problem: if anything else moved that branch between step 1 and step 3 — another bw process in a different terminal, a bw sync finishing in the background, an agent running in a worktree — the safety check fails and the whole command dies with ref moved. You lose the work and have to retry by hand.

Only bw close knew how to retry. start, create, and a bunch of other commands did not.

What we changed

Two things:

1. A shared retry helper. New cmd/bw/commit_retry.go with a small function:

commitWithRetry(store, 3, func() (string, error) {
    // do the mutation; return the commit message
})

If the commit hits ErrRefMoved, it refreshes its view of the branch (picks up whatever the other process did), re-runs the closure against the fresh state, and tries again. Up to 3 times.

Wired into: start, create, update, delete, reopen, attach, label, comment, defer, undefer, dep add, dep remove. The old hand-rolled loop in close got replaced too, so there's now exactly one place that knows how to retry.

2. A latent TreeFS bug. While investigating, TDD turned up something else: TreeFS.SetRef was setting the git ref but not updating its own in-memory snapshot. If anything called SetRef on the ref it was tracking, the next Commit would falsely fail with ref moved. Not currently triggered by any user-facing flow (the one caller reopens TreeFS right after), but it was a footgun. Fixed.

How it works — a worked example

Before this PR, here's the race that caused your error:

Terminal A: bw start abc-123
  └─ snapshot ref @ f2105ed3
  └─ apply "start abc-123"
                              Terminal B: bw close xyz-456
                                └─ snapshot ref @ f2105ed3
                                └─ apply "close xyz-456"
                                └─ commit → ref now 3e642ee7  ✓
  └─ commit → expected f2105ed3, got 3e642ee7  ✗ DIES

After this PR:

Terminal A: bw start abc-123
  └─ snapshot ref @ f2105ed3
  └─ apply "start abc-123"
                              Terminal B: bw close xyz-456 → ref now 3e642ee7
  └─ commit → ref moved, retry…
  └─ refresh: snapshot ref @ 3e642ee7
  └─ re-apply "start abc-123" (on fresh state)
  └─ commit → ref now 9a1bcde4  ✓

The closure-based design matters because the second attempt has to re-derive everything from the fresh state — you can't just retry the bare commit, because the in-memory mutations were built on top of stale data.

Test plan

Tests are TDD-style — written first, drove the implementation:

  • internal/treefs/treefs_test.go: four new TestSnapshotConsistentAfter* tests prove that Reset, Refresh, MergeCommit, and SetRef all keep the in-memory snapshot consistent. The SetRef one failed first, which is how the latent bug was caught.
  • cmd/bw/commit_retry_test.go: three tests for the helper — recovers from a real ref-move race (using a second TreeFS to simulate a concurrent writer), bounds retries, propagates non-CAS errors immediately.
  • Full go test ./... passes.

…sync

Mutating commands (start, create, update, delete, reopen, attach, label,
comment, defer, undefer, dep add/remove) called store.Commit once and
surfaced ErrRefMoved raw when a concurrent writer advanced the ref
between snapshot and commit. Factor the close-command retry loop into
commitWithRetry and apply it everywhere.

Also: TreeFS.SetRef did not advance baseRef when called on the tracked
ref, leaving the in-memory snapshot stale. Latent in current callers
(init reopens TreeFS after) but proven by a new test and fixed.
@jallum jallum requested a review from iautom8things May 21, 2026 08:47

@iautom8things iautom8things left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🚀 🚀 🚀

@jallum jallum merged commit b41636f into main May 21, 2026
1 check passed
@jallum jallum deleted the fix-ref-moved-retry branch May 21, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants