Skip to content

Write-lock after restart: in-memory loadedMeta stale, presence-active path denies forever #49

@slempiam

Description

@slempiam

Summary

After any server restart, docs with active collab presence reconnecting during the ~5–10s Yjs-seeding window become permanently write-locked until either every client disconnects or the process restarts. The doc's DB row stays healthy (document_projections.health = 'healthy', projection markdown matches canonical row markdown), but the in-memory loadedDocDbMeta cache captured during seeding falls out of sync with the row's updated_at / y_state_version snapshot, and the live mutation gate then denies all writes for as long as any presence exists.

The agent's own presence is sufficient to keep itself locked out.

Reproduce

  1. Start the server with npm run serve (or any equivalent that boots server/index.ts).
  2. Create a doc with a non-trivial markdown body (≥1 KB → ≥~9 KB y_state_blob).
  3. SIGTERM the process and restart it.
  4. Connect a client (browser editor OR an agent via the agent bridge) within ~10s of the new process becoming healthy, while the doc's Yjs baseline is still being re-seeded from the legacy projection row.
  5. Attempt any write (POST /documents/:slug/ops with a comment.add, or /edit/v2, or anything that flows through getMutationReadyDocument).

Expected: write succeeds.
Actual:

  • /ops returns 409 PROJECTION_STALE
  • Retry via /agent/ops-equivalent returns 409 LIVE_DOC_UNAVAILABLE
  • The state endpoint returns mutationReady: false, projectionFresh: false, readSource: "yjs_fallback" indefinitely
  • DB row remains healthy. The reconcile path reconcileStaleProjectionsOnStartup won't help because listDocsWithStaleProjection only returns rows where document_projections.health != 'healthy'.

Root cause

getCanonicalReadableDocumentSync (server/collab.ts around line 6620) downgrades mutation_ready to false when active presence exists AND getLocalLiveAuthorityDecision returns allowed: false:

if (activePresenceCount > 0) {
  const handle = loadCanonicalYDocSync(slug, { allowFragmentRecovery: false });
  const localLiveAuthorityDecision = getLocalLiveAuthorityDecision(slug, handle, breakdown);
  if (!localLiveAuthorityDecision.allowed) {
    return { ...result, mutation_ready: false };
  }
}

getLocalLiveAuthorityDecision requires isFreshResidentLocalDocloadedMeta.accessEpoch === row.access_epoch && loadedMeta.updatedAt === row.updated_at && loadedMeta.yStateVersion === getLatestYStateVersion(slug). When a client reconnects mid-seeding, loadedMeta is cached from the seed's snapshot, which is one tick behind whatever the row's updated_at or y_state_version resolves to once the seed completes. The mismatch is never repaired because the only repair path is "evict and re-load," and that path requires either an access_epoch bump (no caller does this) or every client to disconnect (the locked-out agent retries indefinitely).

The noteStaleEpochBypassAdmission branch at the else side (no active presence) would write through fine. The bug is specifically the presence-active path racing against bootstrap.

Why startup-time reconcile (COLLAB_STARTUP_RECONCILE_ENABLED) doesn't address it

reconcileStaleProjectionsOnStartup works at the DB layer (document_projections.health). This bug lives in the in-memory loadedDocDbMeta cache; the DB row is healthy. Setting COLLAB_STARTUP_RECONCILE_ENABLED=true doesn't trigger anything for docs whose DB is fine but memory is stuck.

Suggested fix candidates

A. Bootstrap-window grace. Treat a loadedMeta cached within N seconds of process start as authoritative even if it disagrees with the live row, mirroring the existing isDocFreshForGlobalCollabAdmissionGuard early-window pattern at line 6629.

B. Reconcile-on-mismatch instead of deny. When isFreshResidentLocalDoc returns false, re-load the in-memory state from DB and re-evaluate, instead of returning allowed: false. The current behavior is "be cautious" — but with no caller able to clear it, that's "permanently stuck."

C. Self-heal on presence churn. When a new client connects to a doc whose loadedMeta is stale, evict and re-attach.

A + B together is probably the right design. C is more invasive.

Operational mitigations available today

  • Don't restart while clients have a doc open.
  • For an already-stuck doc, evict via UPDATE documents SET access_epoch = access_epoch + 1 WHERE slug = ?evictStaleLocalStateForAccessEpoch then rebuilds clean on next read.
  • Or have the affected agent create a fresh doc.

Environment

  • Fork: Studio-Intrinsic/proof-service (we'll PR our fix back here too)
  • Host: Fly, single machine, npm run servetsx server/index.ts
  • SQLite on a Fly volume at /data/proof-share.db
  • Node 20, better-sqlite3 12.6.2
  • COLLAB_STARTUP_RECONCILE_ENABLED=true (does not help this case, see above)

Logs from the incident

12:14:00 [collab] seeding missing canonical Yjs baseline from legacy projection row { slug: '9rchcu7l' }
12:20:30 SIGTERM (deploy restart)
12:20:38 [collab] embedded runtime enabled wsUrlBase=...
12:21:27 buildCollabSession lease noted { slug: '9rchcu7l', role: 'editor', accessEpoch: 0 }
12:21:30 authenticated collab presence attached { slug: '9rchcu7l', role: 'editor', accessEpoch: 0 }
12:23–12:40+ writes return 409 PROJECTION_STALE / LIVE_DOC_UNAVAILABLE indefinitely

The agent connected at 12:21:27 — 57s after restart — while seeding was still settling. The doc has been write-locked for 30+ minutes.

Tracker

Fork PR with the fix candidate will be linked here once opened.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions