Write-lock after restart: in-memory loadedMeta stale, presence-active path denies forever

## Summary

After any server restart, docs with active collab presence reconnecting during the ~5–10s Yjs-seeding window become **permanently write-locked** until either every client disconnects or the process restarts. The doc's DB row stays healthy (`document_projections.health = 'healthy'`, projection markdown matches canonical row markdown), but the in-memory `loadedDocDbMeta` cache captured during seeding falls out of sync with the row's `updated_at` / `y_state_version` snapshot, and the live mutation gate then denies all writes for as long as any presence exists.

The agent's *own* presence is sufficient to keep itself locked out.

## Reproduce

1. Start the server with `npm run serve` (or any equivalent that boots `server/index.ts`).
2. Create a doc with a non-trivial markdown body (≥1 KB → ≥~9 KB `y_state_blob`).
3. SIGTERM the process and restart it.
4. Connect a client (browser editor OR an agent via the agent bridge) **within ~10s of the new process becoming healthy**, while the doc's Yjs baseline is still being re-seeded from the legacy projection row.
5. Attempt any write (`POST /documents/:slug/ops` with a `comment.add`, or `/edit/v2`, or anything that flows through `getMutationReadyDocument`).

**Expected:** write succeeds.
**Actual:**
- `/ops` returns `409 PROJECTION_STALE`
- Retry via `/agent/ops`-equivalent returns `409 LIVE_DOC_UNAVAILABLE`
- The state endpoint returns `mutationReady: false, projectionFresh: false, readSource: "yjs_fallback"` indefinitely
- DB row remains `healthy`. The reconcile path `reconcileStaleProjectionsOnStartup` won't help because `listDocsWithStaleProjection` only returns rows where `document_projections.health != 'healthy'`.

## Root cause

`getCanonicalReadableDocumentSync` (`server/collab.ts` around line 6620) downgrades `mutation_ready` to `false` when active presence exists AND `getLocalLiveAuthorityDecision` returns `allowed: false`:

```ts
if (activePresenceCount > 0) {
  const handle = loadCanonicalYDocSync(slug, { allowFragmentRecovery: false });
  const localLiveAuthorityDecision = getLocalLiveAuthorityDecision(slug, handle, breakdown);
  if (!localLiveAuthorityDecision.allowed) {
    return { ...result, mutation_ready: false };
  }
}
```

`getLocalLiveAuthorityDecision` requires `isFreshResidentLocalDoc` — `loadedMeta.accessEpoch === row.access_epoch && loadedMeta.updatedAt === row.updated_at && loadedMeta.yStateVersion === getLatestYStateVersion(slug)`. When a client reconnects mid-seeding, `loadedMeta` is cached from the seed's snapshot, which is one tick behind whatever the row's `updated_at` or `y_state_version` resolves to once the seed completes. The mismatch is never repaired because the only repair path is "evict and re-load," and that path requires either an `access_epoch` bump (no caller does this) or every client to disconnect (the locked-out agent retries indefinitely).

The `noteStaleEpochBypassAdmission` branch at the `else` side (no active presence) would write through fine. The bug is specifically the presence-active path racing against bootstrap.

## Why startup-time reconcile (`COLLAB_STARTUP_RECONCILE_ENABLED`) doesn't address it

`reconcileStaleProjectionsOnStartup` works at the DB layer (`document_projections.health`). This bug lives in the in-memory `loadedDocDbMeta` cache; the DB row is healthy. Setting `COLLAB_STARTUP_RECONCILE_ENABLED=true` doesn't trigger anything for docs whose DB is fine but memory is stuck.

## Suggested fix candidates

**A. Bootstrap-window grace.** Treat a `loadedMeta` cached within N seconds of process start as authoritative even if it disagrees with the live row, mirroring the existing `isDocFreshForGlobalCollabAdmissionGuard` early-window pattern at line 6629.

**B. Reconcile-on-mismatch instead of deny.** When `isFreshResidentLocalDoc` returns false, re-load the in-memory state from DB and re-evaluate, instead of returning `allowed: false`. The current behavior is "be cautious" — but with no caller able to clear it, that's "permanently stuck."

**C. Self-heal on presence churn.** When a new client connects to a doc whose `loadedMeta` is stale, evict and re-attach.

A + B together is probably the right design. C is more invasive.

## Operational mitigations available today

- Don't restart while clients have a doc open.
- For an already-stuck doc, evict via `UPDATE documents SET access_epoch = access_epoch + 1 WHERE slug = ?` — `evictStaleLocalStateForAccessEpoch` then rebuilds clean on next read.
- Or have the affected agent create a fresh doc.

## Environment

- Fork: `Studio-Intrinsic/proof-service` (we'll PR our fix back here too)
- Host: Fly, single machine, `npm run serve` → `tsx server/index.ts`
- SQLite on a Fly volume at `/data/proof-share.db`
- Node 20, `better-sqlite3` 12.6.2
- `COLLAB_STARTUP_RECONCILE_ENABLED=true` (does not help this case, see above)

## Logs from the incident

```
12:14:00 [collab] seeding missing canonical Yjs baseline from legacy projection row { slug: '9rchcu7l' }
12:20:30 SIGTERM (deploy restart)
12:20:38 [collab] embedded runtime enabled wsUrlBase=...
12:21:27 buildCollabSession lease noted { slug: '9rchcu7l', role: 'editor', accessEpoch: 0 }
12:21:30 authenticated collab presence attached { slug: '9rchcu7l', role: 'editor', accessEpoch: 0 }
12:23–12:40+ writes return 409 PROJECTION_STALE / LIVE_DOC_UNAVAILABLE indefinitely
```

The agent connected at 12:21:27 — 57s after restart — while seeding was still settling. The doc has been write-locked for 30+ minutes.

## Tracker

Fork PR with the fix candidate will be linked here once opened.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write-lock after restart: in-memory loadedMeta stale, presence-active path denies forever #49

Summary

Reproduce

Root cause

Why startup-time reconcile (`COLLAB_STARTUP_RECONCILE_ENABLED`) doesn't address it

Suggested fix candidates

Operational mitigations available today

Environment

Logs from the incident

Tracker

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Write-lock after restart: in-memory loadedMeta stale, presence-active path denies forever #49

Description

Summary

Reproduce

Root cause

Why startup-time reconcile (COLLAB_STARTUP_RECONCILE_ENABLED) doesn't address it

Suggested fix candidates

Operational mitigations available today

Environment

Logs from the incident

Tracker

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why startup-time reconcile (`COLLAB_STARTUP_RECONCILE_ENABLED`) doesn't address it