Skip to content

Write-lock from ghost recentCollabSessionLease: /edit/v2 returns LIVE_DOC_UNAVAILABLE for 5 min after every browser visit #50

@slempiam

Description

@slempiam

Summary

The hostedRemoteLiveLease check in server/canonical-document.ts:818-836 blocks /edit/v2 writes whenever a time-bucketed recentCollabSessionLease exists for the slug but no current-epoch live attachment is present. The lease persists for 5 minutes after every browser visit (DEFAULT_COLLAB_SESSION_TTL_SECONDS = 5 * 60), so any doc that was opened in a browser tab in the past 5 minutes — even after the tab closed and the WebSocket disconnected — is write-locked for agent edits during that window.

Comments via /ops and /bridge/comments work because they don't enforce strictLiveDoc. Only /edit/v2 (the canonical agent edit endpoint) trips the check.

Reproduce

  1. Start the server (any environment with isHostedRewriteEnvironment() returning true — Fly with PROOF_PUBLIC_BASE_URL set).
  2. Create a doc via POST /documents.
  3. Open /d/<slug>?token=<token> in a browser. The server records a recentCollabSessionLease (5-min TTL).
  4. Close the browser tab. Wait 5–10 seconds (long enough that the WebSocket has disconnected and active_collab_connections rows are gone).
  5. From an agent or curl, call POST /documents/<slug>/edit/v2 with a valid baseRevision + operations payload.

Expected: write succeeds. The browser is gone; no one is actively editing; the persisted state is authoritative.

Actual: 409 LIVE_DOC_UNAVAILABLE with error: "Live canonical document is unavailable on this hosted replica; retry after refreshing state". Retries within the 5-minute window keep failing the same way. The error's accompanying snapshot shows mutationReady: true, readSource: "projection", projectionFresh: true — i.e. the read side is perfectly healthy; only the write-side liveness check is denying.

Root cause

server/ws.ts:167 computes breakdown.total = Math.max(exactEpochCount, documentLeaseBreakdown.exactEpochCount, recentLeaseCount). Once noteRecentCollabSessionLease records a 5-minute bucket entry (see server/collab.ts:10888 — fired on every collab-session lease admission, including normal browser visits via /d/:slug), recentLeaseCount = 1 for that 5 minutes.

server/canonical-document.ts:826-836:

const hostedRemoteLiveLease = collabRuntimeEnabled
    && hostedRuntime
    && collabClientBreakdown.total > 0
    && collabClientBreakdown.exactEpochCount === 0;
if (strictLiveDocRequested && hostedRemoteLiveLease) {
  return {
    ok: false, status: 409, code: 'LIVE_DOC_UNAVAILABLE',
    error: 'Live canonical document is unavailable on this hosted replica; retry after refreshing state',
    retryWithState: `/api/agent/${args.slug}/state`,
  };
}

Combined: total > 0 (from recentLeaseCount) AND exactEpochCount === 0 (no live presence) → deny. The check cannot distinguish "active reconnect window — there's a live editor briefly disconnected" from "ghost lease — the editor closed their tab 30 seconds ago and isn't coming back."

waitForHostedLiveLeaseMaterialization upstream of the check doesn't help: it polls for exactEpochCount > 0 to materialize, which never happens if no one's actually reconnecting.

Suggested fix

Distinguish the two cases via documentLeaseExactCount (and possibly documentLeaseAnyEpochCount):

  • Real active reconnect window: a documentLease exists (the lease was opened during an actual session). Continue to deny — that lease holder is briefly disconnected.
  • Ghost lease only: recentLeaseCount > 0 but documentLeaseExactCount === 0 AND documentLeaseAnyEpochCount === 0. Treat as no-presence — proceed with the persisted-handle write path.

Concretely, change hostedRemoteLiveLease to also require documentLeaseExactCount > 0 || documentLeaseAnyEpochCount > 0:

const hasRealLeasePresence = collabClientBreakdown.documentLeaseExactCount > 0
  || collabClientBreakdown.documentLeaseAnyEpochCount > 0;
const hostedRemoteLiveLease = collabRuntimeEnabled
    && hostedRuntime
    && collabClientBreakdown.total > 0
    && collabClientBreakdown.exactEpochCount === 0
    && hasRealLeasePresence;

This preserves the original intent (don't write while a real lease holder is mid-reconnect) but removes the false positive when total is being inflated purely by the 5-minute recentLeaseCount bucket.

Operational mitigations available today

  • Set COLLAB_SESSION_TTL_SECONDS=30 (or some value shorter than the typical "agent retry budget"). Trades: shorter window of "ghost deny" against genuine reconnect-grace getting cut.
  • Keep a browser tab open on the doc while the agent edits. exactEpochCount > 0 defuses the check entirely.
  • Wait 5 minutes after closing all browser tabs before agent-editing.

Environment

  • Fork: Studio-Intrinsic/proof-service. Will PR our fix back to upstream.
  • Host: Fly, scout-proof app, single machine, embedded collab runtime.
  • Observed today via POST /documents/9rchcu7l/edit/v2 immediately after closing all browser sessions: deterministic 409 LIVE_DOC_UNAVAILABLE, snapshot in same response shows mutationReady: true, active_collab_connections row count = 0 via direct DB query.

Related

Same-day upstream issue: #49 — different bug (in-memory loadedDocDbMeta stale on read side); fix shipped today. This issue is the next-layer write-side deny that surfaces after the read side is healthy.

Tracker

Fork PR will be linked here once opened.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions