Skip to content

Mutation-push returns an opaque 500 and fails the whole project batch on a single empty-content observation (no repairable 400, no quarantine) #503

@forNerzul

Description

@forNerzul

📋 Pre-flight Checks

  • I have searched existing issues and this is not a duplicate
  • I understand this issue needs status:approved before a PR can be opened

📝 Bug Description

POST /sync/mutations/push accepts an empty-content observation upsert at the validation gate but then fatally rejects it during chunk canonicalization, surfacing as an opaque HTTP 500 that fails the entire per-project batch.

The push gate is lenient: validateLegacyPayload is a no-op for observation/session/prompt (internal/cloud/cloudserver/mutations.go:340-353), so an empty-content observation passes. The same batch is then canonicalized, where empty content is a hard error — "observation payload content is required for upsert" (internal/cloud/chunkcodec/chunkcodec.go:412) — which returns to the client as a bare http.Error(..., StatusInternalServerError) (mutations.go:204). It fires before any transaction (materializedMutationBatchChunks, cloudstore.go:734, before BeginTx at :739). Because the client is all-or-nothing (transport.go:347-350; autosync refuses partial acks, autosync/manager.go:532-534), a single poison row blocks the whole per-project batch and autosync stalls into degraded/backoff (MaxConsecutiveFailures=10, manager.go:144) until the row is removed by hand.

The defect (precise)

Not batch atomicity — that's by design (BW3). The defect is: (1) validation-vs-canonicalization inconsistency (the gate accepts what canonicalization rejects); (2) an opaque 500 instead of an actionable error; (3) no push-side dead-letter/quarantine/repair.

Asymmetry with chunk-push

The chunk-push path handles the identical empty-content case with a repairable 400: validateImportableChunkPayloadwriteActionableError(w, StatusBadRequest, ...Repairable, ...PayloadInvalid, ...) (internal/cloud/cloudserver/cloudserver.go:464). So the two ingest paths disagree on the same bad input: chunk-push → repairable 400; mutation-push → opaque 500.

🔄 Steps to Reproduce

  1. Self-hosted engram-cloud from main; enroll project P.
  2. Ensure one observation in P has empty content (e.g. a pre-fix(mcp): harden handleSessionSummary with process override and empty-content guard #424 session_summary).
  3. Trigger an autosync push (ENGRAM_CLOUD_AUTOSYNC=1 within engram serve/engram mcp).
  4. Observe: HTTP 500 with the canonicalize error; sync_state goes degraded; cloud_mutations for P stays 0 (whole batch blocked).
  5. Remove/repair the empty row → next push succeeds.
  6. Contrast: the same row via chunk-push yields a repairable 400, not an opaque 500.

✅ Expected Behavior

Either the push gate validates observation content up front and returns an actionable, repairable 400 (parity with chunk-push), or the offending row is quarantined/skipped so the rest of the batch syncs — not an opaque 500 that stalls the whole project's cloud sync.

❌ Actual Behavior

A single empty-content observation makes /sync/mutations/push return an opaque HTTP 500; the entire per-project batch is rejected, cloud_mutations for the project stays 0, and autosync stalls into degraded/backoff with no auto-recovery until the row is removed by hand.

Operating System

macOS

Engram Version

main @ 36c0819, client built from source (1.16.4-dev)

Agent / Client

Claude Code

📋 Relevant Logs

status 500: insert mutations: cloudstore: canonicalize materialized mutation batch chunk: mutations[N]: observation payload content is required for upsert
# sync_state.lifecycle -> degraded; cloud_mutations for the project stays 0 until the empty row is excluded by hand.

💡 Additional Context

Real-world trigger: a session_summary observation with empty content, created before the empty-content guard added to mem_session_summary in #424 (commit 2087c5b, internal/mcp/mcp.go:1853-1854). #424 prevents new empties but there's no server-side tolerance/repair for historical pre-#424 rows (or third-party clients). A single such row blocks the whole project — reproduced live on a real project copy.

Suggested fix (any of, ideally (a)+(b) or (c)): (a) validate observation content at the push gate and return a repairable 400 (parity with chunk-push); (b) quarantine/dead-letter the offending entry and accept the rest of the batch; (c) provide a repair that skips/back-fills historical empty-content rows.

Severity: medium — full per-project sync outage with no auto-recovery when triggered; trigger is guarded for new rows via #424 but historical/third-party rows remain.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions