You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched existing issues and this is not a duplicate
I understand this issue needs status:approved before a PR can be opened
📝 Bug Description
POST /sync/mutations/push accepts an empty-content observation upsert at the validation gate but then fatally rejects it during chunk canonicalization, surfacing as an opaque HTTP 500 that fails the entire per-project batch.
The push gate is lenient: validateLegacyPayload is a no-op for observation/session/prompt (internal/cloud/cloudserver/mutations.go:340-353), so an empty-content observation passes. The same batch is then canonicalized, where empty content is a hard error — "observation payload content is required for upsert" (internal/cloud/chunkcodec/chunkcodec.go:412) — which returns to the client as a bare http.Error(..., StatusInternalServerError) (mutations.go:204). It fires before any transaction (materializedMutationBatchChunks, cloudstore.go:734, before BeginTx at :739). Because the client is all-or-nothing (transport.go:347-350; autosync refuses partial acks, autosync/manager.go:532-534), a single poison row blocks the whole per-project batch and autosync stalls into degraded/backoff (MaxConsecutiveFailures=10, manager.go:144) until the row is removed by hand.
The defect (precise)
Not batch atomicity — that's by design (BW3). The defect is: (1) validation-vs-canonicalization inconsistency (the gate accepts what canonicalization rejects); (2) an opaque 500 instead of an actionable error; (3) no push-side dead-letter/quarantine/repair.
Asymmetry with chunk-push
The chunk-push path handles the identical empty-content case with a repairable 400: validateImportableChunkPayload → writeActionableError(w, StatusBadRequest, ...Repairable, ...PayloadInvalid, ...) (internal/cloud/cloudserver/cloudserver.go:464). So the two ingest paths disagree on the same bad input: chunk-push → repairable 400; mutation-push → opaque 500.
🔄 Steps to Reproduce
Self-hosted engram-cloud from main; enroll project P.
Trigger an autosync push (ENGRAM_CLOUD_AUTOSYNC=1 within engram serve/engram mcp).
Observe: HTTP 500 with the canonicalize error; sync_state goes degraded; cloud_mutations for P stays 0 (whole batch blocked).
Remove/repair the empty row → next push succeeds.
Contrast: the same row via chunk-push yields a repairable 400, not an opaque 500.
✅ Expected Behavior
Either the push gate validates observation content up front and returns an actionable, repairable 400 (parity with chunk-push), or the offending row is quarantined/skipped so the rest of the batch syncs — not an opaque 500 that stalls the whole project's cloud sync.
❌ Actual Behavior
A single empty-content observation makes /sync/mutations/push return an opaque HTTP 500; the entire per-project batch is rejected, cloud_mutations for the project stays 0, and autosync stalls into degraded/backoff with no auto-recovery until the row is removed by hand.
Operating System
macOS
Engram Version
main @ 36c0819, client built from source (1.16.4-dev)
Agent / Client
Claude Code
📋 Relevant Logs
status 500: insert mutations: cloudstore: canonicalize materialized mutation batch chunk: mutations[N]: observation payload content is required for upsert
# sync_state.lifecycle -> degraded; cloud_mutations for the project stays 0 until the empty row is excluded by hand.
💡 Additional Context
Real-world trigger: a session_summary observation with empty content, created before the empty-content guard added to mem_session_summary in #424 (commit 2087c5b, internal/mcp/mcp.go:1853-1854). #424 prevents new empties but there's no server-side tolerance/repair for historical pre-#424 rows (or third-party clients). A single such row blocks the whole project — reproduced live on a real project copy.
Suggested fix (any of, ideally (a)+(b) or (c)): (a) validate observation content at the push gate and return a repairable 400 (parity with chunk-push); (b) quarantine/dead-letter the offending entry and accept the rest of the batch; (c) provide a repair that skips/back-fills historical empty-content rows.
Severity: medium — full per-project sync outage with no auto-recovery when triggered; trigger is guarded for new rows via #424 but historical/third-party rows remain.
📋 Pre-flight Checks
status:approvedbefore a PR can be opened📝 Bug Description
POST /sync/mutations/pushaccepts an empty-content observation upsert at the validation gate but then fatally rejects it during chunk canonicalization, surfacing as an opaque HTTP 500 that fails the entire per-project batch.The push gate is lenient:
validateLegacyPayloadis a no-op forobservation/session/prompt(internal/cloud/cloudserver/mutations.go:340-353), so an empty-content observation passes. The same batch is then canonicalized, where empty content is a hard error —"observation payload content is required for upsert"(internal/cloud/chunkcodec/chunkcodec.go:412) — which returns to the client as a barehttp.Error(..., StatusInternalServerError)(mutations.go:204). It fires before any transaction (materializedMutationBatchChunks,cloudstore.go:734, beforeBeginTxat:739). Because the client is all-or-nothing (transport.go:347-350; autosync refuses partial acks,autosync/manager.go:532-534), a single poison row blocks the whole per-project batch and autosync stalls into degraded/backoff (MaxConsecutiveFailures=10,manager.go:144) until the row is removed by hand.The defect (precise)
Not batch atomicity — that's by design (BW3). The defect is: (1) validation-vs-canonicalization inconsistency (the gate accepts what canonicalization rejects); (2) an opaque 500 instead of an actionable error; (3) no push-side dead-letter/quarantine/repair.
Asymmetry with chunk-push
The chunk-push path handles the identical empty-content case with a repairable 400:
validateImportableChunkPayload→writeActionableError(w, StatusBadRequest, ...Repairable, ...PayloadInvalid, ...)(internal/cloud/cloudserver/cloudserver.go:464). So the two ingest paths disagree on the same bad input: chunk-push → repairable 400; mutation-push → opaque 500.🔄 Steps to Reproduce
main; enroll projectP.Phas empty content (e.g. a pre-fix(mcp): harden handleSessionSummary with process override and empty-content guard #424session_summary).ENGRAM_CLOUD_AUTOSYNC=1withinengram serve/engram mcp).canonicalizeerror;sync_stategoesdegraded;cloud_mutationsforPstays0(whole batch blocked).✅ Expected Behavior
Either the push gate validates observation content up front and returns an actionable, repairable 400 (parity with chunk-push), or the offending row is quarantined/skipped so the rest of the batch syncs — not an opaque 500 that stalls the whole project's cloud sync.
❌ Actual Behavior
A single empty-content observation makes
/sync/mutations/pushreturn an opaque HTTP 500; the entire per-project batch is rejected,cloud_mutationsfor the project stays0, and autosync stalls into degraded/backoff with no auto-recovery until the row is removed by hand.Operating System
macOS
Engram Version
main@36c0819, client built from source (1.16.4-dev)Agent / Client
Claude Code
📋 Relevant Logs
💡 Additional Context
Real-world trigger: a
session_summaryobservation with empty content, created before the empty-content guard added tomem_session_summaryin #424 (commit2087c5b,internal/mcp/mcp.go:1853-1854). #424 prevents new empties but there's no server-side tolerance/repair for historical pre-#424 rows (or third-party clients). A single such row blocks the whole project — reproduced live on a real project copy.Suggested fix (any of, ideally (a)+(b) or (c)): (a) validate observation content at the push gate and return a repairable 400 (parity with chunk-push); (b) quarantine/dead-letter the offending entry and accept the rest of the batch; (c) provide a repair that skips/back-fills historical empty-content rows.
Severity: medium — full per-project sync outage with no auto-recovery when triggered; trigger is guarded for new rows via #424 but historical/third-party rows remain.