Skip to content

Gateway returns stale list data on first request (read-after-write race) #14

Description

@nclsprsn

Symptom

tests/test_e2e.py::test_agent_full_lifecycle is flaky: the first request through agentgateway (port 8080) after an idle period sometimes lists agents missing the row that was just POSTed and 201-ed.

Repro (20 iterations through gateway):

  • Run 0: POST 201 → GET /v1/agents/ returns total=20, items=20, new agent missing.
  • Runs 1–19: all consistent.

Direct registry (port 8081) shows no miss in 5 runs.

Notes

  • Pre-existing, not introduced by audit-wave commits (6a9fa09, 23280c1, e6599a4).
  • DB always contains the row by the time the failing GET completes.
  • Behavior suggests gateway-side response caching or connection cold-start serving a snapshot from before the POST committed.

Suspected root cause

agentgateway may have a short response cache for GETs that gets populated on the first read, and the first POST→GET race produces a cache entry from a transient snapshot. Confirm by inspecting infra/agentgateway/*.yaml for any caching directives.

Workaround

Tests can retry GET once when the just-created ID is missing, but the underlying behavior should be addressed in the gateway config.

Severity

P2 — only affects e2e flow tests run after restart; production users would observe stale /v1/agents/ data for ~1 stale request after deploys.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions