Symptom
tests/test_e2e.py::test_agent_full_lifecycle is flaky: the first request through agentgateway (port 8080) after an idle period sometimes lists agents missing the row that was just POSTed and 201-ed.
Repro (20 iterations through gateway):
- Run 0: POST 201 → GET
/v1/agents/ returns total=20, items=20, new agent missing.
- Runs 1–19: all consistent.
Direct registry (port 8081) shows no miss in 5 runs.
Notes
- Pre-existing, not introduced by audit-wave commits (6a9fa09, 23280c1, e6599a4).
- DB always contains the row by the time the failing GET completes.
- Behavior suggests gateway-side response caching or connection cold-start serving a snapshot from before the POST committed.
Suspected root cause
agentgateway may have a short response cache for GETs that gets populated on the first read, and the first POST→GET race produces a cache entry from a transient snapshot. Confirm by inspecting infra/agentgateway/*.yaml for any caching directives.
Workaround
Tests can retry GET once when the just-created ID is missing, but the underlying behavior should be addressed in the gateway config.
Severity
P2 — only affects e2e flow tests run after restart; production users would observe stale /v1/agents/ data for ~1 stale request after deploys.
Symptom
tests/test_e2e.py::test_agent_full_lifecycleis flaky: the first request throughagentgateway(port 8080) after an idle period sometimes lists agents missing the row that was just POSTed and 201-ed.Repro (20 iterations through gateway):
/v1/agents/returnstotal=20, items=20, new agent missing.Direct registry (port 8081) shows no miss in 5 runs.
Notes
Suspected root cause
agentgatewaymay have a short response cache for GETs that gets populated on the first read, and the first POST→GET race produces a cache entry from a transient snapshot. Confirm by inspectinginfra/agentgateway/*.yamlfor any caching directives.Workaround
Tests can retry GET once when the just-created ID is missing, but the underlying behavior should be addressed in the gateway config.
Severity
P2 — only affects e2e flow tests run after restart; production users would observe stale /v1/agents/ data for ~1 stale request after deploys.