Skip to content

test: poll for read-after-write visibility in 4 racy tests#3

Merged
bravo1goingdark merged 2 commits into
mainfrom
fix/flaky-read-after-write-tests
May 5, 2026
Merged

test: poll for read-after-write visibility in 4 racy tests#3
bravo1goingdark merged 2 commits into
mainfrom
fix/flaky-read-after-write-tests

Conversation

@bravo1goingdark
Copy link
Copy Markdown
Owner

Why

Three `keplor-server::pipeline::tests` cases plus one `http_integration` case do POST → GET in the same task and assert the GET sees the just-written event:

  • `pipeline::tests::ingest_stores_and_retrieves`
  • `pipeline::tests::ingest_with_iso_timestamp`
  • `pipeline::tests::authenticated_key_overrides_client_api_key_id`
  • `http_integration::include_archived_with_no_archiver_falls_through_to_live_only`

They relied on `store.wal_checkpoint()` between the write and the read. With the sharded BatchWriter merged on main, the rotator task also runs `wal_checkpoint` via `spawn_blocking` on its own tick — and the two can race on per-shard mutexes:

  1. Rotator enters `flush_all`, locks shard 0, rotates, releases.
  2. Test thread enters `flush_all`, locks shard 0, sees no data, returns paths=[].
  3. Test calls `store.get_event(id)` — index has no segment yet because the rotator's `add_segment` call runs after its `flush_all` returns, on the blocking pool, after the test thread has already moved on.

This was the cause of the failures we kept hitting on PR #2 even on rebase.

Fix

Replace the unconditional 'checkpoint then read' with a polling helper. On the happy path the first poll returns; on the racy path we recheck every 20 ms for up to 2 s. No fixed sleep in the success case.

Verified

`cargo test -p keplor-store -p keplor-server` → 116 / 116 pass on this branch.

Three pipeline tests + one http_integration test do POST → GET in the
same task and assert the GET sees the just-written event. They all
relied on a synchronous 'wal_checkpoint after write' to make the new
segment visible — but the rotator that runs alongside the BatchWriter
also calls wal_checkpoint via spawn_blocking, and the two can interleave
on per-shard mutexes such that the rotator wins flush_all (data drained)
while the test's call sees nothing to flush. The rotator's add_segment
then runs on the blocking pool *after* the test's get_event already
returned None, hence the flake.

Replace the unconditional 'checkpoint then read' with a polling helper
wait_for_event() that retries get_event for up to 2 s. On the happy
path it succeeds on the first or second poll (≤ 20 ms); on the slow
path it survives the rotator+test race window without a fixed sleep.

Same approach in the http_integration test, but polling /v1/events
instead.

116/116 tests pass.
@bravo1goingdark bravo1goingdark merged commit 14c2db2 into main May 5, 2026
5 checks passed
@bravo1goingdark bravo1goingdark deleted the fix/flaky-read-after-write-tests branch May 5, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant