test: poll for read-after-write visibility in 4 racy tests by bravo1goingdark · Pull Request #3 · bravo1goingdark/keplor

bravo1goingdark · 2026-05-05T17:34:36Z

Why

Three `keplor-server::pipeline::tests` cases plus one `http_integration` case do POST → GET in the same task and assert the GET sees the just-written event:

`pipeline::tests::ingest_stores_and_retrieves`
`pipeline::tests::ingest_with_iso_timestamp`
`pipeline::tests::authenticated_key_overrides_client_api_key_id`
`http_integration::include_archived_with_no_archiver_falls_through_to_live_only`

They relied on `store.wal_checkpoint()` between the write and the read. With the sharded BatchWriter merged on main, the rotator task also runs `wal_checkpoint` via `spawn_blocking` on its own tick — and the two can race on per-shard mutexes:

Rotator enters `flush_all`, locks shard 0, rotates, releases.
Test thread enters `flush_all`, locks shard 0, sees no data, returns paths=[].
Test calls `store.get_event(id)` — index has no segment yet because the rotator's `add_segment` call runs after its `flush_all` returns, on the blocking pool, after the test thread has already moved on.

This was the cause of the failures we kept hitting on PR #2 even on rebase.

Fix

Replace the unconditional 'checkpoint then read' with a polling helper. On the happy path the first poll returns; on the racy path we recheck every 20 ms for up to 2 s. No fixed sleep in the success case.

Verified

`cargo test -p keplor-store -p keplor-server` → 116 / 116 pass on this branch.

Three pipeline tests + one http_integration test do POST → GET in the same task and assert the GET sees the just-written event. They all relied on a synchronous 'wal_checkpoint after write' to make the new segment visible — but the rotator that runs alongside the BatchWriter also calls wal_checkpoint via spawn_blocking, and the two can interleave on per-shard mutexes such that the rotator wins flush_all (data drained) while the test's call sees nothing to flush. The rotator's add_segment then runs on the blocking pool *after* the test's get_event already returned None, hence the flake. Replace the unconditional 'checkpoint then read' with a polling helper wait_for_event() that retries get_event for up to 2 s. On the happy path it succeeds on the first or second poll (≤ 20 ms); on the slow path it survives the rotator+test race window without a fixed sleep. Same approach in the http_integration test, but polling /v1/events instead. 116/116 tests pass.

bravo1goingdark added 2 commits May 5, 2026 23:04

test: cargo fmt

dd9f441

bravo1goingdark merged commit 14c2db2 into main May 5, 2026
5 checks passed

bravo1goingdark deleted the fix/flaky-read-after-write-tests branch May 5, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: poll for read-after-write visibility in 4 racy tests#3

test: poll for read-after-write visibility in 4 racy tests#3
bravo1goingdark merged 2 commits into
mainfrom
fix/flaky-read-after-write-tests

bravo1goingdark commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bravo1goingdark commented May 5, 2026

Why

Fix

Verified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant