test: poll for read-after-write visibility in 4 racy tests#3
Merged
Conversation
Three pipeline tests + one http_integration test do POST → GET in the same task and assert the GET sees the just-written event. They all relied on a synchronous 'wal_checkpoint after write' to make the new segment visible — but the rotator that runs alongside the BatchWriter also calls wal_checkpoint via spawn_blocking, and the two can interleave on per-shard mutexes such that the rotator wins flush_all (data drained) while the test's call sees nothing to flush. The rotator's add_segment then runs on the blocking pool *after* the test's get_event already returned None, hence the flake. Replace the unconditional 'checkpoint then read' with a polling helper wait_for_event() that retries get_event for up to 2 s. On the happy path it succeeds on the first or second poll (≤ 20 ms); on the slow path it survives the rotator+test race window without a fixed sleep. Same approach in the http_integration test, but polling /v1/events instead. 116/116 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Three `keplor-server::pipeline::tests` cases plus one `http_integration` case do POST → GET in the same task and assert the GET sees the just-written event:
They relied on `store.wal_checkpoint()` between the write and the read. With the sharded BatchWriter merged on main, the rotator task also runs `wal_checkpoint` via `spawn_blocking` on its own tick — and the two can race on per-shard mutexes:
This was the cause of the failures we kept hitting on PR #2 even on rebase.
Fix
Replace the unconditional 'checkpoint then read' with a polling helper. On the happy path the first poll returns; on the racy path we recheck every 20 ms for up to 2 s. No fixed sleep in the success case.
Verified
`cargo test -p keplor-store -p keplor-server` → 116 / 116 pass on this branch.