perf(BatchWriter): shard the append path; +122% durable rps by bravo1goingdark · Pull Request #1 · bravo1goingdark/keplor

bravo1goingdark · 2026-05-05T16:04:29Z

Summary

Splits the BatchWriter's single funnel into N parallel append loops + 1 periodic rotator. Each append loop calls append_batch_durable independently — keplordb's internal round-robin spreads the calls across separate WAL shards, so N concurrent flushes fsync N different files with no lock contention. The rotator runs once per flush_interval and rotates WAL → segments for read visibility, replacing the per-batch wal_checkpoint that previously serialised every flush behind a tier-global lock-loop.

Measured (4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO, single keplor process)

Metric	Stock (single-funnel)	Sharded (this PR)	Δ
Fire-and-forget zero-error rps	~41 K	~55 K	+34%
Durable peak rps (c=512)	20.2 K	44.9 K	+122%
Durable p99 (c=512)	33.6 ms	21.4 ms	−36%
Batch endpoint zero-error events/s	25.6 K	51.2 K	+100%

Changes

keplor-store/src/batch.rs: rewrite. N append loops, one rotator, round-robin shard selector, per-shard batch threshold scaled as batch_size / flush_shards (floored at 8) so the fill-vs-interval trade-off matches the single-shard baseline.
keplor-server/src/config.rs: new pipeline.flush_shards knob, default 4, range 1–64. Validated at startup.
keplor-cli/src/main.rs: threads the new field through.
keplor.example.toml: ships wal_shard_count = 8 + flush_shards = 8 + channel_capacity = 262144 + batch_size = 256 as recommended defaults; previous values left as inline comments for reference.
site/: docs/configuration + docs/integration document flush_shards; docs/benchmarks gets a new "HTTP tier" section with the legacy-vs-sharded comparison; sidebar anchor added.

Investigated, not shipped (notes for future work)

Per-shard flush_interval scaling. Helps low-c durable p50 in theory; in practice races with synchronous KdbStore::wal_checkpoint callers — the timing window opens when shards tick faster than the rotator. Two pipeline tests broke on it. Worth revisiting if low-c durable p50 ever becomes binding.
Kick channel from append loops to rotator. Same race class. Reverted. Read-visibility falls back to the rotator's interval, which matches v1's contract.

Tests

117 / 117 pass (keplor-store + keplor-server).
New regression guard: pick_shard_distributes_round_robin asserts every shard receives ≥ N/2 picks per N·shards calls.
Existing concurrency tests (concurrent_writes_8_tasks, graceful_shutdown_drains_events, ingest_batch, backpressure_returns_503) all pass.

Test plan

cargo test -p keplor-store -p keplor-server
cargo build --release -p keplor-cli
Site builds clean (svelte-check 0 errors)
Manual oha load test on commodity hardware reproduces the table above
Soak under production-like load before merging

- site/docs/benchmarks: criterion numbers from keplordb engine bench — writes (572K ev/s single-thread, 1.2M concurrent), queries (1.3-22 G elem/s SIMD scans), rollups, WAL, compaction (1450x post-merge). Reproducible methodology + caveats on engine vs HTTP layer. - site/+page.svelte: add storage-layer ceiling caption under Numbers with link to /docs/benchmarks. Existing 4 service-level cards unchanged. - DocsSidebar: nav entry for Benchmarks with section anchors. - keplor.example.toml: bump wal_shard_count 4→8, batch_size 64→256, channel_capacity 32K→256K. Measured on 4-core/8-thread: - fire-and-forget zero-error: 41K → 50K rps (+23%) - durable peak: 20.2K → 34.4K rps, p99 33.6 → 21.8 ms (-35%)

Splits the BatchWriter's single funnel into N parallel append loops + 1 periodic rotator. Each append loop owns its own bounded mpsc channel and calls append_batch_durable independently — keplordb's internal round-robin spreads them across separate WAL shards, so N concurrent flushes fsync N different files with no lock contention. The rotator runs once per flush_interval and rotates WAL → segments for read visibility, replacing the per-batch wal_checkpoint that previously serialised every flush behind a tier-global lock-loop. Per-shard batch_size scales as batch_size / flush_shards (floored at 8) so the fill-vs-interval trade-off matches the single-shard baseline. Without this, low-mid concurrency durable workloads collapsed to flush_interval p50 because each shard's 1/N share of producers never filled the global threshold. Measured on 4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO: Fire-and-forget zero-error: 41K → 55K rps (+34%) Durable peak (c=512): 20.2K → 44.9K rps (+122% rps) p99 33.6 → 21.4 ms (-36%) Batch endpoint zero-error: 25.6K → 51.2K ev/s (+100%) New config knob: pipeline.flush_shards (default 4, range 1-64). Tune up to but not above storage.wal_shard_count. 116/116 tests pass (keplor-store + keplor-server, including concurrent_writes_8_tasks, graceful_shutdown_drains_events, ingest_batch, backpressure_returns_503).

Two follow-ups to the sharded BatchWriter: 1. Per-shard batch_size scales as batch_size / flush_shards (floored at 8). Without this, each shard sees 1/N of the producers and rarely fills the global threshold — partial buffers fall through to the interval tick and durable p50 collapses to flush_interval. With scaling, c=256 durable now hits 32K rps p99 17ms (was 22.6K p99 39ms unsharded). 2. example.toml gets flush_shards = 8 to match the wal_shard_count = 8 already set; otherwise new deployments copying the example get half the configured fanout. Plus: regression test guarding pick_shard's round-robin distribution. Investigated but rejected this round: - Per-shard flush_interval scaling: races with synchronous callers of KdbStore::wal_checkpoint (the rotator and a sync caller can both enter flush_all on the same shard; whoever loses the lock returns empty, and add_segment ordering becomes observable). Behaviourally fine in production where ingest paths don't synchronously rotate, but breaks two pipeline tests that exercise read-after-write inside a single tokio task. Worth revisiting if low-c durable p50 becomes a real bottleneck. - Kick channel from append loops to rotator: same race class as above. Rotator-on-interval is the simpler invariant; ≤ flush_interval read-visibility was already the contract. 117/117 tests pass.

- configuration: add flush_shards row to [pipeline] table, bump example wal_shard_count from 4 to 8 with paired note. Sample TOML in [pipeline] now shows flush_shards. - integration: same flush_shards line in the inline config sample. - benchmarks: new "HTTP tier (sharded BatchWriter)" section with the measured single-funnel vs sharded comparison (+34% F&F, +122% durable, +100% batch). DocsSidebar gets the matching anchor.

clippy::items_after_test_module flagged the in-flight #[cfg(test)] mod sitting above append_only and rotator_loop. Moving the test block to end-of-file.

bravo1goingdark added 5 commits May 5, 2026 20:22

fix(clippy): move test mod after non-test items in batch.rs

0fbab51

clippy::items_after_test_module flagged the in-flight #[cfg(test)] mod sitting above append_only and rotator_loop. Moving the test block to end-of-file.

bravo1goingdark merged commit 78713b0 into main May 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(BatchWriter): shard the append path; +122% durable rps#1

perf(BatchWriter): shard the append path; +122% durable rps#1
bravo1goingdark merged 5 commits into
mainfrom
perf/sharded-batchwriter

bravo1goingdark commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bravo1goingdark commented May 5, 2026

Summary

Measured (4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO, single keplor process)

Changes

Investigated, not shipped (notes for future work)

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant