perf(BatchWriter): shard the append path; +122% durable rps#1
Merged
Conversation
- site/docs/benchmarks: criterion numbers from keplordb engine bench —
writes (572K ev/s single-thread, 1.2M concurrent), queries (1.3-22 G
elem/s SIMD scans), rollups, WAL, compaction (1450x post-merge).
Reproducible methodology + caveats on engine vs HTTP layer.
- site/+page.svelte: add storage-layer ceiling caption under Numbers
with link to /docs/benchmarks. Existing 4 service-level cards
unchanged.
- DocsSidebar: nav entry for Benchmarks with section anchors.
- keplor.example.toml: bump wal_shard_count 4→8, batch_size 64→256,
channel_capacity 32K→256K. Measured on 4-core/8-thread:
- fire-and-forget zero-error: 41K → 50K rps (+23%)
- durable peak: 20.2K → 34.4K rps, p99 33.6 → 21.8 ms (-35%)
Splits the BatchWriter's single funnel into N parallel append loops + 1
periodic rotator. Each append loop owns its own bounded mpsc channel and
calls append_batch_durable independently — keplordb's internal
round-robin spreads them across separate WAL shards, so N concurrent
flushes fsync N different files with no lock contention. The rotator
runs once per flush_interval and rotates WAL → segments for read
visibility, replacing the per-batch wal_checkpoint that previously
serialised every flush behind a tier-global lock-loop.
Per-shard batch_size scales as batch_size / flush_shards (floored at 8)
so the fill-vs-interval trade-off matches the single-shard baseline.
Without this, low-mid concurrency durable workloads collapsed to
flush_interval p50 because each shard's 1/N share of producers never
filled the global threshold.
Measured on 4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO:
Fire-and-forget zero-error: 41K → 55K rps (+34%)
Durable peak (c=512): 20.2K → 44.9K rps (+122% rps)
p99 33.6 → 21.4 ms (-36%)
Batch endpoint zero-error: 25.6K → 51.2K ev/s (+100%)
New config knob: pipeline.flush_shards (default 4, range 1-64). Tune up
to but not above storage.wal_shard_count.
116/116 tests pass (keplor-store + keplor-server, including
concurrent_writes_8_tasks, graceful_shutdown_drains_events, ingest_batch,
backpressure_returns_503).
Two follow-ups to the sharded BatchWriter: 1. Per-shard batch_size scales as batch_size / flush_shards (floored at 8). Without this, each shard sees 1/N of the producers and rarely fills the global threshold — partial buffers fall through to the interval tick and durable p50 collapses to flush_interval. With scaling, c=256 durable now hits 32K rps p99 17ms (was 22.6K p99 39ms unsharded). 2. example.toml gets flush_shards = 8 to match the wal_shard_count = 8 already set; otherwise new deployments copying the example get half the configured fanout. Plus: regression test guarding pick_shard's round-robin distribution. Investigated but rejected this round: - Per-shard flush_interval scaling: races with synchronous callers of KdbStore::wal_checkpoint (the rotator and a sync caller can both enter flush_all on the same shard; whoever loses the lock returns empty, and add_segment ordering becomes observable). Behaviourally fine in production where ingest paths don't synchronously rotate, but breaks two pipeline tests that exercise read-after-write inside a single tokio task. Worth revisiting if low-c durable p50 becomes a real bottleneck. - Kick channel from append loops to rotator: same race class as above. Rotator-on-interval is the simpler invariant; ≤ flush_interval read-visibility was already the contract. 117/117 tests pass.
- configuration: add flush_shards row to [pipeline] table, bump example wal_shard_count from 4 to 8 with paired note. Sample TOML in [pipeline] now shows flush_shards. - integration: same flush_shards line in the inline config sample. - benchmarks: new "HTTP tier (sharded BatchWriter)" section with the measured single-funnel vs sharded comparison (+34% F&F, +122% durable, +100% batch). DocsSidebar gets the matching anchor.
clippy::items_after_test_module flagged the in-flight #[cfg(test)] mod sitting above append_only and rotator_loop. Moving the test block to end-of-file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Splits the BatchWriter's single funnel into N parallel append loops + 1 periodic rotator. Each append loop calls
append_batch_durableindependently — keplordb's internal round-robin spreads the calls across separate WAL shards, so N concurrent flushes fsync N different files with no lock contention. The rotator runs once perflush_intervaland rotates WAL → segments for read visibility, replacing the per-batchwal_checkpointthat previously serialised every flush behind a tier-global lock-loop.Measured (4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO, single keplor process)
Changes
keplor-store/src/batch.rs: rewrite. N append loops, one rotator, round-robin shard selector, per-shard batch threshold scaled asbatch_size / flush_shards(floored at 8) so the fill-vs-interval trade-off matches the single-shard baseline.keplor-server/src/config.rs: newpipeline.flush_shardsknob, default 4, range 1–64. Validated at startup.keplor-cli/src/main.rs: threads the new field through.keplor.example.toml: shipswal_shard_count = 8+flush_shards = 8+channel_capacity = 262144+batch_size = 256as recommended defaults; previous values left as inline comments for reference.site/:docs/configuration+docs/integrationdocumentflush_shards;docs/benchmarksgets a new "HTTP tier" section with the legacy-vs-sharded comparison; sidebar anchor added.Investigated, not shipped (notes for future work)
flush_intervalscaling. Helps low-c durable p50 in theory; in practice races with synchronousKdbStore::wal_checkpointcallers — the timing window opens when shards tick faster than the rotator. Two pipeline tests broke on it. Worth revisiting if low-c durable p50 ever becomes binding.Tests
pick_shard_distributes_round_robinasserts every shard receives ≥ N/2 picks per N·shards calls.concurrent_writes_8_tasks,graceful_shutdown_drains_events,ingest_batch,backpressure_returns_503) all pass.Test plan