Skip to content

perf(BatchWriter): shard the append path; +122% durable rps#1

Merged
bravo1goingdark merged 5 commits into
mainfrom
perf/sharded-batchwriter
May 5, 2026
Merged

perf(BatchWriter): shard the append path; +122% durable rps#1
bravo1goingdark merged 5 commits into
mainfrom
perf/sharded-batchwriter

Conversation

@bravo1goingdark
Copy link
Copy Markdown
Owner

Summary

Splits the BatchWriter's single funnel into N parallel append loops + 1 periodic rotator. Each append loop calls append_batch_durable independently — keplordb's internal round-robin spreads the calls across separate WAL shards, so N concurrent flushes fsync N different files with no lock contention. The rotator runs once per flush_interval and rotates WAL → segments for read visibility, replacing the per-batch wal_checkpoint that previously serialised every flush behind a tier-global lock-loop.

Measured (4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO, single keplor process)

Metric Stock (single-funnel) Sharded (this PR) Δ
Fire-and-forget zero-error rps ~41 K ~55 K +34%
Durable peak rps (c=512) 20.2 K 44.9 K +122%
Durable p99 (c=512) 33.6 ms 21.4 ms −36%
Batch endpoint zero-error events/s 25.6 K 51.2 K +100%

Changes

  • keplor-store/src/batch.rs: rewrite. N append loops, one rotator, round-robin shard selector, per-shard batch threshold scaled as batch_size / flush_shards (floored at 8) so the fill-vs-interval trade-off matches the single-shard baseline.
  • keplor-server/src/config.rs: new pipeline.flush_shards knob, default 4, range 1–64. Validated at startup.
  • keplor-cli/src/main.rs: threads the new field through.
  • keplor.example.toml: ships wal_shard_count = 8 + flush_shards = 8 + channel_capacity = 262144 + batch_size = 256 as recommended defaults; previous values left as inline comments for reference.
  • site/: docs/configuration + docs/integration document flush_shards; docs/benchmarks gets a new "HTTP tier" section with the legacy-vs-sharded comparison; sidebar anchor added.

Investigated, not shipped (notes for future work)

  • Per-shard flush_interval scaling. Helps low-c durable p50 in theory; in practice races with synchronous KdbStore::wal_checkpoint callers — the timing window opens when shards tick faster than the rotator. Two pipeline tests broke on it. Worth revisiting if low-c durable p50 ever becomes binding.
  • Kick channel from append loops to rotator. Same race class. Reverted. Read-visibility falls back to the rotator's interval, which matches v1's contract.

Tests

  • 117 / 117 pass (keplor-store + keplor-server).
  • New regression guard: pick_shard_distributes_round_robin asserts every shard receives ≥ N/2 picks per N·shards calls.
  • Existing concurrency tests (concurrent_writes_8_tasks, graceful_shutdown_drains_events, ingest_batch, backpressure_returns_503) all pass.

Test plan

  • cargo test -p keplor-store -p keplor-server
  • cargo build --release -p keplor-cli
  • Site builds clean (svelte-check 0 errors)
  • Manual oha load test on commodity hardware reproduces the table above
  • Soak under production-like load before merging

- site/docs/benchmarks: criterion numbers from keplordb engine bench —
  writes (572K ev/s single-thread, 1.2M concurrent), queries (1.3-22 G
  elem/s SIMD scans), rollups, WAL, compaction (1450x post-merge).
  Reproducible methodology + caveats on engine vs HTTP layer.
- site/+page.svelte: add storage-layer ceiling caption under Numbers
  with link to /docs/benchmarks. Existing 4 service-level cards
  unchanged.
- DocsSidebar: nav entry for Benchmarks with section anchors.
- keplor.example.toml: bump wal_shard_count 4→8, batch_size 64→256,
  channel_capacity 32K→256K. Measured on 4-core/8-thread:
    - fire-and-forget zero-error: 41K → 50K rps (+23%)
    - durable peak: 20.2K → 34.4K rps, p99 33.6 → 21.8 ms (-35%)
Splits the BatchWriter's single funnel into N parallel append loops + 1
periodic rotator. Each append loop owns its own bounded mpsc channel and
calls append_batch_durable independently — keplordb's internal
round-robin spreads them across separate WAL shards, so N concurrent
flushes fsync N different files with no lock contention. The rotator
runs once per flush_interval and rotates WAL → segments for read
visibility, replacing the per-batch wal_checkpoint that previously
serialised every flush behind a tier-global lock-loop.

Per-shard batch_size scales as batch_size / flush_shards (floored at 8)
so the fill-vs-interval trade-off matches the single-shard baseline.
Without this, low-mid concurrency durable workloads collapsed to
flush_interval p50 because each shard's 1/N share of producers never
filled the global threshold.

Measured on 4-core/8-thread Intel i5-1135G7, NVMe ext4, no PGO:

  Fire-and-forget zero-error:  41K → 55K rps    (+34%)
  Durable peak (c=512):        20.2K → 44.9K rps  (+122% rps)
                               p99 33.6 → 21.4 ms (-36%)
  Batch endpoint zero-error:   25.6K → 51.2K ev/s (+100%)

New config knob: pipeline.flush_shards (default 4, range 1-64). Tune up
to but not above storage.wal_shard_count.

116/116 tests pass (keplor-store + keplor-server, including
concurrent_writes_8_tasks, graceful_shutdown_drains_events, ingest_batch,
backpressure_returns_503).
Two follow-ups to the sharded BatchWriter:

1. Per-shard batch_size scales as batch_size / flush_shards (floored at 8).
   Without this, each shard sees 1/N of the producers and rarely fills
   the global threshold — partial buffers fall through to the interval
   tick and durable p50 collapses to flush_interval. With scaling, c=256
   durable now hits 32K rps p99 17ms (was 22.6K p99 39ms unsharded).

2. example.toml gets flush_shards = 8 to match the wal_shard_count = 8
   already set; otherwise new deployments copying the example get half
   the configured fanout.

Plus: regression test guarding pick_shard's round-robin distribution.

Investigated but rejected this round:
- Per-shard flush_interval scaling: races with synchronous callers of
  KdbStore::wal_checkpoint (the rotator and a sync caller can both
  enter flush_all on the same shard; whoever loses the lock returns
  empty, and add_segment ordering becomes observable). Behaviourally
  fine in production where ingest paths don't synchronously rotate,
  but breaks two pipeline tests that exercise read-after-write inside
  a single tokio task. Worth revisiting if low-c durable p50 becomes
  a real bottleneck.
- Kick channel from append loops to rotator: same race class as above.
  Rotator-on-interval is the simpler invariant; ≤ flush_interval
  read-visibility was already the contract.

117/117 tests pass.
- configuration: add flush_shards row to [pipeline] table, bump
  example wal_shard_count from 4 to 8 with paired note. Sample TOML
  in [pipeline] now shows flush_shards.
- integration: same flush_shards line in the inline config sample.
- benchmarks: new "HTTP tier (sharded BatchWriter)" section with the
  measured single-funnel vs sharded comparison (+34% F&F, +122%
  durable, +100% batch). DocsSidebar gets the matching anchor.
clippy::items_after_test_module flagged the in-flight #[cfg(test)] mod
sitting above append_only and rotator_loop. Moving the test block to
end-of-file.
@bravo1goingdark bravo1goingdark merged commit 78713b0 into main May 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant