Last Updated: 2026-03-13 Platform: Windows 11 Pro, Desktop (NVMe SSD), 4 threads (benchmark), 8 threads (mixed workload) Allocator: rpmalloc (release builds)
These baselines are from Justin's dev machine. Production hardware will differ. Numbers vary 2-3x with system load — always compare runs from the same session.
Source: docs/benchmarks/benchmark-comparison-loading-mode.md (Feb 21, 2026). Bound cache enabled.
| Query Type | p50 | p95 | Cache State | Notes |
|---|---|---|---|---|
| Sparse filter (userId Eq) | 0.041ms | — | warm | Essentially free at any scale |
| Dense filter (nsfwLevel Eq 1) | 7.84ms | — | warm | 1.7x slower than pre-ArcSwap baseline; likely session variance |
| Multi-value filter (tagIds In popular) | 7.82ms | — | warm | Similar to dense filter |
| Sort-only (reactionCount Desc) | 3.64ms | — | warm, bounds | 2.5x faster than no-bounds baseline (9.01ms) |
| Sort + filter (nsfwLevel=1, reactionCount Desc) | 1.68ms | — | warm, bounds | 4.6x faster than no-bounds (7.71ms). Common production pattern |
| Sort + filter (commentCount) | 1.87ms | — | warm, bounds | 3.8x faster than no-bounds |
| Sort + filter (id Asc) | 1.61ms | — | warm, bounds | 13.1x faster than no-bounds (21.13ms) |
| Range filter + sort (3 clause sort) | 6.08ms | — | warm, bounds | 1.7x faster than no-bounds (10.56ms) |
| Filter OR 3 tags | 26.20ms | — | warm | Worst filter-only query. Dense bitmap union |
| Prefix cache shared A | 5.90ms | — | warm | Trie cache prefix match |
Source: docs/benchmarks/benchmark-mixed-workload.md (Mar 11, 2026). Unified cache enabled.
| Query Type | Cold Miss | p50 Hit | p95 Hit | Notes |
|---|---|---|---|---|
| nsfwLevel=1, reactionCount desc | 250ms | 4.78ms | 5.88ms | Common gallery query |
| nsfwLevel=1 + type=image, reactionCount desc | 53ms | 12.09ms | 16.17ms | Two-clause filter + sort |
| nsfwLevel=1, reactionCount asc | 152ms | 3.90ms | 7.11ms | Reverse sort direction |
| nsfwLevel=1, sortAt desc | 15ms | 5.16ms | 6.89ms | Time-based sort |
| nsfwLevel=1 + type=image, sortAt desc | 24ms | 12.51ms | 16.04ms | Two-clause filter + time sort |
Source: tests/loadtest/workload.json (2,516 real Civitai traffic queries). Stable build: fat LTO, codegen-units=1. Unified cache warm.
| Concurrency | QPS | p50 | p95 | p99 | max |
|---|---|---|---|---|---|
| 1 | 8,530 | 0.10ms | 0.17ms | 0.20ms | 1.22ms |
| 4 | 25,343 | 0.14ms | 0.23ms | 0.34ms | 24.06ms |
| 8 | 46,915 | 0.16ms | 0.23ms | 0.29ms | 12.61ms |
| 16 | 63,562 | 0.23ms | 0.36ms | 0.47ms | 15.96ms |
| 32 | 71,415 | 0.42ms | 0.69ms | 0.89ms | 22.27ms |
| 64 | 82,104 | 0.73ms | 1.30ms | 1.63ms | 6.80ms |
| 128 | 77,430 | 1.58ms | 2.78ms | 3.46ms | 9.21ms |
Saturates at c=64 (~82K QPS). c=128 adds no throughput, only latency. Near-linear scaling from c=1 to c=32.
Under concurrent load, memory bandwidth is the bottleneck. These numbers include filter resolution on every request (total_matched computation).
| Metric | p50 | p80 | p95 | p99 | max |
|---|---|---|---|---|---|
| Page-1 queries (server elapsed) | 259ms | 527ms | 821ms | 1.16s | 2.09s |
| Pagination page 2+ (server elapsed) | 466ms | 731ms | 1.02s | 1.36s | 2.01s |
| All queries (wall clock incl. HTTP) | 317ms | 599ms | 908ms | 1.21s | 2.09s |
The gap between single-worker (3-12ms) and 8-worker (259ms) is almost entirely from concurrent bitmap operations saturating memory bandwidth, not from contention.
Source: docs/benchmarks/benchmark-report.md (Feb 19, 2026). No bound cache. Useful for regression comparison of filter-only queries since there was no session variance concern.
| Query Type | p50 | p95 | p99 |
|---|---|---|---|
| Sparse filter (userId Eq) | 0.034ms | 0.060ms | 0.069ms |
| Dense filter (nsfwLevel Eq 1) | 4.665ms | 5.934ms | 6.761ms |
| Multi-value filter (nsfwLevel In) | 4.749ms | 5.905ms | 6.616ms |
| Sort-only (reactionCount Desc) | 9.010ms | 12.545ms | 16.034ms |
| Sort + filter (nsfwLevel=1, reactionCount Desc) | 7.706ms | 10.542ms | 11.985ms |
| Worst case (filter_sort_id_asc) | 21.126ms | 27.918ms | 31.802ms |
| Filter OR 3 tags | 15.112ms | 21.640ms | 25.794ms |
Source: docs/benchmarks/benchmark-report.md and docs/benchmarks/benchmark-comparison-loading-mode.md.
| Scale | Bitmap Memory | RSS | Key Commit | Date |
|---|---|---|---|---|
| 5M | 328 MB | 1.20 GB | 763a008 | Feb 19 |
| 50M | 2.95 GB | 6.09 GB | 763a008 | Feb 19 |
| 100M | 6.19 GB | 11.66 GB | 763a008 | Feb 19 |
| 104.6M (no bounds) | 6.49 GB | 12.14 GB | 763a008 | Feb 19 |
| 104.6M (with bounds) | 6.51 GB | 14.51 GB | 6fb2b78 | Feb 21 |
| 150M (extrapolated) | ~9.3 GB | ~17.4 GB | — | — |
| Component | Size | % of Bitmap |
|---|---|---|
| Filter bitmaps | 5.63 GB | 86.7% |
| — tagIds | 4.48 GB | 79.6% of filter |
| — modelVersionIds | 738 MB | 13.1% of filter |
| — userId | 263 MB | 4.7% of filter |
| Sort bitmaps | 757 MB | 11.7% |
| Trie cache | 111 MB | 1.7% |
| Bound cache | 3.70 KB | negligible |
| Meta-index | 270 B | negligible |
Scaling is linear: ~62 bytes/record bitmap memory. RSS overhead stabilizes at ~48% above bitmap memory at scale (allocator + OS page cache).
Source: docs/benchmarks/benchmark-report.md, docs/benchmarks/benchmark-comparison-loading-mode.md, CLAUDE.md MEMORY.
| Scale | Rate | Wall Time | Commit | Notes |
|---|---|---|---|---|
| 1M (ArcSwap, loading mode) | 70,153/s | 14.25s | 6fb2b78 | Was 82K/s on RwLock baseline |
| 5M (ArcSwap, loading mode) | 56,113/s | 89.11s | 6fb2b78 | 9% below RwLock baseline |
| 104.6M (ArcSwap, loading mode) | 28,316/s | ~70 min | 6fb2b78 | Degradation from growing bitmaps |
| 104.6M (pre-loading-mode, RwLock) | 35,325/s | ~49 min | 763a008 | Original baseline |
Source: CLAUDE.md MEMORY section (various commits, Jan-Feb 2026).
| Optimization Stage | Sustained Rate | Notes |
|---|---|---|
| Fused parse+bitmap (rayon) | 460K/s | Commit dfc977c |
| Direct JSON-to-msgpack encoding | 365K/s | Commit c10c57c, 105M in 5m29s at 320K/s sustained |
| Encode in parse fold | 345K/s | Commit 3702df7 |
| Parallel docstore writes | 290K/s | Commit 8e2137a, per-shard locking |
| put_bulk (benchmark harness) | 641K/s at 104M | Commits 1217f61-61e2032 |
The 641K/s figure is bitmap-path-only throughput (no docstore). The 320-365K/s figures include full pipeline with docstore writes.
Source: docs/benchmarks/benchmark-mixed-workload.md.
| Operation | p50 (wall clock) | mean | Notes |
|---|---|---|---|
| Upsert | 43ms | 134ms | Includes HTTP round-trip. p95=492ms |
| Delete | 27ms | 50ms | Includes HTTP round-trip. p95=162ms |
Source: rebuild_bench --full runs on Justin's dev machine (Mar 13, 2026).
Rebuilds all bitmap indexes (18 filter + 5 sort fields) from the on-disk docstore using packed decode + channel-based merge (rayon workers → bounded channel → single merge thread).
| Phase | Time | Rate | Peak RSS | Notes |
|---|---|---|---|---|
| Build (read + merge) | 98-120s | 876K-1.1M docs/s | 20-21 GB | Varies with system load |
| Persist (save_and_unload) | 37-49s | — | +0-2 GB during write | Zero-copy via fused_cow() |
| Total (build + persist) | 149-159s (~2.5 min) | 662K-706K docs/s e2e | 20-22 GB peak | |
| Disk footprint | — | — | 8 GB | 7.2 GB filter + 866 MB sort + 15 MB system |
Usage:
# Benchmark binary (measures each phase separately)
cargo run --release --bin rebuild_bench -- --data-dir ./data --index civitai --full
# Server with --rebuild flag (same pipeline, starts serving after)
cargo run --release --features server --bin server -- --rebuild --port 3001 --data-dir ./dataSource: docs/benchmarks/benchmark-mixed-workload.md (Mar 11, 2026).
| Metric | Value |
|---|---|
| Cache hit rate | 94.8% |
| Cache entries | 210 (of 5,000 max) |
| Unique query fingerprints | 1,146 across 5,000 requests |
| Cache memory | 21.6 KB total |
| Memory per entry | ~103 bytes |
| Meta-index entries | 210 |
| Meta-index memory | 2.5 KB |
| Query | Cold Miss | Cache Hit p50 | Speedup |
|---|---|---|---|
| nsfwLevel=1, reactionCount desc | 250ms | 4.78ms | 52x |
| nsfwLevel=1 + type, reactionCount desc | 53ms | 12.09ms | 4.4x |
| nsfwLevel=1, reactionCount asc | 152ms | 3.90ms | 39x |
| nsfwLevel=1, sortAt desc | 15ms | 5.16ms | 2.9x |
| nsfwLevel=1 + type, sortAt desc | 24ms | 12.51ms | 1.9x |
Cold miss times vary widely based on filter selectivity and sort field. Cache hits are consistently 3-13ms for single-worker HTTP round-trips.
| Scale | Trie Cache Size | Entries |
|---|---|---|
| 5M | 5.32 MB | 10 |
| 50M | 52.27 MB | 10 |
| 100M | 106.14 MB | 10 |
| 104.6M | 111.07 MB | 10 |
The old trie cache stored full bitmaps per entry (~11 MB/entry at 105M). The unified cache stores only bounded bitmaps (~103 bytes/entry), a >100,000x reduction.
Source: docs/benchmarks/benchmark-comparison-loading-mode.md, bound cache cold/warm comparison.
| Query | No Bounds p50 | Warm Bounds p50 | Speedup |
|---|---|---|---|
| sort_reactionCount_desc | 9.01ms | 3.64ms | 2.5x |
| filter_nsfw1_sort_reactions | 7.71ms | 1.68ms | 4.6x |
| filter_tag_sort_reactions | 7.48ms | 2.14ms | 3.5x |
| filter_sort_commentCount | 7.03ms | 1.87ms | 3.8x |
| filter_sort_id_asc | 21.13ms | 1.61ms | 13.1x |
| filter_nsfw1_onSite_sort | 9.03ms | 4.96ms | 1.8x |
| filter_3_clauses_sort | 10.56ms | 6.08ms | 1.7x |
| Query | Cold p50 | Warm p50 | Speedup |
|---|---|---|---|
| all_sort_reactions | 10.05ms | 3.91ms | 2.6x |
| nsfw1_sort_reactions | 7.09ms | 1.52ms | 4.7x |
| nsfw1_onSite_sort_reactions | 9.58ms | 4.09ms | 2.3x |
| tag_sort_reactions | 7.32ms | 1.56ms | 4.7x |
| nsfw1_sort_commentCount | 8.54ms | 1.68ms | 5.1x |
| nsfw1_sort_id_asc | 22.15ms | 1.83ms | 12.1x |
Bound cache overhead: 6 bounds = 2.28 KB. Meta-index: 6 entries = 180 B. Negligible.
These are guidelines, not hard gates. Hardware, OS, background load, and session variance all affect numbers.
| Metric | Baseline | Regression Threshold | Notes |
|---|---|---|---|
| Sparse filter p50 (userId Eq) | 0.034-0.041ms | >0.1ms (>2x) | Should stay sub-100us at any scale |
| Dense filter p50 (nsfwLevel Eq) | 4.7-7.8ms | >15ms (>2x worst) | Session variance is real; compare same session |
| Sort + filter p50 (common case, bounds warm) | 1.7ms | >3.5ms (>2x) | Must keep bounds enabled |
| Worst sort p50 (filter_sort_id_asc, bounds warm) | 1.6ms | >3.2ms (>2x) | Was 21ms without bounds |
| Cache hit rate (mixed workload) | 94.8% | <90% | Hot-pool-driven; real traffic may differ |
| Single-worker cache hit p50 | 3-13ms | >25ms (>2x worst) | E2E HTTP including round-trip |
| 8-worker concurrent p50 | 259ms | >500ms | Memory-bandwidth-bound; hardware-dependent |
| Bitmap memory at 105M | 6.51 GB | >7.8 GB (+20%) | Linear scaling; watch tagIds growth |
| RSS at 105M | 14.51 GB | >17.4 GB (+20%) | Includes ArcSwap dual-snapshot overhead |
| Bulk load rate (5M) | 56K/s | <39K/s (-30%) | Loading mode enabled |
| Bulk load rate (104M) | 28K/s | <20K/s (-30%) | Degrades with bitmap size; expected |
| Fused pipeline rate | 320-365K/s sustained | <225K/s (-30%) | Full pipeline including docstore |
| Upsert under load p50 | 43ms | >100ms (>2x) | 8 concurrent workers |
| Delete under load p50 | 27ms | >60ms (>2x) | 8 concurrent workers |
| Rebuild from docstore (105M) | 149-159s total | >240s (+60%) | Build + persist; system-load-sensitive |
| Rebuild peak RSS (105M) | 20-22 GB | >30 GB (+40%) | Channel merge bounds memory |
- Same-session comparison is essential. Benchmark numbers vary 2-3x with system load. Never compare runs from different sessions as a regression signal.
- Filter-only regressions are hard to distinguish from session variance. The bound cache comparison showed apparent 1.5-1.7x filter slowdowns that were likely system load, not code regression.
- Sort query regressions are reliable because bound cache improvement (2-13x) overwhelms session noise.
- Memory regressions are reliable — bitmap memory is deterministic for the same dataset.
- If a threshold is exceeded, re-run on a quiet system and compare against a known-good commit in the same session before investigating.
| Document | Content | Date |
|---|---|---|
docs/benchmarks/benchmark-report.md |
5M/50M/100M/104.6M scaling analysis, no bound cache | Feb 19, 2026 |
docs/benchmarks/benchmark-comparison-loading-mode.md |
104.6M bound cache before/after, write perf | Feb 21, 2026 |
docs/benchmarks/benchmark-mixed-workload.md |
105.3M mixed workload, unified cache, 8 workers | Mar 11, 2026 |
CLAUDE.md |
Memory tables, loading pipeline throughput | Ongoing |
| Loadtest (real traffic workload) | 105M HTTP throughput, c=1 to c=128, stable build | Mar 13, 2026 |