Skip to content

perf(encoding): precompute FastLanes transpose/iterate index tables#138

Merged
dfa1 merged 1 commit into
mainfrom
perf/fastlanes-transpose-table
Jun 23, 2026
Merged

perf(encoding): precompute FastLanes transpose/iterate index tables#138
dfa1 merged 1 commit into
mainfrom
perf/fastlanes-transpose-table

Conversation

@dfa1

@dfa1 dfa1 commented Jun 23, 2026

Copy link
Copy Markdown
Owner

What

FastLanes.transposeIndex and iterateIndex computed % / / plus an ORDER[] indirection per element. In the delta transpose and (un)delta hot loops that dependency chain (div → ORDER load → mul) serializes scatter address generation, throttling how many scatter misses stay in flight.

Replaced with permutation tables built once in a static initializer:

  • TRANSPOSE[CHUNK] for transposeIndex
  • ITERATE_BASE[64] for iterateIndex (lane added per call)

Public API unchanged.

Why it's a real win (not just op-count)

Both loops are gather/scatter permutations (or have a serial prev dependency), so they don't auto-vectorize and C2 already strength-reduces the power-of-two division. The measured gain is memory-level parallelism: same destination indices ⇒ identical memory traffic, so faster address generation simply keeps more outstanding scatter misses in flight. That's why the speedup persists even at 256 MB working sets.

Benchmark

New FastLanesTransposeBenchmark (Apple M5, long[], working set 8 KB → 256 MB):

kernel L1 (1K) L2 (256K) DRAM (32M)
transpose 3.4× 2.4× 1.7×
undelta 1.6× 1.5× 1.4×

Run: ./bench FastLanesTransposeBenchmark

Verification

  • Delta unit tests: DeltaEncodingDecoderTest (11), DeltaEncodingEncoderTest (75), DeltaEncodingTest (10), RoundTripPropertyTest (408) — all pass
  • Java→Rust delta i64 round-trip integration test (ground truth) — pass
  • javadoc:javadoc -pl core — clean

🤖 Generated with Claude Code

transposeIndex and iterateIndex computed per-element % / / plus an ORDER[]
indirection. In the delta transpose and (un)delta hot loops that dependency
chain (div -> ORDER load -> mul) serializes scatter address generation,
throttling how many scatter misses stay in flight.

Replace with permutation tables built once in a static initializer:
- TRANSPOSE[CHUNK] for transposeIndex
- ITERATE_BASE[64] for iterateIndex (lane added per call)

Public API unchanged.

JMH (Apple M5, long[], FastLanesTransposeBenchmark) across L1 -> DRAM working
sets:
- transpose: 3.4x (L1) ... 1.7x (256 MB)
- undelta:   1.6x (L1) ... 1.4x (256 MB)

Win persists when memory-bound: same dst indices = same traffic, so the gain
is memory-level parallelism, not bandwidth. Shift-reduction control variants in
the benchmark show strength reduction alone recovers only part of it
(~1.5x transpose, ~1.08x undelta) - the dominant cost is the dependent ORDER[]
load, which only the table removes.

Also drops the now-completed FastLanes optimization item from TODO.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dfa1 dfa1 force-pushed the perf/fastlanes-transpose-table branch from 8623487 to ecd34bb Compare June 23, 2026 05:30
@dfa1 dfa1 merged commit 089b6e3 into main Jun 23, 2026
6 checks passed
@dfa1 dfa1 deleted the perf/fastlanes-transpose-table branch June 23, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant