perf(encoding): precompute FastLanes transpose/iterate index tables by dfa1 · Pull Request #138 · dfa1/vortex-java

dfa1 · 2026-06-23T05:13:48Z

What

FastLanes.transposeIndex and iterateIndex computed % / / plus an ORDER[] indirection per element. In the delta transpose and (un)delta hot loops that dependency chain (div → ORDER load → mul) serializes scatter address generation, throttling how many scatter misses stay in flight.

Replaced with permutation tables built once in a static initializer:

TRANSPOSE[CHUNK] for transposeIndex
ITERATE_BASE[64] for iterateIndex (lane added per call)

Public API unchanged.

Why it's a real win (not just op-count)

Both loops are gather/scatter permutations (or have a serial prev dependency), so they don't auto-vectorize and C2 already strength-reduces the power-of-two division. The measured gain is memory-level parallelism: same destination indices ⇒ identical memory traffic, so faster address generation simply keeps more outstanding scatter misses in flight. That's why the speedup persists even at 256 MB working sets.

Benchmark

New FastLanesTransposeBenchmark (Apple M5, long[], working set 8 KB → 256 MB):

kernel	L1 (1K)	L2 (256K)	DRAM (32M)
transpose	3.4×	2.4×	1.7×
undelta	1.6×	1.5×	1.4×

Run: ./bench FastLanesTransposeBenchmark

Verification

Delta unit tests: DeltaEncodingDecoderTest (11), DeltaEncodingEncoderTest (75), DeltaEncodingTest (10), RoundTripPropertyTest (408) — all pass
Java→Rust delta i64 round-trip integration test (ground truth) — pass
javadoc:javadoc -pl core — clean

🤖 Generated with Claude Code

transposeIndex and iterateIndex computed per-element % / / plus an ORDER[] indirection. In the delta transpose and (un)delta hot loops that dependency chain (div -> ORDER load -> mul) serializes scatter address generation, throttling how many scatter misses stay in flight. Replace with permutation tables built once in a static initializer: - TRANSPOSE[CHUNK] for transposeIndex - ITERATE_BASE[64] for iterateIndex (lane added per call) Public API unchanged. JMH (Apple M5, long[], FastLanesTransposeBenchmark) across L1 -> DRAM working sets: - transpose: 3.4x (L1) ... 1.7x (256 MB) - undelta: 1.6x (L1) ... 1.4x (256 MB) Win persists when memory-bound: same dst indices = same traffic, so the gain is memory-level parallelism, not bandwidth. Shift-reduction control variants in the benchmark show strength reduction alone recovers only part of it (~1.5x transpose, ~1.08x undelta) - the dominant cost is the dependent ORDER[] load, which only the table removes. Also drops the now-completed FastLanes optimization item from TODO.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dfa1 force-pushed the perf/fastlanes-transpose-table branch from 8623487 to ecd34bb Compare June 23, 2026 05:30

dfa1 merged commit 089b6e3 into main Jun 23, 2026
6 checks passed

dfa1 deleted the perf/fastlanes-transpose-table branch June 23, 2026 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): precompute FastLanes transpose/iterate index tables#138

perf(encoding): precompute FastLanes transpose/iterate index tables#138
dfa1 merged 1 commit into
mainfrom
perf/fastlanes-transpose-table

dfa1 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dfa1 commented Jun 23, 2026

What

Why it's a real win (not just op-count)

Benchmark

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant