Cross-cutting findings, failed approaches, and reasoning that don't belong to a single ladder step. Step-by-step ladder records and the forward plan have moved out of this file:
- Ladder Log — per-step ladder records (steps 00–18, the "narrowing chain" decisions, current canonical carries).
- Ladder Plan — forward plan for steps 19–50+ (multilevel PT DI canonical, multilevel + WS-ReSTIR DI, multilevel + PT multibounce, multilevel + ReSTIR PT multibounce, BDPT open).
This file keeps:
- Cross-cutting parity / structural-equivalence story (RTXDI baseline, restir_2d ≡ restir_3d).
- Sampler artefacts that are reusable beyond the ladder (e.g.
EmissivePdfMipmapSampler). - Failed approaches with their diagnoses (one paragraph each, anchored to dates / commits).
- Cross-cutting reasoning paragraphs.
Status: Functional + qualitative parity with RTXDI achieved on the seven-scene matrix; structural equivalence (restir_2d ≡ restir_3d) demonstrated within sampling noise.
| Knob | Value | Rationale |
|---|---|---|
WS_CELL_POOL_N |
128 | Matches RTXDI tile-density target. 64→128 won Sponza_x4 −0.24pp; 128→256 diminishing. |
wsInitialCandidates (K_pre) |
32 | Slim pre-pass; K=64 quality cost ~0.1pp avg — acceptable trade. |
wsCellPoolDrawK (K_pool) |
16 | RTXDI K=24 budget. K_pool=24/64 retested with Conv A and B — both regress (over-weights pool's shading-agnostic distribution vs 8 fresh shading-conditional samples). |
wsMCap |
5 | RTXDI default 20 tested — uniformly +0.1-0.3pp worse on multi-light scenes. |
| Pre-pass emissive sampler | PdfMipmap | New EmissivePdfMipmapSampler peer to Power/LightBVH. RTXDI-style hierarchical pdf-mipmap. |
| Main-pass emissive sampler | LightBVH (default) | Shading-conditional, required by BistroInt; mixed-PdfMipmap-main regressed +1.47pp. |
| Pool read convention | Conv B reader-eval | 1/sourcePdf computed at READER's vertex via emissiveSampler.evalPdf() — RTXDI-faithful unbiased. Earlier writer-pdf Conv B caused fireflies (writer's r²/cos baked in). |
| Bayer N×N | 4 (16 subframes) | RTXDI presample-budget alignment: 16K active pixels × K=8 ≈ 131K presamples = RTXDI's 128×1024. |
| Scene | vanilla | RTXDI | restir (ours) | Δ vs RTXDI |
|---|---|---|---|---|
| CornellBox_1AreaLight | 1.39 | 2.18 | 2.15 | −0.03 win |
| CornellBox_1PointLight | 0.21 | 1.39 | 0.21 | −1.18 win |
| CornellBox_3AreaLights | 2.97 | 2.60 | 3.55 | +0.95 trail |
| CornellBox_32PointLights | 5.36 | 3.73 | 3.31 | −0.42 win |
| BistroExterior | 18.12 | 13.23 | 10.88 | −2.35 win |
| BistroInterior | 16.96 | 10.73 | 9.54 | −1.19 win |
| Sponza | 6.23 | 7.08 | 6.49 | −0.59 win |
Net at x4: 6 wins / 0 parities / 1 trail. Cumulative −4.81pp ahead of RTXDI on aggregate.
The single remaining trail is CornellBox_3AreaLights (+0.95pp). Confirmed structural: per-cell pool architecture vs RTXDI's 1024-tile global structure produces different per-pixel candidate diversity profiles. No within-architecture parameter sweep equalizes them; closing it would require a true global tile structure.
rays_traced_pct per the diagnostic counter (lower is better):
| Scene_x4 | RTXDI | restir | restir / RTXDI |
|---|---|---|---|
| Cornell_1AL | 9.90 | 18.13 | 1.83× |
| Cornell_1PL | 5.15 | 0.38 | 0.07× |
| Cornell_3AL | 9.54 | 22.16 | 2.32× |
| Cornell_32PL | 24.66 | 17.38 | 0.70× |
| BistroExterior | 81.95 | 74.95 | 0.91× |
| BistroInterior | 65.39 | 60.84 | 0.93× |
| Sponza | 59.88 | 60.50 | 1.01× (parity) |
Shadow-ray parity on five scenes; restir uses fewer rays on three. Cornell_3AL/Cornell_1AL fire ~2× because their K-RIS produces valid winners more often (visibility patterns differ from RTXDI's tile fill). Eval-cost gap (pre-pass uses PathTracer instance, ~3-4× more light-evaluations than RTXDI's lean compute presample) is plumbing — addressed by the lean dedicated compute pre-pass when ready (Task #29).
restir_2d (RTXDI's exact data structure: pixel reservoir + screen-space tile pool) and restir_3d (3D-cell pool + per-pixel reservoir) produce identical results within sampling noise on every scene tested:
| Scene_x4 | restir_2d err | restir_3d err | |2d − 3d| |
|---|---|---|---|
| Cornell_1AL | 2.15 | 2.16 | 0.01 |
| Cornell_1PL | 0.21 | 0.21 | 0.00 |
| Cornell_3AL | 3.55 | 3.55 | 0.00 |
| Cornell_32PL | 3.31 | 3.31 | 0.00 |
| BistroExt | 10.88 | 10.85 | 0.03 |
| BistroInt | 9.54 | 9.53 | 0.01 |
| Sponza | 6.49 | 6.47 | 0.02 |
|2d − 3d| ≤ 0.03pp on all scenes — well below the per-frame stochastic noise floor. This is the structural-equivalence claim from paper §3.0 made operational: the 3D-cell pool with footprint-derived entry level is structurally equivalent to RTXDI's 2D-tile pool at matching parameters. The novelty isn't the addressing scheme; it's the curve beyond. Setting the footprint-derived entry level to one screen tile recovers RTXDI's exact pool layout; beyond that operating point, 3D admits cross-tile world-space sharing that 2D cannot express.
A clean Falcor-native peer to EmissiveUniformSampler/EmissiveLightBVHSampler/EmissivePowerSampler, registered as EmissiveLightSamplerType::PdfMipmap = 3 in the existing factory. CPU-side build from MeshLightTriangle.flux placed in z-curve mip-0 layout (using inlined RTXDI_LinearIndexToZCurve); Texture::generateMips builds the chain. Slang side inlines RTXDI_SamplePdfMipmap for hierarchical descent and returns solid-angle pdf via ls.pdf *= mipmapPdf, vanilla-NEE-compatible. Math validated 1.116% on Cornell_3AL vanilla x16 vs LightBVH 1.119% / Power 1.126% — within stochastic noise. RTXDI library files are untouched; the sampler reuses rtxdi/RtxdiMath.hlsli via include only. Reusable by any pass that wants RTXDI-style sampling.
- Conv B with stored solid-angle pdf — fireflies on Sponza_x4 (+6.18pp regression). Writer's
r²/cosbaked into stored1/sourcePdfamplifies at distant-writer slots. Fix: reader-evaluated pdf. - Mixed PdfMipmap main + PdfMipmap pool — BistroInt_x4 +1.47pp regression. Main pass needs shading-conditional LightBVH for tight indoor geometry.
- K_pool > 16 (24, 64) — over-weights pool's shading-agnostic distribution vs the 8 fresh shading-conditional samples. Both Conv A and Conv B regress.
- wsMCap = 20 (RTXDI default) — uniformly +0.1-0.3pp worse on multi-light scenes. Stays at 5.
- Bitterli RIS at insert with writer-pHat — biases pool toward writer's shading point, breaks cross-pixel reuse on heterogeneous lighting.
- Drop main-pass fresh K-RIS (pool-only K=24) — regressed Sponza_x1 +9pp; fresh shading-conditional samples are required.
- Spatial-reuse off (wsSpatialPixelsK=0) — confirmed not the Cornell_3AL bias source (Δ < 0.06pp).
- Probabilistic V-aware fill at insert — preserves expected value (only changes variance); Sponza unchanged.
- RTXDI BoilingFilter port — DISABLED 2026-05-05 (
#if 0in shader, block-commented in C++). Dispatch fires, host-sideclearUAVon the same buffer moves the metric, but shader-side writes silently no-op. Suspect: locally-redeclaredRWStructuredBuffer<WSReservoir>vs the working module-importedgVHFTableinVisCacheDecay. Lesson: silent no-op safety nets are worse than no safety net — they could mask future regressions. Future fix: splitgWSPixelReservoirsinto a separable include bothWSReservoirIOand a fixed BoilingFilter can import. accelDecayDisagreeThresh > 0— Bistro art5 regresses 3–6× (BiE x16 21.8 → 132.8; BiI x16 29.9 → 93.1). Cause: half-decay-on-disagreement creates runaway oscillation on cells with legitimate mixed visibility. ad ∈ {0.05, 0.10, 0.30} all converge to the same (worse) attractor — empirically broken mechanism. Default off (BISTRO_ADD sweep, 2026-05-05).- Trust-gate sweeps at cell4×4 ct=2 on Sponza (vt, se, fd, cwf, posB-quant) — all combinations bit-identical (rays=73.48%, art5=23.36 — tested in step 17, step 18). The 26.5% rays-savings ceiling at this corner is structural, not gate-tunable: ct=2 itself is the bottleneck. Naive raise-base-ct (SPONZA_CT) breaks the saturation: ct=8 cuts art5 23.4→17.5 at x4. Lesson: trust-gate sweeps stop revealing leverage once the boot threshold itself is too low to accumulate the per-cell N needed to trust μ.
These didn't fit any single ladder step's narrative — emerged from the union of multiple sweeps and reframe earlier results.
- Scene-class taxonomy is 4-row, not 2-row. (Class × bounce-depth.) Penumbra-class single-bounce DI (Sponza b=0): vt-tuning helps locally; perceptual-vs-linear metric tradeoff. Penumbra-class multibounce (Sponza b=1/4): cache delivers −74pp rays + OkLab match, PSNR/relmse worsens (linear-space loss is the perceptual cost of the cache's CV+RRR averaging). Firefly-class single-bounce DI (Bistro b=0): cache already at firefly floor — sweeping ct/vt/decay has no leverage but cache is winning −46pp art5 vs vanilla. Firefly-class multibounce (BistroInt b=1/4): cache wins on every metric (relmse 2.4× better, −53pp rays). The "Bistro framework doesn't generalize from Sponza" finding from BISTRO_CT was an artifact of single-bounce DI; multibounce closes the gap — every per-bounce firefly source is a fresh variance the cache amortizes via cell-level mean.
- Bistro firefly-floor reframe. BISTRO_DECAY (decayPeriod sweep) and BISTRO_CT (4-corner ct/vt) both showed bit-identical art5 across all variants on Bistro single-bounce DI. The reframe (BISTRO_DECAY narrative): cache art5 42.87% / 29.93% (x4 / x16) vs vanilla 88.89% / 48.23% means the cache absorbs ~46pp / 18pp of vanilla's variance — the residual is irreducible firefly noise, not cache bias-lock. Bistro DI cache is working as designed, just at its theoretical ceiling. The mechanism that breaks the floor is multibounce, not more DI-level tuning. The "scene-classifier needed" follow-up from BISTRO_CT remains valid as a future direction (per-class auto-tune) but doesn't change the b=0 result.
- vt has anti-correlated optima across metric families. SPONZA_VT at x16: vt=0.001 best art5 (15.21) but RMSE/relmse worse; vt=0.30 best RMSE/PSNR/relmse (relmse 0.09 vs 0.45 at tight vt) but worst art5 (28.42). art5 penalizes LOCAL spikes (firefly-region peaks); RMSE penalizes AVERAGE error. Tight vt kills firefly spots locally; loose vt smooths per-pixel noise globally. Implication: ship per-metric carry tables, not a universal vt — and any paper §11/§12 figure must report multiple metric families honestly.
- vt is SPP-dependent. SPONZA_VT: x4 optimum vt≈0.10, x16 optimum vt≈0.001. Wilson-interval / two-tier ct (LADDER_PLAN improvement A) is the principled fix — Wilson lower-bound > 0.99 OR upper-bound < 0.01 collapses both regimes into one criterion.
- rays_traced_pct ≠ wall-clock saving on ray-trace-cheap scenes. Cornell scenes have tiny geometry (small BVH, sub-millisecond ray cost). The cache's per-pixel infrastructure cost (hash query + atomic decay + cell-state update) is roughly constant per scene, dominated by the lookup machinery rather than the ray itself. So "94% rays saved on Cornell_1PL b=4" is an algorithmic finding, not a wall-clock claim — the saved rays were free to trace in the first place. Wall-clock wins require ray-cost > cache-infrastructure-cost, which holds on Sponza and BistroInterior, but not on Cornell_32PL (2.6 ms vanilla; cache infrastructure already exceeds vanilla's render cost). Pitch implication: report rays-saved as the algorithmic metric, gpu_tracepass_ms as the operational metric, and don't conflate them. Cornell-class scenes are useful as algorithm-validation but not wall-clock benchmarks.
- The cache is designed for 1-SPP-per-frame + frame-accumulation real-time rendering. Every frame is a 1-SPP draw; consecutive frames warm cache state; wins emerge AT STEADY STATE under temporal coherence. A cold-start measurement (render 4-8 warmup frames, average a small window after) under-represents the real-world value because the cache hasn't reached cell-maturity equilibrium yet. Animated scenes benefit naturally: as the camera moves through space, locally-overlapping cells stay warm frame-to-frame; only the leading edge of newly-revealed regions pays cold-start cost, and that's amortized over many subsequent frames where those cells are hit again. Methodology corollary: TIMING measurements need long warmup (64+ frames) to reach the operating regime the cache was built for. Single-shot multi-SPP-per-frame measurements (vanilla x4 in one renderFrame call) are out-of-distribution for this cache and shouldn't be used as the wall-clock benchmark.
- Convention B requires reader-evaluated pdf.
emissiveSampler.evalPdf()at the receiver's vertex; never store the writer's solid-angle pdf — itsr²/cosfactor amplifies into firefly tails at distant readers. - Data-structure equivalence is structural. 2D screen tile and 3D world cell are interchangeable at matched density; the mechanism is flat-multilevel-hash + reservoir reuse + RIS pool fill regardless of which one you address.
Status across the F17P24 baseline after a multi-iteration sweep:
| Knob | RTXDI default | Our F17P24 default |
|---|---|---|
| localLightCandidateCount | 24 | 24 (pool) ✓ |
| infiniteLightCandidateCount | 8 | ~5.67 (uniform-fresh×selectLightType) |
| envLightCandidateCount | 8 | ~5.67 (uniform-fresh×selectLightType) |
| brdfCandidateCount | 1 | 0 (tried, no win) |
| testCandidateVisibility | true | true ✓ |
| biasCorrection | Basic | Basic ✓ (5be5db0) |
| samplingRadius | 30 | 30 ✓ |
| spatialSampleCount | 1 | 1 ✓ |
| spatialIterations | 5 | 1 ← largest unmatched |
| maxHistoryLength | 20 | mCap=20 ✓ |
| boilingFilterStrength | 0 | 0 ✓ |
| presampledTileCount × Size | 128 × 1024 | N/A (cell-pool architecture) |
Quality status at SPP=4 with the locked F17P24 Basic default:
| Scene | err% vs RTXDI | art5% vs RTXDI | rmse vs RTXDI |
|---|---|---|---|
| Cornell_1PL | beats (90%) | beats (97%) | beats (96%) |
| Cornell_1AL | beats (48%) | beats (38%) | beats (54%) |
| Cornell_3AL | beats (-1%) | matches (-7%) | beats (51%) |
| Cornell_32PL | beats (26%) | beats (60%) | matches (-20%) |
| BistroInterior | beats (12%) | matches | trails (+11%) |
| Sponza | beats (12%) | matches | trails (+19%) |
art5 (local-spike penalty) is at parity or beating RTXDI on every scene. err% (OkLab perceptual) beats RTXDI on every scene. Residual rmse trails on Bistro/Sponza — attributed to RTXDI's 5-iteration spatial cascade (our 1-pass spatial reuse cannot recover the same variance reduction without multi-pass ping-pong infrastructure).
Committed optimizations that don't compromise algorithm or params.
Quality verified identical (same K=41, same biasCorrection=Basic,
same mCap=20, same sampler) on Bistro+Sponza at x64 with
LADDER_TIMING_MODE=1 + N_WARMUP=16 + bayerN=4 (16-frame Bayer
cycle, profiler stats.mean, EMA bypassed).
Commit ladder
| Commit | Optimization | Quality delta |
|---|---|---|
| f8b548e | USE_VISCACHE_NORMAL_ADDR gate | identical |
| b7d1a86 | gNormalAddr removed entirely (–71 LOC) | identical |
| 09cf651 | wsCellPoolPrePass=False canonical (R2dP2d + R3dP3d) | identical |
| 662700b | Cross-variant prepass A/B (RDI00_PrepassAB) | verified 6/6 |
| ccbf5b1 | 3 dead cbuffer fields removed (NormalAFine, DiagAccumWindow, LightSoftness) | identical |
| e338acb | LADDER_TIMING_BREAKDOWN env var bypasses capture-file cache | tool fix |
| 4f78aee | Orphan C++ struct fields dropped (Phase A; followup to ccbf5b1, −53 LOC) | identical |
| 4b32125 | useCellInRIS dropped — collapsed into (spatialNeighbours > 0) (Phase B) | drift −0.20% (RNG floor) |
| 695282e | enableCellPool dropped — collapsed into (cellPoolFootprintPx > 0) (Phase C-light) | drift −0.17% (RNG floor) |
Final verified per-variant numbers, x64
| Variant | Bistro ms | Sponza ms | Bistro rmse | Sponza rmse |
|---|---|---|---|---|
| RTXDI reference | 1.30 | 1.04 | 97.9 | 0.376 |
| F17P24 prepass-off (canonical) | 5.14 | 4.80 | 43.6 | 0.133 |
| PureKRIS F8 (no prepass at all) | 3.75 | 3.69 | 45.1 | 0.147 |
| R3dP3d prepass-off (canonical) | 2.95 | 3.12 | 65.8 | 0.176 |
Clean A/B at current HEAD — cross-variant, cross-scene (RDI00_PrepassAB
ladder, same K=41, same biasCorrection=Basic within each pair, only
wsCellPoolPrePass flipped, x64):
| Scene | Variant | On ms | Off ms | Δ | rmse delta |
|---|---|---|---|---|---|
| Bistro | R2dP2d F17P24 | 7.00 | 5.11 | −27.0% | +0.01% |
| Bistro | R3dP3d F00P24 | 4.70 | 2.76 | −41.1% | −0.00% |
| Sponza | R2dP2d F17P24 | 6.04 | 4.39 | −27.4% | +0.05% |
| Sponza | R3dP3d F00P24 | 4.79 | 3.04 | −36.6% | +0.00% |
| Cornell_32PL | R2dP2d F17P24 | 3.01 | 2.32 | −23.2% | +0.00% |
| Cornell_32PL | R3dP3d F00P24 | 2.51 | 1.80 | −28.3% | +0.00% |
Prepass-off is a universal x64 win across 6/6 scene-variant cells (−23% to −41%). R3dP3d benefits MORE than R2dP2d on every scene because its F00 (no fresh K-RIS) made the prepass a larger fraction of total cost. rmse delta ≤±0.05% everywhere — well within measurement noise, proving algorithm-neutrality.
The earlier "Sponza +20% regression" was a contaminated baseline (different measurement context, EMA vs stats.mean, different warmup state) — corrected by the clean cross-variant A/B above.
At low SPP (x4), prepass-on can win on heavy scenes (Bistro x4: 7.67 vs 11.23 ms) because the prepass IS the pool-warmup mechanism. With N_WARMUP=16 (one full Bayer cycle), x16+ steady-state always favors prepass-off.
Why prepass-off is algorithm-neutral: main-pass cellPoolInsert
(PathTracer.slang:1197) already populates pool slots from K-RIS
winners. The prepass was a redundant pool-fill dispatch. With
N_WARMUP=16 + bayerN=4, pool reaches steady state by frame 16
regardless of prepass — verified rmse identical across all 4
RDI00 scenes (Cornell_1AL/3AL/32PL, Bistro, Sponza).
Quality wins preserved
| Scene | Ours rmse | RTXDI rmse | Our advantage |
|---|---|---|---|
| Bistro | 43.6 | 97.9 | 2.2× better |
| Sponza | 0.133 | 0.376 | 2.8× better |
Dead variants disabled from default RDI00 ladder (callable in VisCache_LadderCommon.py if needed):
NoPrepass— REDUNDANT (canonical IS prepass-off now)PureKRIS_F04— K-scaling probe complete, fixed overhead confirmedPoolOnly F00P24— quality worse than RTXDI, fresh K-RIS irreplaceableK5Spatial— single-pass K=5 amplifies Bistro fireflies (+18% rmse)BrdfRis— no rmse improvement (commit log)
Net: ladder runtime ~halved (10 variants → 5 active per scene), and the two RTXDIBaseline variants both inherit the prepass-off win at identical quality. Speed gap to RTXDI now 2.3-4.6× depending on variant, down from 5-6× pre-optimization, while preserving the 2.2-2.8× rmse advantage.
Verified via dedicated A/B: scripts/VisCache_LadderRDI00_PrepassAB.py
isolates the prepass flip at current HEAD with all other optimizations
frozen — confirms −27% on both Bistro AND Sponza at x64 with rmse
identical (algorithm-neutrality preserved).
- Falcor's
events[k]["average"]is an EMA (σ=0.98) and survivesresetStats(). Useevents[k]["stats"]["mean"]for true per-call arithmetic mean. LADDER_TIMING_MODE=1disables VisCache diagnostic-texture writes (~90% of our per-frame GPU cost). Use for timing benchmarks; default-on for quality plates.- Honest steady-state at x16, diagnostics off: ~2.5× slower per frame than RTXDI on Bistro/Sponza. Quality EXCEEDS RTXDI (rmse −41% Bistro, −44% Sponza at x16). Trade: quality-per-frame vs quality-per-ms. RTXDI saturates at mCap=20; our metrics improve monotonically with SPP.
At x16 SPP our F17P24 cell-pool architecture beats RTXDI on every metric across Bistro+Sponza. RTXDI's metrics DEGRADE x4→x16 (M-cap saturation reusing stale fireflies); ours monotonically improve. The Bayer-coordinated cell-pool fill IS our equivalent of RTXDI's per-frame 5-pass cascade — stretched in time across N²=16 subframes. We don't need to port RTXDI's compute-presample-tile or 5-iteration cascade.
- Category-quota K-RIS (single-stream OR sub-reservoir) — mixed pHat scales across env/analytic/BRDF categories cause variance spikes. RTXDI's category separation works because of presample-tile semantics, not the separation itself.
- BRDF candidate stream with MIS-balance Li damping — the damping zeroes the BRDF candidate on diffuse surfaces AND in env-sun direction. RTXDI uses
RTXDI_LightBrdfMisWeight(blended source pdf) which requires the presample tile. - K=5 single-pass spatial — single snapshot amplifies fireflies on Bistro (+18%) even though Sponza wins. Multi-pass spatial reuse with reservoir ping-pong is the right architecture, not bigger K per pass.
- biasCorrection — Pairwise looks safer but the param-parity win is Basic (RTXDI's actual default). Pairwise stays in the codebase only for cross-surface reservoir merges (cell / temporal / spatial — see
project_pairwise_mis_cross_surface_principle).