VisCache Dev Log

Cross-cutting findings, failed approaches, and reasoning that don't belong to a single ladder step. Step-by-step ladder records and the forward plan have moved out of this file:

Ladder Log — per-step ladder records (steps 00–18, the "narrowing chain" decisions, current canonical carries).
Ladder Plan — forward plan for steps 19–50+ (multilevel PT DI canonical, multilevel + WS-ReSTIR DI, multilevel + PT multibounce, multilevel + ReSTIR PT multibounce, BDPT open).

This file keeps:

Cross-cutting parity / structural-equivalence story (RTXDI baseline, restir_2d ≡ restir_3d).
Sampler artefacts that are reusable beyond the ladder (e.g. EmissivePdfMipmapSampler).
Failed approaches with their diagnoses (one paragraph each, anchored to dates / commits).
Cross-cutting reasoning paragraphs.

RTXDI Baseline — Final Result

Status: Functional + qualitative parity with RTXDI achieved on the seven-scene matrix; structural equivalence (restir_2d ≡ restir_3d) demonstrated within sampling noise.

Final canonical config

Knob	Value	Rationale
`WS_CELL_POOL_N`	128	Matches RTXDI tile-density target. 64→128 won Sponza_x4 −0.24pp; 128→256 diminishing.
`wsInitialCandidates` (K_pre)	32	Slim pre-pass; K=64 quality cost ~0.1pp avg — acceptable trade.
`wsCellPoolDrawK` (K_pool)	16	RTXDI K=24 budget. K_pool=24/64 retested with Conv A and B — both regress (over-weights pool's shading-agnostic distribution vs 8 fresh shading-conditional samples).
`wsMCap`	5	RTXDI default 20 tested — uniformly +0.1-0.3pp worse on multi-light scenes.
Pre-pass emissive sampler	PdfMipmap	New `EmissivePdfMipmapSampler` peer to Power/LightBVH. RTXDI-style hierarchical pdf-mipmap.
Main-pass emissive sampler	LightBVH (default)	Shading-conditional, required by BistroInt; mixed-PdfMipmap-main regressed +1.47pp.
Pool read convention	Conv B reader-eval	`1/sourcePdf` computed at READER's vertex via `emissiveSampler.evalPdf()` — RTXDI-faithful unbiased. Earlier writer-pdf Conv B caused fireflies (writer's r²/cos baked in).
Bayer N×N	4 (16 subframes)	RTXDI presample-budget alignment: 16K active pixels × K=8 ≈ 131K presamples = RTXDI's 128×1024.

Quality parity at x4 SPP vs RTXDI (mean OkLab err, 512²)

Scene	vanilla	RTXDI	restir (ours)	Δ vs RTXDI
CornellBox_1AreaLight	1.39	2.18	2.15	−0.03 win
CornellBox_1PointLight	0.21	1.39	0.21	−1.18 win
CornellBox_3AreaLights	2.97	2.60	3.55	+0.95 trail
CornellBox_32PointLights	5.36	3.73	3.31	−0.42 win
BistroExterior	18.12	13.23	10.88	−2.35 win
BistroInterior	16.96	10.73	9.54	−1.19 win
Sponza	6.23	7.08	6.49	−0.59 win

Net at x4: 6 wins / 0 parities / 1 trail. Cumulative −4.81pp ahead of RTXDI on aggregate.

The single remaining trail is CornellBox_3AreaLights (+0.95pp). Confirmed structural: per-cell pool architecture vs RTXDI's 1024-tile global structure produces different per-pixel candidate diversity profiles. No within-architecture parameter sweep equalizes them; closing it would require a true global tile structure.

Cost parity (shadow rays)

rays_traced_pct per the diagnostic counter (lower is better):

Scene_x4	RTXDI	restir	restir / RTXDI
Cornell_1AL	9.90	18.13	1.83×
Cornell_1PL	5.15	0.38	0.07×
Cornell_3AL	9.54	22.16	2.32×
Cornell_32PL	24.66	17.38	0.70×
BistroExterior	81.95	74.95	0.91×
BistroInterior	65.39	60.84	0.93×
Sponza	59.88	60.50	1.01× (parity)

Shadow-ray parity on five scenes; restir uses fewer rays on three. Cornell_3AL/Cornell_1AL fire ~2× because their K-RIS produces valid winners more often (visibility patterns differ from RTXDI's tile fill). Eval-cost gap (pre-pass uses PathTracer instance, ~3-4× more light-evaluations than RTXDI's lean compute presample) is plumbing — addressed by the lean dedicated compute pre-pass when ready (Task #29).

Structural equivalence — the proving result

restir_2d (RTXDI's exact data structure: pixel reservoir + screen-space tile pool) and restir_3d (3D-cell pool + per-pixel reservoir) produce identical results within sampling noise on every scene tested:

Scene_x4	restir_2d err	restir_3d err	\|2d − 3d\|
Cornell_1AL	2.15	2.16	0.01
Cornell_1PL	0.21	0.21	0.00
Cornell_3AL	3.55	3.55	0.00
Cornell_32PL	3.31	3.31	0.00
BistroExt	10.88	10.85	0.03
BistroInt	9.54	9.53	0.01
Sponza	6.49	6.47	0.02

|2d − 3d| ≤ 0.03pp on all scenes — well below the per-frame stochastic noise floor. This is the structural-equivalence claim from paper §3.0 made operational: the 3D-cell pool with footprint-derived entry level is structurally equivalent to RTXDI's 2D-tile pool at matching parameters. The novelty isn't the addressing scheme; it's the curve beyond. Setting the footprint-derived entry level to one screen tile recovers RTXDI's exact pool layout; beyond that operating point, 3D admits cross-tile world-space sharing that 2D cannot express.

Sampler artefact: `EmissivePdfMipmapSampler`

A clean Falcor-native peer to EmissiveUniformSampler/EmissiveLightBVHSampler/EmissivePowerSampler, registered as EmissiveLightSamplerType::PdfMipmap = 3 in the existing factory. CPU-side build from MeshLightTriangle.flux placed in z-curve mip-0 layout (using inlined RTXDI_LinearIndexToZCurve); Texture::generateMips builds the chain. Slang side inlines RTXDI_SamplePdfMipmap for hierarchical descent and returns solid-angle pdf via ls.pdf *= mipmapPdf, vanilla-NEE-compatible. Math validated 1.116% on Cornell_3AL vanilla x16 vs LightBVH 1.119% / Power 1.126% — within stochastic noise. RTXDI library files are untouched; the sampler reuses rtxdi/RtxdiMath.hlsli via include only. Reusable by any pass that wants RTXDI-style sampling.

Failed approaches (short list)

Conv B with stored solid-angle pdf — fireflies on Sponza_x4 (+6.18pp regression). Writer's r²/cos baked into stored 1/sourcePdf amplifies at distant-writer slots. Fix: reader-evaluated pdf.
Mixed PdfMipmap main + PdfMipmap pool — BistroInt_x4 +1.47pp regression. Main pass needs shading-conditional LightBVH for tight indoor geometry.
K_pool > 16 (24, 64) — over-weights pool's shading-agnostic distribution vs the 8 fresh shading-conditional samples. Both Conv A and Conv B regress.
wsMCap = 20 (RTXDI default) — uniformly +0.1-0.3pp worse on multi-light scenes. Stays at 5.
Bitterli RIS at insert with writer-pHat — biases pool toward writer's shading point, breaks cross-pixel reuse on heterogeneous lighting.
Drop main-pass fresh K-RIS (pool-only K=24) — regressed Sponza_x1 +9pp; fresh shading-conditional samples are required.
Spatial-reuse off (wsSpatialPixelsK=0) — confirmed not the Cornell_3AL bias source (Δ < 0.06pp).
Probabilistic V-aware fill at insert — preserves expected value (only changes variance); Sponza unchanged.
RTXDI BoilingFilter port — DISABLED 2026-05-05 (#if 0 in shader, block-commented in C++). Dispatch fires, host-side clearUAV on the same buffer moves the metric, but shader-side writes silently no-op. Suspect: locally-redeclared RWStructuredBuffer<WSReservoir> vs the working module-imported gVHFTable in VisCacheDecay. Lesson: silent no-op safety nets are worse than no safety net — they could mask future regressions. Future fix: split gWSPixelReservoirs into a separable include both WSReservoirIO and a fixed BoilingFilter can import.
accelDecayDisagreeThresh > 0 — Bistro art5 regresses 3–6× (BiE x16 21.8 → 132.8; BiI x16 29.9 → 93.1). Cause: half-decay-on-disagreement creates runaway oscillation on cells with legitimate mixed visibility. ad ∈ {0.05, 0.10, 0.30} all converge to the same (worse) attractor — empirically broken mechanism. Default off (BISTRO_ADD sweep, 2026-05-05).
Trust-gate sweeps at cell4×4 ct=2 on Sponza (vt, se, fd, cwf, posB-quant) — all combinations bit-identical (rays=73.48%, art5=23.36 — tested in step 17, step 18). The 26.5% rays-savings ceiling at this corner is structural, not gate-tunable: ct=2 itself is the bottleneck. Naive raise-base-ct (SPONZA_CT) breaks the saturation: ct=8 cuts art5 23.4→17.5 at x4. Lesson: trust-gate sweeps stop revealing leverage once the boot threshold itself is too low to accumulate the per-cell N needed to trust μ.

Cache regime findings (cross-cutting)

These didn't fit any single ladder step's narrative — emerged from the union of multiple sweeps and reframe earlier results.

Scene-class taxonomy is 4-row, not 2-row. (Class × bounce-depth.) Penumbra-class single-bounce DI (Sponza b=0): vt-tuning helps locally; perceptual-vs-linear metric tradeoff. Penumbra-class multibounce (Sponza b=1/4): cache delivers −74pp rays + OkLab match, PSNR/relmse worsens (linear-space loss is the perceptual cost of the cache's CV+RRR averaging). Firefly-class single-bounce DI (Bistro b=0): cache already at firefly floor — sweeping ct/vt/decay has no leverage but cache is winning −46pp art5 vs vanilla. Firefly-class multibounce (BistroInt b=1/4): cache wins on every metric (relmse 2.4× better, −53pp rays). The "Bistro framework doesn't generalize from Sponza" finding from BISTRO_CT was an artifact of single-bounce DI; multibounce closes the gap — every per-bounce firefly source is a fresh variance the cache amortizes via cell-level mean.
Bistro firefly-floor reframe. BISTRO_DECAY (decayPeriod sweep) and BISTRO_CT (4-corner ct/vt) both showed bit-identical art5 across all variants on Bistro single-bounce DI. The reframe (BISTRO_DECAY narrative): cache art5 42.87% / 29.93% (x4 / x16) vs vanilla 88.89% / 48.23% means the cache absorbs ~46pp / 18pp of vanilla's variance — the residual is irreducible firefly noise, not cache bias-lock. Bistro DI cache is working as designed, just at its theoretical ceiling. The mechanism that breaks the floor is multibounce, not more DI-level tuning. The "scene-classifier needed" follow-up from BISTRO_CT remains valid as a future direction (per-class auto-tune) but doesn't change the b=0 result.
vt has anti-correlated optima across metric families. SPONZA_VT at x16: vt=0.001 best art5 (15.21) but RMSE/relmse worse; vt=0.30 best RMSE/PSNR/relmse (relmse 0.09 vs 0.45 at tight vt) but worst art5 (28.42). art5 penalizes LOCAL spikes (firefly-region peaks); RMSE penalizes AVERAGE error. Tight vt kills firefly spots locally; loose vt smooths per-pixel noise globally. Implication: ship per-metric carry tables, not a universal vt — and any paper §11/§12 figure must report multiple metric families honestly.
vt is SPP-dependent. SPONZA_VT: x4 optimum vt≈0.10, x16 optimum vt≈0.001. Wilson-interval / two-tier ct (LADDER_PLAN improvement A) is the principled fix — Wilson lower-bound > 0.99 OR upper-bound < 0.01 collapses both regimes into one criterion.
rays_traced_pct ≠ wall-clock saving on ray-trace-cheap scenes. Cornell scenes have tiny geometry (small BVH, sub-millisecond ray cost). The cache's per-pixel infrastructure cost (hash query + atomic decay + cell-state update) is roughly constant per scene, dominated by the lookup machinery rather than the ray itself. So "94% rays saved on Cornell_1PL b=4" is an algorithmic finding, not a wall-clock claim — the saved rays were free to trace in the first place. Wall-clock wins require ray-cost > cache-infrastructure-cost, which holds on Sponza and BistroInterior, but not on Cornell_32PL (2.6 ms vanilla; cache infrastructure already exceeds vanilla's render cost). Pitch implication: report rays-saved as the algorithmic metric, gpu_tracepass_ms as the operational metric, and don't conflate them. Cornell-class scenes are useful as algorithm-validation but not wall-clock benchmarks.
The cache is designed for 1-SPP-per-frame + frame-accumulation real-time rendering. Every frame is a 1-SPP draw; consecutive frames warm cache state; wins emerge AT STEADY STATE under temporal coherence. A cold-start measurement (render 4-8 warmup frames, average a small window after) under-represents the real-world value because the cache hasn't reached cell-maturity equilibrium yet. Animated scenes benefit naturally: as the camera moves through space, locally-overlapping cells stay warm frame-to-frame; only the leading edge of newly-revealed regions pays cold-start cost, and that's amortized over many subsequent frames where those cells are hit again. Methodology corollary: TIMING measurements need long warmup (64+ frames) to reach the operating regime the cache was built for. Single-shot multi-SPP-per-frame measurements (vanilla x4 in one renderFrame call) are out-of-distribution for this cache and shouldn't be used as the wall-clock benchmark.

Lessons distilled

Convention B requires reader-evaluated pdf. emissiveSampler.evalPdf() at the receiver's vertex; never store the writer's solid-angle pdf — its r²/cos factor amplifies into firefly tails at distant readers.
Data-structure equivalence is structural. 2D screen tile and 3D world cell are interchangeable at matched density; the mechanism is flat-multilevel-hash + reservoir reuse + RIS pool fill regardless of which one you address.

RTXDI param-parity audit (2026-05-15)

Status across the F17P24 baseline after a multi-iteration sweep:

Knob	RTXDI default	Our F17P24 default
localLightCandidateCount	24	24 (pool) ✓
infiniteLightCandidateCount	8	~5.67 (uniform-fresh×selectLightType)
envLightCandidateCount	8	~5.67 (uniform-fresh×selectLightType)
brdfCandidateCount	1	0 (tried, no win)
testCandidateVisibility	true	true ✓
biasCorrection	Basic	Basic ✓ (5be5db0)
samplingRadius	30	30 ✓
spatialSampleCount	1	1 ✓
spatialIterations	5	1 ← largest unmatched
maxHistoryLength	20	mCap=20 ✓
boilingFilterStrength	0	0 ✓
presampledTileCount × Size	128 × 1024	N/A (cell-pool architecture)

Quality status at SPP=4 with the locked F17P24 Basic default:

Scene	err% vs RTXDI	art5% vs RTXDI	rmse vs RTXDI
Cornell_1PL	beats (90%)	beats (97%)	beats (96%)
Cornell_1AL	beats (48%)	beats (38%)	beats (54%)
Cornell_3AL	beats (-1%)	matches (-7%)	beats (51%)
Cornell_32PL	beats (26%)	beats (60%)	matches (-20%)
BistroInterior	beats (12%)	matches	trails (+11%)
Sponza	beats (12%)	matches	trails (+19%)

art5 (local-spike penalty) is at parity or beating RTXDI on every scene. err% (OkLab perceptual) beats RTXDI on every scene. Residual rmse trails on Bistro/Sponza — attributed to RTXDI's 5-iteration spatial cascade (our 1-pass spatial reuse cannot recover the same variance reduction without multi-pass ping-pong infrastructure).

Optimization log — algorithm-preserving wins (2026-05-19)

Committed optimizations that don't compromise algorithm or params. Quality verified identical (same K=41, same biasCorrection=Basic, same mCap=20, same sampler) on Bistro+Sponza at x64 with LADDER_TIMING_MODE=1 + N_WARMUP=16 + bayerN=4 (16-frame Bayer cycle, profiler stats.mean, EMA bypassed).

Commit ladder

Commit	Optimization	Quality delta
f8b548e	USE_VISCACHE_NORMAL_ADDR gate	identical
b7d1a86	gNormalAddr removed entirely (–71 LOC)	identical
09cf651	wsCellPoolPrePass=False canonical (R2dP2d + R3dP3d)	identical
662700b	Cross-variant prepass A/B (RDI00_PrepassAB)	verified 6/6
ccbf5b1	3 dead cbuffer fields removed (NormalAFine, DiagAccumWindow, LightSoftness)	identical
e338acb	LADDER_TIMING_BREAKDOWN env var bypasses capture-file cache	tool fix
4f78aee	Orphan C++ struct fields dropped (Phase A; followup to ccbf5b1, −53 LOC)	identical
4b32125	useCellInRIS dropped — collapsed into (spatialNeighbours > 0) (Phase B)	drift −0.20% (RNG floor)
695282e	enableCellPool dropped — collapsed into (cellPoolFootprintPx > 0) (Phase C-light)	drift −0.17% (RNG floor)

Final verified per-variant numbers, x64

Variant	Bistro ms	Sponza ms	Bistro rmse	Sponza rmse
RTXDI reference	1.30	1.04	97.9	0.376
F17P24 prepass-off (canonical)	5.14	4.80	43.6	0.133
PureKRIS F8 (no prepass at all)	3.75	3.69	45.1	0.147
R3dP3d prepass-off (canonical)	2.95	3.12	65.8	0.176

Clean A/B at current HEAD — cross-variant, cross-scene (RDI00_PrepassAB ladder, same K=41, same biasCorrection=Basic within each pair, only wsCellPoolPrePass flipped, x64):

Scene	Variant	On ms	Off ms	Δ	rmse delta
Bistro	R2dP2d F17P24	7.00	5.11	−27.0%	+0.01%
Bistro	R3dP3d F00P24	4.70	2.76	−41.1%	−0.00%
Sponza	R2dP2d F17P24	6.04	4.39	−27.4%	+0.05%
Sponza	R3dP3d F00P24	4.79	3.04	−36.6%	+0.00%
Cornell_32PL	R2dP2d F17P24	3.01	2.32	−23.2%	+0.00%
Cornell_32PL	R3dP3d F00P24	2.51	1.80	−28.3%	+0.00%

Prepass-off is a universal x64 win across 6/6 scene-variant cells (−23% to −41%). R3dP3d benefits MORE than R2dP2d on every scene because its F00 (no fresh K-RIS) made the prepass a larger fraction of total cost. rmse delta ≤±0.05% everywhere — well within measurement noise, proving algorithm-neutrality.

The earlier "Sponza +20% regression" was a contaminated baseline (different measurement context, EMA vs stats.mean, different warmup state) — corrected by the clean cross-variant A/B above.

At low SPP (x4), prepass-on can win on heavy scenes (Bistro x4: 7.67 vs 11.23 ms) because the prepass IS the pool-warmup mechanism. With N_WARMUP=16 (one full Bayer cycle), x16+ steady-state always favors prepass-off.

Why prepass-off is algorithm-neutral: main-pass cellPoolInsert (PathTracer.slang:1197) already populates pool slots from K-RIS winners. The prepass was a redundant pool-fill dispatch. With N_WARMUP=16 + bayerN=4, pool reaches steady state by frame 16 regardless of prepass — verified rmse identical across all 4 RDI00 scenes (Cornell_1AL/3AL/32PL, Bistro, Sponza).

Quality wins preserved

Scene	Ours rmse	RTXDI rmse	Our advantage
Bistro	43.6	97.9	2.2× better
Sponza	0.133	0.376	2.8× better

Dead variants disabled from default RDI00 ladder (callable in VisCache_LadderCommon.py if needed):

NoPrepass — REDUNDANT (canonical IS prepass-off now)
PureKRIS_F04 — K-scaling probe complete, fixed overhead confirmed
PoolOnly F00P24 — quality worse than RTXDI, fresh K-RIS irreplaceable
K5Spatial — single-pass K=5 amplifies Bistro fireflies (+18% rmse)
BrdfRis — no rmse improvement (commit log)

Net: ladder runtime ~halved (10 variants → 5 active per scene), and the two RTXDIBaseline variants both inherit the prepass-off win at identical quality. Speed gap to RTXDI now 2.3-4.6× depending on variant, down from 5-6× pre-optimization, while preserving the 2.2-2.8× rmse advantage.

Verified via dedicated A/B: scripts/VisCache_LadderRDI00_PrepassAB.py isolates the prepass flip at current HEAD with all other optimizations frozen — confirms −27% on both Bistro AND Sponza at x64 with rmse identical (algorithm-neutrality preserved).

Timing + RTXDI cost comparison (caveats)

Falcor's events[k]["average"] is an EMA (σ=0.98) and survives resetStats(). Use events[k]["stats"]["mean"] for true per-call arithmetic mean.
LADDER_TIMING_MODE=1 disables VisCache diagnostic-texture writes (~90% of our per-frame GPU cost). Use for timing benchmarks; default-on for quality plates.
Honest steady-state at x16, diagnostics off: ~2.5× slower per frame than RTXDI on Bistro/Sponza. Quality EXCEEDS RTXDI (rmse −41% Bistro, −44% Sponza at x16). Trade: quality-per-frame vs quality-per-ms. RTXDI saturates at mCap=20; our metrics improve monotonically with SPP.

Bayer-cascade convergence

At x16 SPP our F17P24 cell-pool architecture beats RTXDI on every metric across Bistro+Sponza. RTXDI's metrics DEGRADE x4→x16 (M-cap saturation reusing stale fireflies); ours monotonically improve. The Bayer-coordinated cell-pool fill IS our equivalent of RTXDI's per-frame 5-pass cascade — stretched in time across N²=16 subframes. We don't need to port RTXDI's compute-presample-tile or 5-iteration cascade.

RTXDI-parity attempts that did NOT help (pitfalls)

Category-quota K-RIS (single-stream OR sub-reservoir) — mixed pHat scales across env/analytic/BRDF categories cause variance spikes. RTXDI's category separation works because of presample-tile semantics, not the separation itself.
BRDF candidate stream with MIS-balance Li damping — the damping zeroes the BRDF candidate on diffuse surfaces AND in env-sun direction. RTXDI uses RTXDI_LightBrdfMisWeight (blended source pdf) which requires the presample tile.
K=5 single-pass spatial — single snapshot amplifies fireflies on Bistro (+18%) even though Sponza wins. Multi-pass spatial reuse with reservoir ping-pong is the right architecture, not bigger K per pass.
biasCorrection — Pairwise looks safer but the param-parity win is Basic (RTXDI's actual default). Pairwise stays in the codebase only for cross-surface reservoir merges (cell / temporal / spatial — see project_pairwise_mis_cross_surface_principle).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VisCache Dev Log

RTXDI Baseline — Final Result

Final canonical config

Quality parity at x4 SPP vs RTXDI (mean OkLab err, 512²)

Cost parity (shadow rays)

Structural equivalence — the proving result

Sampler artefact: `EmissivePdfMipmapSampler`

Failed approaches (short list)

Cache regime findings (cross-cutting)

Lessons distilled

RTXDI param-parity audit (2026-05-15)

Optimization log — algorithm-preserving wins (2026-05-19)

Timing + RTXDI cost comparison (caveats)

Bayer-cascade convergence

RTXDI-parity attempts that did NOT help (pitfalls)

FilesExpand file tree

DEVLOG.md

Latest commit

History

DEVLOG.md

File metadata and controls

VisCache Dev Log

RTXDI Baseline — Final Result

Final canonical config

Quality parity at x4 SPP vs RTXDI (mean OkLab err, 512²)

Cost parity (shadow rays)

Structural equivalence — the proving result

Sampler artefact: EmissivePdfMipmapSampler

Failed approaches (short list)

Cache regime findings (cross-cutting)

Lessons distilled

RTXDI param-parity audit (2026-05-15)

Optimization log — algorithm-preserving wins (2026-05-19)

Timing + RTXDI cost comparison (caveats)

Bayer-cascade convergence

RTXDI-parity attempts that did NOT help (pitfalls)

Sampler artefact: `EmissivePdfMipmapSampler`