Learnings

Notes and lessons learned during development.

Process

Always run validation before committing code. Use scripts/validate_before_commit.sh (8 CPU tests) and the Colab notebook (14 tests including GPU) to verify output matches analytical solutions. Even "structural only" changes can introduce bugs.
Colab T4 timing varies between sessions — shared hardware means 20-50% variation is normal. Compare trends, not individual numbers.
Stale binaries on Colab — if you %cd into a directory from a previous clone without rebuilding, you'll run old code. Always rebuild after a fresh clone.

Bugs Found

abs() on float calls integer abs (2026-04-01)

abs() from <stdlib.h> truncates floats to int silently. At nanometer scales (radius = 8E-6), abs(0.000003) becomes 0. Wall collision detection was completely broken. Use fabsf() for float or fabs() for double.

gridSize integer division (2026-04-01)

ceil(iter/blockSize) where both are int — the division truncates before ceil sees it. ceil(7) is still 7. Particles in the last partial block were never simulated. Fix: (iter + blockSize - 1) / blockSize.

Data race on shared device pointers (2026-04-01)

x_new and y_new in wall reflection were single device-allocated doubles shared across all threads. Multiple threads writing simultaneously produces garbage. Fixed with per-thread stack-local variables.

Wall reflection double-offset bug (2026-04-01)

d_reflection returns absolute world-space coordinates (wall intersection point + reflected overshoot). But the caller did d_x[idx] = x_last + refl_x, double-counting x_last which is already baked into the result. Proof: particle at (8,0) reflecting off radius-10 wall correctly returns (8,0), but 8 + 8 = 16 is outside the wall. Fix: d_x[idx] = refl_x. This bug existed in the original Fast_wide.reflect thesis code (the printf("uhhhh\n") debug line was likely related). Wall reflection was discussed in the thesis but never successfully validated — all thesis validation figures are free-space tests.

Float/double mismatch in d_reflection (2026-04-01)

Using powf, sqrtf, sinf, cosf (float precision) but storing in double variables. Gives false confidence in precision — intermediate results are truncated to float. Use matching precision throughout.

RNG: per-call clock64() is faster than global memory state (2026-04-01)

The thesis approach of curand_init(clock64(), idx, 0, &state) per kernel call was deliberately chosen and is 4-5x faster than global memory RNG state on a T4. A/B test results (10k particles × 10k steps, same session):

Per-call clock64(): ~27ms
Global memory state: ~117ms

Why: Philox state created on the stack lives in registers (~1 cycle access). State in global memory costs ~400-800 cycles per read/write. The curand_init cost for Philox is cheap (counter-based, just sets a few integers) — much less than the global memory round-trip of 64 bytes × 2 (read+write) × 10k threads per timestep.

The thesis breakthrough was two things working together:

Choosing Philox (cheap init) over XORWOW (expensive init)
Using clock64() for fresh seeds without global memory state

Trade-off: clock64() can produce correlated streams when threads launch at the same clock tick, but thesis validation (cross-correlation vs MATLAB randn) confirmed acceptable quality for the physics. When --seed is provided, we use the fixed seed instead of clock64() for reproducibility — same performance, deterministic output.

Validation Methodology (2026-04-01)

KS test over NRMSE for simulation validation

NRMSE on histograms is bin-dependent (change binning, change result), has no known null distribution, and is not a formal statistical test. The Kolmogorov-Smirnov test compares empirical CDF against theoretical CDF with no binning, provides a rigorous p-value, and is consistent. NRMSE kept as supplementary human-readable metric. References: NIST Handbook, Law (2015), Devroye (1986).

3D hit-time validation requires two separate tests

The 3D H_Diff PDF integrates to Pr(hit) = a/d < 1, not to 1. Observed hitting times are samples from the conditional distribution (given that the particle hit). KS test must compare against H_Diff normalized to integrate to 1, not the raw PDF. Hit probability validated separately with a binomial test. This is standard practice: Noel et al. (2014), Yilmaz et al. (2014).

alpha = 0.001 for CI, not 0.05

At alpha = 0.05, ~5% of CI runs false-fail even with correct code. At alpha = 0.001, false positive rate is 0.1%. With n=10,000 the KS test still detects CDF shifts of ~0.019, catching any real bug.

1D inverse Gaussian has exact scipy parameterization

No numerical integration needed. scipy.stats.invgauss with mu = 2*D/(b*v), scale = b^2/(2*D) gives exact CDF. Derivation: our PDF p(t) = b/sqrt(4*pi*D*t^3) * exp(-(v*t-b)^2/(4*D*t)) is a standard inverse Gaussian with mean mu = b/v and shape lambda = b^2/(2*D). scipy's parameterization absorbs lambda into scale.

3D conditional CDF via numerical integration

The 3D H_Diff PDF integrates to Pr(hit) < 1, but our samples are only from particles that hit. We must compare against H_Diff normalized to 1 (conditional distribution). CDF built by cumulative_trapezoid on a 500k-point grid, then normalized. interp1d creates a callable CDF for scipy.stats.kstest. This is standard: see Noel et al. (2014), Jamali et al. (2019).

Thesis equation 4.3 has a typesetting error

The PDF shows -(vt - b^2) but should be -(v*t - b)^2. All code (MATLAB and Python) implements the correct form. The squared term applies to the full expression (v*t - b), not just b. Verified against Kadloor et al. 2012 Eq. 12.

Per-call Philox RNG with clock64() is 4-5x faster than global memory state

See detailed entry under Bugs Found.

Remaining Validation Improvements

Items identified during deep review, to be addressed:

~~KS test as formal pass/fail~~ (done — 1D uses scipy.stats.invgauss, 3D uses numerical CDF + binomial hit rate test)
CSV NaN filtering — scripts don't skip headers, MATLAB skips 2 rows
~~np.trapezoid compatibility~~ (done — pinned numpy>=2.0, scipy>=1.11 in requirements.txt)
~~Dead QQ plot code in validate_rng.py~~ (done — removed dead code, dropped try/except for scipy since it's pinned)
~~3D test at 1k paths~~ (done — NRMSE gated to >= 1000 hits per Wasserman 2004 Theorem 20.9: histogram RMSE converges O(n^{-1/3}), below 1000 dominated by shot noise)
~~Wall reflection validation~~ (done — ported TobyThesisTest_walls.m, added --start-x/y/z params for transmitter positioning, GPU-only test)
~~Multi-parameter 1D test~~ (done — added Test 1b with dist=1E-7, vel=3E-4, dt=1E-8 matching TobyThesisTest_dist.m; added --velocity CLI flag)
~~Timestep convergence test~~ (skipped — Euler-Maruyama is exact for constant-coefficient SDEs; the only dt-dependent error is hit detection boundary crossing O(sqrt(dt)) per Gobet 2000, which requires ~25M samples to distinguish dt=1E-7 from 1E-8; redundant with existing KS tests at two dt values)
~~Inter-stream RNG correlation~~ (skipped — Philox4_32_10 is counter-based, inter-stream correlation eliminated by construction per Salmon et al. SC11; our usage pattern (same seed, distinct subsequences) is the canonical cuRAND recommendation; KS test against analytical solutions is a stronger end-to-end check)
~~Edge cases~~ (done — KS/binomial tests skip with message when n < 20 per Conover 1999; low-power warning at n < 50; empty CSV already handled by load_hit_times)

Performance

Device-side hit detection was the biggest win — moving hit checks into the kernel and eliminating per-timestep cudaMemcpy gave 25-224x GPU speedup.
GPU speedup scales with simulation length: more timesteps = more amortized kernel launch overhead.
CPU cleanup (stack vars, reduced redundant sqrt calls) gave ~2x CPU speedup.
Apple Silicon M-series is ~2x faster than Colab Xeon for single-threaded CPU work.
Per-call Philox RNG with clock64() is 4-5x faster than global memory state — register-local stack state beats global memory round-trips. Don't assume "initialize once" is always faster; memory hierarchy matters more than init cost for lightweight RNGs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learnings

Process

Bugs Found

abs() on float calls integer abs (2026-04-01)

gridSize integer division (2026-04-01)

Data race on shared device pointers (2026-04-01)

Wall reflection double-offset bug (2026-04-01)

Float/double mismatch in d_reflection (2026-04-01)

RNG: per-call clock64() is faster than global memory state (2026-04-01)

Validation Methodology (2026-04-01)

KS test over NRMSE for simulation validation

3D hit-time validation requires two separate tests

alpha = 0.001 for CI, not 0.05

1D inverse Gaussian has exact scipy parameterization

3D conditional CDF via numerical integration

Thesis equation 4.3 has a typesetting error

Per-call Philox RNG with clock64() is 4-5x faster than global memory state

Remaining Validation Improvements

Performance

FilesExpand file tree

LEARNINGS.md

Latest commit

History

LEARNINGS.md

File metadata and controls

Learnings

Process

Bugs Found

abs() on float calls integer abs (2026-04-01)

gridSize integer division (2026-04-01)

Data race on shared device pointers (2026-04-01)

Wall reflection double-offset bug (2026-04-01)

Float/double mismatch in d_reflection (2026-04-01)

RNG: per-call clock64() is faster than global memory state (2026-04-01)

Validation Methodology (2026-04-01)

KS test over NRMSE for simulation validation

3D hit-time validation requires two separate tests

alpha = 0.001 for CI, not 0.05

1D inverse Gaussian has exact scipy parameterization

3D conditional CDF via numerical integration

Thesis equation 4.3 has a typesetting error

Per-call Philox RNG with clock64() is 4-5x faster than global memory state

Remaining Validation Improvements

Performance