Notes and lessons learned during development.
- Always run validation before committing code. Use
scripts/validate_before_commit.sh(8 CPU tests) and the Colab notebook (14 tests including GPU) to verify output matches analytical solutions. Even "structural only" changes can introduce bugs. - Colab T4 timing varies between sessions — shared hardware means 20-50% variation is normal. Compare trends, not individual numbers.
- Stale binaries on Colab — if you
%cdinto a directory from a previous clone without rebuilding, you'll run old code. Always rebuild after a fresh clone.
abs() from <stdlib.h> truncates floats to int silently. At nanometer scales (radius = 8E-6), abs(0.000003) becomes 0. Wall collision detection was completely broken. Use fabsf() for float or fabs() for double.
ceil(iter/blockSize) where both are int — the division truncates before ceil sees it. ceil(7) is still 7. Particles in the last partial block were never simulated. Fix: (iter + blockSize - 1) / blockSize.
x_new and y_new in wall reflection were single device-allocated doubles shared across all threads. Multiple threads writing simultaneously produces garbage. Fixed with per-thread stack-local variables.
d_reflection returns absolute world-space coordinates (wall intersection point + reflected overshoot). But the caller did d_x[idx] = x_last + refl_x, double-counting x_last which is already baked into the result. Proof: particle at (8,0) reflecting off radius-10 wall correctly returns (8,0), but 8 + 8 = 16 is outside the wall. Fix: d_x[idx] = refl_x. This bug existed in the original Fast_wide.reflect thesis code (the printf("uhhhh\n") debug line was likely related). Wall reflection was discussed in the thesis but never successfully validated — all thesis validation figures are free-space tests.
Using powf, sqrtf, sinf, cosf (float precision) but storing in double variables. Gives false confidence in precision — intermediate results are truncated to float. Use matching precision throughout.
The thesis approach of curand_init(clock64(), idx, 0, &state) per kernel call was deliberately chosen and is 4-5x faster than global memory RNG state on a T4. A/B test results (10k particles × 10k steps, same session):
- Per-call clock64(): ~27ms
- Global memory state: ~117ms
Why: Philox state created on the stack lives in registers (~1 cycle access). State in global memory costs ~400-800 cycles per read/write. The curand_init cost for Philox is cheap (counter-based, just sets a few integers) — much less than the global memory round-trip of 64 bytes × 2 (read+write) × 10k threads per timestep.
The thesis breakthrough was two things working together:
- Choosing Philox (cheap init) over XORWOW (expensive init)
- Using
clock64()for fresh seeds without global memory state
Trade-off: clock64() can produce correlated streams when threads launch at the same clock tick, but thesis validation (cross-correlation vs MATLAB randn) confirmed acceptable quality for the physics. When --seed is provided, we use the fixed seed instead of clock64() for reproducibility — same performance, deterministic output.
NRMSE on histograms is bin-dependent (change binning, change result), has no known null distribution, and is not a formal statistical test. The Kolmogorov-Smirnov test compares empirical CDF against theoretical CDF with no binning, provides a rigorous p-value, and is consistent. NRMSE kept as supplementary human-readable metric. References: NIST Handbook, Law (2015), Devroye (1986).
The 3D H_Diff PDF integrates to Pr(hit) = a/d < 1, not to 1. Observed hitting times are samples from the conditional distribution (given that the particle hit). KS test must compare against H_Diff normalized to integrate to 1, not the raw PDF. Hit probability validated separately with a binomial test. This is standard practice: Noel et al. (2014), Yilmaz et al. (2014).
At alpha = 0.05, ~5% of CI runs false-fail even with correct code. At alpha = 0.001, false positive rate is 0.1%. With n=10,000 the KS test still detects CDF shifts of ~0.019, catching any real bug.
No numerical integration needed. scipy.stats.invgauss with mu = 2*D/(b*v), scale = b^2/(2*D) gives exact CDF. Derivation: our PDF p(t) = b/sqrt(4*pi*D*t^3) * exp(-(v*t-b)^2/(4*D*t)) is a standard inverse Gaussian with mean mu = b/v and shape lambda = b^2/(2*D). scipy's parameterization absorbs lambda into scale.
The 3D H_Diff PDF integrates to Pr(hit) < 1, but our samples are only from particles that hit. We must compare against H_Diff normalized to 1 (conditional distribution). CDF built by cumulative_trapezoid on a 500k-point grid, then normalized. interp1d creates a callable CDF for scipy.stats.kstest. This is standard: see Noel et al. (2014), Jamali et al. (2019).
The PDF shows -(vt - b^2) but should be -(v*t - b)^2. All code (MATLAB and Python) implements the correct form. The squared term applies to the full expression (v*t - b), not just b. Verified against Kadloor et al. 2012 Eq. 12.
See detailed entry under Bugs Found.
Items identified during deep review, to be addressed:
KS test as formal pass/fail(done — 1D uses scipy.stats.invgauss, 3D uses numerical CDF + binomial hit rate test)- CSV NaN filtering — scripts don't skip headers, MATLAB skips 2 rows
np.trapezoid compatibility(done — pinned numpy>=2.0, scipy>=1.11 in requirements.txt)Dead QQ plot code in validate_rng.py(done — removed dead code, dropped try/except for scipy since it's pinned)3D test at 1k paths(done — NRMSE gated to >= 1000 hits per Wasserman 2004 Theorem 20.9: histogram RMSE converges O(n^{-1/3}), below 1000 dominated by shot noise)Wall reflection validation(done — ported TobyThesisTest_walls.m, added --start-x/y/z params for transmitter positioning, GPU-only test)Multi-parameter 1D test(done — added Test 1b with dist=1E-7, vel=3E-4, dt=1E-8 matching TobyThesisTest_dist.m; added --velocity CLI flag)Timestep convergence test(skipped — Euler-Maruyama is exact for constant-coefficient SDEs; the only dt-dependent error is hit detection boundary crossing O(sqrt(dt)) per Gobet 2000, which requires ~25M samples to distinguish dt=1E-7 from 1E-8; redundant with existing KS tests at two dt values)Inter-stream RNG correlation(skipped — Philox4_32_10 is counter-based, inter-stream correlation eliminated by construction per Salmon et al. SC11; our usage pattern (same seed, distinct subsequences) is the canonical cuRAND recommendation; KS test against analytical solutions is a stronger end-to-end check)Edge cases(done — KS/binomial tests skip with message when n < 20 per Conover 1999; low-power warning at n < 50; empty CSV already handled by load_hit_times)
- Device-side hit detection was the biggest win — moving hit checks into the kernel and eliminating per-timestep cudaMemcpy gave 25-224x GPU speedup.
- GPU speedup scales with simulation length: more timesteps = more amortized kernel launch overhead.
- CPU cleanup (stack vars, reduced redundant sqrt calls) gave ~2x CPU speedup.
- Apple Silicon M-series is ~2x faster than Colab Xeon for single-threaded CPU work.
- Per-call Philox RNG with clock64() is 4-5x faster than global memory state — register-local stack state beats global memory round-trips. Don't assume "initialize once" is always faster; memory hierarchy matters more than init cost for lightweight RNGs.