This project demonstrates empirically that Gaussian Processes bypass the double descent phenomenon exhibited by finite-parameter models, and investigates the convergence of neural networks towards GP behaviour as width increases.
The double descent curve is a non-monotonic pattern observed when plotting test error against model complexity: first a descent in the underparameterised regime, then a peak at the interpolation threshold (where parameters roughly equal samples), and finally a second descent in the overparameterised regime.
My intuition was that GPs skip directly to the far side of this curve, operating immediately in the benign overfitting regime. After studying Jacot et al.'s "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS, 2018), this turned out to be a well-established result.
The key insight: in the overparameterised regime, infinitely many functions achieve zero training error. GPs explicitly select the smoothest one (minimum RKHS norm). Neural networks, in the infinite-width limit, implicitly converge toward the same choice. GPs are the explicit regularisation that wide NNs only achieve implicitly.
Both experiments use the Concrete Compressive Strength dataset (UCI, 1030 samples, 8 features). A GP with a learned RBF kernel serves as the baseline throughout.
Random Fourier Features (Rahimi & Recht, 2008) approximate the RBF kernel with D random features. This is the random features model from Jacot et al. Section 3.1, where the GP is the D -> infinity limit.
We sweep D from 10 to 20,000 (with n_train ~ 824) and solve via least-squares. The RFF gamma is matched to the GP's learned lengthscale so the two models share the same kernel.
What this shows: Textbook double descent. Test MSE descends in the underparameterised regime, explodes at D ~ n_train, then descends again in the overparameterised regime. The GP baseline remains stable throughout -- it has already selected the principled minimum-norm interpolant.
A single-hidden-layer ReLU network is trained at widths from 2 to 500, each with 5 random seeds, trained to convergence (loss < 1e-4 or 10,000 epochs).
What this shows: NNs improve smoothly with width. However, the NTK for a ReLU network is an arc-cosine kernel, not the RBF kernel used by the GP baseline. This means the NN does not converge to the RBF-GP solution -- it converges to a different GP (the NTK-kernel GP). On this dataset the NNs outperform the RBF-kernel GP at large widths, which is consistent: the RBF kernel is not optimal for Concrete.
The connection between the two experiments is the NTK paper:
- Section 3.1 shows that gradient descent on a random features model is equivalent to kernel gradient descent. The RFF experiment is a direct implementation of this: the GP is the infinite-features limit.
- Theorem 1 shows that the NTK converges to a deterministic kernel in the infinite-width limit.
- Theorem 2 shows that this kernel stays constant during training, so the network function follows a linear ODE -- it is doing kernel regression.
GPs bypass double descent because they are already at the infinite-parameter limit. They solve the kernel regression problem in closed form, selecting the minimum-norm interpolant without navigating the unstable interpolation threshold.
data_mgmt.py-- Data loading and preprocessing (Concrete, MNIST, synthetic GP data)experiment.py-- Model definitions: SimpleNN, ExactGPModel, training and evaluation functionsimplement.py-- Main experiment runner: RFF sweep, NN width sweep, GP baseline
- RFF double descent experiment (clear success, textbook curve)
- GP baseline (stable throughout)
- NN width sweep (clean improvement with width)
- Gamma matching between RFF and GP kernel [PENDING]
- Plotting (log-scale test MSE vs D; test MSE vs width) [PENDING]
- Write-up of results with figures [PENDING]
- Jacot, Gabriel, Hongler. "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS, 2018.
- Belkin, Hsu, Ma, Mandal. "Reconciling modern machine learning practice and the bias-variance trade-off." PNAS, 2019.
- Rahimi, Recht. "Random Features for Large-Scale Kernel Machines." NeurIPS, 2008.