Gaussian Processes and Double Descent

This project demonstrates empirically that Gaussian Processes bypass the double descent phenomenon exhibited by finite-parameter models, and investigates the convergence of neural networks towards GP behaviour as width increases.

Background

The double descent curve is a non-monotonic pattern observed when plotting test error against model complexity: first a descent in the underparameterised regime, then a peak at the interpolation threshold (where parameters roughly equal samples), and finally a second descent in the overparameterised regime.

My intuition was that GPs skip directly to the far side of this curve, operating immediately in the benign overfitting regime. After studying Jacot et al.'s "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS, 2018), this turned out to be a well-established result.

The key insight: in the overparameterised regime, infinitely many functions achieve zero training error. GPs explicitly select the smoothest one (minimum RKHS norm). Neural networks, in the infinite-width limit, implicitly converge toward the same choice. GPs are the explicit regularisation that wide NNs only achieve implicitly.

Experiments

Both experiments use the Concrete Compressive Strength dataset (UCI, 1030 samples, 8 features). A GP with a learned RBF kernel serves as the baseline throughout.

Experiment 1: Double Descent via Random Fourier Features

Random Fourier Features (Rahimi & Recht, 2008) approximate the RBF kernel with D random features. This is the random features model from Jacot et al. Section 3.1, where the GP is the D -> infinity limit.

We sweep D from 10 to 20,000 (with n_train ~ 824) and solve via least-squares. The RFF gamma is matched to the GP's learned lengthscale so the two models share the same kernel.

What this shows: Textbook double descent. Test MSE descends in the underparameterised regime, explodes at D ~ n_train, then descends again in the overparameterised regime. The GP baseline remains stable throughout -- it has already selected the principled minimum-norm interpolant.

Experiment 2: Neural Network Width Sweep

A single-hidden-layer ReLU network is trained at widths from 2 to 500, each with 5 random seeds, trained to convergence (loss < 1e-4 or 10,000 epochs).

What this shows: NNs improve smoothly with width. However, the NTK for a ReLU network is an arc-cosine kernel, not the RBF kernel used by the GP baseline. This means the NN does not converge to the RBF-GP solution -- it converges to a different GP (the NTK-kernel GP). On this dataset the NNs outperform the RBF-kernel GP at large widths, which is consistent: the RBF kernel is not optimal for Concrete.

Theoretical Grounding

The connection between the two experiments is the NTK paper:

Section 3.1 shows that gradient descent on a random features model is equivalent to kernel gradient descent. The RFF experiment is a direct implementation of this: the GP is the infinite-features limit.
Theorem 1 shows that the NTK converges to a deterministic kernel in the infinite-width limit.
Theorem 2 shows that this kernel stays constant during training, so the network function follows a linear ODE -- it is doing kernel regression.

GPs bypass double descent because they are already at the infinite-parameter limit. They solve the kernel regression problem in closed form, selecting the minimum-norm interpolant without navigating the unstable interpolation threshold.

Project Structure

data_mgmt.py -- Data loading and preprocessing (Concrete, MNIST, synthetic GP data)
experiment.py -- Model definitions: SimpleNN, ExactGPModel, training and evaluation functions
implement.py -- Main experiment runner: RFF sweep, NN width sweep, GP baseline

Status

RFF double descent experiment (clear success, textbook curve)
GP baseline (stable throughout)
NN width sweep (clean improvement with width)
Gamma matching between RFF and GP kernel [PENDING]
Plotting (log-scale test MSE vs D; test MSE vs width) [PENDING]
Write-up of results with figures [PENDING]

References

Jacot, Gabriel, Hongler. "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS, 2018.
Belkin, Hsu, Ma, Mandal. "Reconciling modern machine learning practice and the bias-variance trade-off." PNAS, 2019.
Rahimi, Recht. "Random Features for Large-Scale Kernel Machines." NeurIPS, 2008.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data/MNIST/raw		data/MNIST/raw
.gitignore		.gitignore
README.md		README.md
data_mgmt.py		data_mgmt.py
experiment.py		experiment.py
implement.py		implement.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gaussian Processes and Double Descent

Background

Experiments

Experiment 1: Double Descent via Random Fourier Features

Experiment 2: Neural Network Width Sweep

Theoretical Grounding

Project Structure

Status

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gaussian Processes and Double Descent

Background

Experiments

Experiment 1: Double Descent via Random Fourier Features

Experiment 2: Neural Network Width Sweep

Theoretical Grounding

Project Structure

Status

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages