-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy patherrors.html
More file actions
344 lines (299 loc) · 38.7 KB
/
errors.html
File metadata and controls
344 lines (299 loc) · 38.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Understanding and Mitigating Errors</title>
<style>
body {
font-family: sans-serif;
max-width: 900px;
margin: 0 auto;
padding: 2rem;
line-height: 1.6;
color: #333;
}
h1, h2, h3, h4 {
color: #2c3e50; /* Darker shade for better contrast */
margin-top: 1.5em;
margin-bottom: 0.5em;
}
h1 { font-size: 2.5em; border-bottom: 2px solid #3498db; padding-bottom: 0.3em;}
h2 { font-size: 2em; border-bottom: 1px solid #bdc3c7; padding-bottom: 0.2em;}
h3 { font-size: 1.5em; }
h4 { font-size: 1.2em; color: #555;}
nav { margin-bottom: 30px; padding: 10px; background: #ecf0f1; border: 1px solid #bdc3c7; border-radius: 4px;}
nav ul { list-style: none; padding: 0; }
nav li { display: inline-block; margin-right: 15px; }
nav a { text-decoration: none; color: #3498db; font-weight: bold;}
nav a:hover { text-decoration: underline; color: #2980b9;}
pre {
background: #f8f9f9; /* Lighter background for code blocks */
padding: 1rem;
overflow-x: auto;
border: 1px solid #e1e4e8; /* Softer border */
border-left: 4px solid #3498db; /* Accent border */
border-radius: 4px;
font-size: 0.9em;
}
code {
font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
}
/* For inline code */
p > code, li > code, table td > code {
background: #e8eaed;
padding: 0.2em 0.4em;
border-radius: 3px;
font-size: 0.85em;
}
pre code { /* Reset for code inside pre, already handled by pre styling */
background: none;
padding: 0;
font-size: 1em; /* Ensure pre's font size is inherited */
}
ul, ol {
padding-left: 20px;
}
li {
margin-bottom: 0.5em;
}
strong {
color: #2980b9;
}
hr {
border: 0;
height: 1px;
background: #bdc3c7;
margin-top: 2em;
margin-bottom: 2em;
}
</style>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
displayMath: [['$$','$$'], ['\\[','\\]']],
processEscapes: true
}
});
</script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/github-dark.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
<script>hljs.highlightAll();</script>
</head>
<body>
<nav>
<ul>
<li><a href="#errors">Model Errors</a></li>
<li><a href="#evaluation">Evaluation</a></li>
<li><a href="#generalization">Generalization</a></li>
<li><a href="#regularization">Regularization & Augmentation</a></li>
<li><a href="#optimizers">Optimizers</a></li>
<li><a href="#scaling-laws">Scaling Laws</a></li>
</ul>
</nav>
<h1 id="errors">Understanding and Diagnosing Model Errors</h1>
<p>Even with lots of training data, models don’t perform perfectly. There are three main sources of error:</p>
<p><code>True Error = Approximation Error + Estimation Error + Optimization Error</code></p>
<h3>Approximation Error (Bias from Model Class)</h3>
<p>This is error due to the model class not being able to represent the true underlying function. The model is too weak. You need a stronger model.</p>
<h3>Estimation Error (Generalization Gap, Variance)</h3>
<p>This is error from not having enough data. With a small dataset, you’re overfitting to noise, so you can’t find the right parameters. You need more data, regularization, or data augmentation.</p>
<h3>Optimization Error</h3>
<p>This is error due to not finding the best parameters within your model class. This can be caused by a non-convex loss function, bad local minima, or saddle points. You need better optimizers or good learning rate schedules.</p>
<p>The first two are the classic **bias-variance trade-off**. Here, we’re focusing on the optimization error. But how do you know which error is the cause of poor performance?</p>
<ul>
<li>If the model is too simple, you have **high training error and high test error**. Your bias is too high; the model is underfitting. You need a better model.</li>
<li>If your model is too complex relative to the dataset size, you **overfit**. You'll see low training error but high test error (a large generalization gap). You should add more data, regularize, or even go with a simpler model.</li>
<li>When **training error decreases too slowly** or seems to plateau early, that’s an optimization error. You should adjust the optimizer, as you’ll see in this tutorial.</li>
<li>If **test error improves with more data**, that’s an estimation error. It indicates that variance dominates.</li>
<li>If the **test error plateaus even with more data**, that’s an approximation error; the model is too limited.</li>
<li>If **training error drops with more expressive models**, that’s an approximation error. If the training error stays high, that’s an optimization error.</li>
<li>If the test error is much higher than the training error and adding more data improves the test error, that’s variance (estimation error). If the test error doesn’t improve, that may be the **irreducible error** in the data, also known as the Bayes error.</li>
</ul>
<h3>The Irreducible Error (Bayes Error)</h3>
<p>Irreducible error is something you cannot remove, no matter what model, data, or optimizer you use! It comes from the randomness or noise inherent in the data.</p>
<p>Suppose the true relationship between an input $x$ and a target $y$ is:</p>
$$ Y = f(x) + \epsilon $$
<p>Where $\epsilon$ is the noise from natural randomness. This is the Bayes error, which represents the lower bound on test error. If you keep improving the model and the test error plateaus, you might be hitting this floor.</p>
<p>If your training error is low (optimization succeeded), your test error is close to your training error (low variance), and increasing the model size or data doesn’t dramatically reduce the error, you’re probably as close as you can be to the irreducible noise floor.</p>
<p>So, again, when do you know you have hit this limit? You try bigger models, and the training error doesn’t go down much more. You try a better optimizer or learning rate scheduler, and the optimization error is already tiny. You try more data, and the test error doesn’t improve. Both training and test error are low and close to each other. That’s when you go to your boss and say, "I've hit the theoretical minimum error achievable on this dataset, and nothing I do will decrease this inherent noise." No model can capture this because it’s not a learnable signal. That’s when you ask for data from another modality or sensory data to help the model with more information about the true latent state. Combining these complementary signals can reduce the noise floor.</p>
<hr>
<h2 id="evaluation">Evaluation Strategies</h2>
<p>When it comes to evaluating a model, what comes to mind?</p>
<ul>
<li><strong>Accuracy:</strong> Does it classify correctly?</li>
<li><strong>Latency:</strong> Is it fast enough for real-time use?</li>
<li><strong>Robustness:</strong> Does it handle new users and noisy sensors?</li>
<li><strong>Usability:</strong> Does it work for daily scenarios, not just on lab data?</li>
</ul>
<p>There are three main testing strategies:</p>
<ol>
<li><strong>Offline Testing:</strong> Cross-validation within and across subjects and measuring metrics.</li>
<li><strong>Online Testing:</strong> Test the system in streaming mode, measuring latency, stability, and false positives. What is the throughput? How many errors per hour?</li>
<li><strong>Stress Testing:</strong> Vary conditions like placement and noise level, and augment with synthetic perturbations. Is the system able to recover quickly from an error?</li>
</ol>
<hr>
<h2 id="generalization">Overfitting and Generalization</h2>
<p><strong>Overfitting</strong> is what happens when the model learns the training data too well, including noise and irrelevant patterns, so it performs poorly on new, unseen data. A model can be too complex, there might not be enough data, there could be noise in the data, etc. As we said earlier, when there is a large gap between training and test error, it is likely overfitting.</p>
<h3>When can you claim model generalization?</h3>
<p>If you think about wearables, training and testing on seen subjects is a much easier problem because the model can memorize subject idiosyncrasies, so the test accuracy would look much better than it would on a new person.</p>
<ul>
<li><strong>Cross-Subject Generalization:</strong> You train on some people and test on entirely new people. You can do a form of cross-validation by testing on a held-out subject and repeating this across all subjects, then averaging the performance.</li>
<li><strong>Temporal Generalization:</strong> This is session-to-session robustness. Does a model trained on a subject’s data from day 1 work on that same subject’s data from day 2? Is it robust to time variance?</li>
<li><strong>Domain Shift:</strong> Does varying the placement or recording conditions impact the model’s performance? Or does the model collapse under realistic shifts?</li>
</ul>
<p>So, before you claim generalization or start looking at power laws, you need to be sure you have a large enough dataset and have done enough experiments to be able to claim model generalization and have a good enough model that you can use to start looking at the performance over dataset sizes.</p>
<h3>What if your model doesn't work on new users?</h3>
<p>That’s a sign that your model has overfit to subject-specific patterns. The solution lies in data, the model, and adaptation strategies.</p>
<p>Obviously, you can **collect more diverse training data**; that’s the most important step. With wearables, you want to adapt efficiently to new subjects and avoid **negative transfer**, which is overwriting useful general features with subject-specific noise. Commonly, you freeze most of the network and fine-tune the final layers. Or you can introduce adapter modules like LoRA and bottleneck layers for fine-tuning. There is also layer-wise unfreezing, where you start with head-only fine-tuning, and if performance plateaus, you unfreeze higher encoder layers one by one to prevent catastrophic forgetting of general features. You can use weight regularization to keep personalized weights close to the pre-trained ones. Or you can keep a smaller buffer of original subjects and train jointly on all of them to avoid drifting too far toward subject-specific patterns.</p>
<p>You can try different augmentation techniques and regularization, some of which are discussed later here. And you might want to spend some time on **normalizing and pre-processing the data per subject** so you can align the amplitude of the signals across users. You can do this in an unsupervised fashion where you use the stream of data from new users to adapt normalization layers.</p>
<h3>Preprocessing Sensory Data</h3>
<p>Let’s talk a little bit about preprocessing sensory data, like from wearables. Sensory data can be noisy; there can be powerline interference, baseline drifts, and issues with the impedance of the sensors. What you can do is use some signal processing to improve the signal and get the features that have the most information without the noise to feed to your model.</p>
<ul>
<li><strong>Band-pass filtering:</strong> Removes low-frequency drift and high-frequency noise and keeps the most important frequencies.</li>
<li><strong>Notch filtering:</strong> Suppresses powerline interference.</li>
<li><strong>Rectification and enveloping:</strong> By taking the absolute value of the signal and extracting magnitude envelopes by low-passing the rectified signal.</li>
<li><strong>Z-scoring:</strong> Per channel or scaling signals to reduce inter-subject and inter-session amplitude variability.</li>
</ul>
<p>There are also a variety of time, frequency, and time-frequency domain features you can extract from the signal, or you can design CNN and LSTM layers in your model that perform feature extraction.</p>
<h3>Handling Missing Data and Fusing Modalities</h3>
<p>Additionally, there can be missing signals due to Bluetooth packet loss or sweat increasing impedance, which can lead to dropped channels or unreliable signals. If it’s noisy or missing data, you can try detecting those channels that have flatlined or saturated and interpolate them from neighbors. Based on the sensors, the neighbors can be spatial or temporal. You can also try a wide range of augmentations. You can make the model aware of such noises by providing a binary mask of which sensors are valid, and the model learns to ignore the missing ones. You can use an attention mechanism to weight sensors that are more reliable. The model learns end-to-end which channels correlate better with the task and which ones don’t.</p>
<p>You can **combine modalities**; if one modality is missing, maybe another can still provide information. Depending on the signals, you could concatenate after processing and use the same encoder, but that would assume the modalities align well in time and scale. More commonly, you would use separate encoders and concatenate their latent features and feed that to a joint decoder. Instead of concatenating, you can have **cross-modal attention**. Modalities that can complement each other don’t necessarily align. For example, one modality might detect some activity sooner than another. Attention can learn to synchronize these modalities. The model can learn how much to trust each modality depending on the context. Features from one of the sensors can form the query, and features from the other sensors can form the keys and values. Attention will decide how strongly to fuse information. And if one modality is missing or noisy, attention can down-weight it.</p>
<p>This is more flexible as it lets each encoder specialize in encoding its signal. You can also have very late fusion, where you train entirely separate models for each modality and fuse their outputs, but that would not really solve the missing signal problem.</p>
<p>You can try foundation-style models where you train on a large cohort of users and fine-tune it on new users using calibration sets. You can try having a shared encoder that is subject-agnostic and shared, specific, lightweight heads that can be fine-tuned for new users.</p>
<hr>
<h2 id="regularization">Regularization & Augmentation</h2>
<h3>Regularization: Dropout</h3>
<p><strong>Dropout</strong> randomly drops some activations with probability $p$. This prevents the network from relying too heavily on any single neuron. Each forward pass effectively trains a slightly different, thinned sub-network. Say you have a hidden layer output vector $[h_1, h_2, \dots, h_n]$. Dropout generates a binary mask using a Bernoulli distribution with probability $(1-p)$. Then it applies the mask:</p>
$$ h' = \frac{1}{1-p} \cdot (h \odot m) $$
<p>The factor $1/(1-p)$ rescales the activations so the expected activation value stays the same, which keeps the scale consistent between training and testing. Dropout also acts like noise injection in the hidden layers. The noise forces the model to learn robust features that work under perturbation. It’s less common with CNNs and transformers. There are fancier techniques like DropPath, Stochastic Depth, and DropConnect that generalize the same idea.</p>
<p>At inference time, we don’t drop anything. We use the full network. And because during training we adjusted the expected activation, it matches the test time.</p>
<h3>Data Augmentation for Wearables</h3>
<p>So far, we have discussed improving the estimation error by adjusting the model with regularization. What about the data? Let’s think about how we can synthetically increase our dataset. We know about adding noise, scaling values, and rotations. But what are some realistic ways to augment data collected from wearables?</p>
<ul>
<li><strong>Channel Dropout:</strong> If you have multiple sensors on a wearable, you can try randomly dropping them during training. This could mimic losing contact or being slightly misaligned and forces the model to be robust across subsets of sensors.</li>
<li><strong>Channel Shuffling:</strong> You can permute sensors within a local neighborhood to mimic the band being shifted along the arm. It helps the model learn that activity can move across channels.</li>
<li><strong>Spatial Mixing:</strong> You can replace a sensor (if there are multiple of the same type) with a weighted average of its neighbors. This would simulate the displacement blur when the band shifts.</li>
<li><strong>Rotation Augmentation:</strong> Apply a 3D rotation to sensors to mimic the band being worn at different angles, twisted, or rotated. This is an interesting one that needs a little bit more discussion. A wristband would have three axes: the X-axis points along your arm toward the hand, the Y-axis points sideways out of your wrist, and the Z-axis points up out of the skin. If you twist the wristband around your arm, the X-axis stays the same, but Y and Z rotate relative to the device, and relative to the forearm axes, all three change. Sliding doesn’t change the band’s axes themselves; they just capture slightly different signals. That’s why we rotate the device into what it would look like if it had been twisted on the arm.</li>
</ul>
<h4>A Bit More on Rotations</h4>
<p>Rotation around the x-axis is **roll**. It rotates the y-z plane around x, like turning a doorknob.</p>
$$ R_x(\theta) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta) & -\sin(\theta) \\ 0 & \sin(\theta) & \cos(\theta) \end{bmatrix} $$
<p>Rotation around the y-axis is **pitch**. It rotates the x-z plane around y, like pointing your hand up or down while keeping the wrist straight.</p>
$$ R_y(\theta) = \begin{bmatrix} \cos(\theta) & 0 & \sin(\theta) \\ 0 & 1 & 0 \\ -\sin(\theta) & 0 & \cos(\theta) \end{bmatrix} $$
<p>Rotation around the z-axis is **yaw**. It rotates the x-y plane around z, like turning your palm up vs. palm down.</p>
$$ R_z(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) & 0 \\ \sin(\theta) & \cos(\theta) & 0 \\ 0 & 0 & 1 \end{bmatrix} $$
<p>The general rotation is then $R(\alpha, \beta, \gamma) = R_x \cdot R_y \cdot R_z$. Then you multiply this with each sensor vector you want to rotate: $v' = \text{sensor} \cdot R^T$.</p>
<h4>Other Augmentations</h4>
<ul>
<li><strong>Random Scaling:</strong> Multiply the signals by random factors to mimic changes in strap tightness or skin-electrode impedance.</li>
<li><strong>Temporal Jitter:</strong> Shift or stretch signals slightly in time. This mimics latency introduced by altered placement.</li>
<li><strong>Cross-modal Masking:</strong> Temporarily drop one modality to encourage the model to generalize across partial signal availability.</li>
</ul>
<h3>Handling Class Imbalance</h3>
<p>Class imbalance can also cause issues by making the model overfit to easy, frequent samples. There are a number of ways to treat this. You can **oversample** minority classes or **undersample** majority classes. You can augment the signals more for minority classes. **SMOTE** is a classic technique. It oversamples minority classes without duplicating examples by interpolating between existing minority samples. You can use k-NN to find samples close to the existing samples and create a new sample along the line by randomly interpolating between the data and its neighbor. There are also loss-level techniques like **class-weighted loss**, where you weight the minority class higher, or **focal loss**, where the model focuses training on hard-to-classify examples. You can also try **balanced batch sampling** to have roughly the same class representation in each batch. **Curriculum learning** also allows you to start with balanced, easy examples and gradually add hard or imbalanced ones.</p>
<hr>
<h2 id="optimizers">Optimizers and Schedulers</h2>
<h3>Gradient Descent and SGD</h3>
<p>Gradient descent updates the model parameters in the direction that reduces the loss function. **Stochastic Gradient Descent (SGD)** takes a random subset of data instead of the entire dataset and performs the update rule based on that gradient.</p>
$$ \theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta_t} L $$
<p>This makes it noisy but much faster and allows for escaping local minima. It can be sensitive to the learning rate; too large, and the optimization path can zigzag or diverge, and too small would make it very slow. Its learning rate is the same across all parameters; there’s no adaptation. With a constant learning rate, SGD will bounce around near a local minimum. The stochasticity helps escape saddle points and poor local minima. A saddle point is a point in the parameter space where the gradient is zero, and the optimizer might think it’s a stationary point, but it’s not a good minimum. Along one direction, the surface curves up like a minimum, and along another, it curves down like a maximum, which makes it a saddle. High-dimensional loss surfaces are full of saddle points. SGD can shake itself loose from saddles, while full-batch GD may get stuck because there’s no noise.</p>
<h3>SGD with Momentum</h3>
<p>Momentum can smooth the zigzag path of SGD by accumulating a running average of past gradients so updates have a more consistent direction. The update rule with momentum introduces a velocity term:</p>
$$ v_{t+1} = \mu \cdot v_t + \eta \cdot \nabla_{\theta_t} L $$
$$ \theta_{t+1} = \theta_t - v_{t+1} $$
<p>Where $v_t$ is the running average of the gradients and $\mu$ is the momentum coefficient (typically 0.9). A higher $\mu$ smooths out the noise by averaging the velocity over more steps. It gives more weight to past gradients. A higher $\mu$ means you carry more velocity from the past, like rolling downhill with less friction. The velocity builds up, and the magnitude of the step can actually be bigger than with plain SGD. It’s slower to react but faster to converge in a consistent direction.</p>
<h3>Nesterov Accelerated Gradient (NAG)</h3>
<p>With NAG, you look ahead in the direction of the velocity, then you compute the gradient direction, which helps with preventing overshooting. You “pretend” to take the gradient of the parameters already moved in the direction of your current velocity and take the gradient of that.</p>
$$ \theta_{\text{lookahead}} = \theta_t - \mu \cdot v_t $$
$$ g_t = \nabla_{\theta_{\text{lookahead}}} L $$
<p>So the gradient is not at your position $\theta_t$ but at a future-leaning position where momentum would have taken you. Then the update rule becomes:</p>
$$ v_{t+1} = \mu \cdot v_t + \eta \cdot g_t $$
$$ \theta_{t+1} = \theta_t - v_{t+1} $$
<p>It basically glances in the direction you’re already moving, then adjusts based on the slope there. It’s anticipating where the momentum is taking you. It’s like an early warning if the slope is flattening or turning so you can correct sooner. With vanilla momentum, you say, “I’m here, what’s the slope?” With Nesterov, you say, “I’m about to be over there, what’s the slope? Let me adjust my push so I don’t overshoot.” This makes for a more stable, less oscillatory, and faster adjustment when the optimum is close.</p>
<h3>AdaGrad</h3>
<p>These next optimizers introduce **adaptive learning rates**, where each parameter can have its own step size. Some parameters might need bigger steps (rare features), while others might need smaller steps (frequent features). AdaGrad adapts the learning rate of each parameter based on the history of its gradients. For each parameter, you maintain its own accumulator, a cumulative sum of squared gradients.</p>
$$ G_{t,i} = G_{t-1,i} + (\nabla_{\theta_i} L_t)^2 $$
<p>The update scales the gradients by this accumulated value. Over time, $G_{t,i}$ just keeps growing; it never shrinks.</p>
$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \cdot \nabla_{\theta_i} L_t $$
<p>Where $\epsilon$ is just a small number for numerical stability. It basically uses the L2-norm of the gradient history to normalize the gradient update for that parameter. For parameters that consistently see large gradients (steep directions), the denominator grows, and the steps shrink, preventing runaway updates. Parameters that see small gradients will have a larger effective learning rate so they can still learn. However, there’s a downside: the denominator grows monotonically, so the learning rate shrinks forever. AdaGrad remembers everything forever and because of this can suffer from a diminishing learning rate.</p>
<h3>RMSProp</h3>
<p>Instead of a running sum, RMSProp uses an exponentially decaying average of squared gradients.</p>
$$ E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1-\rho) \cdot (\nabla_{\theta_i} L_t)^2 $$
$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot \nabla_{\theta_i} L_t $$
<p>Where $\rho$ is the decay rate (commonly 0.9) and $E[g^2]_t$ is the exponentially weighted average of squared gradients. Older gradients decay away exponentially, so they don’t dominate forever. Now the denominator tracks the more recent scale of the gradients, not the entire history. If the problem shifts and we get bigger or smaller gradients, the optimizer can adapt more quickly. This can help with both exploding and vanishing gradients.</p>
<h3>Adam (Adaptive Moment Estimation)</h3>
<p>Adam combines the momentum from SGD with the adaptive learning rates from RMSProp and also adds a bias correction. For each parameter $\theta$ at step $t$, the gradient is $g_t = \nabla_{\theta_t} L$. Then we compute the first moment (the mean of gradients, like momentum):</p>
$$ m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t $$
<p>And the second moment (the mean of squared gradients, like RMSProp):</p>
$$ v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 $$
<p>There’s a problem with the exponential moving average (EMA). At the very first step, the moments are initialized to zero. Then there is a bias correction because the running stats start from zero and take a while to build up. This ensures a stable update even at early steps. Adam corrects this by rescaling them with their “expected underestimation factor”:</p>
$$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{and} \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$
<p>When $t$ is small, the denominator is small, so the correction inflates the moments back to realistic values, and as $t$ grows, the correction goes away. Then we get the update rule:</p>
$$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$
<h3>L1 vs. L2 Regularization</h3>
<p>Because we want a model to generalize, we don’t want it to overfit, so we regularize it. One way to do that is to discourage very large weights. So, we add a penalty to the loss. **L2 regularization** adds the squared L2-norm of the weights:</p>
$$ L_{\text{reg}}(\theta) = L(\theta) + \frac{\lambda}{2} \|\theta\|_2^2 $$
<p>Where $\|\theta\|_2^2 = \sum_i \theta_i^2$. If a parameter grows, the penalty term grows exponentially. The optimizer is then pushed to keep the weights small and spread out instead of letting one parameter dominate. L2 regularization makes the optimization prefer solutions within a ball around the origin. Inside this ball, the optimal solution is the one where the energy is spread across coordinates, not spiked in one direction. It moves the weights closer to the origin. L2 produces **shrinkage**; weights spread out but remain non-zero.</p>
<p>**L1 regularization** uses absolute values:</p>
$$ L_{\text{reg}}(\theta) = L(\theta) + \lambda \|\theta\|_1 $$
<p>Where $\|\theta\|_1 = \sum_i |\theta_i|$. Its derivative is either +1, -1, or undefined at zero. Its constraint looks like a diamond shape. Even small weights feel a push toward zero, and because the gradient is undefined at zero, the optimizer might choose to stop there. Because the gradient is either 1 or -1, the force is the same regardless of the weight's size, so what happens is that small weights eventually get dragged to zero. L1 produces **sparsity**.</p>
<h3>AdamW (Adam with Weight Decay)</h3>
<p>Adam and most other optimizers implement L2 regularization by adding the gradient of the L2 norm to the loss gradient: $g_t = \nabla_{\theta_t} L + \lambda \cdot \theta_t$. This is then used in all the first and second moment updates. However, because it was added inside the gradient calculation, the second moment, which contributes to the scaling factor, also includes the L2 norm, distorting the effect of weight decay because it’s being adaptively rescaled.</p>
<p>AdamW keeps a moving average of only the true gradients (without the L2 term) and applies the regularization term to the update separately:</p>
$$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \cdot \lambda \cdot \theta_t $$
<p>Where $\lambda$ is the weight decay. So, remember that **weight decay is not L2 regularization** for adaptive methods; that’s why there’s a separate AdamW optimizer.</p>
<h3>Nadam (Nesterov-accelerated Adam)</h3>
<p>Nadam combines the lookahead moment from Nesterov with the Adam optimizer. The update is:</p>
$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \left( \beta_1 \hat{m}_t + \frac{(1-\beta_1) g_t}{1 - \beta_1^t} \right) $$
<p>This isn’t much like Nesterov, where we actually took a step in a direction and used that slope to update our parameters; it’s just blending the current gradient (again) with the smoothed momentum term. It’s like short-circuiting the dilution in the smoothed mean by adding a stronger dose of the current slope so it can react more quickly.</p>
<h3>SAM (Sharpness-Aware Minimization)</h3>
<p>SAM is more of a wrapper that can be layered on top of an optimizer. Most optimizers minimize the training loss, which can lead them to find sharp minima that might have poor generalization. SAM looks for flat minima.</p>
$$ \min_{\theta} \max_{\|\epsilon\|_p \le \rho} L(\theta + \epsilon) $$
<p>Where $\epsilon = \rho \frac{\nabla L(\theta)}{\|\nabla L(\theta)\|_2}$ and $\rho$ is a hyperparameter. It’s minimizing the worst-case loss in a small neighborhood around $\theta$. The inner maximization is looking for a worst-case perturbation $\epsilon$ within a radius $\rho$. The outer minimization updates $\theta$ to reduce that worst-case loss. The optimizer prefers parameters that sit in a flat valley. It’s asking, "Would the loss remain low if I wiggle my weights a little bit?" SAM computes the raw gradient, uses that to form the perturbation, then temporarily shifts the weights and computes the gradient there. SAM is usually used with AdamW. AdamW runs as usual but instead of the original gradient $g$, it uses $g_{\text{sam}}$ to update its moments.</p>
<h3>Learning Rate Schedulers</h3>
<p>A constant learning rate is rarely optimal; too high, and it’s unstable, and too low, it gets stuck. Schedulers adapt the learning rate over time.</p>
<ul>
<li><strong>Step Decay:</strong> Drops the learning rate by a factor every few epochs. $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$</li>
<li><strong>Exponential Decay:</strong> Instead of sudden drops, the learning rate decays smoothly every step/epoch. $\eta_t = \eta_0 \cdot e^{-\lambda t}$</li>
<li><strong>Cosine Annealing:</strong> Instead of decreasing stepwise or exponentially, cosine annealing makes the learning rate follow a cosine curve.
$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{t}{T} \pi\right)\right) $$
The exponential decay can shrink the learning rate too early and cause the model to settle down too soon, so it gets harder to escape suboptimal minima. With cosine annealing, the learning rate stays higher for longer and encourages exploration. It can be combined with **warm restarts**, where after it reaches the minimum learning rate, it resets back to the maximum and decays again to explore multiple basins.</li>
<li><strong>One-Cycle Policy:</strong> Instead of just decreasing the learning rate, it increases it first and then decreases it in a single cycle. This helps the model escape bad minima early and then settle into flat minima later. It’s typically paired with a momentum schedule that does the opposite (decreases momentum while the learning rate increases). When we have a larger learning rate and lower momentum, it encourages the model to take large gradient steps, leading to lots of exploration and rapid movements.</li>
</ul>
<h3>Gradient Clipping</h3>
<p>Gradient clipping is another technique to ensure gradients are under control by capping their size. It clamps gradients that lie within a range. Most commonly, it scales the whole gradient vector so its norm doesn’t exceed a threshold $c$.</p>
$$ g \leftarrow g \cdot \frac{c}{\|g\|} \quad \text{if } \|g\| > c $$
<p>This keeps the direction of the gradient, only shrinking its magnitude. This will shrink the magnitude such that the new norm is $c$ only if it’s over the threshold.</p>
<hr>
<h2 id="scaling-laws">Scaling Laws</h2>
<p>This brings us to the conclusion: **scaling laws**. How do you know which error you need to optimize? Where do you need to invest more resources?</p>
<p>Scaling laws describe how performance (usually test loss) improves as you increase data size, model size, or compute. You want to know if you should invest in more data collection, a bigger model, or if the bottleneck is the optimization error. Scaling laws help you decide where the marginal gain is coming from.</p>
<p>The following equation is a common one known as the power law. This has been found empirically across many domains.</p>
$$ L(N) = L_{\infty} + K \cdot N^{-\alpha} $$
<p>Where $L(N)$ is the loss when trained on a dataset of size $N$, $L_{\infty}$ is the irreducible error, and $K$ and $\alpha$ are constants. $\alpha$ is a scaling exponent. These you estimate by fitting a line.</p>
<p>But wait, how do you estimate $L_{\infty}$? You can train your model on multiple dataset sizes and record the test loss. Then plot $L(N) - L_{\infty}$ vs. $N$ in a log-log space. You adjust $L_{\infty}$ until the curve is a straight line, then fit the slope, which will be $-\alpha$. If adding more data improves the test loss, you haven’t hit $L_{\infty}$ yet. If the curve starts to flatten, you’re approaching $L_{\infty}$. If you guess wrong (too small), it will flatten out prematurely. If you guess too large, it may go negative at large $N$ or bend the other way.</p>
<p>Once you have fit the line correctly, you know three things: the irreducible error, $\alpha$ (which is the slope in log-log space and tells you how fast the error falls when you add more data, parameters, or compute), and a constant factor. If the irreducible error is high, that means your modality has a limit. If $\alpha$ is large (like 0.5), then doubling the data would give you a noticeable gain, and if it’s small, not so much. You can also predict performance as you scale the data.</p>
<p>$G(N) = L(N) - L_{\infty} = K \cdot N^{-\alpha}$. $G$ is the gap above the floor.</p>
<p>$G(2N) = G(N) \cdot 2^{-\alpha}$ is how much of the gap remains after doubling the data. When you scale the data by a factor $c$ (e.g., $c=2$ for doubling), then $G(cN)/G(N) = c^{-\alpha}$. If $\alpha$ was 0.5, this ratio is $2^{-0.5} \approx 0.71$. This means 71% of the gap remains. So if the gap was 5% (0.05), by doubling the data, you get a new gap of $0.05 \cdot 0.71 = 0.035$. The new loss would be $L(2N) = L_{\infty} + G(2N)$.</p>
<p>You can apply the same equation to model size (number of parameters) or compute. So when you plot loss vs. dataset size on a log-log plot, you often get a straight line. That line’s slope tells you how much benefit you would get from scaling the data. For example, with model size, it will tell you if you need a deeper model. And if you’re building a model for a wearable where there is a compute limit, then you might get a sense of how much you might need, because if the scaling ratio is too large and you need a bigger model, it will tell you what you’re capped at.</p>
<p>There are also joint scaling laws:</p>
$$ L(N,P) \approx L_{\infty} + K \cdot N^{-\alpha} + C \cdot P^{-\beta} $$
<p>Where $P$ is for the number of parameters. This will let you trade off between being data-limited and model-limited. Now you would want to see a smooth surface in 2D. If you hold one fixed and vary the other, you get either $-\alpha$ or $-\beta$. You estimate $L_{\infty}$ by ensuring that both dimensions straighten out simultaneously.</p>
<h3>Fitting a Line with Least Squares</h3>
<p>This might be obvious, but how do you fit a line or a surface? This is actually a nice getaway to optimizers. Say you have a list of data points. You want to fit a line with parameters $\theta$. The residual error is $r_i = y_i - f(x_i, \theta)$. **Least squares** chooses the parameter $\theta$ to minimize the sum of squared errors.</p>
$$ \min_{\theta} \sum_{i=1}^{n} (y_i - f(x_i, \theta))^2 $$
<p>It squares the error so it penalizes larger errors more. It gives a nice, smooth, convex function, and for linear models, there is a closed-form solution. Say $y = ax + b$. The cost function is:</p>
$$ J(a,b) = \sum_{i=1}^{n} (y_i - (ax_i + b))^2 $$
<p>The solution to this would give the line of best fit. You take the partial derivatives with respect to $a$ and $b$ and solve the normal equations:</p>
$$ \frac{\partial J}{\partial a} = -2 \sum_{i=1}^{n} x_i(y_i - ax_i - b) = 0 $$
$$ \frac{\partial J}{\partial b} = -2 \sum_{i=1}^{n} (y_i - (ax_i + b)) = 0 $$
<p>If you do some math and solve for $a$ and $b$, you can find a closed-form solution for them:</p>
$$ a = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{\text{cov}(x,y)}{\text{var}(x)} $$
$$ b = \bar{y} - a\bar{x} $$
<p>Where $\bar{y}$ is the mean of all $y$'s and $\bar{x}$ is the mean of all $x$'s.</p>
</body>
</html>