-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtransformer_p2.html
More file actions
474 lines (440 loc) · 34.5 KB
/
transformer_p2.html
File metadata and controls
474 lines (440 loc) · 34.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Transformers Explained: Self-Attention and Masking</title>
<style>
body {
font-family: sans-serif;
max-width: 900px;
margin: 0 auto;
padding: 2rem;
line-height: 1.6;
color: #333;
}
h1, h2, h3, h4 {
color: #2c3e50; /* Darker shade for better contrast */
margin-top: 1.5em;
margin-bottom: 0.5em;
}
h1 { font-size: 2.5em; border-bottom: 2px solid #3498db; padding-bottom: 0.3em;}
h2 { font-size: 2em; border-bottom: 1px solid #bdc3c7; padding-bottom: 0.2em;}
h3 { font-size: 1.5em; }
h4 { font-size: 1.2em; color: #555;}
nav { margin-bottom: 30px; padding: 10px; background: #ecf0f1; border: 1px solid #bdc3c7; border-radius: 4px;}
nav ul { list-style: none; padding: 0; }
nav li { display: inline-block; margin-right: 15px; }
nav a { text-decoration: none; color: #3498db; font-weight: bold;}
nav a:hover { text-decoration: underline; color: #2980b9;}
pre {
background: #f8f9f9; /* Lighter background for code blocks */
padding: 1rem;
overflow-x: auto;
border: 1px solid #e1e4e8; /* Softer border */
border-left: 4px solid #3498db; /* Accent border */
border-radius: 4px;
font-size: 0.9em;
}
code {
font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
}
/* For inline code */
p > code, li > code, table td > code {
background: #e8eaed;
padding: 0.2em 0.4em;
border-radius: 3px;
font-size: 0.85em;
}
pre code { /* Reset for code inside pre, already handled by pre styling */
background: none;
padding: 0;
font-size: 1em; /* Ensure pre's font size is inherited */
}
ul, ol {
padding-left: 20px;
}
li {
margin-bottom: 0.5em;
}
strong {
color: #2980b9;
}
hr {
border: 0;
height: 1px;
background: #bdc3c7;
margin-top: 2em;
margin-bottom: 2em;
}
</style>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
displayMath: [['$$','$$'], ['\\[','\\]']],
processEscapes: true
}
});
</script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/github-dark.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
<script>hljs.highlightAll();</script>
</head>
<body>
<nav>
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#self-attention">Self-Attention</a></li>
<li><a href="#mha">Multi-Head Attention</a></li>
<li><a href="#masking">Masking</a></li>
</ul>
</nav>
<h1 id="intro">Introduction to Transformers</h1>
<p>
Older sequence models like RNNs process one token at a time, so they struggle with long dependencies and cannot parallelize well. CNNs can look at a local neighborhood, but stacking many layers is needed to capture long-range relations.
</p>
<p>
<strong>Transformers</strong> replace recurrence and convolutions with <strong>self-attention</strong>, a mechanism where every token can directly interact with every other token in a single step. There are two main parts inside a Transformer: an <strong>encoder</strong> and a <strong>decoder</strong>. Both the encoder and decoder contain two primary sub-layers:
</p>
<ul>
<li>A <strong>self-attention</strong> layer that allows each token to attend to all others.</li>
<li>A <strong>feed-forward</strong> block that applies a position-wise MLP to enrich each token's representation.</li>
</ul>
<p>
These parts also employ residual connections, layer normalization, and positional encodings to function effectively.
</p>
<hr>
<h2 id="self-attention">The Self-Attention Mechanism</h2>
<p>
Self-attention is the heart of the Transformer. Suppose you have an input sequence of length $n$ with an embedding of dimension $d$. We typically map each discrete token to a continuous vector using a learned <strong>embedding matrix</strong>, which is essentially a big lookup table.
</p>
<p>
For example, if you have a vocabulary of 30,000 tokens and you choose an embedding size of 512, then the embedding matrix will have a shape of $30,000 \times 512$. When you feed a word to the model, it looks up the corresponding row in this matrix, and that becomes the embedding vector of size 512 for that word.
</p>
<p>
The input sequence has a length of $n$ (the number of tokens), and each token has an embedding dimension of $d_{model}$. After the embedding lookup, the input tensor $X$ has a shape of $n \times d_{model}$.
</p>
<h3>Query, Key, and Value</h3>
<p>First, you project each input embedding into three distinct vectors:</p>
<ul>
<li><strong>Query (Q)</strong>: Represents what a token is looking for. Each word effectively asks a "question."</li>
<li><strong>Key (K)</strong>: Represents what a token contains or offers. It "answers" the questions posed by other words.</li>
<li><strong>Value (V)</strong>: Represents the actual information or content of a token. This is the information that gets pulled in by other tokens that find it relevant.</li>
</ul>
<p>We obtain these vectors by multiplying the input embedding matrix $X$ by learned weight matrices:</p>
$$ Q = X W_Q $$
$$ K = X W_K $$
$$ V = X W_V $$
<p>
These weight matrices project the input embeddings from dimension $d_{model}$ into a dimension $d_k$. The <strong>Query</strong> and <strong>Key</strong> vectors must always have the same dimension, $d_k$, because their dot product is computed later. The <strong>Value</strong> vector typically has the same dimension, but some implementations allow it to be different. At this point, our $Q$, $K$, and $V$ matrices each have a shape of $n \times d_k$.
</p>
<h3>Attention Score Calculation</h3>
<p>
Next, we compute the attention scores. This is done by taking the dot product of every query with every key. The resulting score between token $i$ and token $j$ determines how much token $i$ should attend to token $j$.
</p>
$$ \text{score}(i,j) = Q_i \cdot K_j $$
<p>This is vectorized to compute the entire $n \times n$ score matrix at once:</p>
$$ \text{Scores} = Q K^T $$
<p>
The dot product is used as a measure of similarity. Since $Q$ and $K$ are in the same vector space, their dot product measures both magnitude and directional alignment. If a query and a key point in the same direction, the model considers them relevant.
</p>
<p>
We then scale these scores by dividing by $\sqrt{d_k}$. Let’s understand why. If you assume the components of $Q$ and $K$ are normally distributed with a mean of 0 and a variance of 1, then their dot product will have a mean of 0 and a variance of $d_k$. The magnitude of this dot product would then be on the order of $\sqrt{d_k}$. By dividing by $\sqrt{d_k}$, we are normalizing the scores to have a variance of 1, which helps stabilize training. Without this scaling, the scores would grow with the dimension $d_k$.
</p>
<p>
We want these scores to behave like attention weights—they should be non-negative and sum to 1 to form a probability distribution. That’s what the <strong>softmax</strong> function does. For very large scores (i.e., if we did not normalize them), the softmax output would collapse to a one-hot distribution where one weight is 1 and all others are 0. This would cause the query to "latch on" to a single token, ignoring all other context. This not only kills the attention mechanism's ability to softly mix information but also introduces <strong>vanishing gradients</strong>, as small changes in extreme values barely change the output, preventing the network from adjusting the attention weights. The normalization by $\sqrt{d_k}$ is therefore crucial for keeping the softmax in a smooth, trainable zone.
</p>
<p>The final attention weights, which we can call $\alpha$, are calculated as:</p>
$$ \text{Attention Weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) $$
<p>This gives us an $n \times n$ matrix.</p>
<h3>Weighted Sum and Final Projection</h3>
<p>Now, we use our attention weights to combine the <strong>Value</strong> vectors. For each token $i$, the output is a weighted sum of all other value vectors:</p>
$$ \text{Output}_i = \sum_j \alpha_{ij} V_j $$
<p>The entire output matrix is computed as $(\text{Attention Weights}) \times V$. The output has the shape $n \times d_k$.</p>
<p>
Finally, a learned linear projection matrix, $W_O$, is applied. This projection re-mixes the features after the attention step. Without it, the output would just be a weighted average of the values.
</p>
$$ \text{Final Output} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V W_O $$
<p>
The matrix $W_O$ (with shape $d_k \times d_{model}$) adapts the output back to the model’s main feature space, $d_{model}$, so that it can be passed to subsequent feed-forward blocks consistently.
</p>
<hr>
<h2 id="mha">Multi-Head Attention</h2>
<p>
So far, for each token, we have learned a single representation of size $d_k$, meaning the model learns one type of relationship at a time. We can introduce <strong>multiple heads</strong> to allow the model to learn multiple types of relationships in parallel. Each head projects $Q$, $K$, and $V$ into a different learned subspace, enabling the model to look at the sequence in different ways simultaneously.
</p>
<p>
We split the model dimension, $d_{model}$, into $h$ smaller chunks (the heads), each with dimension $d_k = d_{model} / h$. For instance, we can take a 512-dimensional embedding and split it into 8 heads, each with 64 dimensions. Each head then runs its own scaled dot-product attention on its respective chunk.
</p>
<p>
Instead of one set of $W_Q, W_K, W_V$, we now learn separate projection matrices for each head $i$:
</p>
$$ Q^i = X W_Q^i $$
$$ K^i = X W_K^i $$
$$ V^i = X W_V^i $$
<p>Each head $i$ computes its own attention output, $Z^i$, with shape $n \times d_k$. After each head has independently analyzed the sequence through its own "lens," we concatenate the outputs:</p>
$$ Z = \text{concat}(Z^1, Z^2, \dots, Z^h) $$
<p>The resulting matrix $Z$ has a shape of $n \times (h \times d_k)$, which is equivalent to $n \times d_{model}$. We then apply a final learned projection matrix $W_O$ (with shape $d_{model} \times d_{model}$) to re-mix the information across all the heads, allowing the model to combine the different views each head produced.</p>
$$ O = Z W_O $$
<hr>
<h2 id="masking">Masking in Attention</h2>
<p>
Masking is a concept in attention that prevents the mechanism from looking at every token. There are two main types of masking.
</p>
<h3>Padding Mask</h3>
<p>
A <strong>padding mask</strong> is typically applied when batching sequences where some are shorter than others. We pad these sequences with dummy tokens to make them the same shape, but we don’t want the model to attend to these dummy tokens. So, we mask these positions by setting their attention scores to $-\infty$ before applying the softmax.
</p>
<p>After computing the attention scores $S = QK^T / \sqrt{d_k}$, we apply a mask matrix $M$:</p>
$$ S' = S + M $$
<p>where $M_{ij} = 0$ for tokens that are allowed to attend and $-\infty$ for those that are blocked.</p>
<p>
For the padding mask, we want to mask out some <strong>keys</strong>, not queries. This means if key $j$ is padding, all queries $i$ should ignore it. It’s kind of like queries are the "listeners" and keys are the "speakers." If a padding token is a "speaker," that’s bad because a real "listener" might listen to it and get confused. We need both attention masking (to prevent confusion during context creation) and loss masking (to not penalize the model for its predictions at padded positions).
</p>
<h3>Causal (Look-Ahead) Mask</h3>
<p>
There is also a <strong>causal</strong>, or <strong>look-ahead mask</strong>. In decoder models, when generating the next token, we don’t want the model to peek at future tokens. For a position $t$, we block attention to all positions $> t$ to enforce autoregressive behavior. This is achieved by adding an upper-triangular matrix of $-\infty$ values (where the diagonal is 0) to the score matrix.
</p>
$$ M_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases} $$
<p>At position 1, a token can only attend to itself. At position 2, it can attend to token 1 and itself. At position 3, it can attend to tokens 1, 2, and 3, and so on.</p>
<p>
This is an upper triangular matrix of $-\infty$ with a shape of <code>n x n</code> that gets broadcast across the batch. Each row corresponds to a query at position <code>i</code>, which can now only attend to columns where <code>j <= i</code>. At position 1, a token can only attend to itself; at position 2, it can attend to token 1 and itself; at position 3, it can attend to tokens 1, 2, and 3.
</p>
<p>
So, after we compute the attention score and before applying the softmax:
</p>
$$ S = \frac{QK^T}{\sqrt{d_k}} \quad (\text{shape: } n \times n) $$
<p>We want to apply a mask matrix M:</p>
$$ S' = S + M $$
<p>Where $M_{ij} = 0$ allows tokens to attend and $-\infty$ blocks them.</p>
<h3>Loss vs. Attention Mask</h3>
<p>
For a <strong>padding mask</strong> (a sequence-level mask), suppose we have a batch size of $B$. Some tokens are real and some are padding. We want to mask out some <strong>keys</strong>, not queries. That means if key $j$ is a padding token, all queries $i$ should ignore it. So, the masking needs to be along the key dimension.
</p>
$$ M_{bij} = \begin{cases} 0 & \text{if key position } j \text{ is real} \\ -\infty & \text{if key position } j \text{ is padding} \end{cases} \quad (\text{shape: } B \times 1 \times n) $$
<p>
The <code>1</code> in the middle dimension means the same mask applies to all queries $i$. This will get broadcasted during addition: <code>(B x 1 x n) + (B x n x n) -> (B x n x n)</code>.
</p>
<p>
Broadcasting always confuses me, so let’s discuss it again. Broadcasting compares shapes element-wise from the right. If the dimensions match or one of them is 1, they’re compatible. The dimension of size 1 is stretched to match the other. The mask only told us which keys (columns) are real or padding. So, we apply the same restriction to every query row by repeating the mask along that dimension.
</p>
<p>
What about a padding query token? Say we have a categorical cross-entropy for next-token prediction. We don’t want to penalize the model for predicting something at <code><pad></code> positions, so we zero out the contribution of the padding tokens. We have to mask out key padding because it gets integrated into the context vector for that token. With the query padding, well, the model would actually predict nothing for that position, because padding positions are masked in the loss function as well. We just don’t want the model to waste time attending to padding keys here.
</p>
<p>
It’s kind of like queries are the "listeners" and keys are the "speakers." If queries hear non-sense, that’s fine because we would never ask for their opinion anyway. But, if a padding token is a "speaker," that’s bad because a real listener might listen to it and get confused. We need both <strong>attention masking</strong> and <strong>loss masking</strong>.
</p>
<ul>
<li><strong>Loss masking</strong> doesn't penalize the model for its outputs at padded positions. The output from a padded query position doesn't affect training.</li>
<li><strong>Attention masking</strong> makes sure real tokens don’t get confused by attending to padded inputs.</li>
</ul>
<hr>
<h2 id="encoder-decoder">The Encoder-Decoder Architecture</h2>
<p>
So far, we had our input sequence of <code>n x d_model</code> with some positional encoding added. We compute multi-head attention (MHA) and output a tensor of shape <code>n x d_model</code>, which is the same shape as the input.
</p>
<h3>The Encoder Block</h3>
<p>
Then in our encoder block, we have an "add and norm" block, which uses a <strong>residual connection</strong>. It adds the input <code>X</code> to the attention output and applies a <strong>LayerNorm</strong> to stabilize it.
</p>
$$ H_1 = \text{LayerNorm}(X + \text{MHA}(X)) \quad (\text{shape: } n \times d_{model}) $$
<p>
A 2-layer MLP, called a Position-wise Feed-Forward Network (FFN), is then applied independently to each position.
</p>
$$ \text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2 $$
<p>
This expands the dimension from $d_{model}$ to maybe $4 \times d_{model}$ and then projects it back down to $d_{model}$ while preserving the sequence length. Then we add and norm again with another residual connection.
</p>
$$ H_2 = \text{LayerNorm}(H_1 + \text{FFN}(H_1)) $$
<p>
Why do this? Deep networks suffer from vanishing gradients. The residual path provides a shortcut for gradients to flow backward, making optimization easier and allowing the whole block to safely act like an identity function at the start of training. The non-linear transformation is more expressive in a higher dimension, so the expansion allows the model to process richer combinations of features before compressing them back. We always try to hold on to the original information using residual connections.
</p>
<h4>Pre-LN vs. Post-LN</h4>
<p>
Now, you might have heard of <strong>Pre-LN</strong> and <strong>Post-LN</strong>. In the original Vaswani paper, LayerNorm was applied *after* the residual connection was added (Post-LN, as shown above). In modern practice, LayerNorm is often applied *before* the sublayer, and the residual is added afterward (Pre-LN). This keeps the residual path clean and helps gradients remain stable.
</p>
$$ H = X + \text{MHA}(\text{LayerNorm}(X)) $$
<p>
This ordering is important. If the input <code>X</code> is too large, we don’t want it to drown out the MHA output, so we apply LayerNorm to the input, not the output. This keeps the relative balance between the residual and the sublayer stable. So, Pre-Layer Norm addresses two things: a clean residual path and a strong, stable gradient flow.
</p>
<p>
With a clean residual path $H = X + f(X)$, the gradient is:
</p>
$$ \frac{dH}{dX} = I + \frac{df(X)}{dX} $$
<p>
Where $I$ is the identity matrix. This is the magic of residuals. With Post-Layer Norm, $H = \text{LayerNorm}(X + f(X))$, the gradient has to go through the LayerNorm's Jacobian, which complicates the gradient flow.
</p>
<h4>Why LayerNorm?</h4>
<p>
Here, <code>X</code> is our <code>n x d_model</code> tensor, and LayerNorm is applied to the feature dimension ($d_{model}$) for each token. So, why not use something like <strong>BatchNorm</strong>? Why <strong>LayerNorm</strong>?
</p>
<p>
On most sequential problems, sequences can have varying lengths. BatchNorm assumes consistent spatial dimensions, but LayerNorm doesn’t care about that; it works token by token. With big Transformer models, BatchNorm is only accurate if batch sizes are large, but due to memory constraints, we typically use smaller batches, making it noisy. Furthermore, with autoregressive models, we generate one token at a time at inference, so a batch size of 1 would make BatchNorm undefined.
</p>
<p>LayerNorm computes the mean and standard deviation of a single token's feature vector. There is no mixing across tokens or the batch.</p>
$$ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma} + \beta $$
<h3>The Decoder Block</h3>
<p>
The decoder block is similar to the encoder block but uses causal masking in its first attention layer. It ensures each position can only attend to itself and earlier tokens. Crucially, we also need to add information from the encoder block to the decoder block using <strong>cross-attention</strong>.
</p>
<p>
The output of the encoder is a sequence of contextualized embeddings of shape <code>n_source x d_model</code>. The input to the decoder is the target sequence (e.g., the Farsi words in a translation task). At training time, we shift the ground-truth target sequence one step to the right. At inference, we autoregressively feed the tokens generated so far.
</p>
<p>
Inside the decoder block, we first apply masked self-attention on the target sequence input, <code>Y_in</code>.
</p>
$$ H_1 = Y_{in} + \text{MHA}_{\text{causal}}(\text{LayerNorm}(Y_{in})) $$
<p>
This hidden state $H_1$ is now contextualized using only past target tokens. Now comes cross-attention. The **queries** come from the decoder, representing what the target tokens need. The **keys and values** come from the encoder, representing the information saved from the source sentence.
</p>
<p>
Say the encoder processes an English sentence. The decoder query for the word "pishi" (Farsi for kitten) asks which English words are relevant to this translation step. The model compares the query with the encoder keys and pulls in the value for the word "cat" into the decoder’s hidden state.
</p>
<p>
We take $H_1$ from the decoder and use it to form the queries for the cross-attention layer. We take the encoder output, $H_{enc}$, which is fixed across all decoder layers, to form the keys and values.
</p>
<p>
The score matrix $QK^T$ will have the shape <code>n_target x n_source</code>. Basically, for each decoder position (query), the attention weights across all encoder positions (keys) sum to 1. Each target query decides how to distribute its attention over the encoder output.
</p>
<p>
Each decoder query pulls information from all encoder values, weighted by relevance. This is followed by another Add & Norm and the FFN.
</p>
$$ H_2 = H_1 + \text{MHA}_{\text{cross}}(Q=\text{LayerNorm}(H_1), K=H_{enc}, V=H_{enc}) $$
$$ H_3 = H_2 + \text{FFN}(\text{LayerNorm}(H_2)) $$
<p>
This is the output of one decoder block. If we had multiple decoder blocks, they would use the previous block’s output as input, but all decoder blocks would use the same encoder output $H_{enc}$ as their keys and values.
</p>
<p>
The encoder output is the memory of the source sequence. The decoder needs consistent access to this memory. What changes are the decoder's queries, which are progressively refined.
</p>
<p>
If you’re like me, you might ask why we use $H_1$ as the decoder’s queries and not wait until after the FFN. One explanation is that $H_1$ has already integrated information about the past target tokens via self-attention. The FFN only adds a position-wise MLP; it doesn’t add any new context across tokens. Therefore, it doesn’t make the queries any more informative about what to attend to in the encoder. That’s why $H_1$ is the right place to ask questions of our source. Once this context is added, we can then refine the token’s representation locally with the FFN.
</p>
<hr>
<h3>Positional Encoding</h3>
<p>
So, we skipped a pretty important step: <strong>positional encoding</strong>. Attention takes tokens as a set, not a sequence. The model is <strong>permutation invariant</strong>. But the whole point of Transformers is that sequential ordering matters.
</p>
$$ \text{Attention}(X) = \text{Attention}(\text{shuffle}(X)) $$
<p>
Once we get our token embeddings of size <code>d_model</code>, we add a positional vector <code>P(t)</code> to each token that encodes its position <code>t</code>. This way, the token knows both its semantic meaning and its place in the sequence. In the original paper, they used fixed sinusoidal encodings. For each position <code>t</code> and dimension <code>i</code>:
</p>
$$ P_{t,2i} = \sin(t / 10000^{2i/d_{model}}) $$
$$ P_{t,2i+1} = \cos(t / 10000^{2i/d_{model}}) $$
<p>
This preserves the relative position between <code>t</code> and <code>t + k</code>. Let’s unpack this. We want each position <code>t</code> to be represented as a unique vector but also in a way that captures relative offsets. Let’s review some trigonometry.
</p>
<p>
The term $\sin(\omega t)$ means the function repeats every $2\pi$. The periodicity is inversely proportional to the frequency $\omega$.
</p>
$$ T = \frac{2\pi}{\omega} $$
<p>
A larger $\omega$ means shorter periods (faster oscillation), and a smaller $\omega$ means slower oscillation. In the original paper, they defined the frequency for each dimension <code>i</code> as:
</p>
$$ \omega_i = \frac{1}{10000^{2i / d_{model}}} $$
<p>
When <code>i</code> is 0, that’s our shortest range; $\omega_0$ is 1 and our period is $2\pi$. As <code>i</code> gets larger, $\omega_i$ gets smaller, resulting in a longer period. Each dimension in a token's vector at position <code>t</code> gets a sinusoid of a different frequency. Every token’s position gets encoded in multiple frequency bands. This way, each position gets a unique, multi-frequency signature. We use sine and cosine to encode positions as a point on a circle. That way, we avoid ambiguity and can also derive relative positions using trigonometric identities. In the original paper, they assign high-frequency dimensions to local order and low-frequency dimensions to long-range order.
</p>
<h3>Output Layer and Weight Tying</h3>
<p>
So, let’s take a look at the input and output of Transformers. At the input, we took discrete tokens and mapped them to $d_{model}$ using an embedding matrix of shape <code>|V| x d_model</code>, where <code>|V|</code> is the vocabulary size.
</p>
<p>
Now, at the decoder’s output, after the last block, we have hidden states of shape <code>n_target x d_model</code>. To predict the next token, we need logits over the vocabulary.
</p>
$$ Z = H_{dec} W_{out} + b $$
<p>
Where $W_{out}$ is the output projection matrix of shape <code>d_model x |V|</code>. The input embedding matrix and our output projection can be the transpose of each other. So, instead of learning two separate matrices, we can set $W_{out} = E^T$. This technique, called <strong>weight tying</strong>, has consistently improved the perplexity of language models.
</p>
<p>
In a translation task, you don’t want to tie the weights to the encoder's embedding matrix, but you could tie it to the decoder’s input embedding, as both are in the same language.
</p>
<h3>Computational Complexity</h3>
<p>Let’s talk complexity.</p>
<h4>For RNNs</h4>
<p>The core computation is:</p>
$$ h_t = f(W_h h_{t-1} + W_x x_t) $$
<p>Where:</p>
<ul>
<li><code>x_t</code> is the input embedding at time t, with dimension $d_{in}$</li>
<li><code>H_t</code> is the hidden state at time t, with dimension $d$</li>
<li><code>W_h</code> is the hidden-to-hidden weights, with dimension $d \times d$</li>
<li><code>W_x</code> is the input-to-hidden weights, with dimension $d \times d_{in}$</li>
</ul>
<p>
The complexity of the matrix multiplications are:
</p>
<ul>
<li><code>W_h x h_t-1</code>: $O(d^2)$</li>
<li><code>W_x x_t</code>: $O(d \cdot d_{in})$, which is about $O(d^2)$</li>
</ul>
<p>
The cost for each hidden state is $O(d^2)$. For each time step across a sequence of length $n$, the total cost is $O(n \cdot d^2)$, and memory for storing hidden states is $O(n \cdot d)$. This is linear over the sequence length, but it doesn’t parallelize.
</p>
<h4>For CNNs</h4>
<p>
In a 1D convolution over sequences with a kernel size of $k$ and hidden dimension $d$, for each of the $n$ positions, you have to compute the weighted sum of $k$ neighboring positions. Each token has $d_{in}$ features, so one window has $k \cdot d_{in}$ features in total. The filter is a weight vector of size $k \cdot d_{in}$. To produce $d_{out}$ channels at position $t$, we need $d_{out}$ such filters. So, the weight matrix is $d_{out} \times (k \cdot d_{in})$. The output vector at position $t$ is calculated as $W \times \text{window}(t)$, which has a complexity of $O(k \cdot d_{in} \cdot d_{out})$.
</p>
<p>
So the cost per position is $O(k \cdot d_{in} \cdot d_{out})$ and the cost per layer is $O(n \cdot k \cdot d^2)$. This is limited by the receptive field; you need depth for a global view.
</p>
<h4>Now, the complexity of Transformers</h4>
<p>
In attention, the expensive step is computing the score matrix.
</p>
<p>
Let’s look at matrix multiplication complexity quickly. If A is $m \times d$ and B is $d \times n$, then C = AB is $m \times n$. For every element $C_{ij}$, we take the dot product between row $i$ of A and column $j$ of B, where $d$ is the length of each dot product. There are $m \times n$ elements, so the total cost is $O(m \cdot n \cdot d)$.
</p>
<p>
In attention, computing Q, K, and V has a complexity of roughly $O(n \cdot d_{model} \cdot d_k)$ for each, since X is $n \times d_{model}$ and the weight matrices are $d_{model} \times d_k$.
</p>
<p>
The score matrix is $S = Q K^T$. Multiplying Q ($n \times d_k$) by $K^T$ ($d_k \times n$) results in an $n \times n$ matrix. Each of the $n$ queries must dot product with all $n$ keys. For $h$ heads, the complexity is $O(h \cdot n^2 \cdot d_k)$, with $O(n^2)$ memory for storing the score matrix for each head.
</p>
<p>
Applying the $n \times n$ score matrix to V (an $n \times d_v$ matrix) is $O(n^2 \cdot d_v)$. So for short sequences and large hidden sizes, the complexity is dominated by the feed-forward layers ($O(n \cdot d^2)$), but for long sequences, it’s quadratic in the sequence length ($O(n^2 \cdot d)$). That’s why attention is generally considered expensive; $n^2$ grows quickly. This global receptive field comes with a cost, and it’s a fundamental bottleneck.
</p>
<hr>
<h2 id="variants">Architectural Variants and Streaming Models</h2>
<h3>Decoder-Only Transformers</h3>
<p>
The <strong>decoder-only Transformer</strong> that’s now the new family of modern large language models like GPTs and LLaMA throws away the encoder and cross-attention. This is perfect for autoregressive next-token prediction. For conditional generation across modalities you need the encoder-decoder, but for text generation, decoder-only models typically outperform encoder-decoder models because they can be scaled bigger.
</p>
<p>
Some actually use a decoder-only model for language translation and simply use a <code><SEP></code> token before the prompt/context and the target after this token. In a decoder-only model, everything is just context; source or target don’t matter. As long as you mark the boundary, the model learns what’s source and what’s target.
</p>
<h3>Streaming Transformers with Windowed Attention</h3>
<p>
Let’s now switch gears a bit. Consider <strong>streamable models</strong> like online ASR or real-time translation. They can’t see the entire sequence at once, so instead of a full self-attention, you use a restricted window. In a normal attention model, attention is global—each query token attends to all past and future tokens. With <strong>windowed attention</strong>, a token only attends to nearby tokens in a fixed range, and it is also causal because we can’t look into the future.
</p>
<p>
Say you have 10 ms frames and you want the model to make predictions in streaming chunks every 500 ms. We might want each query frame to look back at the last 50 frames. The **attention window** is how many past frames each frame is allowed to see. So, if we feed our attention mechanism 100 frames but only want it to look back 50 frames, we restrict how many keys a query can see using a sliding window. This is done by masking the attention matrix, setting the score outside the window to $-\infty$, and forcing the model to ignore keys/values beyond the window.
</p>
<pre><code class="language-python">
# Create window mask (causal + limited)
mask = torch.full((n, n), float("-inf"))
for i in range(n):
start = max(0, i - self.window_size + 1)
mask[i, start:i+1] = 0 # allow only last `window_size` frames
</code></pre>
<p>
You might ask, why not feed the model overlapping windows? Why use masking with a sliding window? With overlapping windows, tokens are recomputed multiple times, and the model has no way of knowing the previous context.
</p>
<p>
This notion of context might seem a bit fuzzy. Context is the information a token carries from other tokens it has attended to; it’s the weighted sum of the value vectors. Imagine at layer 1, you look 10 frames back, so each output token is a blend of 10 input tokens. Now, in layer 2, the enriched tokens from layer 1 can add their context. Your token at position 19 isn’t just token 19 anymore; it’s a representation of tokens 10 through 19 from the previous layer. And since token 10 from the previous layer saw tokens 0 through 9, the context has propagated. It’s kind of like the receptive field growing in each layer; by stacking enough layers, you grow the attention span.
</p>
<p>
If you were to just send 10 frames at a time in an overlapping fashion, the issue is that the new chunk has no memory of the past. There’s no connection between the chunks because the context gets reset at each chunk boundary. The sliding window with masking allows for the context to propagate forward through the overlap, even if the window size itself is small.
</p>
<p>
The choice of window size is a balance between accuracy, latency, and compute. You want a large enough window to disambiguate context, but you have to wait longer to fill up a larger window, and a longer window means more memory and FLOPs.
</p>
<p>
Say you’re feeding your model 1 second of audio data (100 frames at 10 ms per frame), and you care about phonemes, which are about 20 to 50 ms long (about 2 to 5 frames per phoneme). How do you choose the attention window? We would want to integrate the whole phoneme span and maybe a little extra because the boundaries can be blurry, so you might choose 10 frames as your context window. You don’t need the entire 100 frames, because then you would be modeling at the word-level context, which increases computation and latency. So, think about the natural dependencies in your data when you select the attention window in a streaming fashion. How long is the smallest unit you need to capture? A phoneme, a gesture, a word, a spike? Convert that to milliseconds and figure out how many frames/tokens it needs to be. You can increase the window if validation accuracy keeps improving and stop if latency and compute become unacceptable.
</p>
And that's it! Bye bye.
</body>
</html>