002: Positional information, multi-head attention, and why one attention head is not enough¶
§ Positional information: why position matters¶
Self-attention by itself does not know token order.
If you only give embeddings, then these two sequences contain the same token set:
dog bites manman bites dog
A plain attention mechanism without positional information only sees vectors and similarities. It does not automatically know which token came first.
That is a huge problem, because meaning depends on order.
So transformers need a way to inject:
token identity + token position
into the model.
§ The core idea¶
Instead of using only token embeddings, we use
$$ \text{input representation} = \text{token embedding} + \text{positional information} $$
So if,
- $E_i$ is the embedding of token at position $i$
- $P_i$ is the positional vector for position $i$
then the model input becomes:
$$ X_i = E_i + P_i $$
This means two identical words in different positions get different final vectors.
That is enough for attention to start noticing order.
§ An example¶
Suppose our sentence is
I like tea
and token embeddings are
$$ E_1 = [1,0], \quad E_2 = [0,1], \quad E_3 = [1,1] $$
Now invent positional vectors
$$ P_1 = [0.1,0.0], \quad P_2 = [0.0,0.1], \quad P_3 = [0.1,0.1] $$
Then the actual inputs become
$$ X_1 = E_1 + P_1 = [1.1,0.0] $$
$$ X_2 = E_2 + P_2 = [0.0,1.1] $$
$$ X_3 = E_3 + P_3 = [1.1,1.1] $$
So now position changes the vectors the attention layer sees.
Without that, the model would struggle to distinguish many reorderings.
§ Different ways to represent position¶
There are several approaches.
Learned positional embeddings¶
The model learns a vector for position 1, position 2, position 3, and so on.
So position behaves like a trainable lookup table.
Sinusoidal positional encodings¶
Used in the original transformer paper.
These are deterministic vectors built from sine and cosine waves of different frequencies.
The original paper defines them as
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
The nice property is that relative offsets can be represented systematically, i.e. if one knows the encoding for position pos, one can relate it predictably to the encoding for position pos + k.
Rotary position embeddings¶
Modern decoder LLMs often use RoPE.
Instead of adding a position vector directly, RoPE rotates query and key vectors in a position-dependent way. That lets attention depend on relative position more naturally.
For now, the important idea is
the model must know both what the token is and where it is
§ Why one attention head is not enough¶
In 001, we used one attention calculation
$$ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{mask}\right)V $$
That gives one attention pattern.
But language usually contains multiple relationships at once.
For example,
The cat that chased the mouse was hungry
the model may need to track
- subject-verb relationship:
The cat↔was - local modifier relationship:
the→cat,the→mouse - long-distance dependency:
cat↔was hungryacrossthat chased the mouse - phrase structure: main clause
The cat was hungry, embedded clausethat chased the mouse - boundary information: recognizing that
that chased the mouseis one unit, even without punctuation
One attention map is a single “view” of the sequence.
That is too limiting.
So transformers use multiple heads.
Each head gets its own learned projections
$$ Q_h = XW_Q^{(h)}, \quad K_h = XW_K^{(h)}, \quad V_h = XW_V^{(h)} $$
Each head computes its own attention result
$$ \text{head}_h = \text{Attention}(Q_h, K_h, V_h) $$
Then the heads are concatenated and projected again
$$ \text{MultiHead}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_H)W_O $$
So instead of one attention pattern, the model gets several.
§ Intuition for multi-head attention¶
Think of each head as learning a different question.
A head may learn any of the below
- "Which earlier word is syntactically related to me?"
- "Which nearby token helps clarify my phrase?"
- "Where is the delimiter or structure boundary?"
A real head is not manually labeled like that, but this intuition is useful.
So multi-head attention gives the model
several different ways to look at the same sequence at the same time.
§ A example of two heads¶
Suppose our sequence is:
I like tea
and suppose its token representation matrix is
$$ X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} $$
Here, each row is the vector representation of one token in the sequence:
- row 1 corresponds to
I - row 2 corresponds to
like - row 3 corresponds to
tea
So $X$ is not a single vector representing the whole sentence. It is the matrix of token representations for all tokens in the sequence.
In a real transformer, each attention head starts with the same input matrix $X$, but applies different learned projection matrices to it. These learned matrices were acquired during training: the model starts with random weights, makes prediction errors on large amounts of text, and then updates the weights through gradient-based learning. Over time, the model learns projection matrices that make attention useful.
To keep it simple, lets say
- Head 1 attends more based on one learned pattern in the token vectors
- Head 2 attends more based on a different learned pattern
After computing attention, head 1 might produce attention weights like
$$ A^{(1)} = \begin{bmatrix} 1 & 0 & 0 \\ 0.3 & 0.7 & 0 \\ 0.2 & 0.2 & 0.6 \end{bmatrix} $$
and head 2 might produce
$$ A^{(2)} = \begin{bmatrix} 1 & 0 & 0 \\ 0.6 & 0.4 & 0 \\ 0.1 & 0.7 & 0.2 \end{bmatrix} $$
These two heads are attending differently to the same sequence.
That is the key point.
Each head then uses its attention weights to mix its value vectors and produce its own output:
$$ O^{(1)} = A^{(1)}V^{(1)}, \quad O^{(2)} = A^{(2)}V^{(2)} $$
Then the model concatenates the head outputs:
$$ [O^{(1)} \; || \; O^{(2)}] $$
and applies a final output projection.
So the final representation contains multiple complementary attention views of the same sequence.
§ Why separate projections matter¶
Each attention head starts with the same input matrix $X$.
But each head has its own learned projection matrices:
$$ W_Q^{(1)}, \; W_K^{(1)}, \; W_V^{(1)} $$
$$ W_Q^{(2)}, \; W_K^{(2)}, \; W_V^{(2)} $$
and so on.
A projection matrix is a learned weight matrix that transforms the input token representations into the form needed for attention.
These matrices transform the same input $X$ into different queries, keys, and values for each head:
$$ Q_1 = XW_Q^{(1)}, \quad K_1 = XW_K^{(1)}, \quad V_1 = XW_V^{(1)} $$
$$ Q_2 = XW_Q^{(2)}, \quad K_2 = XW_K^{(2)}, \quad V_2 = XW_V^{(2)} $$
Here:
- $Q$ (query) represents what a token is looking for
- $K$ (key) represents what a token offers for matching
- $V$ (value) represents the information a token contributes if it is attended to
These are internal attention roles inside the model. They do not refer to the user’s question or prompt in the everyday sense. Instead, each token in the sequence produces its own query, key, and value vectors.
If all heads used the same projection matrices, they would produce the same or very similar queries, keys, and values, and would therefore tend to learn very similar attention patterns.
Separate projections allow different heads to specialize in different relationships.
Attention then works in two stages
- compare queries with keys to compute attention weights
- use those weights to mix the values
So for each head, we compute attention weights such as:
$$ A^{(1)} = \text{softmax}\left(\frac{Q_1K_1^T}{\sqrt{d_k}}+\text{mask}\right) $$
$$ A^{(2)} = \text{softmax}\left(\frac{Q_2K_2^T}{\sqrt{d_k}}+\text{mask}\right) $$
Then each head multiplies its attention weights by its value matrix to produce its output:
$$ O^{(1)} = A^{(1)}V^{(1)}, \quad O^{(2)} = A^{(2)}V^{(2)} $$
Then the model concatenates the head outputs:
$$ [O^{(1)} \; || \; O^{(2)}] $$
and applies a final output projection.
So the final representation contains multiple complementary attention views.
§ Shape intuition¶
Shapes matter to help answer questions like:
- how many token vectors are we processing?
- how wide is each token representation?
- how is that width divided across heads?
- after combining heads, what size is the result?
So shapes are important because they tell you:
what kind of object each matrix is, and how the computations fit together
To begin
- Sequence length = $T$, number of tokens in the input sequence, controls the length of the text being processed.
- Model dimension = $d_{\text{model}}$, width of the token representation space, say each token is represented by a vector of length 768
- Number of heads = $H$, how many different attention "views" the model uses at the same time
- Per-head key dimension = $d_k$, the internal vector width used by one attention head
$X$ tells the shape of the input to the attention layer. Here, the model is processing $T$ token vectors, each of width $d_{\text{model}}$
$$ X \in \mathbb{R}^{T \times d_{\text{model}}} $$
Each head computes the query, key and value vectors for each token. $$ Q_h, K_h, V_h \in \mathbb{R}^{T \times d_k} $$
The number of rows stays the same because we still have the same tokens. The number of columns changes because each head uses projected representations.
Each head produces $$ O_h \in \mathbb{R}^{T \times d_k} $$
- one output vector per token
- at the same internal width it used for attention
Concatenating all head outputs, i.e.combining the separate attention views from different heads into one larger representation.
$$ \text{Concat}(O_1,\dots,O_H) \in \mathbb{R}^{T \times (H d_k)} $$
which has the same total width as the model dimentions
$$ H d_k = d_{\text{model}} $$
That makes it easier to pass the result into the next layer.
§ Why multi-head attention helps more than one big head¶
A single large head has to compress all relationships into one attention pattern.
Multi-head attention lets the model distribute work:
- one head for local structure
- one for longer dependencies
- one for punctuation or formatting cues
- one for semantic similarity
Not every head ends up useful, but in general the architecture gives the model more representational flexibility.
This is one reason transformers work so well.
§ The role of causality is still the same¶
Even with many heads, decoder-only models still apply a causal mask.
So for every head
$$ \text{head}_h = \text{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}} + \text{causal mask}\right)V_h $$
So multi-head attention does not remove the left-to-right constraint.
Each head still cannot see future tokens.
§ What this means for real LLMs¶
In a real decoder LLM
- token embeddings are higher-dimensional
- positional information is more sophisticated
- there are many heads
- each head is learned
- there are many stacked layers
But the core idea remains the same
- embed tokens
- inject position
- build $Q$, $K$, and $V$ for each head
- compute masked attention per head
- combine heads
- pass to the next layer
§ Common misunderstandings¶
A common misunderstanding is
"attention already sees all tokens, so position is unnecessary"
That is false.
Attention sees a set of vectors and computes interactions among them. Without positional information, order is not inherently represented.
Another common misunderstanding is
"multi-head attention is just repeated redundant attention"
Also false.
The heads are useful because they use different learned projections and can therefore learn different relations.
§ Summarising¶
- Positional information - lets the model know where each token is
- Multi-head attention - lets the model look at the same sequence in several different learned ways at once
- One head is not enough - because language contains multiple overlapping relationships that a single attention map cannot represent well.
§ Takeaway formulas¶
Single head
$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+\text{mask}\right)V $$
Multi-head
$$ \text{MultiHead}(X)=\text{Concat}(\text{head}_1,\dots,\text{head}_H)W_O $$
where each head has its own projections.
§ Closing intuition¶
001 showed how one attention computation works.
002 adds the two big ideas that make transformers much more expressive
- position tells the model where tokens are
- multiple heads let the model capture several relationships at the same time
That is why real transformers are much more powerful than a single attention pass.
Published: Mar 20, 2026