001: Tokenization, Causal Masking, and Attention Pass by Hand¶
This notebook is a first-principles introduction to three foundational ideas behind decoder-only LLMs such as Mistral 7B.
- Tokenization
- Causal masking
- One self-attention pass by hand
The goal is to understand the pipeline:
text → tokens → embeddings → attention scores → causal mask → attention weights → mixed output
The math has been kept tiny and intentionally simplified so the mechanism is easy to see.
§ Learning goals¶
- why an LLM reads token IDs, not raw text
- how tokens become embedding vectors
- how Q, K, V are used to compute attention
- what the causal mask does and why decoder-only models need it
- how attention creates a weighted blend of value vectors
§ Tokenization¶
A language model does not read raw text directly.
It reads a sequence of tokens.
A token is usually not exactly the same thing as a word. It can be:
- a whole word
- part of a word
- punctuation
- whitespace-related pieces
- bytes for unusual text
In this notebook we will use a trivial tokenizer where each word is one token.
Why tokenization matters¶
Because the model's actual input is not English. It is token IDs.
This affects:
- context length
- cost
- handling of rare words
- spelling mistakes
- code and numbers
- multilingual text
A useful sentence to remember:
The model never sees
I like teaas letters.
It sees a sequence of token IDs.
sentence = "I like tea"
toy_vocab = {"I": 1, "like": 2, "tea": 3}
tokens = [toy_vocab[word] for word in sentence.split()]
print("Sentence:", sentence)
print("Token IDs:", tokens)
Sentence: I like tea Token IDs: [1, 2, 3]
§ Turn tokens into vectors¶
Models work with vectors, not IDs.
So each token gets mapped to an embedding vector.
Choosing a 2-dimensional embeddings to keep the math small.
Lets assume, for the sake of simplicity
I→ ([1, 0])like→ ([0, 1])tea→ ([1, 1])
So our input matrix $X$ is
$$ X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} $$
Each row is one token:
- row 1 =
I - row 2 =
like - row 3 =
tea
import numpy as np
X = np.array([
[1, 0], # I
[0, 1], # like
[1, 1], # tea
], dtype=float)
print("Input embedding matrix X:")
print(X)
Input embedding matrix X: [[1. 0.] [0. 1.] [1. 1.]]
§ Create Q, K, V¶
Attention uses three versions of the input $X$
- $Q$ (query) representing what a token is looking for
- $K$ (key) representing what a token offers for matching
- $V$ (value) representing the information a token contributes if it is attended to
These are internal attention roles inside the model.
They do not mean:
- the human user’s query to the LLM
- a database query
- a search query
Instead, at a given layer, each token in the sequence produces its own query, key, and value vectors.
A helpful way to think about it is:
- the query says: “what kind of information would help me right now?”
- the key says: “what kind of information do I contain?”
- the value says: “if another token attends to me, this is the information I pass along”
So if token $i$ is being processed, its query is compared against the keys of other tokens. Stronger matches lead to higher attention weights, and then those weights are used to mix the value vectors.
Normally,
$$ Q = XW_Q,\quad K = XW_K,\quad V = XW_V $$
where $W_Q, W_K, W_V$ are projection matrices.
§ Projection matrices¶
A projection matrix is a learned weight matrix that transforms the input token representations into the form needed for attention.
They start as randomly initialized weights.
During training on a large corpus, the model repeatedly
- reads sequences of tokens
- makes next-token predictions
- measures the prediction error using a loss function
- uses backpropagation and gradient descent to update its weights
That update process changes all the model’s learnable parameters, including the projection matrices.
Over many training steps, the model learns projection matrices that make attention useful for predicting the next token.
So the training timeline is
- initialize $W_Q, W_K, W_V$ randomly
- train on many sequences of text
- compute prediction loss
- update the matrices through backpropagation
- end up with learned projections that work well
After training, these learned matrices are fixed for inference. When a user sends a prompt, the model uses the learned $W_Q, W_K, W_V$ to process the prompt tokens, but it is no longer updating them.
To keep it simple, lets choose
$$ W_Q = W_K = W_V = I $$
So,
$$ Q = K = V = X $$
This is not realistic, but it makes the first attention pass easy to inspect.
During training:
- the model computes $Q$, $K$, and $V$ for tokens in training sequences
- the projection matrices are updated through learning
During inference:
- the model still computes $Q$, $K$, and $V$ in the same way
- but the projection matrices are no longer being learned, only used
Q = X.copy()
K = X.copy()
V = X.copy()
print("Q =")
print(Q)
print("\nK =")
print(K)
print("\nV =")
print(V)
Q = [[1. 0.] [0. 1.] [1. 1.]] K = [[1. 0.] [0. 1.] [1. 1.]] V = [[1. 0.] [0. 1.] [1. 1.]]
§ Compute raw attention scores¶
The word self in self-attention means that queries, keys, and values all come from the same input sequence, so each token can look at other tokens in that same sequence.
The attention score matrix is
$$ \text{scores} = \frac{QK^T}{\sqrt{d_k}} $$
Here $d_k = 2$, so we divide by
$$ \sqrt{2} \approx 1.414 $$
Because $Q = K = X$, we first compute
$$ QK^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix} $$
Then divide by $\sqrt{2}$
$$ S = \frac{QK^T}{\sqrt{2}} \approx \begin{bmatrix} 0.707 & 0 & 0.707 \\ 0 & 0.707 & 0.707 \\ 0.707 & 0.707 & 1.414 \end{bmatrix} $$
Interpretation:
- row 1 = what token 1 wants to look at
- row 2 = what token 2 wants to look at
- row 3 = what token 3 wants to look at
d_k = Q.shape[1]
raw_scores = (Q @ K.T) / np.sqrt(d_k)
print("Raw attention scores S = QK^T / sqrt(d_k):")
print(np.round(raw_scores, 3))
Raw attention scores S = QK^T / sqrt(d_k): [[0.707 0. 0.707] [0. 0.707 0.707] [0.707 0.707 1.414]]
§ Apply the causal mask¶
This is the key idea in decoder-only LLMs.
When predicting token at position $t$, the model is allowed to see:
- earlier tokens
- itself
It is not allowed to see future tokens.
So:
- token 1 cannot look at tokens 2 or 3
- token 2 cannot look at token 3
The causal mask for length 3 is
$$ M = \begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix} $$
Add this mask to the score matrix
$$ S_{\text{masked}} = \begin{bmatrix} 0.707 & -\infty & -\infty \\ 0 & 0.707 & -\infty \\ 0.707 & 0.707 & 1.414 \end{bmatrix} $$
Intuition¶
Without this mask, token 1 could "cheat" by seeing token 3.
That would break next-token prediction training.
seq_len = raw_scores.shape[0]
mask = np.triu(np.full((seq_len, seq_len), -np.inf), k=1)
masked_scores = raw_scores + mask
print("Causal mask:")
print(mask)
print("\nMasked scores:")
print(masked_scores)
Causal mask: [[ 0. -inf -inf] [ 0. 0. -inf] [ 0. 0. 0.]] Masked scores: [[0.70710678 -inf -inf] [0. 0.70710678 -inf] [0.70710678 0.70710678 1.41421356]]
§ Softmax each row¶
Now convert masked scores into probabilities:
$$ A = \text{softmax}(S_{\text{masked}}) $$
Row 1¶
$$ [0.707, -\infty, -\infty] \rightarrow [1, 0, 0] $$
Token 1 can only attend to itself.
Row 2¶
$$ [0, 0.707, -\infty] $$
Exponentials
- $e^0 = 1$
- $e^{0.707} \approx 2.03$
Sum
$$ 1 + 2.03 = 3.03 $$
Normalize
$$ \left[\frac{1}{3.03}, \frac{2.03}{3.03}, 0\right] \approx [0.33, 0.67, 0] $$
Row 3¶
$$ [0.707, 0.707, 1.414] $$
Exponentials
- $e^{0.707} \approx 2.03$
- $e^{0.707} \approx 2.03$
- $e^{1.414} \approx 4.11$
Sum
$$ 2.03 + 2.03 + 4.11 = 8.17 $$
Normalize
$$ \left[ \frac{2.03}{8.17}, \frac{2.03}{8.17}, \frac{4.11}{8.17} \right] \approx [0.25, 0.25, 0.50] $$
So the attention matrix is approximately
$$ A \approx \begin{bmatrix} 1 & 0 & 0 \\ 0.33 & 0.67 & 0 \\ 0.25 & 0.25 & 0.50 \end{bmatrix} $$
def row_softmax(x):
# stable softmax, safe with -inf in masked positions
x = x - np.max(x, axis=-1, keepdims=True)
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
A = row_softmax(masked_scores)
print("Attention weights A:")
print(np.round(A, 3))
Attention weights A: [[1. 0. 0. ] [0.33 0.67 0. ] [0.248 0.248 0.503]]
§ Multiply attention weights by $V$¶
Computing the output
$$ O = AV $$
Recall
$$ V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} $$
Output for row 1
$$ 1[1,0] + 0[0,1] + 0[1,1] = [1,0] $$
Output for row 2
$$ 0.33[1,0] + 0.67[0,1] + 0[1,1] = [0.33, 0.67] $$
Output for row 3
$$ 0.25[1,0] + 0.25[0,1] + 0.50[1,1] $$
$$ = [0.25,0] + [0,0.25] + [0.50,0.50] = [0.75, 0.75] $$
So,
$$ O \approx \begin{bmatrix} 1 & 0 \\ 0.33 & 0.67 \\ 0.75 & 0.75 \end{bmatrix} $$
This is the attention output.
O = A @ V
print("Attention output O = A @ V:")
print(np.round(O, 3))
Attention output O = A @ V: [[1. 0. ] [0.33 0.67 ] [0.752 0.752]]
§ Summarisation¶
Each token started as its own vector.
After attention,
- token 1 stayed the same
- token 2 became a mix of token 1 and token 2
- token 3 became a mix of all three tokens
That is the heart of self-attention
each token builds a new representation by mixing information from allowed tokens
And the causal mask ensures
only the past and present can contribute, never the future
§ Intuition¶
The process so far in plain English:
- Tokenization turns text into discrete units.
- Embeddings turn those units into vectors.
- Q and K decide who should pay attention to whom.
- The causal mask blocks future information.
- Softmax turns raw scores into attention weights.
- Those weights mix the value vectors into a contextual representation.
That contextual representation then goes through more layers, and eventually a final linear layer produces logits for the next token.
§ Learnings so far¶
- tokenization
- decoder-only causal masking
- self-attention
- contextual mixing
Later, when we look into Mistral specifically, we can read on
- multi-head attention
- grouped-query attention
- rotary position embeddings
- feed-forward layers
- normalization
- sliding-window attention
§ Main concept¶
For a decoder-only LLM:
$$ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{causal mask}\right)V $$
§ Upshot¶
A decoder-only language model predicts the next token by building contextual representations of the tokens it has already seen.
The key mechanism is self-attention, and the key restriction is the causal mask.
This notebook explains the forward-pass attention mechanics used in both training and inference.
During training, these computations are used while learning next-token prediction.
During inference, the same computations are used to process the prompt and generate the next token.
The causal mask enforces left-to-right behavior in both cases.
Published: Mar 17, 2026