How LLMs actually work
The transformer machinery, from a string of text to the next-token loop — built to be poked at.
An interactive walkthrough of the transformer architecture inside modern large language models, covering tokenization, embeddings, positional encoding, attention, multi-head attention, the feed-forward network, the residual stream, normalization, and next-token prediction.
Almost every modern LLM is the same idea stacked many times over: a transformer block, repeated. Learn the block and you understand most of the model. This is a walkthrough of that machinery — the moving parts, not the math — with each part rebuilt as something you can drag, step through, and inspect.
Different families train on different data at different scales and post-train differently afterward. But the skeleton is shared. By the end you should be able to open a model card or a paper and place each section against one of the stages below.
The whole thing at a glance
The pipeline
Text goes in, an integer comes out, and that integer becomes the next piece of text. Everything in between is the transformer stack. Step through it once before the detail: the same block of attention plus a feed-forward network repeats dozens of times, and only the very last step turns into a word.
Tap step to walk a token through the model.
Stage 1
Tokens
Models never see your letters. They see integer IDs. A tokenizer turns a string into a sequence of integers, each pointing at one entry in a fixed vocabulary of typically tens of thousands to a few hundred thousand pieces.
The pieces are usually not whole words — they are subword fragments. Whole-word vocabularies are too large and break on anything new; character-level is too small and forces the model to relearn everything from scratch. Subwords sit in the middle: common pieces get their own token, rare words get assembled from smaller ones.
This is why models used to miscount the R's in "strawberry." It was never a counting failure — the model was handed a couple of integers, not the letters a human would tick off one by one.
GPT-family models use Byte Pair Encoding variants; LLaMA-style models lean on SentencePiece. The choice changes compute cost and multilingual coverage, but the shape is identical: text in, integers out.
Stage 2
Embeddings
An ID like 1024 is just a row index — meaningless on its own. Meaning comes from the embedding matrix: a giant lookup table with one row per vocabulary entry, each row a long vector. The row length is the model's hidden size — 4,096 numbers per token is common in 7B-class models; bigger models go wider.
The useful property is that semantically similar tokens land near each other, and none of it is hand-coded — it emerges because those positions help the model predict text. The geometry even carries analogies you can do arithmetic with. Drag through the classic one below.
At this point each token is a vector — but the vector for "dog" is identical whether "dog" is first or fifth in your prompt. The embedding knows what, not where. That gap is the next stage's job.
Stage 3
Positional encoding
Plain self-attention has no built-in sense of order. Nothing tells it "dog" came before "bites" rather than after. Since word order changes meaning, the position of each token has to be injected into the math.
The 2017 transformer added a fixed sine-and-cosine pattern to each embedding before anything else, so "dog at position 1" and "dog at position 5" differed by the pattern stacked on top. It worked, but it forced one set of numbers to carry both meaning and position, and learned absolute positions did not extend cleanly past the lengths seen in training.
Modern models mostly use Rotary Position Embeddings (RoPE). Instead of adding a position vector, RoPE rotates the Query and Key vectors by an angle that grows with position. When two tokens are later compared, what matters is the difference between their rotations — which is exactly the distance between them.
RoPE encodes relative position for free, extends to longer contexts better, and adds no parameters — which is why LLaMA, Mistral, Gemma and Qwen all use it. Even so, long-context models show a documented "lost in the middle" effect: information at the start and end of a long prompt gets used more reliably than information buried in the middle. That is the real reason "put the important context first" helps.
Stage 4
Attention
This is the mechanism the architecture is named for. Inside every layer, attention lets each token look at the tokens it is allowed to see and decide which ones matter. It does this by giving each token three roles, produced by three learned projections: Query, Key, and Value.
Each Query is scored against every visible Key with a scaled dot product — how well the two line up. Softmax turns those scores into weights that sum to one, and the result is a weighted average of the Value vectors. Strong matches dominate.
Take "The cat that I saw yesterday was sleeping." When the model processes "was," its Query lines up strongly with the Key for "cat" and weakly with "yesterday," so the value of "cat" dominates the new representation of "was." A subject several positions back becomes the referent. Tap a word below to see what it attends to.
One of the cleaner interpretability findings is the induction head (Anthropic, 2022): a head that spots a pattern of the form "A B … A" and predicts B comes next. When the model meets "A" the second time, the head looks back to where "A" appeared, sees what followed, and copies it. It's one of the clearest known mechanisms behind in-context learning.
Attention has one big cost: every token compares against every visible token, so doubling the prompt roughly quadruples the work. That is why long prompts are expensive, and why FlashAttention, sparse attention, and linear attention exist.
Stage 5
Multi-head attention
One attention pass gives one view of which tokens matter. Language has many relationships running at once — subject–verb agreement, pronouns and their referents, long-range links, local phrasing. Multi-head attention runs attention many times in parallel, each pass in its own smaller space.
The part tutorials often get wrong: a head does not get a fixed slice of the token vector. Each head learns its own projection that maps the full vector down to its own small Q, K, V. With 4,096 numbers and 32 heads, each head works in 128 dimensions — but those 128 are a learned projection of all 4,096, a different view, not a different chunk.
Specialisation emerges on its own. Below are four heads of the kind researchers actually find — never instructed, just learned. Shared scale, so the patterns are comparable at a glance.
Heads are concatenated and mixed by a final learned layer. A layer might hold 32 heads; a frontier model stacks dozens of layers — so a model carries thousands of heads, each its own learned view.
That cache is why a recent change stuck. In Grouped-Query Attention, groups of query heads share key/value heads: LLaMA-2 70B has 64 query heads but 8 key/value heads; Mistral 7B has 32 and 8. Almost the same accuracy, much less memory pressure.
Stage 6
The feed-forward network
After attention mixes information between tokens, every layer has a second step that gets far less attention itself. The feed-forward network runs on each token's vector on its own — no cross-token mixing — and does three things in order: expand to a larger width, apply a non-linearity, compress back down.
The original transformer used ReLU; GPT and BERT moved to GELU; LLaMA, Mistral and PaLM use SwiGLU. The expand-then-compress shape stayed fixed — only the bend in the middle got iterated on.
Most of a dense model's parameters live here, not in attention — and they are not generic. This is where much of the stored factual structure sits. Some FFN neurons fire on specific concepts: one on Eiffel-Tower text, another on programming languages, another on past-tense verbs. "Paris is the capital of France" is represented across these weights, which is why methods like ROME can edit a single fact — flipping the Eiffel Tower to Rome — with a targeted low-rank change to one FFN matrix, no retraining.
The largest models increasingly replace the dense FFN with Mixture of Experts: Mixtral 8×7B has 8 experts per layer but activates 2 per token, so it carries 46.7B parameters yet spends about 12.9B per token. That is how you grow parameter count without growing inference cost in step.
Stage 7
The residual stream and normalization
The model is additive, not replacing. After attention runs, and after the feed-forward network runs, the result is added to the token's vector rather than overwriting it. Across thirty or a hundred layers those contributions accumulate into a running sum — the residual stream — and the original embedding keeps a direct additive path all the way to the late layers.
The trick came from ResNet (2015), for image recognition: deep networks were untrainable because the learning signal weakened travelling back through many layers, and the shortcut path let it flow directly from output to input. Transformers inherited it. In interpretability work the residual stream is now the central object — every head, every FFN, even the final unembedding reads from it and writes back to it.
The second piece, normalization, is plumbing: without it the running sum would explode or collapse and training would fail. The 2017 design normalized after each block (post-norm); modern models normalize before (pre-norm), which made very deep stacks trainable. Many also swapped full layer norm for RMSNorm — drop the mean-shift, keep the rescaling, most of the benefit at lower cost.
Stage 8
Predicting the next token
After all the layers, the model has a vector for every token. To predict what comes next it takes the final vector of the last token only, and converts it into one number per vocabulary entry — 100,000 numbers for a 100,000-token vocabulary. Those are the logits: raw scores, any size, not yet probabilities.
Softmax turns the logits into a probability distribution. The model usually does not just grab the top token: temperature sharpens or flattens the distribution, and top-k / top-p trim the candidates. Drag the temperature and watch the same logits reshape.
The cat
Once a token is chosen it gets appended, the KV cache is reused, and the loop runs again: new attention, new feed-forward, new final vector, new prediction — until an end-of-sequence token or a length limit. A whole paragraph is just this loop, one token at a time.
This single objective — predict the next token — is the entire training signal for a base model. It is never trained directly on accuracy, reasoning, or conversation; those come from post-training on top. One efficiency trick worth knowing is speculative decoding: a small fast model drafts several tokens ahead and the big model verifies them in parallel, matching the big model's output distribution while running the loop faster.
Stage 9
Architecture vs trained weights
So what actually differs between GPT, Claude, Gemini and LLaMA? At the level covered here, surprisingly little — they sit in the same transformer-family design space. Almost all of them share tokenization, embeddings, positional encoding, stacked layers of multi-head attention plus a feed-forward network, residual streams, normalization, and next-token prediction.
| Shared across modern LLMs | What actually varies |
|---|---|
| The skeleton tokenizer → embeddings → +position → N×(attention + FFN) → norm → logits | The trained weights learned from different data, at different scale |
| The mechanisms Q/K/V attention, residual stream, softmax output | The configuration layer count, vocab size, head count, parameters, dense vs MoE |
| The converged choices pre-norm, RMSNorm, RoPE, SwiGLU, GQA | The post-training instruction tuning, human-feedback learning, safety |
A few worked configurations, all from the same skeleton:
| Model | Query heads | KV heads | FFN | Params |
|---|---|---|---|---|
| Mistral 7B | 32 | 8 | dense, SwiGLU | 7B |
| LLaMA-2 70B | 64 | 8 | dense, SwiGLU | 70B |
| Mixtral 8×7B | 32 | 8 | MoE, 8 experts / 2 active | 46.7B total · ~12.9B / token |
The 2023–2025 stack — pre-norm, RMSNorm, RoPE, SwiGLU, GQA, and MoE in the largest models — converged across teams that arrived at it independently. None of it was invented at once; it accumulated over five years on top of the 2017 design.
It could still change. Mamba and other state-space models are credible for very long sequences, and hybrids are being explored. But the problems in this walkthrough — turning text into integers, giving them meaning and order, letting tokens share information, processing each one, keeping a deep stack trainable, and emitting one token at a time — are the durable parts. Any sequence model has to solve them in some form.
If you've made it here, you can open a modern transformer paper or model card and place each section against one of these nine stages. That was the whole goal.