How LLMs actually work

The transformer machinery, from a string of text to the next-token loop — built to be poked at.

Sunday, 260607

Almost every modern LLM is the same idea stacked many times over: a transformer block, repeated. Learn the block and you understand most of the model. This is a walkthrough of that machinery — the moving parts, not the math — with each part rebuilt as something you can drag, step through, and inspect.

Different families train on different data at different scales and post-train differently afterward. But the skeleton is shared. By the end you should be able to open a model card or a paper and place each section against one of the stages below.

An interactive rebuild of 0xkato's essay "How LLMs actually work" (June 2026). The structure and the worked examples follow the original; the words and every diagram here are reworked, not reproduced.

The pipeline 1 · Tokens 2 · Embeddings 3 · Position 4 · Attention 5 · Multi-head 6 · Feed-forward 7 · Residual stream 8 · Next token 9 · Architecture vs weights

The whole thing at a glance

The pipeline

Text goes in, an integer comes out, and that integer becomes the next piece of text. Everything in between is the transformer stack. Step through it once before the detail: the same block of attention plus a feed-forward network repeats dozens of times, and only the very last step turns into a word.

From prompt to next token

Step through the stages, or play the whole pass.

Tap step to walk a token through the model.

Stage 1

Tokens

Models never see your letters. They see integer IDs. A tokenizer turns a string into a sequence of integers, each pointing at one entry in a fixed vocabulary of typically tens of thousands to a few hundred thousand pieces.

The pieces are usually not whole words — they are subword fragments. Whole-word vocabularies are too large and break on anything new; character-level is too small and forces the model to relearn everything from scratch. Subwords sit in the middle: common pieces get their own token, rare words get assembled from smaller ones.

Token ID. The integer the model uses for one vocabulary entry. The model computes on the number, never on the written word. · Vocabulary. The tokenizer's fixed list of pieces; the model can only receive IDs drawn from it.

Watch text become integers

Illustrative subword splits — the exact pieces depend on the tokenizer.

This is why models used to miscount the R's in "strawberry." It was never a counting failure — the model was handed a couple of integers, not the letters a human would tick off one by one.

GPT-family models use Byte Pair Encoding variants; LLaMA-style models lean on SentencePiece. The choice changes compute cost and multilingual coverage, but the shape is identical: text in, integers out.

Stage 2

Embeddings

An ID like 1024 is just a row index — meaningless on its own. Meaning comes from the embedding matrix: a giant lookup table with one row per vocabulary entry, each row a long vector. The row length is the model's hidden size — 4,096 numbers per token is common in 7B-class models; bigger models go wider.

Vector. A list of numbers. Each token becomes a vector so the model can do arithmetic on it. · Embedding matrix. A lookup table — token ID in, learned vector out.

The useful property is that semantically similar tokens land near each other, and none of it is hand-coded — it emerges because those positions help the model predict text. The geometry even carries analogies you can do arithmetic with. Drag through the classic one below.

Embedding space, projected to two dimensions

Hover any word. The gender displacement from "man" to "woman" is the same move that takes "king" to "queen".

At this point each token is a vector — but the vector for "dog" is identical whether "dog" is first or fifth in your prompt. The embedding knows what, not where. That gap is the next stage's job.

Stage 3

Positional encoding

Plain self-attention has no built-in sense of order. Nothing tells it "dog" came before "bites" rather than after. Since word order changes meaning, the position of each token has to be injected into the math.

Positional encoding. How the model learns where each token sits in the sequence.

The 2017 transformer added a fixed sine-and-cosine pattern to each embedding before anything else, so "dog at position 1" and "dog at position 5" differed by the pattern stacked on top. It worked, but it forced one set of numbers to carry both meaning and position, and learned absolute positions did not extend cleanly past the lengths seen in training.

Modern models mostly use Rotary Position Embeddings (RoPE). Instead of adding a position vector, RoPE rotates the Query and Key vectors by an angle that grows with position. When two tokens are later compared, what matters is the difference between their rotations — which is exactly the distance between them.

RoPE. Rotary Position Embeddings. Rather than adding a position vector, it rotates Query and Key vectors so that relative distance shows up naturally during attention.

RoPE rotates by position

Set two token positions. Attention reads the angle between them — the relative distance, not the absolute spots.

Query @ pos 6

Key @ pos 2

RoPE encodes relative position for free, extends to longer contexts better, and adds no parameters — which is why LLaMA, Mistral, Gemma and Qwen all use it. Even so, long-context models show a documented "lost in the middle" effect: information at the start and end of a long prompt gets used more reliably than information buried in the middle. That is the real reason "put the important context first" helps.

Stage 4

Attention

This is the mechanism the architecture is named for. Inside every layer, attention lets each token look at the tokens it is allowed to see and decide which ones matter. It does this by giving each token three roles, produced by three learned projections: Query, Key, and Value.

Q, K, V. Query is "what am I looking for"; Key is "what do I match against"; Value is "what gets copied when the match is strong." The same token plays all three roles at once.

Each Query is scored against every visible Key with a scaled dot product — how well the two line up. Softmax turns those scores into weights that sum to one, and the result is a weighted average of the Value vectors. Strong matches dominate.

Dot product. A quick score of how aligned two vectors are. · Softmax. Turns raw scores into weights that add to one — big scores get big weights.

Take "The cat that I saw yesterday was sleeping." When the model processes "was," its Query lines up strongly with the Key for "cat" and weakly with "yesterday," so the value of "cat" dominates the new representation of "was." A subject several positions back becomes the referent. Tap a word below to see what it attends to.

What each word attends to

A word can only look leftward — at itself and earlier tokens. That left-only rule is causal masking.

Causal masking. Hides future tokens so a left-to-right model can't peek ahead while predicting the next one. Future positions get scores so low they vanish after softmax.

One of the cleaner interpretability findings is the induction head (Anthropic, 2022): a head that spots a pattern of the form "A B … A" and predicts B comes next. When the model meets "A" the second time, the head looks back to where "A" appeared, sees what followed, and copies it. It's one of the clearest known mechanisms behind in-context learning.

Attention has one big cost: every token compares against every visible token, so doubling the prompt roughly quadruples the work. That is why long prompts are expensive, and why FlashAttention, sparse attention, and linear attention exist.

Stage 5

Multi-head attention

One attention pass gives one view of which tokens matter. Language has many relationships running at once — subject–verb agreement, pronouns and their referents, long-range links, local phrasing. Multi-head attention runs attention many times in parallel, each pass in its own smaller space.

Attention head. One independent attention pass with its own learned projections.

The part tutorials often get wrong: a head does not get a fixed slice of the token vector. Each head learns its own projection that maps the full vector down to its own small Q, K, V. With 4,096 numbers and 32 heads, each head works in 128 dimensions — but those 128 are a learned projection of all 4,096, a different view, not a different chunk.

Specialisation emerges on its own. Below are four heads of the kind researchers actually find — never instructed, just learned. Shared scale, so the patterns are comparable at a glance.

Four heads, one shared scale

Rows attend to columns. Darker = more weight. Tokens: the · cat · sat · on · the · mat.

Heads are concatenated and mixed by a final learned layer. A layer might hold 32 heads; a frontier model stacks dozens of layers — so a model carries thousands of heads, each its own learned view.

KV cache. Stores old Key and Value vectors during generation so the model doesn't recompute the whole prompt for every new token. It is the main memory cost at long context. · GQA. Grouped-Query Attention lets several query heads share fewer key/value heads — nearly the same accuracy, far less cache.

That cache is why a recent change stuck. In Grouped-Query Attention, groups of query heads share key/value heads: LLaMA-2 70B has 64 query heads but 8 key/value heads; Mistral 7B has 32 and 8. Almost the same accuracy, much less memory pressure.

Stage 6

The feed-forward network

After attention mixes information between tokens, every layer has a second step that gets far less attention itself. The feed-forward network runs on each token's vector on its own — no cross-token mixing — and does three things in order: expand to a larger width, apply a non-linearity, compress back down.

Expand · bend · compress

Run it. The non-linearity zeroes the negative units (ReLU shown) — that bend is what keeps the network from collapsing.

Non-linearity. A function that bends its input, stopping the whole network from collapsing into a single linear transform. Two linear layers in a row equal one; a hundred still equal one. The bend is what lets depth do anything.

The original transformer used ReLU; GPT and BERT moved to GELU; LLaMA, Mistral and PaLM use SwiGLU. The expand-then-compress shape stayed fixed — only the bend in the middle got iterated on.

Most of a dense model's parameters live here, not in attention — and they are not generic. This is where much of the stored factual structure sits. Some FFN neurons fire on specific concepts: one on Eiffel-Tower text, another on programming languages, another on past-tense verbs. "Paris is the capital of France" is represented across these weights, which is why methods like ROME can edit a single fact — flipping the Eiffel Tower to Rome — with a targeted low-rank change to one FFN matrix, no retraining.

MoE. Mixture of Experts — many parallel feed-forward networks plus a tiny router that sends each token through only a few of them.

The largest models increasingly replace the dense FFN with Mixture of Experts: Mixtral 8×7B has 8 experts per layer but activates 2 per token, so it carries 46.7B parameters yet spends about 12.9B per token. That is how you grow parameter count without growing inference cost in step.

Stage 7

The residual stream and normalization

The model is additive, not replacing. After attention runs, and after the feed-forward network runs, the result is added to the token's vector rather than overwriting it. Across thirty or a hundred layers those contributions accumulate into a running sum — the residual stream — and the original embedding keeps a direct additive path all the way to the late layers.

Residual connection. Adds a block's output back to the vector it started from, giving information and gradients a shortcut straight through the network.

The stream accumulates

Each layer adds its attention output and its feed-forward output. The input never gets erased — it just gets added to.

The trick came from ResNet (2015), for image recognition: deep networks were untrainable because the learning signal weakened travelling back through many layers, and the shortcut path let it flow directly from output to input. Transformers inherited it. In interpretability work the residual stream is now the central object — every head, every FFN, even the final unembedding reads from it and writes back to it.

Layer normalization. Rescales a token vector so its numbers stay in a stable range as the model trains. · RMSNorm. A cheaper variant that rescales the size of the vector without first subtracting the mean.

The second piece, normalization, is plumbing: without it the running sum would explode or collapse and training would fail. The 2017 design normalized after each block (post-norm); modern models normalize before (pre-norm), which made very deep stacks trainable. Many also swapped full layer norm for RMSNorm — drop the mean-shift, keep the rescaling, most of the benefit at lower cost.

Stage 8

Predicting the next token

After all the layers, the model has a vector for every token. To predict what comes next it takes the final vector of the last token only, and converts it into one number per vocabulary entry — 100,000 numbers for a 100,000-token vocabulary. Those are the logits: raw scores, any size, not yet probabilities.

Logits. Raw per-token scores. They become probabilities only after softmax. · Temperature. Controls randomness during sampling — low makes the model conservative, high makes it varied.

Softmax turns the logits into a probability distribution. The model usually does not just grab the top token: temperature sharpens or flattens the distribution, and top-k / top-p trim the candidates. Drag the temperature and watch the same logits reshape.

Same logits, different temperature

Context: "The cat ___". Low temperature concentrates on the favourite; high temperature spreads the mass.

Temperature 0.8

The cat

Once a token is chosen it gets appended, the KV cache is reused, and the loop runs again: new attention, new feed-forward, new final vector, new prediction — until an end-of-sequence token or a length limit. A whole paragraph is just this loop, one token at a time.

This single objective — predict the next token — is the entire training signal for a base model. It is never trained directly on accuracy, reasoning, or conversation; those come from post-training on top. One efficiency trick worth knowing is speculative decoding: a small fast model drafts several tokens ahead and the big model verifies them in parallel, matching the big model's output distribution while running the loop faster.

Speculative decoding. A small draft model guesses ahead; the larger model verifies several guesses at once, keeping its own output distribution but moving faster.

Stage 9

Architecture vs trained weights

So what actually differs between GPT, Claude, Gemini and LLaMA? At the level covered here, surprisingly little — they sit in the same transformer-family design space. Almost all of them share tokenization, embeddings, positional encoding, stacked layers of multi-head attention plus a feed-forward network, residual streams, normalization, and next-token prediction.

Shared across modern LLMs	What actually varies
The skeleton tokenizer → embeddings → +position → N×(attention + FFN) → norm → logits	The trained weights learned from different data, at different scale
The mechanisms Q/K/V attention, residual stream, softmax output	The configuration layer count, vocab size, head count, parameters, dense vs MoE
The converged choices pre-norm, RMSNorm, RoPE, SwiGLU, GQA	The post-training instruction tuning, human-feedback learning, safety

Weights. The learned numbers inside the model. Training adjusts them until the model predicts text well. The architecture is the wiring; the weights are what the wiring learned.

A few worked configurations, all from the same skeleton:

Model	Query heads	KV heads	FFN	Params
Mistral 7B	32	8	dense, SwiGLU	7B
LLaMA-2 70B	64	8	dense, SwiGLU	70B
Mixtral 8×7B	32	8	MoE, 8 experts / 2 active	46.7B total · ~12.9B / token

The 2023–2025 stack — pre-norm, RMSNorm, RoPE, SwiGLU, GQA, and MoE in the largest models — converged across teams that arrived at it independently. None of it was invented at once; it accumulated over five years on top of the 2017 design.

It could still change. Mamba and other state-space models are credible for very long sequences, and hybrids are being explored. But the problems in this walkthrough — turning text into integers, giving them meaning and order, letting tokens share information, processing each one, keeping a deep stack trainable, and emitting one token at a time — are the durable parts. Any sequence model has to solve them in some form.

If you've made it here, you can open a modern transformer paper or model card and place each section against one of these nine stages. That was the whole goal.