Looking Inside Large Language Models
I'm a mathematician that loves its applications in all spheres of life, especially in the field of machine learning. I write Java, Python and Android applications
Introduction
Now that we have a solid understanding of how text is broken into tokens and represented as numerical embeddings, it's time to go a level deeper — inside the language model itself. In this chapter, we'll explore the core intuitions behind how Transformer-based language models actually work, using the GPT (Generative Pre-trained Transformer) family of models — the same family that powers tools like ChatGPT — as our primary point of reference.
We'll be exploring both the underlying concepts and hands-on code examples to bring those concepts to life.
To get started, we load our model and tokenizer — the same Microsoft Phi-3 model we worked with earlier — and set up a pipeline (a convenient wrapper that handles the full process of tokenizing input, running it through the model, and decoding the output) so we can interact with it easily throughout this chapter:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
# Create a pipeline
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
max_new_tokens=50,
do_sample=False,
)
The Inputs and Outputs of a Trained Transformer LLM
The simplest way to think about a Transformer LLM is as a system that takes in text and produces text in response. This capability is the result of being trained on a large, high-quality dataset — giving the model enough exposure to language to learn how to respond meaningfully to almost any input.
As we established in the previous chapter, the model doesn't produce its full response in one go. Instead, it generates its output one token at a time — each token being produced through what is called a forward pass (the process by which an input enters the neural network, flows through all its internal computations, and emerges as an output on the other side — think of it like the simple equation y = mx + c, where you put a value in and get a result out). The image below illustrates four steps of this token-by-token generation process in response to an input prompt.
After each token is generated, the model doesn't simply move on with the original input unchanged. Instead, the newly generated token is appended to the end of the input prompt, and this extended sequence becomes the new input for the next generation step. This process repeats with each token, with the input growing longer and longer until the full response is complete — as illustrated in the image below.
This gives us a clearer and more accurate picture of what the model is really doing at its core — it is simply predicting the next most likely token based on everything in the input prompt so far. The software wrapped around the neural network runs this prediction in a loop, continuously appending each new token to the growing input and feeding it back in, until the response is complete.
Models that work this way — where each prediction is fed back in as input to make the next prediction — are called autoregressive models. All generative models, including the GPT family, are autoregressive. This is one of the key characteristics that sets them apart from representation models like BERT (which we covered in Chapter 1), which process the entire input at once rather than generating output token by token.
Here is a look at how this process plays out under the hood when the model is given the following prompt:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap.Explain how it happened."
output = generator(prompt)
print(output[0]['generated_text'])
Running this prompt through our model produces the output below — the model generates its response token by token, starting from the subject line and continuing until it reaches the maximum token limit we set earlier (max_new_tokens=50).
The Components of the Forward Pass
Beyond the autoregressive loop that drives token-by-token generation, it's worth understanding what actually happens inside a single forward pass. When a prompt enters the Transformer, it passes through two key components before a new token is produced — the Tokenizer, which converts the raw text into token IDs as we covered in Chapter 2, and the Language Modeling Head (or LM Head), which takes the model's internal representations and translates them into an actual predicted token. Together, these two components bookend the forward pass — one prepares the input, and the other produces the output.
As we covered in the previous chapter, the Tokenizer breaks the input prompt down into a sequence of token IDs, which then become the input to the model. From there, the input flows through the neural network — a stack of Transformer blocks (individual processing layers stacked on top of one another) that carry out all the heavy computational work. Finally, the output of that stack is passed to the LM Head, which converts those internal representations into probability scores — essentially ranking every token in the vocabulary by how likely it is to be the best next token.
It's also worth recalling that the Tokenizer holds a vocabulary — a complete table of all the tokens it knows. For every single token in that vocabulary, the model maintains a corresponding embedding vector (the numerical representation that captures that token's meaning, as we explored in Chapter 2). This relationship between the vocabulary and its embeddings is illustrated in the image below.
As illustrated in the image below, the computation flows from top to bottom, following the direction of the arrows. For each token the model generates, the process moves sequentially through each Transformer block in the stack — one after another, in order — before finally reaching the LM Head, which takes the output of the last block and converts it into a probability distribution over the entire vocabulary, indicating which token is most likely to come next.
The LM Head is itself a relatively simple neural network layer — but it is a crucial one. What makes it interesting is that it is just one of several possible "heads" that can be attached to the same stack of Transformer blocks, depending on what kind of task you want the model to perform. Think of the Transformer block stack as a powerful, general-purpose engine, and the head as the attachment you swap in to direct that power toward a specific goal. For example, attaching a Sequence Classification Head turns the model into a text classifier — useful for tasks like sentiment analysis. Attaching a Token Classification Head allows the model to label individual tokens — useful for tasks like named entity recognition. The LM Head we've been discussing simply directs that power toward text generation.
Choosing a Single Token from the Probability Distribution (Sampling/ Decoding)
At the end of each forward pass, the model produces a probability score for every single token in its vocabulary — essentially a ranked list of candidates for what the next token should be. But having a list of probabilities is not the same as making a decision. The model still needs a way to pick one token from that distribution to actually output. The method used to make this selection is called a decoding strategy.
Different decoding strategies make this choice in different ways — some always pick the highest probability token, while others introduce an element of randomness to make the output more varied and creative. The image below shows a simple example of this process in action, where the decoding strategy selects "Dear" as the next token from the probability distribution.
The simplest decoding strategy would be to always select the token with the highest probability score — but in practice, this rarely produces the best results for most use cases. This approach, known as greedy decoding, is too rigid — it always makes the "safest" choice, which can lead to repetitive and predictable output. It is what happens when you set the temperature parameter (a setting that controls how much randomness the model applies when making its selection) to zero in an LLM.
A better approach is to introduce some randomness into the selection process — allowing the model to occasionally pick the second or third highest scoring token rather than always defaulting to the top one. This is called sampling from the probability distribution. What this means in practice is that each token's probability score directly reflects its chance of being selected. Using the example from the image above — if the token "Dear" has a 40% probability score, it has a 40% chance of being picked. Every other token in the vocabulary also gets a chance of being selected according to its own score, making the output more natural, varied, and creative.
Let's look at this process more closely in code. In the block below, we pass the input tokens through the model and then through the LM Head:
prompt = "The capital of France is"
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# Tokenize the input prompt
input_ids = input_ids.to("cuda")
# Get the output of the model before the lm_head
model_output = model.model(input_ids)
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])
The lm_head_output produced here has a shape of [1, 6, 32064] — which we can read as: 1 item in the batch, 6 tokens in the input sequence, and a probability score for each of the 32,064 tokens in the model's vocabulary. To find out what the model predicts as the next token, we only need the scores for the last token in the sequence. We access this using lm_head_output[0, -1] — where 0 selects the first (and only) item in the batch, and -1 retrieves the last token in the sequence. This gives us a list of 32,064 probability scores — one for every token in the vocabulary. From there, we can grab the token ID with the highest score and decode it back into readable text to reveal what the model predicts as the next token:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)
Parallel Token Processing and Context Size
One of the defining strengths of the Transformer architecture — and a key reason it overtook earlier neural network architectures like RNNs — is its ability to perform parallel computing. Rather than processing one token at a time in sequence, the Transformer processes all input tokens simultaneously, making it significantly faster and more efficient.
As we established in previous chapters, the Tokenizer first breaks the input text down into individual tokens. Once tokenized, each of those input tokens is assigned its own independent computation path through the model — flowing through the Transformer blocks in parallel rather than waiting for the token before it to finish processing.
Every Transformer model has a limit on how many tokens it can process at once — this limit is known as the model's context length. A model with a 4K context length, for example, can only process 4,000 tokens at a time, meaning it can only maintain 4,000 of these parallel computation streams simultaneously.
Each of these token streams begins with an input vector — a combination of the token's embedding vector and some positional information (data that tells the model where in the sequence each token sits). At the end of its journey through the Transformer blocks, each stream produces an output vector as the result of all the model's processing.
For text generation, however, only the output vector of the last token stream is actually used to predict the next token — it is the only vector passed into the LM Head to calculate the probability distribution.
You might wonder why the model bothers computing all the other token streams if their final output vectors are discarded. The answer lies in the attention mechanism inside each Transformer block. While we don't use the final output vectors of the earlier streams, their intermediate outputs at each block along the way are actively used in the calculations of the final stream. In other words, every token contributes to the outcome — just not through its final vector.
Returning to our earlier code example — the output of the LM Head had a shape of [1, 6, 32064] because its input was of shape [1, 6, 3072]. This represents one input string in the batch, containing six tokens, each represented by an output vector of 3,072 values produced after passing through the full stack of Transformer blocks.
We can inspect these matrices and their dimensions by printing:
model_output[0].shape
Similarly, we can print the output of the LM Head to inspect its dimensions:
lm_head_output.shape
Speeding Up Generation by Caching Keys and Values
Recall that when generating each new token, the model appends the previously generated token to the input and runs another complete forward pass. Without any optimization, this means recalculating the computations for every token in the growing input sequence at every single step — which becomes increasingly expensive as the sequence gets longer.
A smarter approach is to give the model the ability to cache the results of previous calculations — specifically, certain vectors inside the attention mechanism known as keys and values (two of the central components of how attention works, which we will explore in more detail shortly). By storing these from earlier steps, the model no longer needs to recompute them on every forward pass. Instead, only the calculations for the last token stream need to be performed, since everything before it has already been computed and saved. This optimization technique is called the KV Cache (Keys and Values Cache), and it provides a significant speedup to the generation process.
In the Hugging Face Transformers library, caching is enabled by default. We can disable it by setting use_cache to False. To see the difference this makes, we can time the generation of a long output both with and without caching enabled:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")
We then time how long it takes to generate 100 tokens with caching enabled. We use the %%timeit magic command — a built-in tool in Jupyter and Google Colab that runs the code multiple times and returns the average execution time:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
input_ids=input_ids,
max_new_tokens=100,
use_cache=True
)
On a Google Colab instance running a T4 GPU, this completes in approximately 4.5 seconds. Now let's see what happens when we disable the cache:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
input_ids=input_ids,
max_new_tokens=100,
use_cache=False
)
Without caching, the same generation takes 21.8 seconds — nearly five times longer. The difference is dramatic. And even the 4.5-second result with caching enabled is a long time from a user experience standpoint — nobody enjoys staring at a blank screen waiting for a response. This is precisely why most LLM APIs stream their output, sending tokens to the user as they are generated rather than waiting for the entire response to be completed before displaying anything.
Inside the Transformer Block
Now that we understand how the forward pass works end to end, let's zoom in on the component sitting at the heart of every modern Large Language Model — the Transformer block. As illustrated in the image below, a Transformer LLM is not made up of just one block, but a series of Transformer blocks stacked sequentially on top of one another — ranging from as few as six blocks in the original Transformer paper to well over a hundred in many of today's large language models. Each block takes in the output of the block before it, performs its own set of computations, and passes the result forward to the next block in the stack.
Each Transformer block is made up of two successive components that work together:
The Attention Layer — This layer is responsible for allowing the model to look across the entire input sequence and pull in relevant information from other tokens and their positions. It is what gives the model its ability to understand context — recognising how each word relates to every other word in the sequence.
The Feed-Forward Layer — This layer follows the Attention Layer and houses the majority of the model's raw processing capability. It takes the context-enriched output from the Attention Layer and passes it through a series of computations that allow the model to learn and apply deeper, more complex patterns in the data.
The feedforward neural network at a glance
To build an intuition for what the Feed-Forward layer does, consider this simple example: if you pass the input "The Shawshank" to a language model, the most probable next word it should generate is "Redemption" — a reference to the iconic 1994 film.
The reason the model knows this comes entirely from the Feed-Forward layers, collectively across all the model's blocks. When the model was trained on a massive archive of text — which naturally included countless mentions of "The Shawshank Redemption" — it learned and stored that association, along with countless others, within the parameters of these layers.
However, it is important to understand that an LLM is not simply a large database. Yes, memorisation is part of what makes it work — but it is only one ingredient. What makes LLMs truly impressive is their ability to go beyond what they have seen before. The same machinery that allows the model to memorise facts also allows it to generalise — to identify and apply complex patterns in ways that let it handle inputs it has never encountered during training. This ability to interpolate between data points and reason about new situations is what separates a language model from a simple lookup table.
The attention layer at a glance
Memorisation and pattern matching alone can only take a language model so far. To truly model language well, a model needs something more fundamental — context. In fact, before neural networks became the dominant approach, one of the leading methods for building language models relied purely on predicting the next word based on the few words immediately before it (known as N-gram models). The limitation of that approach was clear: language meaning often depends on words that appeared much earlier in a sentence, not just the ones immediately preceding.
This is exactly the problem the Attention mechanism was designed to solve. As the model processes each token, Attention allows it to reach across the entire input sequence and pull in relevant context from other tokens — no matter how far back they appear.
Consider this example: "The dog chased the squirrel because it"
For the model to predict what comes after "it", it needs to resolve what "it" actually refers to — the dog or the squirrel? This is a classic case of pronoun ambiguity, and it is precisely the kind of challenge that Attention handles. By looking back at the full context of the sentence, the Attention mechanism determines which earlier token "it" most likely refers to, and incorporates that information into the representation of the "it" token before making a prediction.
The model makes this determination based on patterns it learned during training. And if earlier sentences in the input provided additional clues — for example, referring to the dog as "she" — the model can use those too, making it even clearer that "it" refers to the squirrel.
Attention is all you need
The image below shows a simplified view of how the Attention mechanism operates. Multiple token positions feed into the Attention layer simultaneously, but the one currently being processed is highlighted — let's call it the focus token. The Attention mechanism takes the input vector at that position, looks across all the other token positions in the sequence for relevant context, and incorporates that context into the output vector it produces for that same position. In other words, what goes in is a vector that only represents that single token in isolation — and what comes out is a richer, context-aware vector that now carries information gathered from the surrounding tokens as well.
The Attention mechanism achieves this in two main steps:
Scoring Relevance — For each token currently being processed, the model calculates a relevance score for every other token in the input sequence. These scores determine how much attention the model should pay to each surrounding token — essentially asking: "How relevant is this token to understanding the one I am currently processing?"
Combining Information — Using those relevance scores as weights, the model then blends the information from the various token positions into a single output vector. Tokens with higher relevance scores contribute more to the final output, while less relevant tokens contribute less — producing a context-enriched representation for the token being processed.
To give the Transformer even greater capacity to understand complex language patterns, the Attention mechanism is not run just once — it is duplicated and executed multiple times in parallel. Each of these parallel instances is called an Attention Head, and together they form what is known as Multi-Head Attention.
The reason for this design is straightforward: a single Attention mechanism can only focus on one type of relationship at a time. By running multiple heads simultaneously, the model can pay attention to several different patterns at once — for example, one head might focus on grammatical relationships between words, while another tracks pronoun references, and another picks up on thematic connections. This dramatically increases the model's ability to capture the full complexity of language in a single pass.
How Attention Is Calculated
Let's take a closer look at what actually happens inside a single Attention Head. Before walking through the calculation, it helps to establish a clear starting picture:
The Attention layer of a generative LLM is processing attention for one token position at a time — specifically, the current token being generated.
The inputs to the layer are two things:
The vector representation of the current token being processed
The vector representations of all the previous tokens in the sequence
The goal is to produce a new, enriched representation of the current token that incorporates relevant information drawn from those previous tokens. For example, if we are processing the last position in the sentence "Sarah fed the cat because it" — we want the model to understand that "it" refers to "the cat". Attention achieves this by pulling in relevant information from the "cat" token and baking it into the representation of "it".
To make this calculation possible, the training process produces three special projection matrices — learnable sets of weights that transform the input vectors into three distinct components that interact with one another during the attention calculation:
A Query projection matrix
A Key projection matrix
A Value projection matrix
The attention calculation begins by multiplying the input vectors by each of the three projection matrices, producing three new matrices — the Queries, Keys, and Values. Each of these matrices represents the input tokens projected into a different mathematical space, and each plays a specific role in carrying out the two steps of attention:
Queries and Keys work together to handle relevance scoring — determining how much attention each previous token deserves relative to the current one.
Values are used in combining information — they hold the actual content that gets blended together based on those relevance scores.
As illustrated in the image below, each row in these three matrices corresponds to a specific token position in the sequence. The bottom row across all three matrices is associated with the current token being processed, while the rows above it correspond to the previous tokens in the sequence.
Self-attention: Relevance scoring
In a generative Transformer, since we are generating one token at a time, the Attention mechanism is only concerned with one position at a time — the current token being processed. The question it is trying to answer is: "Which of the previous tokens in the sequence are most relevant to the token I am currently processing, and how much should I draw from each of them?"
To answer this, the relevance scoring step works as follows: the Query vector of the current token position is multiplied against the Keys matrix — which contains the key vectors of all previous token positions. This multiplication produces a set of relevance scores, one for each previous token, indicating how closely related each one is to the current position.
These raw scores are then passed through a softmax operation — a mathematical function that converts the scores into a set of values that all add up to 1 — effectively turning them into a clean probability-like distribution of relevance weights. The higher a token's score, the more attention the model will pay to it. The figure below shows the relevance scores produced from this calculation.
Self-attention: Combining information
With the relevance scores in hand, the second step of attention can now take place. Each token's Value vector is multiplied by that token's relevance score — meaning tokens that scored highly contribute more strongly, while tokens with low scores contribute very little. These weighted Value vectors are then summed together into a single output vector, which becomes the final output of this attention step — a rich, context-aware representation of the current token that now carries blended information from all the relevant positions that came before it, as illustrated below.
The Transformer Block
Recall that every Transformer block is built around two core components — the Attention Layer and the Feed-Forward Neural Network. However, a more detailed look inside the block reveals two additional operations that play a crucial supporting role: Residual Connections and Layer Normalisation, both of which are visible in the diagram below.
While the core components of the Transformer block have remained consistent, the latest models have introduced a number of refinements that improve performance and training efficiency, as shown in the image below.
One notable change is the position of normalisation — in newer architectures, normalisation is applied before the Attention and Feed-Forward layers, rather than after. This simple reordering has been shown to reduce the time required to train the model (for further reading: "On Layer Normalization in the Transformer Architecture").
Another improvement is the type of normalisation used. Newer models have moved from the original LayerNorm to RMSNorm (Root Mean Square Normalisation) — a simpler and more computationally efficient alternative (for further reading: "Root Mean Square Layer Normalization").
Finally, the original ReLU activation function — a mathematical operation used inside the Feed-Forward layer to introduce non-linearity into the model's computations — has largely been replaced by newer variants like SwiGLU, which have been shown to improve model performance (for further reading: "GLU Variants Improve Transformer").
Positional Embeddings (RoPE)
Since the very first Transformer, positional embeddings have been a fundamental component. Without them, the model would have no way of knowing the order of tokens in a sequence — and in language, order is everything. The meaning of "the dog bit the man" is entirely different from "the man bit the dog", even though both sentences contain exactly the same words.
Over the years, many positional encoding schemes have been proposed. The original Transformer used absolute positional embeddings — essentially assigning each token a fixed position number: the first token gets position 1, the second gets position 2, and so on. These could either be static (where position vectors are generated using geometric functions like sine and cosine) or learned (where the model figures out the best positional values during training). While effective for smaller models, these approaches run into challenges as models are scaled up.
One such challenge involves training models with large context windows efficiently. In practice, many documents in a training dataset are far shorter than the model's full context length. Allocating, say, the entire 4K context window to a single 10-word sentence would be extremely wasteful. To address this, multiple shorter documents are packed together into a single context during training — filling the context window more efficiently, as illustrated in the image below.
A positional embedding method also has to adapt to this kind of practical challenge. If Document 50, for example, starts at position 50 in the packed context, telling the model that its first token is position 50 would actually mislead it — the model would assume there is prior context belonging to the same document, when in reality those earlier positions belong to a completely separate, unrelated document that should be ignored.
This is one of the key reasons why Rotary Positional Embeddings (RoPE) — introduced in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding" — has become one of the most widely adopted positional encoding methods in modern large language models. Rather than adding fixed, absolute position numbers at the beginning of the forward pass, RoPE encodes positional information by rotating vectors in their embedding space — a method that naturally captures both the absolute position of each token and the relative distance between tokens. In the forward pass, these rotary embeddings are applied directly inside the Attention step, as shown below.
During the Attention calculation, the rotary positional information is injected at a very specific point — it is applied directly to the Queries and Keys matrices just before they are multiplied together for relevance scoring. This ensures that when the model calculates how relevant each token is to the current position, it is doing so with full awareness of where each token sits in the sequence, as illustrated in the image below.
Conclusion
Peering inside a Large Language Model reveals something both elegant and intricate. What appears on the surface as a simple text-in, text-out system is in reality a sophisticated stack of carefully designed components — each one solving a specific problem, each one building on the work of the one before it.
From the way tokens flow through parallel computation paths, to the attention mechanism quietly resolving pronoun references and contextual meaning, to the feed-forward layers storing and applying vast amounts of learned knowledge — every part of the Transformer has a purpose. And understanding those purposes doesn't just satisfy curiosity — it gives you the foundation to make smarter decisions when building, fine-tuning, and working with these models.
In the next chapter, we will take this understanding further and begin exploring how these models are actually trained — and how that training process shapes everything we have discussed here.
🗺️ The Story at a Glance
Here's a quick overview of everything covered in this chapter:
The Big Picture — A Transformer LLM takes in text, processes it through a series of components, and generates output one token at a time
Inputs and Outputs
The model predicts the next most probable token based on the input prompt
Each new token is appended to the input and fed back in — this is called autoregressive generation
The forward pass is the full journey from input to output for each token
Key Components of the Forward Pass
Tokenizer — converts raw text into token IDs
Transformer Block Stack — performs all the heavy processing
LM Head — converts the final output into probability scores for the next token
Decoding Strategies
Greedy Decoding — always picks the highest probability token; predictable but repetitive
Sampling — introduces randomness; produces more natural and varied output
Temperature — controls how much randomness is applied; set to zero for greedy decoding
Parallel Processing and Context Size
All input tokens are processed simultaneously, each through its own computation path
The model's context length determines the maximum number of tokens it can process at once
Only the last token stream's output is passed to the LM Head — but all streams contribute through the attention mechanism
KV Cache
Caches Keys and Values from previous steps to avoid recomputing them
Dramatically speeds up generation — 4.5 seconds vs 21.8 seconds in our example
Enabled by default in Hugging Face Transformers
Inside the Transformer Block
Each block has two core components: Attention Layer and Feed-Forward Layer
Modern blocks also include Residual Connections and Layer Normalisation
Latest improvements: pre-normalisation, RMSNorm, and SwiGLU activation
The Feed-Forward Layer
Stores the majority of the model's learned knowledge and patterns
Enables both memorisation and generalisation to unseen inputs
The Attention Mechanism
Allows the model to pull in relevant context from other tokens when processing the current one
Works in two steps: Relevance Scoring (Query × Keys) and Combining Information (weighted Values)
Multi-Head Attention runs multiple attention heads in parallel to capture different patterns simultaneously
Positional Embeddings (RoPE)
Enables the model to track the order and position of tokens in a sequence
Rotary Positional Embeddings (RoPE) captures both absolute and relative position information
Applied directly to the Queries and Keys matrices inside the Attention step
References
Cohere, "What Are Transformer Models and How Do They Work?," LLM University. [Online]. Available: https://cohere.com/llmu/what-are-transformer-models
Vaswani et al., "Attention is All You Need," arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762
J. Alammar and M. Grootendorst, Hands-On Large Language Models. O'Reilly Media, 2024.