Skip to main content

Command Palette

Search for a command to run...

Token and Embeddings

Updated
25 min read
R

I'm a mathematician that loves its applications in all spheres of life, especially in the field of machine learning. I write Java, Python and Android applications

Introduction

Before you can truly understand how Large Language Models work — how they are built, how they process language, and where they are headed — you need to get comfortable with two fundamental building blocks: tokens and embeddings. These two concepts sit at the very core of every LLM, and having a solid grasp of them will give you the foundation needed to make sense of everything that follows in this course.

LLM Tokenization

One thing you may have noticed when using a Large Language Model is that its response doesn't appear all at once — instead, it streams out word by word, almost as if the model is thinking in real time. But this isn't just a design choice for the output; it reflects something fundamental about how LLMs process language at every level.

When you send a text prompt to an LLM, the model doesn't read it the way a human would. Instead, it first breaks the input down into smaller units called tokens — which can be individual words, parts of words, or even single characters, depending on the model. This process of breaking text into tokens is called tokenization, and it applies to both the input the model receives and the output it generates — one token at a time.

How Tokenizers Prepares the Inputs to the Language Model

Before your text prompt can be understood and processed by a Large Language Model, it has to go through an important first step — the Tokenizer. Think of the Tokenizer as a gatekeeper that sits between you and the model, preparing your input before it ever reaches the LLM.

The Tokenizer takes your raw text and breaks it down into smaller pieces — either whole words or subwords — converting your prompt into a sequence of tokens that the model can work with. Only after this tokenization step is complete does the model begin using those tokens to predict the next output.

You can explore this process yourself using OpenAI's tokenizer tool — [link] — where you can paste any text and see exactly how it gets broken down into tokens in real time.

Downloading and Running LLM

from transformers import AutoModelForCausalLM, AutoTokenizer

#Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", device_map="cuda", torch_dtype="auto", trust_remote_code=True, )

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

#Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

#Generate the text
generation_output = model.generate( input_ids=input_ids, max_new_tokens=20 )

#print the output 
print(tokenizer.decode(generation_output[0]))

The code above loads two things — the language model itself and its corresponding tokenizer. The process follows a clear sequence: we first define the instruction or prompt we want to give the model, then pass it through the tokenizer, which breaks it down into tokens before it is handed off to the model to generate a response. In this example, we instruct the model to produce a maximum of 20 new tokens as output, controlled by the max_new_tokens = 20 parameter.

What's particularly important to note here is that the model never receives your original text prompt directly. Instead, the tokenizer processes the raw text first and stores the resulting token sequences in the variable input_ids — and it is this variable, not the original prompt, that gets passed to the model as input for generating the output.

To make this even clearer, let's print the input_ids and see exactly what the tokenized version of our prompt looks like before the model uses it:

What you're looking at is the actual input the LLM uses to process and generate its response — not human-readable text, but a sequence of integers. Each integer is a unique ID that corresponds to a specific token in the tokenizer's vocabulary, whether that token is a single character, a full word, or a fragment of a word. In essence, this is how the model "reads" — not in letters and words, but in numbers.

These integer IDs are not just arbitrary numbers — each one maps directly to a specific token that the LLM reads and works with. To make this visible, we can translate these token IDs back into the words or subwords they represent using the tokenizer's decode method, as shown below:

for id in input_ids[0]:
    print(tokenizer.decode(id))

From the output above, we can draw a few interesting observations about how the tokenizer works:

  • The very first token ID (1) maps to a special token — <s> — which signals the beginning of the text. This is one of several special tokens tokenizers use to help the model understand the structure of the input.

  • Some tokens represent complete words — for example, "email" and "Sarah" each appear as a single token, meaning the tokenizer recognised them as whole units.

  • Other tokens represent only fragments of words — for example, the word "apologizing" gets split into "apolog" and "izing". This is the tokenizer breaking down less common or longer words into smaller, more manageable subword units.

This last point is especially worth noting — tokenizers don't always split text at clean word boundaries. When a word is uncommon or complex, it gets broken into subwords, which is why the total number of tokens in a prompt is often higher than the total number of words.

We can also examine the tokens generated by the model using the generation_output variable. When decoded, this will display both the original input tokens that were passed to the model and the new tokens the model generated as its response.

To convert this back into readable text, we simply apply the tokenizer's decode method on the generated output like this:

print(tokenizer.decode(generation_output[0]))

How does the Tokenizer Break Down Text?

The way a tokenizer breaks down an input prompt isn't random — it is the result of deliberate design decisions made during the creation of the model. There are three main factors that determine how a tokenizer works:

  • Tokenization Method — The model's creator first decides which tokenization algorithm to use. Two of the most widely adopted methods are Byte Pair Encoding (BPE), which is the approach used by GPT models, and WordPiece, which is the method used by BERT. Each method has its own strategy for deciding how to split text into tokens.

  • Tokenizer Design Choices — Once the method is chosen, the creator then configures key parameters, such as the vocabulary size — which determines how many unique tokens the tokenizer can recognise — as well as any special tokens that mark the beginning or end of a sequence or sentence.

  • Training Data — Finally, the tokenizer is trained on a specific dataset. This training process allows it to derive the most effective vocabulary for representing the language patterns found in that dataset.

Word, Subword, Character, and Byte Tokens

The tokenization approach we just explored — where text is broken down into whole words or fragments of words — is known as subword tokenization, and it is by far the most commonly used scheme in Large Language Models today. However, it is not the only way to tokenize text. There are several other tokenization schemes, each with its own approach to breaking down input, and understanding the differences between them gives us a clearer picture of the design choices that go into building an LLM.

Word Tokens One of the earliest tokenization methods, used by models like Word2Vec, word tokenization treats each complete word as a single token. Its major drawback, however, is its inability to handle words it has never seen during training. It also produces a vocabulary where many tokens are only slightly different from one another — for example, "apology", "apologize", and "apologetic" would each require their own separate token entry despite sharing the same root. This inefficiency highlighted the need to break words down further, which gave rise to subword tokenization.

Subword Tokens Subword tokenization addresses the limitation of word tokens by breaking unfamiliar or complex words down into smaller, recognisable pieces. This means that even if the model encounters a word it has never seen before, it can still represent it by combining smaller subword units already present in its vocabulary — making the model far more flexible and adaptable.

Character Tokens Character tokenization takes things a step further by breaking text down into its individual characters. While this approach can technically handle any word, it comes at a cost — it makes token representations significantly longer and more complex. For instance, subword tokenization can represent the word "play" as a single token, whereas character tokenization would require four separate tokens — "p", "l", "a", and "y". This matters because subword tokenization is much more efficient at preserving the model's context window, fitting more meaningful text into it compared to character-level approaches.

Byte Tokens Byte tokenization breaks text down to its most fundamental level — individual bytes, which are the binary representations of characters. These bytes are then used to represent Unicode characters, making this approach highly versatile and capable of handling virtually any language or symbol.

Comparing Trained LLM Tokenizers

Now that we understand the factors that shape how a tokenizer is designed, let's see those differences in action. In this section, we will compare several real tokenizers — each trained with different design choices — to observe how those choices influence the way they break down the same piece of text, and what that means for model performance.

To keep the comparison fair and revealing, we will run each tokenizer against the following text — a deliberately varied sample that includes regular words, capitalisation, emojis, non-English characters, code-like syntax, symbols, and numbers:

text = """ English and CAPITALIZATION 🎵鸟 show_tokens False None elif == >= else: two tabs:" " Three tabs: " " 12.0*50=600 """

To make the comparison visual and easy to follow, we will tokenize the same text using several different models — each trained with a different tokenization method — and print the resulting tokens with alternating colour backgrounds. This way, each token is clearly distinguishable, making it easy to see exactly where each tokenizer decides to make its splits.

We first define a list of colours to cycle through:

colors_list = [ '102;194;165', '252;141;98', '141;160;203', '231;138;195', '166;216;84', '255;217;47' ]
def show_tokens(sentence, tokenizer_name): 
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) 

token_ids = tokenizer(sentence).input_ids 

for idx, t in enumerate(token_ids): 
    print( f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +            tokenizer.decode(t) + '\x1b[0m', end=' ' )

BERT Base Model (Uncased) — 2018

  • Tokenization Method: WordPiece

  • Vocabulary Size: 30,522

The first tokenizer we'll examine is from BERT Base (Uncased) — one of the pioneering Transformer models released by Google in 2018. Being "uncased" means that this version of BERT does not distinguish between uppercase and lowercase letters — all text is converted to lowercase before tokenization begins.

BERT's tokenizer comes with five special tokens, each serving a distinct purpose:

  • [UNK] — Unknown Token: Used when the tokenizer encounters a word or character it has no specific encoding for in its vocabulary.

  • [SEP] — Separator Token: Used to separate two pieces of text, enabling tasks that require the model to process two inputs at once — such as question answering or sentence pair classification.

  • [PAD] — Padding Token: Used to pad shorter inputs to a fixed length, since models expect inputs of a consistent size.

  • [CLS] — Classification Token: A special token placed at the beginning of every input, whose final representation is used as the basis for classification tasks.

  • [MASK] — Mask Token: Used during training to hide certain tokens from the model, forcing it to predict the masked words and learn richer language representations in the process.

From the tokenized output above, we can make several interesting observations about how BERT's tokenizer handles our sample text:

  • Newline breaks are lost — BERT's tokenizer strips out newline characters entirely, meaning any information or structure encoded through line breaks is invisible to the model.

  • All text is lowercased — True to its "uncased" nature, every capital letter in the original text has been converted to lowercase before tokenization.

  • Subword splitting with "##" — The word "capitalization" gets broken into two subtokens: "capital" and "##ization". The ## prefix is BERT's way of signalling that a token is a continuation of the token before it — it is attached, not a separate word. Conversely, any token without the ## prefix is assumed to have a space before it.

  • Emojis and Chinese characters are replaced with [UNK] — Since these characters fall outside BERT's vocabulary, the tokenizer has no encoding for them and simply replaces them with the [UNK] special token, marking them as unknown. This means the model loses any information those characters carried.

BERT Base Model (Cased) — 2018

  • Tokenization Method: WordPiece

  • Vocabulary Size: 28,996

  • Special Tokens: Same as the uncased version — [UNK], [SEP], [PAD], [CLS], [MASK]

The next tokenizer is from BERT Base (Cased) — the sister model to the uncased version we just examined. The key difference here is in the name: this version is "cased", meaning it preserves the original capitalisation of the input text rather than converting everything to lowercase. This makes it better suited for tasks where capitalisation carries meaningful information — such as named entity recognition, where distinguishing "Apple" the company from "apple" the fruit actually matters.

You may also notice that the vocabulary size is slightly smaller than the uncased version — 28,996 compared to 30,522. This might seem counterintuitive at first, since handling both uppercase and lowercase letters would appear to require a larger vocabulary. However, this difference comes down to the specific dataset and design choices made during training.

The most notable difference between the cased and uncased versions becomes immediately clear when we look at how capitalised text is handled. Rather than converting everything to lowercase, the cased tokenizer preserves the original case — but this comes at a cost. The word "CAPITALIZATION", written entirely in uppercase, is now broken down into eight separate subtokens: "CA", "##PI", "##TA", "##L", "##I", "##Z", "##AT", and "##ION". This is significantly more fragmented than the uncased version, illustrating how capitalisation can increase the number of tokens required to represent the same word.

It is also worth pointing out that both the cased and uncased BERT tokenizers follow the same structural pattern — every input is wrapped with a [CLS] token at the very beginning and a [SEP] token at the end. As we covered earlier, [CLS] serves as the classification token that summarises the entire input, while [SEP] acts as a separator — particularly useful in tasks that require passing two separate sentences to the model at once.

GPT-2 — 2019

  • Tokenization Method: Byte Pair Encoding (BPE)

  • Vocabulary Size: 50,257

  • Special Tokens: <|endoftext|>

Moving away from BERT, we now look at the tokenizer used by GPT-2, OpenAI's influential language model released in 2019. Unlike BERT, which was built for language understanding, GPT-2 was designed purely for text generation — and this difference in purpose is reflected in its tokenizer design as well.

GPT-2 uses Byte Pair Encoding (BPE) as its tokenization method, which works by starting with individual characters and progressively merging the most frequently occurring pairs of characters or subwords until it reaches the desired vocabulary size. This results in a significantly larger vocabulary of 50,257 tokens compared to BERT's — giving GPT-2 a broader and more expressive token set to work with.

GPT-2 also takes a much simpler approach to special tokens, using just one — <|endoftext|> — which is placed at the end of a piece of text to signal where one document ends and another begins.

Comparing GPT-2's tokenized output to BERT's, two differences stand out immediately:

  • Newline breaks are preserved — Unlike BERT, which stripped out newline characters entirely, GPT-2's tokenizer retains them. This means the model can actually "see" the structure of the input text, making it more aware of formatting and layout information encoded through line breaks.

  • Capitalisation is preserved and handled more efficiently — GPT-2 keeps the original casing of the text, just like BERT Cased. However, thanks to BPE's larger vocabulary, it handles capitalised words far more efficiently. The word "CAPITALIZATION" — which required eight subtokens in BERT Cased — is represented in just four tokens in GPT-2. This highlights one of the key advantages of BPE: a larger, more expressive vocabulary means fewer token splits, which in turn means more text can fit within the model's context window.

GPT-4 — 2023

  • Tokenization Method: Byte Pair Encoding (BPE)

  • Vocabulary Size: Just over 100,000 tokens

  • Special Tokens:

    • <|endoftext|> — Same as GPT-2, signals the end of a text document

    • Fill in the Middle (FIM) Tokens — A set of three special tokens that enable the model to generate a completion by considering not just the text before a given point, but also the text after it. This gives the model a much richer understanding of context when generating responses:

      • <|fim_prefix|> — Marks the text that comes before the section to be completed

      • <|fim_middle|> — Marks the position where the generated completion should be inserted

      • <|fim_suffix|> — Marks the text that comes after the section to be completed

Compared to GPT-2's vocabulary of 50,257 tokens, GPT-4's tokenizer is nearly twice as large — allowing it to represent text more efficiently, handle a wider range of languages and symbols, and fit significantly more content within its context window.

The GPT-4 tokenizer behaves similarly to GPT-2 tokenizer, but some observed differences are;

  • The GPT-4 tokenizer represents the four spaces as a single token. In fact, it has a specific token for every sequence of whitespaces up to a list of 83 whitespaces.

  • The Python keyword elif has its own token in GPT-4. Both this and the previ‐ ous point stem from the model’s focus on code in addition to natural language.

  • The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include “CAPITALIZATION” (two tokens versus four) and “tokens” (one token versus three).

Token Embeddings

Our exploration of tokenization has made one thing clear — language models don't work with raw text. They work with tokens. And if a model is trained on a large enough collection of tokens, it begins to pick up on the underlying patterns of language — things like grammar, syntax, and the relationships between concepts that naturally emerge from the data.

The nature of what the model learns is directly shaped by what it is trained on:

  • If the training data contains a large amount of English text, the patterns that emerge will manifest as a model capable of understanding and generating fluent English.

  • If the training data contains factual information — such as Wikipedia — the model will develop the ability to recall and generate factual content.

But for the model to actually learn and work with these patterns computationally, it needs something more than just tokens — it needs a way to represent those tokens numerically, in a form that allows it to perform calculations and model the relationships between them.

This is exactly what embeddings are. Embeddings are the numerical representation space that the model uses to capture the meaning and patterns hidden within language. The richer and more accurate these representations are, the more capable the model becomes — whether that shows up as coherent language generation, coding ability, reasoning, or any of the other growing list of capabilities we expect from modern language models.

A Language Model Holds Embeddings for the Vocabulary of Its Tokenizer

Every pretrained language model is inseparably linked to the tokenizer it was trained with. This is not a flexible relationship — a model cannot simply swap in a different tokenizer after training, because the entire model has been built around the specific vocabulary that its tokenizer produces. Using a different tokenizer would result in a completely different set of tokens, making the model's learned representations meaningless.

This tight coupling exists because the language model holds a dedicated embedding vector for every single token in its tokenizer's vocabulary. When you load a pretrained model, one of the core components you are loading is its embedding matrix — a large table that maps every token in the vocabulary to its corresponding numerical vector. These vectors are not assigned randomly; they are learned during training and encode the meaning, context, and relationships of each token as the model encountered them across vast amounts of text.

Creating Contextualized Word Embeddings with Language Models

We established earlier that static embeddings like those produced by Word2Vec assign a single fixed vector to every word, regardless of how it is used. But for more advanced tasks — such as named entity recognition or extractive text summarisation — a model needs to go further. It needs to understand the surrounding words and the broader context of a sentence well enough to produce token embeddings that actually shift based on how a word is being used. These are called contextualized word embeddings.

The key idea is straightforward: the same word can carry different meanings in different contexts, and a truly capable model should produce a different embedding for each of those contexts rather than always defaulting to the same fixed representation.

Here's how we can generate contextualized word embeddings in practice:

from transformers import AutoModel, AutoTokenizer

#load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

#load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

#Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

#Process the tokens
output = model(**tokens)[0]

The model we are using here, DeBERTa V3, is one of the best performing language models for generating high quality token embeddings. The code above downloads the pretrained tokenizer and its corresponding model, processes the input string "Hello world", and saves the resulting output into the output variable.

Now let's take a closer look at what that output actually contains. The first thing we want to do is inspect its dimensions by printing its shape:

output.shape

This prints out:

Setting aside the first dimension — which is the batch dimension used during the training process — we can read this output as 4 tokens, each represented by an embedding vector of 384 values. In other words, every token in our input has been mapped to a 384-dimensional numerical vector that captures its meaning in context.

But this raises an interesting question — we only passed in two words: "Hello world". So why do we have four tokens? Did the tokenizer split those two words into four pieces, or is something else going on behind the scenes?

Let's inspect the tokens to find out:

for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

This prints out:

The mystery of the four tokens is solved. The two words "Hello" and "world" each produce one token as expected, but this particular tokenizer automatically wraps every input with a [CLS] token at the beginning and a [SEP] token at the end — the same structural pattern we saw earlier with BERT. These two special tokens account for the additional two tokens, bringing the total to four.

Text Embeddings (for Sentences and Whole Documents)

While token embeddings are fundamental to how language models process text, many real-world LLM applications need to operate at a much higher level — working with entire sentences, paragraphs, or even full documents rather than individual tokens. This need gave rise to a specialised class of models that produce text embeddings — a single vector that captures the meaning of an entire piece of text, no matter how long.

Think of a text embedding model as a compressor: it takes any piece of text — a sentence, a paragraph, or a full document — and distills it down into one single vector that represents its overall meaning in a numerically useful form. There are several ways to produce this single vector, but one of the most common approaches is to simply average the values of all the individual token embeddings produced by the model, combining them into one unified representation.

To generate text embeddings in practice, we can use sentence-transformers — one of the most widely used Python packages for working with pretrained embedding models:

from sentence_transformers import SentenceTransformer

#Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

#Convert text to text embeddings
vector = model.encode("Best movie ever!")

Here we load a pretrained sentence transformer model and pass in the text "Best movie ever!" — the model processes the entire sentence and compresses it down into a single embedding vector that captures its overall meaning.

The size of that vector — that is, the number of dimensions it contains — varies depending on which embedding model is being used. Different models produce vectors of different lengths, and the size directly affects how much information the embedding can capture. Let's inspect the dimensions of the vector our model produced:

vector.shape

The entire sentence "Best movie ever!" has been compressed and encoded into a single vector containing 768 numerical values. That one vector — 768 numbers — is how the model captures and represents the complete meaning of that sentence in a form that can be used for downstream tasks like semantic search, text classification, or similarity comparisons.

Conclusion

Tokens and embeddings are not just technical details — they are the very language that LLMs speak. Before a model can understand or generate a single word, that word must first be broken into tokens, converted into numbers, and represented as a meaningful vector in space. Every capability we admire in modern language models — from answering questions to writing code to translating languages — traces back to how well these two foundational steps are executed.

Understanding tokenization tells us how models read. Understanding embeddings tells us how models think. And together, they give us the clearest window yet into what is really happening inside a Large Language Model.

In the next chapter, we will begin building on this foundation and explore how these representations are actually used by the model to generate language.

🗺️ The Story at a Glance

Here's a quick overview of everything covered in this chapter:

  • The Foundation — LLMs do not process raw text; they work with tokens and numerical representations called embeddings

  • Tokenization

    • Text prompts are broken into words or subwords called tokens before reaching the model

    • The model never sees your original text — it receives token IDs (integers) instead

    • Output is also generated token by token, not all at once

  • How Tokenizers Are Designed

    • Tokenization method (e.g. BPE or WordPiece)

    • Design choices — vocabulary size and special tokens

    • Trained on a specific dataset to derive the best vocabulary

  • Types of Tokenization Schemes

    • Word Tokens — whole words; struggles with unseen words

    • Subword Tokens — most widely used; handles new words by splitting into smaller pieces

    • Character Tokens — individual characters; flexible but inefficient

    • Byte Tokens — individual bytes; handles any language or symbol

  • Comparing Real Tokenizers

    • BERT Base Uncased (2018) — WordPiece, 30,522 tokens, strips capitalisation and newlines, unknown characters replaced with [UNK]

    • BERT Base Cased (2018) — WordPiece, 28,996 tokens, preserves capitalisation, wraps input with [CLS] and [SEP]

    • GPT-2 (2019) — BPE, 50,257 tokens, preserves capitalisation and newlines

    • GPT-4 (2023) — BPE, 100,000+ tokens, handles whitespace and code efficiently, fewer tokens per word

  • Token Embeddings

    • Every token in a model's vocabulary has a corresponding embedding vector

    • A model cannot be used with a tokenizer it was not trained with

    • Embeddings are stored in an embedding matrix loaded with the pretrained model

  • Contextualized Word Embeddings

    • Unlike static embeddings (Word2Vec), contextualized embeddings shift based on surrounding words

    • Best for tasks like named entity recognition and extractive summarisation

    • Models like DeBERTa V3 excel at producing high quality contextualized token embeddings

  • Text Embeddings

    • Operate at a higher level — representing entire sentences, paragraphs, or documents as a single vector

    • Commonly produced by averaging all token embedding values

    • Libraries like sentence-transformers make this easy to implement

    • The size of the vector (e.g. 768 dimensions) depends on the underlying model

References

  1. Cohere, "What Are Transformer Models and How Do They Work?," LLM University. [Online]. Available: https://cohere.com/llmu/what-are-transformer-models

  2. Vaswani et al., "Attention is All You Need," arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762

  3. J. Alammar and M. Grootendorst, Hands-On Large Language Models. O'Reilly Media, 2024.

LLMs: Under the Hood

Part 2 of 3

A structured, beginner-friendly series that breaks down the world of Generative AI — from the foundations of how large language models work, to building real-world applications with them. No fluff, just clear concepts and practical insights delivered one week at a time.

Up next

Looking Inside Large Language Models

Introduction Now that we have a solid understanding of how text is broken into tokens and represented as numerical embeddings, it's time to go a level deeper — inside the language model itself. In thi