Tokenization

The key to how language models understand and process text.

What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. These are the building blocks language models use to understand and generate text.

Modern tokenizers use subword algorithms to balance vocabulary size and meaning, handling rare words and multiple languages effectively.

Why it Matters

• Token limits define input/output length
• API costs are calculated per token
• Model performance depends on tokenization
• Understanding tokens helps optimize prompts

Common Tokenization Methods

Byte Pair Encoding (BPE)

Merges frequent character pairs. Used in GPT models.

Used in: GPT-2, GPT-3, GPT-4

WordPiece

Breaks words into subword units. Used by BERT.

Used in: BERT, DistilBERT

SentencePiece

Language-agnostic, works on raw text.

Used in: T5, ALBERT, LLaMA

Tiktoken

OpenAI's fast tokenizer library.

Used in: GPT-3.5, GPT-4 models

Key Concepts

Token
A unit of text (word, subword, or character).
Vocabulary
The complete set of tokens a model knows.
Subword
A meaningful part of a word.
Special Tokens
Reserved tokens like [CLS], [SEP].

Pro Tip

Use the interactive tokenizer to see how your text gets tokenized in real-time!