Tokenization
The key to how language models understand and process text.
What is Tokenization?
Tokenization is the process of breaking down text into smaller units called tokens. These are the building blocks language models use to understand and generate text.
Modern tokenizers use subword algorithms to balance vocabulary size and meaning, handling rare words and multiple languages effectively.
Why it Matters
- • Token limits define input/output length
- • API costs are calculated per token
- • Model performance depends on tokenization
- • Understanding tokens helps optimize prompts
Common Tokenization Methods
Byte Pair Encoding (BPE)
Merges frequent character pairs. Used in GPT models.
Used in: GPT-2, GPT-3, GPT-4WordPiece
Breaks words into subword units. Used by BERT.
Used in: BERT, DistilBERTSentencePiece
Language-agnostic, works on raw text.
Used in: T5, ALBERT, LLaMATiktoken
OpenAI's fast tokenizer library.
Used in: GPT-3.5, GPT-4 modelsKey Concepts
- Token
A unit of text (word, subword, or character).
- Vocabulary
The complete set of tokens a model knows.
- Subword
A meaningful part of a word.
- Special Tokens
Reserved tokens like [CLS], [SEP].
Related Topics