AI Knowledge Hub

Tokenization

The key to how language models understand and process text.
What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens. These are the building blocks language models use to understand and generate text.

Modern tokenizers use subword algorithms to balance vocabulary size and meaning, handling rare words and multiple languages effectively.

Why it Matters
  • • Token limits define input/output length
  • • API costs are calculated per token
  • • Model performance depends on tokenization
  • • Understanding tokens helps optimize prompts
Common Tokenization Methods
Byte Pair Encoding (BPE)

Merges frequent character pairs. Used in GPT models.

Used in: GPT-2, GPT-3, GPT-4
WordPiece

Breaks words into subword units. Used by BERT.

Used in: BERT, DistilBERT
SentencePiece

Language-agnostic, works on raw text.

Used in: T5, ALBERT, LLaMA
Tiktoken

OpenAI's fast tokenizer library.

Used in: GPT-3.5, GPT-4 models
Key Concepts
  • Token

    A unit of text (word, subword, or character).

  • Vocabulary

    The complete set of tokens a model knows.

  • Subword

    A meaningful part of a word.

  • Special Tokens

    Reserved tokens like [CLS], [SEP].

Related Topics