AI Glossary
Tokenization
The process of breaking text into smaller units called tokens before it is processed by an AI model.
Overview
Before an AI model can understand language, it must first break that language into smaller pieces.
Humans naturally read words, sentences, and paragraphs. Computers, however, require information to be structured in ways that can be processed mathematically.
This preparation step is called tokenization.
Tokenization is the process of dividing text into smaller units called tokens. Depending on the model, a token might represent a word, part of a word, punctuation, or another piece of text.
For example, the sentence:
“Artificial intelligence is transforming business.”
might be divided into multiple tokens that the model can process individually.
Although tokenization happens behind the scenes, it is one of the first and most important steps in language processing.
Everything that follows—including embeddings, attention mechanisms, and language generation—depends on the model first converting text into tokens.
Understanding tokenization helps explain how AI systems transform human language into a format that machines can analyze and understand.
Why It Matters
Tokenization serves as the foundation for how language models process text.
Real-World Example
When you type a prompt into a chatbot, the text is tokenized before the model begins generating a response.