← Back to AI Glossary

AI Glossary

Tokenization

The process of breaking text into smaller units called tokens before it is processed by an AI model.

Overview

Before an AI model can understand language, it must first break that language into smaller pieces.

Humans naturally read words, sentences, and paragraphs. Computers, however, require information to be structured in ways that can be processed mathematically.

This preparation step is called tokenization.

Tokenization is the process of dividing text into smaller units called tokens. Depending on the model, a token might represent a word, part of a word, punctuation, or another piece of text.

For example, the sentence:

“Artificial intelligence is transforming business.”

might be divided into multiple tokens that the model can process individually.

Although tokenization happens behind the scenes, it is one of the first and most important steps in language processing.

Everything that follows—including embeddings, attention mechanisms, and language generation—depends on the model first converting text into tokens.

Understanding tokenization helps explain how AI systems transform human language into a format that machines can analyze and understand.

Why It Matters

Tokenization serves as the foundation for how language models process text.

Real-World Example

When you type a prompt into a chatbot, the text is tokenized before the model begins generating a response.

Related Concepts

Related Articles