What Is a Token in a Language Model?
Large Language Models (LLMs) don't "read" text the way humans do. They neither process individual letters nor whole words — instead, they use intermediate units called tokens. A token is the smallest text fragment to which input data (a prompt) is reduced during natural language processing, and from which the model constructs its responses.
A single token can represent:
- A single character — a letter, digit, punctuation mark, or space (a space before a word creates a different token than the word itself),
- A word fragment (subword) — rare or complex words are split into smaller parts; e.g., "unbelievable" → "un", "believ", "able",
- An entire word — common words like "the", "dog", "fast" typically constitute a single token.
It's important to distinguish language tokens from crypto AI tokens — although both classes share the same name, they are fundamentally different technologies. Crypto AI tokens operate on blockchain platforms and serve as payment instruments or voting rights. This analysis focuses exclusively on tokens in the context of language models.
How Are Tokens Created? The Byte-Pair Encoding Algorithm
Specialized algorithms called tokenizers are responsible for converting text into tokens. The dominant approach in modern models — used by OpenAI, Anthropic, and Google — is Byte-Pair Encoding (BPE).
BPE is an iterative statistical algorithm that works in four steps:
- Initially, every single character in the training corpus is treated as a separate token.
- The system scans the text and identifies the most frequently co-occurring character pairs.
- The most frequent pair is merged into a single new token (e.g., "t" + "h" → "th").
- Steps 2–3 repeat until the vocabulary reaches its target size — typically 50,000 to 100,000 unique tokens.
The result is a vocabulary where common English words form a single token, while rare or unknown words are reconstructed from smaller known fragments. BPE solves the classic OOV (Out-Of-Vocabulary) problem — even unknown words can be composed from fragments already in the vocabulary.
Different Tokenizers, Different Results
Each company uses its own tokenizer implementation — the same text can yield a different token count depending on the model. Take the 7-word test sentence “Artificial intelligence is transforming software development today.” and see how each tokenizer breaks it down:
| Model / Tokenizer | Vocabulary | Token count | Notes |
|---|---|---|---|
| GPT-4 (cl100k) | 100 000 | 9 | “Artificial” → “Art” + “ificial” |
| GPT-4o / o1 (o200k) | 200 000 | 8 | “Artificial” as a single token |
| Claude (Anthropic) | proprietary | ~9 | Comparable to GPT-4 cl100k |
| Llama 2 (SentencePiece) | 32 000 | ~11 | Smaller vocab = more fragments |
How Many Words Per Token? Rules of Thumb
Calculating an exact token count requires running the model’s tokenizer. In everyday practice, however, rough “rules of thumb” are sufficient — quick heuristics that let you estimate how many tokens a given piece of text will consume without running any tools. The values below apply to standard English text.
| Measure | Approximate value |
|---|---|
| 1 token | ≈ 4 characters |
| 1 token | ≈ 0.75 words |
| 100 tokens | ≈ 75 words |
| 1 paragraph | ≈ 100 tokens |
| A4 page (500 words) | ≈ 650–700 tokens |
| 1,000 words | ≈ 1,333 tokens |
Why Non-English Languages Are More Expensive
Tokenizers were trained primarily on English corpora, causing a multilingual deficit — languages with richer morphology or different writing systems are split into more, smaller fragments. English averages 1 word per 1.3 tokens. Romance languages (Spanish, French) average 1 word per 2 tokens. Polish, Hindi, and Chinese require a multiplier of 1.5×–3× compared to English.
In practice, this means non-English prompts consume significantly more space in the context window and generate higher API costs than equivalent English text.
Context Window — The Model's Working Memory
Every generative model operates within a context window — a hard limit expressed in tokens that determines the maximum volume of data processed in a single session. The limit covers input tokens (the entire prompt: instructions, conversation history, attached context retrieved via Retrieval-Augmented Generation architecture) and output tokens — the space reserved for the generated response.
If the limit is exceeded, the model either rejects the request with an API error, or "forgets" the oldest part of the conversation, leading to incoherence and hallucinations.
| Model | Context window (tokens) | Approximate word equivalent |
|---|---|---|
| GPT-3 (Davinci) | 4,096 | ≈ 3,000 |
| GPT-3.5 Turbo | 16,385 | ≈ 12,000 |
| GPT-4o | 128,000 | ≈ 96,000 |
| OpenAI o3 | 200,000 | ≈ 150,000 |
| Llama 3.1 (Meta) | 128,000 | ≈ 96,000 |
| Claude 3.7 Sonnet | 200,000 | ≈ 150,000 |
| Gemini 2.0 Flash | 1,000,000 | ≈ 750,000 |
Reasoning Tokens — The Hidden Context Consumer
In reasoning models, an additional factor emerges: Reasoning Tokens. Before generating a visible 200-token response, the model may internally produce up to 10,000 "thinking" working tokens — all of which count against the context window limit and are billed to the user. Practitioners recommend reserving up to 25,000 tokens of headroom for such internal operations.
Token Budget — Controlling Reasoning Depth
Unconstrained internal reasoning can consume a large share of the token budget and multiply the cost of a request many times over. Modern reasoning models therefore expose an explicit control mechanism called the token budget — the ability to set a hard upper limit on reasoning tokens before the visible response is generated. In Anthropic’s API, the thinking parameter with a budget_tokens field specifies the maximum number of tokens Claude 3.7 Sonnet may spend on its internal “chain of thought.” With budget_tokens = 5,000 the model reasons more shallowly but faster and cheaper; with 16,000 it is allowed extended, multi-step reasoning. In OpenAI’s API the analogous control is the reasoning_effort parameter (low / medium / high) for o1 and o3 — it sets reasoning intensity in a way that indirectly maps to token consumption. For o1-pro and o3-pro a direct max_completion_tokens parameter is also available, covering both reasoning tokens and the visible response. The practical implication: production applications should always set an explicit token budget — leaving it unconstrained risks unexpected cost spikes, especially when the model encounters an ambiguous or multi-step query.
The difference between reasoning_effort levels has measurable financial consequences. Example: analysing a 500-word document with o3-mini (OpenAI pricing, May 2025: $1.10 / 1M input tokens, $1.10 / 1M reasoning, $4.40 / 1M output), assuming ≈667 input tokens and 300 output tokens:
| reasoning_effort | Reasoning tokens | Reasoning cost | Total cost | Multiplier vs low |
|---|---|---|---|---|
| low | ≈ 500 | $0.0006 | $0.0021 | 1× (baseline) |
| medium | ≈ 3,000 | $0.0033 | $0.0048 | 2.3× |
| high | ≈ 15,000 | $0.0165 | $0.0180 | 8.6× |
Per request the difference looks marginal. At 10,000 requests per day, the high level generates a reasoning cost of approximately $165 / day — compared to $6 at low. Best practice is to reserve high exclusively for genuinely multi-step tasks (e.g., legal analysis, mathematical proofs), and use low or medium for routine classification and data extraction.
| Provider / Model | Parameter | Range / Values | Notes |
|---|---|---|---|
| Anthropic / Claude 3.7 Sonnet | thinking.budget_tokens | 1,024–100,000 | Min. 1,024; must be less than max_tokens |
| OpenAI / o1, o3 | reasoning_effort | low / medium / high | Indirectly controls reasoning token count |
| OpenAI / o1-pro, o3-pro | max_completion_tokens | integer | Covers reasoning + visible response |
| Recommended headroom (practice) | N/A | ≥25,000 tokens | For complex, multi-step tasks |
Token Economics — How Models Price Usage
Commercial APIs use a pricing per token model, where costs are asymmetric: output tokens are typically 3–5 times more expensive than input tokens. This stems from the autoregressive nature of generation — each successive token must be computed sequentially, limiting parallel processing and consuming more GPU/TPU resources.
Prompt Caching — Reducing Costs for Repeated Queries
When an application repeatedly sends the same large block of text (e.g., a policy document, system prompt), providers offer Prompt Caching: a once-processed prefix is stored in cache and reused with discounts of up to 90%. Other optimization techniques include request batching and Retrieval-Augmented Generation architecture, which allows injecting only relevant document fragments into context rather than entire documents.
The Importance of Tokens for AI Engineering
Understanding tokens is not academic knowledge — it is a practical engineering skill with direct implications for system costs and quality. Four areas where this knowledge has real consequences:
- Prompt design — compact, English-language prompts are cheaper and fit more content into the context window. Every redundant sentence in a system prompt adds cost to every single API call.
- Application architecture — every request requires an explicit token budget: how many tokens are allocated for the system prompt, conversation history, RAG data, and response space. Exceeding the context window limit results in an API error or silent loss of the oldest context.
- Model evaluation — comparing models without standardizing tokenization produces misleading results. The same text yields a different token count in GPT-4 vs Llama 3, directly affecting measured costs and output length.
- Multilingual systems — systems targeting Polish, Arabic, or Chinese users must account for 2–3× higher token consumption than equivalent English content. Ignoring this leads to cost underestimation and incorrect context window sizing.
The Future of Tokenization
BPE-based tokenization dominates today’s models, but it is not the final answer. Researchers are actively exploring three directions that could fundamentally change how text — and not only text — is represented in generative models.
Byte-Level Tokenization
Instead of learning a vocabulary of text fragments, byte-level models operate directly on raw UTF-8 bytes. This approach eliminates the Out-Of-Vocabulary (OOV) problem entirely — any possible string can be expressed as a sequence of 256 possible byte values. Google’s ByT5 model (2021) demonstrated that byte-level tokenization achieves quality comparable to BPE while performing better on low-resource languages and noisy text. The main drawback is sequence length: the same text occupies 3–4 times more positions than with BPE, dramatically increasing the computational cost of the attention mechanism.
Tokenization-Free Models — SSMs and Mamba
The Transformer architecture requires tokenization because it processes data in discrete steps through its attention mechanism. Alternative architectures — State Space Models (SSMs), particularly Mamba (2023) and RWKV — model sequences continuously, without needing to split input into discrete units. In theory, this opens the door to models operating directly on raw signal streams — audio, sensor data, time series — bypassing the tokenization step entirely. In practice, SSM models still lag behind Transformers on language tasks, and hybrid architectures (Mamba + attention) appear to be the most promising direction.
Multimodal Tokenization — Images and Audio as Tokens
Modern multimodal models extend the concept of a token far beyond text. DALL-E 3 and GPT-4o use discrete visual tokens — an image is divided into a grid of patches (e.g., 16×16 pixels each), with each patch encoded by a visual encoder into a single vector, then treated like a text token by the Transformer layers. Gemini 1.5 and Claude 3 apply the same approach to images, while audio models (Whisper, Gemini Audio) tokenize sound signals as sequences of spectrograms. This means the context window becomes a shared space for text, visual, and audio tokens — with a unified budget that all of them consume equally.
The token was, and will remain, the central unit of the language model ecosystem — a measure of work, a billing currency, and the boundary of working memory. Even if future architectures reduce its role at the computational layer, the question “how many tokens does this cost?” will remain one that every AI engineer must be able to answer precisely for years to come. Understanding tokenization is no longer academic knowledge — it is a foundational competency for anyone who designs, deploys, and optimizes systems built on large language models.
Sources
OpenAI Documentation, Anthropic Claude API Reference, Google Gemini Technical Reports, Tiktoken GitHub Repository, aioutlooks.com, moreonlinetools.com, medium.com
