1. Model architecture defines the maximum position length: positional encoding (sinusoidal in "Attention Is All You Need", learned in GPT-2, RoPE in Llama/Mistral/Qwen, ALiBi in MPT) is designed or trained for a specific maximum n.
2. KV-cache: during autoregressive inference each newly generated position writes its key and value (K, V) into accelerator memory. The cache grows linearly with length and dominates VRAM consumption at long windows.
3. Attention cost: standard self-attention computes an n×n dot-product matrix — O(n²) FLOPs and memory. For n=1M that matrix has a trillion entries; infeasible without optimizations. FlashAttention, sliding window, GQA/MQA, Mamba and similar reduce this cost.
4. Post-hoc window extension: techniques like RoPE scaling (linear, dynamic NTK, YaRN), positional interpolation, or long-context fine-tuning extend a model's window after pretraining without retraining from scratch — with varying quality.
5. In the client / API: prompt + history + system + tools + RAG file are tokenized and must fit the model's window. Above the limit, strategies differ: error (OpenAI), automatic truncation (some interfaces), sliding window over history (chatbots), or RAG (retrieval instead of stuffing).
An autoregressive model needs a clearly defined input domain: how many token positions are supported by attention layers, positional encoding, and KV-cache. The context window provides that contract — a deterministic limit that lets memory be allocated up-front and operation cost be predicted. In practice, the window defines whether the model can fit: an entire book, a code monorepo, a long conversation with history, a legal document, or multimodal content (1 hour of audio ≈ several hundred thousand tokens). This directly determines the range of tasks executable in a single call — without RAG or agentic decomposition.
Models advertising 1M+ tokens often degrade in quality already at 32k–128k (RULER). Verify the effective window empirically for your domain — do not trust marketing figures.
Information placed in the middle of a long context is used worse than at the start/end (Liu et al. 2023). Affects RAG strategy (chunk ordering) and prompt layout.
Per-token cost and attention compute both grow with context length — filling a 1M window is hundreds of times more expensive than 10k. Prefill latency grows linearly or worse.
PL/JP/AR text uses 2–3× more tokens than English — the effective window for these languages is proportionally smaller. "128k tokens" ≠ "128k characters" or "128k words".
The temptation to "stuff the whole corpus" into a long window is often worse — in quality and cost — than a well-built RAG pipeline. A long window should be a tool, not an excuse to skip retrieval.