The MTP architecture consists of a shared transformer backbone and n independent output heads. Each head predicts the token at position t+1, t+2, ..., t+n given the current context. The loss is the sum of cross-entropy losses across all n heads. Heads typically share the input embedding layer but have separate output projections. At inference, one can use only the first head (preserving compatibility with next-token sampling) or all n heads as a native drafter in speculative decoding — head 1 emits the next token, heads 2..n propose continuations, and the model verifies them all in a single step. The shared backbone and KV-cache eliminate the typical draft+target implementation pitfalls.
The standard next-token prediction loss trains the model on short-sighted, local dependencies. This causes weaker sample efficiency and forces a separate drafter model for speculative decoding (with the burden of coordinating two weight sets, KV-caches, and tokenizers). MTP addresses both at once: better training signal plus a native drafter inside the model.
MTP assumes token T+k can be predicted independently of T+1...T+k-1 given context. This approximation degrades for sequences with strong long-range dependencies.
Each prediction head adds ~(d_model × vocab_size) parameters — for 70B+ models with 128k vocab this is hundreds of GB of additional weights. Requires selective application or head compression.
Gloeckle et al. formalize the training objective and show that 13B models trained with 4-token prediction solve 12% more HumanEval and 17% more MBPP than next-token-only. Inference up to 3x faster even at large batch sizes.
DeepSeek-V3 (671B MoE, 37B activated) adopts MTP as an auxiliary training objective to strengthen quality. Open-weight model, trained with 2.788M H800 GPU-hours.
Google releases on May 6, 2026 experimental MTP drafter models for the Gemma 4 family under Apache 2.0 — 74M-parameter drafters for multi-billion-parameter targets. Supported by MLX, vLLM, SGLang, Ollama. 2.8x and 3.1x speedup on Pixel (E2B/E4B), 2.5x on Apple M4 (31B), 2x on RTX PRO 6000 (26B). No quality loss.
Number of future tokens the model learns to predict in parallel. Increasing n past some point yields diminishing quality gains and raises training cost.
Weight of the MTP loss relative to the next-token loss. Too high degrades main-head quality; too low reduces benefit.
Gemma 4 MTP drafters run on consumer GPUs (RTX PRO 6000) with 2x speedup, and on Pixel mobile GPUs with 2.8x–3.1x.
Apple Silicon (M4) with unified memory achieves 2.5x speedup on Gemma 4 31B via MLX.