Training

MTP

2024ActivePublished

Key innovation

Training a language model to predict n future tokens at once (instead of one) using n independent output heads on a shared backbone — yielding better model quality, higher sample efficiency, and native support for speculative decoding without a separate drafter model.

How it works

The MTP architecture consists of a shared transformer backbone and n independent output heads. Each head predicts the token at position t+1, t+2, ..., t+n given the current context. The loss is the sum of cross-entropy losses across all n heads. Heads typically share the input embedding layer but have separate output projections. At inference, one can use only the first head (preserving compatibility with next-token sampling) or all n heads as a native drafter in speculative decoding — head 1 emits the next token, heads 2..n propose continuations, and the model verifies them all in a single step. The shared backbone and KV-cache eliminate the typical draft+target implementation pitfalls.

Problem solved

The standard next-token prediction loss trains the model on short-sighted, local dependencies. This causes weaker sample efficiency and forces a separate drafter model for speculative decoding (with the burden of coordinating two weight sets, KV-caches, and tokenizers). MTP addresses both at once: better training signal plus a native drafter inside the model.

Implementation

Reference implementations

Meta MTP (arXiv 2404.19737)

Python · Meta AI / FAIR

Official

DeepSeek-V3 (open weights)

Python · DeepSeek AI

Official

Gemma 4 MTP drafters (MLX, vLLM, SGLang, Ollama)

Python · Google

Official

Implementation pitfalls

Independent prediction heads assume locality conditionsMedium

MTP assumes token T+k can be predicted independently of T+1...T+k-1 given context. This approximation degrades for sequences with strong long-range dependencies.

Additional heads increase model sizeMedium

Each prediction head adds ~(d_model × vocab_size) parameters — for 70B+ models with 128k vocab this is hundreds of GB of additional weights. Requires selective application or head compression.

Evolution

Original paper · 2024 · arXiv preprint; Meta AI / FAIR · Fabian Gloeckle

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

2024

MTP introduced (Meta AI)

Inflection point

Gloeckle et al. formalize the training objective and show that 13B models trained with 4-token prediction solve 12% more HumanEval and 17% more MBPP than next-token-only. Inference up to 3x faster even at large batch sizes.

Better & Faster Large Language Models via Multi-token Prediction (paper)

2024

DeepSeek-V3 uses MTP at 671B scale

DeepSeek-V3 (671B MoE, 37B activated) adopts MTP as an auxiliary training objective to strengthen quality. Open-weight model, trained with 2.788M H800 GPU-hours.

DeepSeek-V3 Technical Report (paper)

2026

Gemma 4 MTP drafter models (Google)

Google releases on May 6, 2026 experimental MTP drafter models for the Gemma 4 family under Apache 2.0 — 74M-parameter drafters for multi-billion-parameter targets. Supported by MLX, vLLM, SGLang, Ollama. 2.8x and 3.1x speedup on Pixel (E2B/E4B), 2.5x on Apple M4 (31B), 2x on RTX PRO 6000 (26B). No quality loss.