Speculative Decoding
Accelerating autoregressive decoding without altering the model output distribution — a lightweight draft model proposes multiple candidate tokens, and the target model verifies them in a single parallel forward pass, accepting matching tokens and rejecting incorrect ones.
At each step, the draft model autoregressively generates a sequence of k candidate tokens. The target model then computes probabilities for all those positions in a single parallel forward pass. A modified rejection sampling algorithm compares drafter vs. target distributions at each position: if the drafter token has at least as much mass under target — the token is accepted; otherwise it is accepted with probability proportional to the distribution ratio. After the first rejection, the remainder of the sequence is discarded, and at the rejected position the target samples a token from the corrected residual distribution. Each step thus produces at least one token (from target) plus 0 to k accepted drafter tokens — with zero change in output distribution.
Standard autoregressive decoding generates tokens sequentially — one at a time — and each step requires reloading full model parameters from HBM/VRAM. On consumer hardware, where memory bandwidth is low relative to compute, processing units spend most of their time waiting for data. Speculative Decoding exploits that waiting time productively.
Draft length (k)
- 4Typical value in Leviathan and Chen papers.
- 5–8Common in production implementations (vLLM, SGLang).
Number of tokens generated by the drafter in one step before verification by the target. Higher k → potentially larger speedup but lower acceptance rate and higher risk of wasted compute.
Drafter size
- 74M (Gemma 4 MTP drafter, target multi-billion)Gemma 4 MTP 2026 configuration.
- Chinchilla 4B drafter dla 70B targetDeepMind 2023 configuration.
Size of the drafter model relative to the target. The smaller the drafter, the cheaper candidate generation, but the lower the agreement with target and acceptance rate. Typically 1–10% of target size.
Acceptance rate
Measured average fraction of drafter tokens accepted by the target. Directly determines achieved speedup — a high acceptance rate (>70%) approaches the theoretical maximum, low (<30%) yields only marginal gains.
Common pitfalls
Low acceptance rate destroys gainsHIGH
If the drafter is poorly matched to the target (different tokenizer, different distribution, different RLHF), acceptance rate falls below 30% and the cost of the drafter forward pass eats up the savings.
Train the drafter on the same distribution as the target or use a drafter distilled from the target.
KV-cache managementMEDIUM
The drafter and target must coordinate the KV-cache. A naive implementation recomputes context from scratch each step, negating the gain. Sharing the KV-cache requires care when drafter tokens are rejected (cache rollback).
Use proven frameworks (vLLM, SGLang) or reference implementations instead of writing your own.
Diminishing gains on fast hardwareLOW
When the inference bottleneck is compute rather than memory bandwidth (e.g., large models on H100/H200), parallel verification adds significant compute overhead and the speedup drops below 2x.
Speculative decoding yields the largest gains on consumer GPUs, mobile devices, and smaller data-center models. On frontier hardware, measure ROI carefully.
GENESIS · Source paper
Fast Inference from Transformers via Speculative DecodingSpeculative Decoding (Google)
breakthroughLeviathan, Kalman, and Matias formalize the algorithm and prove distribution preservation. They demonstrate 2–3x speedup on T5-XXL.
Speculative Sampling (DeepMind)
Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper independently present the same idea as "speculative sampling," achieving 2–2.5x speedup on 70B Chinchilla in a distributed setup.
Medusa: drafter-free speculative decoding
Cai et al. propose attaching multiple decoding heads directly to the target model, eliminating the need for a separate drafter. Tree attention enables parallel verification of multiple branches. 2.2–3.6x speedup.
DeepSeek-V3 combines MTP training with speculative decoding
DeepSeek-V3 (671B MoE, 37B activated) adopts Multi-Token Prediction as an auxiliary training objective, improving model quality and simultaneously enabling speculative decoding without a separate drafter.
Gemma 4 MTP drafters (Google)
Google releases experimental MTP drafter models for the Gemma 4 family under Apache 2.0, with MLX, vLLM, SGLang, and Ollama support. 74M-parameter drafters achieve 2.5x–3.1x speedups on local hardware (Pixel, Apple M4, RTX PRO 6000) with no quality degradation.
Consumer and prosumer GPUs with limited VRAM bandwidth gain the most — underutilized tensor cores perform parallel verification almost for free.
On CPU, memory bandwidth is even more constraining — speculative decoding yields significant speedup in llama.cpp and similar implementations.
The source paper by Leviathan et al. demonstrates speedup on TPU (T5-XXL).
BUILT ON
Transformer
Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTLLM
A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Fast Inference from Transformers via Speculative Decoding Source paper by Leviathan et al., Google Research, ICML 2023 Oral. | arXiv | scientific article |
| Accelerating Large Language Model Decoding with Speculative Sampling Concurrent DeepMind publication (Chen et al.). | arXiv | scientific article |
| Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads | arXiv | scientific article |
Source paper by Leviathan et al., Google Research, ICML 2023 Oral.
Concurrent DeepMind publication (Chen et al.).