At each step, the draft model autoregressively generates a sequence of k candidate tokens. The target model then computes probabilities for all those positions in a single parallel forward pass. A modified rejection sampling algorithm compares drafter vs. target distributions at each position: if the drafter token has at least as much mass under target — the token is accepted; otherwise it is accepted with probability proportional to the distribution ratio. After the first rejection, the remainder of the sequence is discarded, and at the rejected position the target samples a token from the corrected residual distribution. Each step thus produces at least one token (from target) plus 0 to k accepted drafter tokens — with zero change in output distribution.
Standard autoregressive decoding generates tokens sequentially — one at a time — and each step requires reloading full model parameters from HBM/VRAM. On consumer hardware, where memory bandwidth is low relative to compute, processing units spend most of their time waiting for data. Speculative Decoding exploits that waiting time productively.
If the drafter is poorly matched to the target (different tokenizer, different distribution, different RLHF), acceptance rate falls below 30% and the cost of the drafter forward pass eats up the savings.
The drafter and target must coordinate the KV-cache. A naive implementation recomputes context from scratch each step, negating the gain. Sharing the KV-cache requires care when drafter tokens are rejected (cache rollback).
When the inference bottleneck is compute rather than memory bandwidth (e.g., large models on H100/H200), parallel verification adds significant compute overhead and the speedup drops below 2x.
Leviathan, Kalman, and Matias formalize the algorithm and prove distribution preservation. They demonstrate 2–3x speedup on T5-XXL.
Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper independently present the same idea as "speculative sampling," achieving 2–2.5x speedup on 70B Chinchilla in a distributed setup.
Cai et al. propose attaching multiple decoding heads directly to the target model, eliminating the need for a separate drafter. Tree attention enables parallel verification of multiple branches. 2.2–3.6x speedup.
DeepSeek-V3 (671B MoE, 37B activated) adopts Multi-Token Prediction as an auxiliary training objective, improving model quality and simultaneously enabling speculative decoding without a separate drafter.
Google releases experimental MTP drafter models for the Gemma 4 family under Apache 2.0, with MLX, vLLM, SGLang, and Ollama support. 74M-parameter drafters achieve 2.5x–3.1x speedups on local hardware (Pixel, Apple M4, RTX PRO 6000) with no quality degradation.
Number of tokens generated by the drafter in one step before verification by the target. Higher k → potentially larger speedup but lower acceptance rate and higher risk of wasted compute.
Size of the drafter model relative to the target. The smaller the drafter, the cheaper candidate generation, but the lower the agreement with target and acceptance rate. Typically 1–10% of target size.
Measured average fraction of drafter tokens accepted by the target. Directly determines achieved speedup — a high acceptance rate (>70%) approaches the theoretical maximum, low (<30%) yields only marginal gains.
Consumer and prosumer GPUs with limited VRAM bandwidth gain the most — underutilized tensor cores perform parallel verification almost for free.
On CPU, memory bandwidth is even more constraining — speculative decoding yields significant speedup in llama.cpp and similar implementations.
The source paper by Leviathan et al. demonstrates speedup on TPU (T5-XXL).