Inference

Speculative Decoding

2022ActivePublished

Key innovation

Accelerating autoregressive decoding without altering the model output distribution — a lightweight draft model proposes multiple candidate tokens, and the target model verifies them in a single parallel forward pass, accepting matching tokens and rejecting incorrect ones.

How it works

At each step, the draft model autoregressively generates a sequence of k candidate tokens. The target model then computes probabilities for all those positions in a single parallel forward pass. A modified rejection sampling algorithm compares drafter vs. target distributions at each position: if the drafter token has at least as much mass under target — the token is accepted; otherwise it is accepted with probability proportional to the distribution ratio. After the first rejection, the remainder of the sequence is discarded, and at the rejected position the target samples a token from the corrected residual distribution. Each step thus produces at least one token (from target) plus 0 to k accepted drafter tokens — with zero change in output distribution.

Problem solved

Standard autoregressive decoding generates tokens sequentially — one at a time — and each step requires reloading full model parameters from HBM/VRAM. On consumer hardware, where memory bandwidth is low relative to compute, processing units spend most of their time waiting for data. Speculative Decoding exploits that waiting time productively.

Implementation

Reference implementations

vLLM (speculative decoding)

Python · vLLM project

Medusa

Python · FasterDecoding

Official

SGLang

Python · SGLang project

llama.cpp speculative

C/C++ · Georgi Gerganov / community

Implementation pitfalls

Low acceptance rate destroys gainsHigh

If the drafter is poorly matched to the target (different tokenizer, different distribution, different RLHF), acceptance rate falls below 30% and the cost of the drafter forward pass eats up the savings.

Fix:Train the drafter on the same distribution as the target or use a drafter distilled from the target.

KV-cache managementMedium

The drafter and target must coordinate the KV-cache. A naive implementation recomputes context from scratch each step, negating the gain. Sharing the KV-cache requires care when drafter tokens are rejected (cache rollback).

Fix:Use proven frameworks (vLLM, SGLang) or reference implementations instead of writing your own.

Diminishing gains on fast hardwareLow

When the inference bottleneck is compute rather than memory bandwidth (e.g., large models on H100/H200), parallel verification adds significant compute overhead and the speedup drops below 2x.

Fix:Speculative decoding yields the largest gains on consumer GPUs, mobile devices, and smaller data-center models. On frontier hardware, measure ROI carefully.

Evolution

Original paper · 2022 · ICML 2023 (Oral); Google Research · Yaniv Leviathan

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

2022

Speculative Decoding (Google)

Inflection point

Leviathan, Kalman, and Matias formalize the algorithm and prove distribution preservation. They demonstrate 2–3x speedup on T5-XXL.

Fast Inference from Transformers via Speculative Decoding (paper)

2023

Speculative Sampling (DeepMind)

Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper independently present the same idea as "speculative sampling," achieving 2–2.5x speedup on 70B Chinchilla in a distributed setup.

Accelerating Large Language Model Decoding with Speculative Sampling (paper)

2024

Medusa: drafter-free speculative decoding

Cai et al. propose attaching multiple decoding heads directly to the target model, eliminating the need for a separate drafter. Tree attention enables parallel verification of multiple branches. 2.2–3.6x speedup.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (paper)

2024

DeepSeek-V3 combines MTP training with speculative decoding

DeepSeek-V3 (671B MoE, 37B activated) adopts Multi-Token Prediction as an auxiliary training objective, improving model quality and simultaneously enabling speculative decoding without a separate drafter.

DeepSeek-V3 Technical Report (paper)

2026

Gemma 4 MTP drafters (Google)

Google releases experimental MTP drafter models for the Gemma 4 family under Apache 2.0, with MLX, vLLM, SGLang, and Ollama support. 74M-parameter drafters achieve 2.5x–3.1x speedups on local hardware (Pixel, Apple M4, RTX PRO 6000) with no quality degradation.