Speculative Decoding

Accelerating autoregressive decoding without altering the model output distribution — a lightweight draft model proposes multiple candidate tokens, and the target model verifies them in a single parallel forward pass, accepting matching tokens and rejecting incorrect ones.

Category

Abstraction level

Operation level

Accelerating LLM inference on consumer hardware (GPUs, phones, laptops)Local inference of open-weight models (Gemma, Llama, DeepSeek)Reducing response latency in interactive apps (chat, code completion)Lowering inference costs in data centers while preserving quality

At each step, the draft model autoregressively generates a sequence of k candidate tokens. The target model then computes probabilities for all those positions in a single parallel forward pass. A modified rejection sampling algorithm compares drafter vs. target distributions at each position: if the drafter token has at least as much mass under target — the token is accepted; otherwise it is accepted with probability proportional to the distribution ratio. After the first rejection, the remainder of the sequence is discarded, and at the rejected position the target samples a token from the corrected residual distribution. Each step thus produces at least one token (from target) plus 0 to k accepted drafter tokens — with zero change in output distribution.

Standard autoregressive decoding generates tokens sequentially — one at a time — and each step requires reloading full model parameters from HBM/VRAM. On consumer hardware, where memory bandwidth is low relative to compute, processing units spend most of their time waiting for data. Speculative Decoding exploits that waiting time productively.

Draft length (k)

Standard

4Typical value in Leviathan and Chen papers.
5–8Common in production implementations (vLLM, SGLang).

Number of tokens generated by the drafter in one step before verification by the target. Higher k → potentially larger speedup but lower acceptance rate and higher risk of wasted compute.

Drafter size

Critical

74M (Gemma 4 MTP drafter, target multi-billion)Gemma 4 MTP 2026 configuration.
Chinchilla 4B drafter dla 70B targetDeepMind 2023 configuration.

Size of the drafter model relative to the target. The smaller the drafter, the cheaper candidate generation, but the lower the agreement with target and acceptance rate. Typically 1–10% of target size.

Acceptance rate

Standard

Measured average fraction of drafter tokens accepted by the target. Directly determines achieved speedup — a high acceptance rate (>70%) approaches the theoretical maximum, low (<30%) yields only marginal gains.

Common pitfalls

Low acceptance rate destroys gains

HIGH

If the drafter is poorly matched to the target (different tokenizer, different distribution, different RLHF), acceptance rate falls below 30% and the cost of the drafter forward pass eats up the savings.

Train the drafter on the same distribution as the target or use a drafter distilled from the target.

KV-cache management

MEDIUM

The drafter and target must coordinate the KV-cache. A naive implementation recomputes context from scratch each step, negating the gain. Sharing the KV-cache requires care when drafter tokens are rejected (cache rollback).

Use proven frameworks (vLLM, SGLang) or reference implementations instead of writing your own.

Diminishing gains on fast hardware

LOW

When the inference bottleneck is compute rather than memory bandwidth (e.g., large models on H100/H200), parallel verification adds significant compute overhead and the speedup drops below 2x.

Speculative decoding yields the largest gains on consumer GPUs, mobile devices, and smaller data-center models. On frontier hardware, measure ROI carefully.

Reference implementations

vLLM (speculative decoding)

Python · vLLM project

Medusaofficial

Python · FasterDecoding

SGLang

Python · SGLang project

llama.cpp speculative

C/C++ · Georgi Gerganov / community

GENESIS · Source paper

Fast Inference from Transformers via Speculative Decoding

2022ICML 2023 (Oral); Google ResearchYaniv Leviathan, Matan Kalman, Yossi Matias

2022

Speculative Decoding (Google)

breakthrough

Leviathan, Kalman, and Matias formalize the algorithm and prove distribution preservation. They demonstrate 2–3x speedup on T5-XXL.

Fast Inference from Transformers via Speculative Decoding

2023

Speculative Sampling (DeepMind)

Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper independently present the same idea as "speculative sampling," achieving 2–2.5x speedup on 70B Chinchilla in a distributed setup.

Accelerating Large Language Model Decoding with Speculative Sampling

2024

Medusa: drafter-free speculative decoding

Cai et al. propose attaching multiple decoding heads directly to the target model, eliminating the need for a separate drafter. Tree attention enables parallel verification of multiple branches. 2.2–3.6x speedup.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

2024

DeepSeek-V3 combines MTP training with speculative decoding

DeepSeek-V3 (671B MoE, 37B activated) adopts Multi-Token Prediction as an auxiliary training objective, improving model quality and simultaneously enabling speculative decoding without a separate drafter.

DeepSeek-V3 Technical Report

2026

Gemma 4 MTP drafters (Google)

Google releases experimental MTP drafter models for the Gemma 4 family under Apache 2.0, with MLX, vLLM, SGLang, and Ollama support. 74M-parameter drafters achieve 2.5x–3.1x speedups on local hardware (Pixel, Apple M4, RTX PRO 6000) with no quality degradation.

GPU Tensor CoresPRIMARY

Consumer and prosumer GPUs with limited VRAM bandwidth gain the most — underutilized tensor cores perform parallel verification almost for free.

CPU AVXGOOD

On CPU, memory bandwidth is even more constraining — speculative decoding yields significant speedup in llama.cpp and similar implementations.

TPUGOOD

The source paper by Leviathan et al. demonstrates speedup on TPU (T5-XXL).

BUILT ON

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

Title	Publisher	Type
Fast Inference from Transformers via Speculative Decoding Source paper by Leviathan et al., Google Research, ICML 2023 Oral.	arXiv	scientific article
Accelerating Large Language Model Decoding with Speculative Sampling Concurrent DeepMind publication (Chen et al.).	arXiv	scientific article
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads	arXiv	scientific article