Gemma 4 MTP — Google przyspiesza lokalne modele AI nawet 3x dzięki MTP

On May 6, 2026, Google released experimental Multi-Token Prediction (MTP) drafter models for the Gemma 4 family, accelerating local inference up to three times with no loss of output quality. The technique is based on speculative decoding: a lightweight draft model predicts future tokens, which are then verified in parallel by the main model.

Key highlights

Inference speedup ranges from 2.5x to 3.1x depending on hardware — with no degradation in output quality
E2B and E4B models on Pixel phones achieve 2.8x and 3.1x more tokens per second, respectively
Gemma 4 31B on Apple M4 silicon achieves a 2.5x speedup
MTP drafter models have just 74 million parameters while the target model counts in the billions
Drafters are available under the Apache 2.0 license and supported via MLX, vLLM, SGLang, and Ollama

How speculative decoding works in Gemma 4

Standard language models generate tokens autoregressively — one at a time, with each new token requiring the full model to load its parameters from memory. This bottleneck is especially pronounced on consumer hardware, where VRAM has significantly lower bandwidth than the HBM used in data centers. Compute units spend much of their time waiting for data rather than processing it.

Multi-Token Prediction exploits that waiting time productively. A lightweight drafter model (74 million parameters in the E2B variant) speculatively generates several future tokens, sharing the same key-value cache already built by the main model — avoiding the need to recompute context. The generated tokens are then verified by the target model in a single parallel forward pass: if the model agrees with the drafter's predictions, the entire sequence is accepted. Simultaneously, the main model generates one additional token using the standard method.

The result: in the time it previously took to generate one token, the system can now produce several — the accepted drafter sequence plus the token generated by the main model. Google describes the gain as "zero quality degradation," meaning that errors characteristic of generative AI systems are not introduced by the speculative process — if the drafter is wrong, the main model rejects the token and continues generating on its own.

Benchmark results by hardware

Google provided measurements for several hardware configurations. E2B and E4B models on Pixel phones achieve speedups of 2.8x and 3.1x respectively — with the additional benefit of improved battery life. Gemma 4 31B on Apple M4 achieves a 2.5x speedup. Gemma 4 26B on an NVIDIA RTX PRO 6000 achieves approximately 2x more tokens per second.

Peak values (up to 3x) are achieved on mobile models running on Pixel. The larger the model and the faster the hardware, the smaller the relative gain from MTP — because on faster hardware the memory bandwidth bottleneck is less dominant. Nonetheless, even 2.5x on M4 is a meaningful improvement in the practical usability of local AI models.

Context: Gemma 4 and Google's strategy

Gemma 4 is an open-weight model family released by Google in spring 2026 under the Apache 2.0 license — more permissive than the custom license Google used for previous Gemma releases. The models are optimized for local inference but are based on the same underlying technology that powers Google's frontier Gemini models. Google offers them as an alternative for users who prefer to process data locally rather than sending it to the cloud.

The architecture includes MoE (Mixture of Experts) models — including Gemma 4 26B — and Dense models such as Gemma 4 31B. The largest dense model can run on a single high-end accelerator at full precision; quantization allows it to run on consumer GPUs. The MTP release is well-timed strategically: the announcement comes just weeks after Gemma 4's initial launch, indicating that Google is systematically building an inference tooling layer around its open model family.

Why this matters

Local inference is to AI what compilers were to programming: the faster and more efficient it becomes, the more developers and enthusiasts can work without depending on external services. A threefold speedup on mobile hardware and a two-to-three-fold improvement on desktop GPUs crosses the threshold where interacting with the model stops feeling noticeably slow to the user.

Speculative decoding is not a new technique — it has been explored previously by companies including Meta AI with Llama and DeepSeek AI. What is new is that Google is integrating it directly into the Gemma ecosystem as a ready-to-use tool available through four popular frameworks (MLX, vLLM, SGLang, Ollama) and under the same Apache 2.0 license. This reduces the barrier to adoption to a minimum. The limitation is the experimental status: Google explicitly describes the MTP models as "experimental," meaning it does not guarantee API stability or long-term support under current terms.

What's next

Google has not provided a timeline for graduating MTP from experimental to production status — the "experimental" label should be taken as a signal that interfaces may change
Support via Ollama suggests Gemma 4 with MTP will become one of the standard recommended models for local inference in the coming months
Results on Pixel indicate potential for on-device AI in mobile use cases; Google is likely to promote this approach with future Android AI updates