Introduction

note · March 2, 2026 · 3 min

Introduction

ai inference

This note is part of a series.

Speculative decoding is a popular technique in autoregressive LLM inference that aims to reduce the number of forward passes of a target model using a draft model. It was introduced in Accelerating Large Language Model Decoding with Speculative SamplingDeepmind2023 and improved in many subsequent works. * Coincidentally, other research teams published the idea around the same time as DeepMind.

The intuition behind speculative decoding is that not all predictions require a full pass through the model. When generating text, many tokens are obvious from context, and can be guessed by a smaller model faster and cheaper.

Speculative Decoding

At each step, when processing a sequence x of length t, a speculative decoding algorithm (usually, a small neural network) makes a guess about the value of n future tokens, denoted draft (where n is usually small, between 2 and 8).

A sequence of context tokens followed by draft tokens

Next, the model is run on the concatenated sequence x + draft. The model’s output probabilities let us know whether the draft is correct, partially correct or incorrect.

The target model verifies draft tokens in parallel, accepting, rejecting, or discarding each one