Architecture

Temperature (sampling parameter)

mature

How it works

1. The model returns logits z_i for each token in the vocabulary. 2. Scaling: divide the logits by T: z_i / T. 3. Softmax: p_i = exp(z_i / T) / Σ_j exp(z_j / T). 4. Sampling: draw a token from distribution p (or apply greedy argmax for T→0). Effect of T: • T < 1 → sharpening: probability concentrates on the highest-logit token. T→0 is greedy decoding. • T = 1 → raw model distribution. • T > 1 → flattening: differences between tokens shrink, randomness increases. T→∞ is a uniform distribution. 5. Temperature is applied BEFORE optional top-k / top-p — scale logits first, then truncate. 6. Temperature scaling (calibration): learn a single scalar T on a validation set by minimising negative log-likelihood, so that posterior probabilities better match actual accuracy.

Problem solved

Language model logits must be converted into a probability distribution from which you sample the next token. The raw distribution (T=1) produces text that mechanically follows statistical habits — leading to repetition or degeneration. At the same time, lacking control over randomness makes it impossible to adapt generation to the task: code needs determinism, creative writing needs diversity. Temperature solves both problems with a single scalar parameter.

Key mechanisms

Scaling logits before softmax: p_i = exp(z_i / T) / Σ exp(z_j / T)

T < 1 — sharpens the distribution (less randomness, higher confidence)

T > 1 — flattens the distribution (more diversity, risk of degeneration)

T → 0 — greedy decoding limit (argmax over logits)

T → ∞ — uniform-distribution limit

Compositional with top-k and top-p — temperature scales logits before truncation

Temperature scaling — a single T fitted on validation to minimize NLL/ECE

High T in distillation — generates "soft targets" carrying inter-class relations

Strengths & limitations

Strengths

✓Very simple mechanism — a single scalar with a clear interpretation

✓Zero computational overhead — one extra division

✓Widely supported across inference libraries and APIs

✓Controls the certainty–diversity balance without modifying the model

✓Works post-hoc — no retraining required

✓In calibration it does not change classification accuracy (preserves argmax)

Limitations

✗Global effect — operates on the entire distribution and cannot selectively suppress wrong tokens

✗High T can lead to text degeneration (Holtzman et al. 2020)

✗Very low T yields repetitive, dull outputs (loops, "I am an AI...")

✗No context adaptivity — requires manual tuning per task

✗Composes poorly with strongly calibrated models (excessive smoothing)

✗Impact on generation quality depends heavily on architecture, data and training stage

Implementation

Implementation pitfalls

T=0 is not truly deterministic in float16Medium

Temperature 0 (argmax) should give deterministic results, but float16 rounding errors and different GPU operation orders can give different tokens on logit ties. Full determinism requires float32 and seeded PRNG.

High temperature does not improve creativity — it increases randomnessMedium

T>1.5 often leads to incoherent or nonsensical responses. LLM "creativity" comes from training data diversity, not high temperature — T=0.7–1.0 is the optimal range for most applications.

Evolution

Original paper · 1985 · David H. Ackley

A Stochastic Model for Statistical Mechanics (softmax with temperature)

David H. Ackley, Geoffrey E. Hinton, Terrence J. Sejnowski

1985

Ackley, Hinton and Sejnowski introduce a temperature parameter in Boltzmann machines — its original physical motivation (Gibbs distribution).

2015

Hinton, Vinyals and Dean ("Distilling the Knowledge in a Neural Network") use a high softmax temperature to create "soft targets" for knowledge distillation.

2017

Guo, Pleiss, Sun and Weinberger publish "On Calibration of Modern Neural Networks" — temperature scaling becomes a standard calibration technique for classification models.

2019

GPT-2 (Radford et al.) popularizes temperature as a primary text-generation parameter in large language models.

2020

Holtzman et al. ("The Curious Case of Neural Text Degeneration") introduce nucleus sampling (top-p) and analyze how temperature affects text degeneration.

2023

The OpenAI ChatGPT API standardizes a 0–2 temperature range as a public developer parameter; the convention is adopted by most LLM providers.