Robots Atlas>ROBOTS ATLAS
Architecture

Temperature (sampling parameter)

mature
Category
Architecture
Abstraction level
Primitive
Operation level
InferenceTraining
Use cases
Controlling generation diversity in LLMs (chat, creative writing, code)Greedy / deterministic decoding at T → 0 (factual tasks, code, classification)High temperature for creative writing, brainstorming, variant generationTemperature scaling — post-hoc calibration of classification modelsDistillation — high T yields "soft targets" that transfer "dark knowledge"Combined with top-k and top-p in modern sampling pipelinesTuning the exploration–exploitation balance in RL with soft policies

How it works

1. The model returns logits z_i for each token in the vocabulary. 2. Scaling: divide the logits by T: z_i / T. 3. Softmax: p_i = exp(z_i / T) / Σ_j exp(z_j / T). 4. Sampling: draw a token from distribution p (or apply greedy argmax for T→0). Effect of T: • T < 1 → sharpening: probability concentrates on the highest-logit token. T→0 is greedy decoding. • T = 1 → raw model distribution. • T > 1 → flattening: differences between tokens shrink, randomness increases. T→∞ is a uniform distribution. 5. Temperature is applied BEFORE optional top-k / top-p — scale logits first, then truncate. 6. Temperature scaling (calibration): learn a single scalar T on a validation set by minimising negative log-likelihood, so that posterior probabilities better match actual accuracy.

Problem solved

Language model logits must be converted into a probability distribution from which you sample the next token. The raw distribution (T=1) produces text that mechanically follows statistical habits — leading to repetition or degeneration. At the same time, lacking control over randomness makes it impossible to adapt generation to the task: code needs determinism, creative writing needs diversity. Temperature solves both problems with a single scalar parameter.

Key mechanisms

Scaling logits before softmax: p_i = exp(z_i / T) / Σ exp(z_j / T)
T < 1 — sharpens the distribution (less randomness, higher confidence)
T > 1 — flattens the distribution (more diversity, risk of degeneration)
T → 0 — greedy decoding limit (argmax over logits)
T → ∞ — uniform-distribution limit
Compositional with top-k and top-p — temperature scales logits before truncation
Temperature scaling — a single T fitted on validation to minimize NLL/ECE
High T in distillation — generates "soft targets" carrying inter-class relations

Strengths & limitations

Strengths
Very simple mechanism — a single scalar with a clear interpretation
Zero computational overhead — one extra division
Widely supported across inference libraries and APIs
Controls the certainty–diversity balance without modifying the model
Works post-hoc — no retraining required
In calibration it does not change classification accuracy (preserves argmax)
Limitations
Global effect — operates on the entire distribution and cannot selectively suppress wrong tokens
High T can lead to text degeneration (Holtzman et al. 2020)
Very low T yields repetitive, dull outputs (loops, "I am an AI...")
No context adaptivity — requires manual tuning per task
Composes poorly with strongly calibrated models (excessive smoothing)
Impact on generation quality depends heavily on architecture, data and training stage

Implementation

Implementation pitfalls
T=0 is not truly deterministic in float16Medium

Temperature 0 (argmax) should give deterministic results, but float16 rounding errors and different GPU operation orders can give different tokens on logit ties. Full determinism requires float32 and seeded PRNG.

High temperature does not improve creativity — it increases randomnessMedium

T>1.5 often leads to incoherent or nonsensical responses. LLM "creativity" comes from training data diversity, not high temperature — T=0.7–1.0 is the optimal range for most applications.

Evolution

Original paper · 1985 · David H. Ackley
A Stochastic Model for Statistical Mechanics (softmax with temperature)
David H. Ackley, Geoffrey E. Hinton, Terrence J. Sejnowski
1985
Ackley, Hinton and Sejnowski introduce a temperature parameter in Boltzmann machines — its original physical motivation (Gibbs distribution).
2015
Hinton, Vinyals and Dean ("Distilling the Knowledge in a Neural Network") use a high softmax temperature to create "soft targets" for knowledge distillation.
2017
Guo, Pleiss, Sun and Weinberger publish "On Calibration of Modern Neural Networks" — temperature scaling becomes a standard calibration technique for classification models.
2019
GPT-2 (Radford et al.) popularizes temperature as a primary text-generation parameter in large language models.
2020
Holtzman et al. ("The Curious Case of Neural Text Degeneration") introduce nucleus sampling (top-p) and analyze how temperature affects text degeneration.
2023
The OpenAI ChatGPT API standardizes a 0–2 temperature range as a public developer parameter; the convention is adopted by most LLM providers.

Computational complexity

Computational characteristics
Computational cost: O(V) per decoding step (V = vocab size) — negligible
Memory: a single scalar — O(1)
No GPU/CPU memory overhead
Trivially composes with top-k, top-p, repetition penalty in the sampling pipeline
One-line implementation: logits / T before softmax
In calibration: a one-dimensional optimization over the validation set
Benchmark notes

Typical ranges: T = 0.0 (greedy, code / factual tasks), T = 0.2–0.5 (precise answers), T = 0.7–1.0 (creative writing, chat), T > 1.2 (experimental / high creativity). Guo et al. (2017) showed that temperature scaling reduces the Expected Calibration Error (ECE) of ResNets on CIFAR-100 from ~16% to <2%. Holtzman et al. (2020) demonstrated that T = 1.0 with pure sampling yields high perplexity but leads to text degeneration — motivating nucleus sampling as an alternative.