Temperature (sampling parameter)
How it works
1. The model returns logits z_i for each token in the vocabulary. 2. Scaling: divide the logits by T: z_i / T. 3. Softmax: p_i = exp(z_i / T) / Σ_j exp(z_j / T). 4. Sampling: draw a token from distribution p (or apply greedy argmax for T→0). Effect of T: • T < 1 → sharpening: probability concentrates on the highest-logit token. T→0 is greedy decoding. • T = 1 → raw model distribution. • T > 1 → flattening: differences between tokens shrink, randomness increases. T→∞ is a uniform distribution. 5. Temperature is applied BEFORE optional top-k / top-p — scale logits first, then truncate. 6. Temperature scaling (calibration): learn a single scalar T on a validation set by minimising negative log-likelihood, so that posterior probabilities better match actual accuracy.
Problem solved
Language model logits must be converted into a probability distribution from which you sample the next token. The raw distribution (T=1) produces text that mechanically follows statistical habits — leading to repetition or degeneration. At the same time, lacking control over randomness makes it impossible to adapt generation to the task: code needs determinism, creative writing needs diversity. Temperature solves both problems with a single scalar parameter.
Key mechanisms
Strengths & limitations
Implementation
Temperature 0 (argmax) should give deterministic results, but float16 rounding errors and different GPU operation orders can give different tokens on logit ties. Full determinism requires float32 and seeded PRNG.
T>1.5 often leads to incoherent or nonsensical responses. LLM "creativity" comes from training data diversity, not high temperature — T=0.7–1.0 is the optimal range for most applications.
Evolution
Computational complexity
Typical ranges: T = 0.0 (greedy, code / factual tasks), T = 0.2–0.5 (precise answers), T = 0.7–1.0 (creative writing, chat), T > 1.2 (experimental / high creativity). Guo et al. (2017) showed that temperature scaling reduces the Expected Calibration Error (ECE) of ResNets on CIFAR-100 from ~16% to <2%. Holtzman et al. (2020) demonstrated that T = 1.0 with pure sampling yields high perplexity but leads to text degeneration — motivating nucleus sampling as an alternative.