Self-Consistency
Replaced greedy decoding in Chain-of-Thought with sampling multiple diverse reasoning paths and selecting the most frequent answer, improving reasoning reliability without additional training.
Algorithm: (1) Sample k different CoT paths using temperature T>0, (2) Extract the final answer from each path, (3) Select the answer by majority vote (most frequently occurring). A typical range is k=5-40 paths. The method requires no additional training or model modification.
Greedy decoding in Chain-of-Thought is sensitive to errors in a single reasoning path - one wrong step propagates to the final answer.
Number of samples (k)
- 5Minimum for a noticeable improvement.
- 40Value used in the experiments of Wang et al. (2022).
Number of independently sampled CoT paths. Increasing k improves answer stability but linearly increases inference cost.
Sampling temperature
- 0.5β0.7Range recommended in the original paper.
Temperature T controls reasoning-path diversity. T = 0 makes the method useless (no diversity).
Aggregation method
How path outputs are combined: majority vote (classic), probability-weighted vote, or semantic clustering (Universal Self-Consistency).
Common pitfalls
Cost grows linearly with kMEDIUM
Sampling k paths multiplies inference cost by k, which can be expensive for large models and long reasoning chains.
Choose k adaptively (e.g. early stopping when most paths already agree) or use smaller k for easier tasks.
Majority voting fails for open-ended answersMEDIUM
When answers are not discrete and not exact-match comparable (e.g. prose, code, longer explanations), standard majority voting is unusable.
Use Universal Self-Consistency or an LLM-as-judge to aggregate semantically similar answers.
Requires non-zero temperatureLOW
Without sampling diversity (T = 0) all paths are identical and voting adds no information. Non-zero temperature T > 0 or top-p < 1 is required.
Use T in the 0.5β0.7 range and verify that the generated reasoning paths actually differ.
GENESIS Β· Source paper
Self-Consistency Improves Chain of Thought Reasoning in Language ModelsSelf-Consistency (Wang et al., ICLR 2023)
breakthroughWang et al. propose majority-vote over multiple CoT paths, showing +17.9 points on GSM8K over vanilla CoT.
Universal Self-Consistency and extensions
Follow-up work extends Self-Consistency to open-ended tasks where exact-match voting is inapplicable (Universal Self-Consistency, Chen et al., 2023).
Self-Consistency is a layer on top of LLM inference β agnostic to specific hardware. All calls are standard autoregressive generation, which parallelizes well on GPU and TPU.
Sampling k paths can be parallelized within a batch, which exploits accelerator throughput well.
BUILT ON
CoT
Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
GO TO CONCEPTLLM
A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTEXTENDS
CoT
Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
GO TO CONCEPTALTERNATIVE TO
ToT
Tree of Thoughts (ToT) is a reasoning framework for language models proposed by Yao et al. (NeurIPS 2023). It generalizes the Chain-of-Thought approach, allowing the model to explore multiple different reasoning paths instead of a single linear sequence. Each intermediate reasoning state ('thought') is assigned an evaluation of its promise by the LLM. The framework endows the model with the ability for deliberate decision-making, consideration of alternative reasoning branches, backtracking when a path is a dead end, and lookahead before making global choices. Experiments showed that ToT significantly improves LLM problem-solving capabilities on tasks requiring non-trivial planning or search (Game of 24: GPT-4+CoT 4% vs ToT 74%).
GO TO CONCEPTReflexion
Reflexion (Shinn et al., NeurIPS 2023) is a framework for reinforcing LLM agents not through weight updates, but through linguistic feedback. Reflexion agents verbally reflect on task feedback signals, then maintain these verbal reflections in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible regarding types of feedback signals (scalar values or free-form language) and sources (external or internally simulated). The method achieved 91% pass@1 on the HumanEval benchmark (coding), surpassing the then-SOTA GPT-4 at 80%. The work aligns with the paradigm of agents learning from experience without expensive RL.
GO TO CONCEPTCommonly used with
CoT
Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Self-Consistency Improves Chain of Thought Reasoning in Language Models | arXiv / Google Research | scientific article |
| Universal Self-Consistency for Large Language Model Generation | arXiv / Google Research | scientific article |