Self-Consistency

Arithmetic and math tasksLogical reasoningOpen-ended multi-step reasoning questionsLLM answer correctness verification

Algorithm: (1) Sample k different CoT paths using temperature T>0, (2) Extract the final answer from each path, (3) Select the answer by majority vote (most frequently occurring). A typical range is k=5-40 paths. The method requires no additional training or model modification.

Greedy decoding in Chain-of-Thought is sensitive to errors in a single reasoning path - one wrong step propagates to the final answer.

Number of samples (k)

Critical

5Minimum for a noticeable improvement.
40Value used in the experiments of Wang et al. (2022).

Number of independently sampled CoT paths. Increasing k improves answer stability but linearly increases inference cost.

Sampling temperature

Standard

0.5–0.7Range recommended in the original paper.

Temperature T controls reasoning-path diversity. T = 0 makes the method useless (no diversity).

Aggregation method

Standard

How path outputs are combined: majority vote (classic), probability-weighted vote, or semantic clustering (Universal Self-Consistency).

Common pitfalls

Cost grows linearly with k

MEDIUM

Sampling k paths multiplies inference cost by k, which can be expensive for large models and long reasoning chains.

Choose k adaptively (e.g. early stopping when most paths already agree) or use smaller k for easier tasks.

Majority voting fails for open-ended answers

MEDIUM

When answers are not discrete and not exact-match comparable (e.g. prose, code, longer explanations), standard majority voting is unusable.

Use Universal Self-Consistency or an LLM-as-judge to aggregate semantically similar answers.

Requires non-zero temperature

LOW

Without sampling diversity (T = 0) all paths are identical and voting adds no information. Non-zero temperature T > 0 or top-p < 1 is required.

Use T in the 0.5–0.7 range and verify that the generated reasoning paths actually differ.

Reference implementations

LangChain — self-consistency parser

Python · LangChain

DSPy — multi-sample programs with voting

Python · Stanford NLP

GENESIS · Source paper

Self-Consistency Improves Chain of Thought Reasoning in Language Models

2022ICLR 2023Xuezhi Wang, Jason Wei, Dale Schuurmans et al.

2022

Self-Consistency (Wang et al., ICLR 2023)

breakthrough

Wang et al. propose majority-vote over multiple CoT paths, showing +17.9 points on GSM8K over vanilla CoT.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

2023

Universal Self-Consistency and extensions

Follow-up work extends Self-Consistency to open-ended tasks where exact-match voting is inapplicable (Universal Self-Consistency, Chen et al., 2023).

Universal Self-Consistency for Large Language Model Generation

Hardware agnosticPRIMARY

Self-Consistency is a layer on top of LLM inference — agnostic to specific hardware. All calls are standard autoregressive generation, which parallelizes well on GPU and TPU.

Sampling k paths can be parallelized within a batch, which exploits accelerator throughput well.

BUILT ON

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

EXTENDS

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

ALTERNATIVE TO

ToT

Tree of Thoughts (ToT) is a reasoning framework for language models proposed by Yao et al. (NeurIPS 2023). It generalizes the Chain-of-Thought approach, allowing the model to explore multiple different reasoning paths instead of a single linear sequence. Each intermediate reasoning state ('thought') is assigned an evaluation of its promise by the LLM. The framework endows the model with the ability for deliberate decision-making, consideration of alternative reasoning branches, backtracking when a path is a dead end, and lookahead before making global choices. Experiments showed that ToT significantly improves LLM problem-solving capabilities on tasks requiring non-trivial planning or search (Game of 24: GPT-4+CoT 4% vs ToT 74%).

GO TO CONCEPT

Reflexion

Reflexion (Shinn et al., NeurIPS 2023) is a framework for reinforcing LLM agents not through weight updates, but through linguistic feedback. Reflexion agents verbally reflect on task feedback signals, then maintain these verbal reflections in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible regarding types of feedback signals (scalar values or free-form language) and sources (external or internally simulated). The method achieved 91% pass@1 on the HumanEval benchmark (coding), surpassing the then-SOTA GPT-4 at 80%. The work aligns with the paradigm of agents learning from experience without expensive RL.

GO TO CONCEPT

Commonly used with

CoT

Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.

GO TO CONCEPT

Title	Publisher	Type
Self-Consistency Improves Chain of Thought Reasoning in Language Models	arXiv / Google Research	scientific article
Universal Self-Consistency for Large Language Model Generation	arXiv / Google Research	scientific article

Self-Consistency Improves Chain of Thought Reasoning in Language Models

scientific articlearXiv / Google Research

Universal Self-Consistency for Large Language Model Generation

scientific articlearXiv / Google Research

Use cases

How it works

Problem solved

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

BUILT ON

EXTENDS

ALTERNATIVE TO

Commonly used with

Sources