Reasoning

ToT

2023ActiveUpdated: 6 May 2026Published

Key innovation

Extended Chain-of-Thought to tree exploration of potential reasoning paths with evaluation and backtracking, enabling strategic planning and solution space search.

How it works

ToT decomposes reasoning into three components: (1) Thought Generator - LLM produces candidates for the next reasoning step, (2) State Evaluator - LLM or a separate function evaluates the promise of each state, (3) Search Algorithm - BFS or DFS (or beam search) explores the state tree. The model can backtrack to a previous state and try a different branch.

Problem solved

Language models with CoT are confined to left-to-right, token-by-token decisions during inference, which fails on tasks requiring exploration, strategic planning, or where initial decisions are pivotal.

Components

Thought GeneratorProduces candidates for the next reasoning step from the current tree state.

LLM-based component generating k candidate thoughts. Two strategies: 'sample' (independent sampling from the same prompt - for rich thought spaces, e.g. Creative Writing) or 'propose' (sequential proposal of multiple thoughts in one prompt - for constrained spaces, e.g. Game of 24).

sampleIndependent sampling with temperature.

proposeSequential proposal of multiple thoughts in one call.

Official

State EvaluatorEvaluates the promise of each intermediate state (thought) as a heuristic for the search algorithm.

LLM evaluates each state in two ways: 'value' (assigns numeric or categorical value to a single state, used in Game of 24) or 'vote' (compares multiple states and votes for the best, used in Creative Writing). The heuristic is soft and depends on LLM quality.

valueIndependent value evaluation of a single state.

voteLLM voting over a set of candidate states.

Official

Search AlgorithmExplores the tree of thought states using classical graph algorithms.

ToT uses classical algorithms: BFS (Game of 24, Creative Writing - maintains b most promising states per level) or DFS (Mini Crosswords - explores deeply with backtracking when a state is judged unpromising). The algorithm is LLM-independent.

BFSBreadth-first search with beam of width b.

DFSDepth-first search with backtracking.

Official

Implementation

Reference implementations

princeton-nlp/tree-of-thought-llm

Python · Princeton NLP

Official

Implementation pitfalls

High computational costHigh

ToT requires multiple LLM calls per step (generation + evaluation x branches), making it 10-100x more expensive than CoT.

Thought evaluator can be unreliableMedium

Using the same LLM as both generator and evaluator creates circular reasoning risks; evaluation quality depends on model capability.

Evolution

Original paper · 2023 · NeurIPS 2023 · Shunyu Yao

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan

2022

Chain-of-Thought prompting as foundation

Wei et al. establish CoT as the baseline linear reasoning approach that ToT extends.

2022

Self-Consistency - first multi-path exploration

Wang et al. propose sampling multiple CoT paths and voting, a precursor to ToT.

2023

Tree of Thoughts (Yao et al., NeurIPS 2023)

Inflection point

Yao et al. formalize tree-structured deliberate reasoning with evaluation and backtracking.

2024

Graph of Thoughts and successors

Besta et al. extend ToT to arbitrary graph structures for more flexible reasoning topologies.

Technical details

Hyperparameters (configurable axes)

Number of thought candidates (k)High

Number of thoughts generated at each step by the generator. Larger k = broader exploration but linearly higher cost.

1Game of 24 (propose)

5-10Creative Writing (sample)

Beam width (b)Critical

Number of states kept per BFS level. Classical beam search parameter - tradeoff between exploration and cost.

5Default for Game of 24 in paper

1Greedy (degenerates to CoT)

Number of evaluation samplesMedium

Number of LLM calls to evaluate each state (averaged to reduce variance).

3Paper default

Tree depth (T)High

Maximum number of reasoning steps in the tree. Task-dependent (Game of 24: 3, Crosswords: much deeper).

3Game of 24

Generator strategyHigh

Choice of thought generation strategy: sample (independent sampling) or propose (sequential proposals in one prompt).

sampleRich thought space

proposeConstrained thought space

Evaluator strategyHigh

Choice of state evaluation strategy: value (independent scoring) or vote (voting over a set).

valueGame of 24

voteCreative Writing

Computational complexity

Time complexity: O(k^T · c_LLM). Space complexity: O(b · T).

Compute bottleneck

LLM API calls

Each ToT step requires many LLM calls: generation of k candidates + evaluation n_evaluate_sample times for each of b states. API latency and cost grow multiplicatively with search parameters.

Depends on

Branching factor (k)Beam width (b)Glebokosc drzewa (T)

Execution paradigm

Primary mode

conditional

Computation trajectory is conditional - depends on state evaluations. Different inputs produce different tree shapes and different paths.

Activation pattern

input_dependent

Routing mechanism

Search algorithm routes computation to the most promising tree branches based on evaluator scores. Unpromising branches are pruned (not expanded further).

Parallelism

Parallelism level

partially_parallel

Within a BFS level, generation of k candidates and their evaluation are fully parallel. Between levels the search remains sequential.

Scope

inferenceacross_tokens

Constraints

!Level t+1 requires completed evaluations from level t (to select top-b).

Hardware requirements

Primary

ToT is an inference framework - it runs on any backend capable of LLM execution (API, local GPU, TPU). The search algorithm itself is CPU-bound and lightweight.

Good fit

Local ToT execution benefits from GPUs for parallel batching of k candidates and b states in one LLM inference call.

Sources

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Paper