ToT
How it works
ToT decomposes reasoning into three components: (1) Thought Generator - LLM produces candidates for the next reasoning step, (2) State Evaluator - LLM or a separate function evaluates the promise of each state, (3) Search Algorithm - BFS or DFS (or beam search) explores the state tree. The model can backtrack to a previous state and try a different branch.
Problem solved
Language models with CoT are confined to left-to-right, token-by-token decisions during inference, which fails on tasks requiring exploration, strategic planning, or where initial decisions are pivotal.
Components
LLM-based component generating k candidate thoughts. Two strategies: 'sample' (independent sampling from the same prompt - for rich thought spaces, e.g. Creative Writing) or 'propose' (sequential proposal of multiple thoughts in one prompt - for constrained spaces, e.g. Game of 24).
Official
LLM evaluates each state in two ways: 'value' (assigns numeric or categorical value to a single state, used in Game of 24) or 'vote' (compares multiple states and votes for the best, used in Creative Writing). The heuristic is soft and depends on LLM quality.
Official
ToT uses classical algorithms: BFS (Game of 24, Creative Writing - maintains b most promising states per level) or DFS (Mini Crosswords - explores deeply with backtracking when a state is judged unpromising). The algorithm is LLM-independent.
Official
Implementation
ToT requires multiple LLM calls per step (generation + evaluation x branches), making it 10-100x more expensive than CoT.
Using the same LLM as both generator and evaluator creates circular reasoning risks; evaluation quality depends on model capability.
Evolution
Wei et al. establish CoT as the baseline linear reasoning approach that ToT extends.
Wang et al. propose sampling multiple CoT paths and voting, a precursor to ToT.
Yao et al. formalize tree-structured deliberate reasoning with evaluation and backtracking.
Besta et al. extend ToT to arbitrary graph structures for more flexible reasoning topologies.
Technical details
Hyperparameters (configurable axes)
Number of thoughts generated at each step by the generator. Larger k = broader exploration but linearly higher cost.
Number of states kept per BFS level. Classical beam search parameter - tradeoff between exploration and cost.
Number of LLM calls to evaluate each state (averaged to reduce variance).
Maximum number of reasoning steps in the tree. Task-dependent (Game of 24: 3, Crosswords: much deeper).
Choice of thought generation strategy: sample (independent sampling) or propose (sequential proposals in one prompt).
Choice of state evaluation strategy: value (independent scoring) or vote (voting over a set).
Computational complexity
Time complexity: O(k^T · c_LLM). Space complexity: O(b · T).
Compute bottleneck
Each ToT step requires many LLM calls: generation of k candidates + evaluation n_evaluate_sample times for each of b states. API latency and cost grow multiplicatively with search parameters.
Execution paradigm
Computation trajectory is conditional - depends on state evaluations. Different inputs produce different tree shapes and different paths.
Search algorithm routes computation to the most promising tree branches based on evaluator scores. Unpromising branches are pruned (not expanded further).
Parallelism
Within a BFS level, generation of k candidates and their evaluation are fully parallel. Between levels the search remains sequential.
Hardware requirements
ToT is an inference framework - it runs on any backend capable of LLM execution (API, local GPU, TPU). The search algorithm itself is CPU-bound and lightweight.
Local ToT execution benefits from GPUs for parallel batching of k candidates and b states in one LLM inference call.