Tree of Thoughts: Teaching LLMs to Backtrack Through Reasoning Paths
Hook
What if your LLM could say “wait, that approach isn’t working” and try a different path—just like you do when solving a hard problem? Tree of Thoughts makes exactly that possible by turning token generation into tree search.
Context
Standard LLM prompting generates tokens left-to-right with no ability to reconsider. Chain-of-thought prompting improved this by making reasoning explicit, but it still follows a single linear path. If the LLM starts down the wrong reasoning trajectory, it’s stuck—there’s no backtracking mechanism. This fundamental limitation becomes painfully obvious on tasks requiring exploration and planning.
Princeton NLP’s Tree of Thoughts (ToT) framework, published at NeurIPS 2023, tackles this by treating reasoning as a search problem. Instead of generating one long chain of tokens, ToT generates intermediate “thoughts” as nodes in a tree, evaluates their promise, and uses search algorithms to systematically explore the solution space. The results are striking: on the Game of 24 puzzle (make 24 from four numbers using basic operations), the paper originally reported 74% success with GPT-4 (though reproduction achieved 69% due to decoding randomness). The framework essentially gives LLMs the ability to “think out loud, try different approaches, and change their mind.”
Technical Insight
ToT’s architecture separates reasoning into four distinct components: thought generation, state evaluation, search strategy, and task definition. This modularity is what makes the framework adaptable across different problem types.
Thought generation operates in two modes. The “sample” mode generates multiple independent thoughts through repeated LLM calls—useful for creative tasks with many valid continuations. The “propose” mode generates sequential thoughts as deliberate next steps—better for mathematical or logical tasks. Here’s how you’d set up a Game of 24 solver (requires Python 3.7+ and OpenAI API key):
import argparse
from tot.methods.bfs import solve
from tot.tasks.game24 import Game24Task
args = argparse.Namespace(
backend='gpt-4',
temperature=0.7,
task='game24',
method_generate='propose', # Sequential thought generation
method_evaluate='value', # Independent state evaluation
method_select='greedy',
n_generate_sample=1, # Thoughts generated per step
n_evaluate_sample=3, # Evaluation samples per state
n_select_sample=5 # Top states to keep (beam width)
)
task = Game24Task()
ys, infos = solve(args, task, 900)
print(ys[0])
The output shows the step-by-step reasoning:
10 - 4 = 6 (left: 5 6 6)
5 * 6 = 30 (left: 6 30)
30 - 6 = 24 (left: 24)
Answer: (5 * (10 - 4)) - 6 = 24
State evaluation is where ToT gets clever—the LLM evaluates its own thoughts. The “value” method prompts the LLM to rate each state independently (“sure/likely/impossible”), while the “vote” method has the LLM compare multiple states and select the most promising. This self-evaluation mechanism eliminates the need for external reward models or human feedback during search.
The search strategy uses BFS for most tasks (Game of 24, text generation) and DFS specifically for crosswords. The n_select_sample parameter controls beam width—how many promising states to keep at each level. Smaller values (like 3) make search faster but risk pruning good paths. Larger values (like 10) explore more thoroughly but multiply API costs.
Adding new tasks requires implementing a task class in tot/tasks/ and defining task-specific prompts in tot/prompts/. The framework’s modularity shines here—you choose whether thoughts should be sampled or proposed, whether states should be valued or voted on, based on your task’s nature. Creative writing benefits from sampling diverse continuations and voting on coherence. Mathematical puzzles need proposed next steps and independent value judgments. The separation of concerns means you’re configuring a reasoning strategy, not rewriting the search algorithm.
Gotcha
ToT’s biggest limitation is cost and latency. Each search step requires multiple LLM calls—one per thought generation, several for evaluation, multiplied by the beam width. A single Game of 24 problem can easily require many GPT-4 API calls, making the approach computationally expensive for applications with tight budget constraints.
Non-determinism is another real issue. The repository’s own logs note that reproducing the Game of 24 experiment achieved 69% success instead of the paper’s original 74%, solely due to randomness in GPT decoding. This variance is inherent to LLM sampling and means consistent performance isn’t guaranteed across runs.
Task setup requires substantial prompt engineering expertise. The framework isn’t plug-and-play. You need to craft effective prompts for thought generation, design evaluation criteria that help the LLM distinguish good states from bad, and choose appropriate hyperparameters (n_generate_sample, n_evaluate_sample, n_select_sample). The repository provides examples for three specific tasks (Game of 24, creative writing, crossword puzzles), but extending to new domains demands careful experimentation. If your task doesn’t naturally decompose into intermediate “thought” states, forcing it into the ToT framework will feel awkward and won’t deliver the promised gains.
Verdict
Use Tree of Thoughts if you’re tackling complex reasoning tasks where solution quality justifies computational cost—strategic planning, mathematical problem-solving, creative generation with coherence constraints, or research applications where you’re benchmarking reasoning capabilities. It’s particularly valuable when simpler prompting approaches prove insufficient and you need systematic exploration of the solution space. The codebase also serves as an excellent reference implementation for understanding how to combine LLMs with search algorithms. Consider the computational costs carefully for applications where API expenses or latency matter. Also be aware that the framework assumes strong instruction-following and self-evaluation capabilities from the underlying model (demonstrated with GPT-4 in the paper). For straightforward tasks where simpler prompting strategies already perform well, the added complexity and cost of ToT may not be justified. The framework shines on problems that genuinely need deliberate, exploratory reasoning with the ability to backtrack and try alternative approaches.