Tree of Thoughts: Teaching Language Models to Think by Exploring, Not Just Reasoning
Hook
GPT-4 can write essays and debug code, but ask it to solve the Game of 24 with numbers 4, 9, 10, 13, and it fails 96% of the time with standard prompting. Tree of Thoughts gets it right 74% of the time by doing something deceptively simple: letting the model change its mind.
Context
Chain-of-Thought (CoT) prompting revolutionized LLM capabilities in 2022 by encouraging models to show their work—breaking down complex problems into step-by-step reasoning. But CoT has a fundamental limitation: it's a one-way street. Once the model commits to a reasoning path ("let's try dividing 10 by 4 first"), it can't backtrack if that approach leads nowhere. The model generates tokens sequentially, and each decision narrows the solution space until it either stumbles onto the answer or fails completely.
Tree of Thoughts (ToT), developed by researchers at Princeton NLP and Google DeepMind, treats reasoning as a search problem rather than a generation problem. Instead of following a single chain of thought, ToT explicitly constructs a tree of possible reasoning steps, evaluates which branches look promising, and explores multiple paths simultaneously. When one approach hits a dead end, the system backtracks and tries alternatives—mimicking how humans actually solve hard problems. The framework achieved near-perfect scores on tasks where CoT barely functions, but it comes with a catch: solving a single Game of 24 puzzle might require 50+ LLM API calls instead of one.
Technical Insight
The ToT architecture separates problem-solving into three orthogonal components: thought generation, state evaluation, and search strategy. This modularity is its genius—you can mix and match approaches depending on your task's characteristics.
Thought generation happens through two methods. The "sample" approach asks the LLM to generate multiple independent thoughts ("give me 5 different next steps"), useful when you want diversity. The "propose" approach generates thoughts sequentially with awareness of previous candidates, better when thoughts need to be distinct and systematic. Here's how thought generation looks in practice:
def get_proposals(state, task, k=5):
"""Generate k possible next reasoning steps"""
prompt = f"""
Current state: {state}
Propose {k} possible next steps to solve: {task}
Each step should be a distinct approach.
"""
# For Game of 24: state might be "4 9 10 13"
# Proposals could be:
# 1. "(10 - 4) * 9 - 13"
# 2. "13 * 9 - 10 * 4"
# 3. "(13 - 9) * (10 - 4)"
# etc.
response = llm_call(prompt)
return parse_proposals(response)
def get_samples(state, task, num_samples=5):
"""Sample multiple independent thoughts"""
thoughts = []
for _ in range(num_samples):
prompt = f"Given {state}, suggest one next step for {task}"
thoughts.append(llm_call(prompt))
return thoughts
State evaluation determines which branches to explore. The "value" method asks the LLM to score each thought's promise ("rate this step 1-10 for likelihood of reaching the solution"), returning scalar values for easy comparison. The "vote" method generates multiple evaluations and takes the majority—more robust but more expensive. For Game of 24, voting works better because numerical evaluation is hard; for creative writing, values work well because coherence is gradual.
def evaluate_states_vote(states, task, n_evaluators=3):
"""Use voting to select best state"""
votes = {state: 0 for state in states}
for _ in range(n_evaluators):
prompt = f"""
Which of these partial solutions is most promising?
Task: {task}
Options:
{enumerate_states(states)}
Answer with just the option number.
"""
winner_idx = int(llm_call(prompt))
votes[states[winner_idx]] += 1
return sorted(states, key=lambda s: votes[s], reverse=True)
The search strategy ties everything together. Breadth-First Search (BFS) with beam width explores the most promising branches at each level before going deeper—essential for tasks where early mistakes compound. Depth-First Search tries paths to completion before backtracking, more efficient when you can quickly identify dead ends. The framework implements a clean search abstraction:
def bfs_search(initial_state, task, beam_width=5, max_depth=4):
"""BFS with beam search pruning"""
current_level = [initial_state]
for depth in range(max_depth):
next_level = []
for state in current_level:
if is_solution(state, task):
return state
# Generate thought candidates
proposals = get_proposals(state, task, k=10)
# Create child states
children = [apply_thought(state, p) for p in proposals]
next_level.extend(children)
# Prune to beam_width best states
current_level = evaluate_states_vote(next_level, task)[:beam_width]
return max(current_level, key=lambda s: final_value(s, task))
The real insight is how this framework transforms LLM limitations into strengths. LLMs are excellent at local evaluation ("is this step reasonable?") but poor at global planning. ToT offloads the planning to classical search algorithms while keeping the LLM focused on what it does well: generating creative candidates and judging immediate promise. On Game of 24, this hybrid approach improved accuracy from 4% (CoT) to 74% (ToT with BFS, beam=5). On creative writing with constraints (writing a passage about a topic that ends with specific words), ToT achieved 100% constraint satisfaction versus 50% for vanilla prompting.
The cost structure is transparent: if you generate b proposals per state, evaluate them with v votes, and maintain a beam width of w for d depth levels, you're looking at roughly (b + v) * w * d LLM calls. For the Game of 24, typical runs use 5 proposals, 3 votes, beam width 5, depth 4 = (5+3)54 = 160 API calls per puzzle. At GPT-4 pricing, this turns a $0.01 query into a $1.60 query—acceptable for research, prohibitive for production scale.
Gotcha
The elephant in the room is API cost and latency. While the paper shows dramatic accuracy improvements, running ToT at scale requires serious budget planning. A single Game of 24 solution might need 100+ API calls, and the method's effectiveness degrades significantly with cheaper models. The researchers used GPT-4 for most experiments; when you try this with GPT-3.5-turbo or open-source alternatives, the evaluation quality drops because the model struggles to accurately judge which reasoning paths are promising. You end up with garbage-in-garbage-out: the search explores confidently, but in wrong directions.
Prompt engineering becomes critical and brittle. The paper's impressive results depend heavily on carefully crafted prompts for each task type. The Game of 24 prompts include specific formatting instructions ("write equations using these numbers exactly once") and examples that took iteration to refine. Change the task slightly—say, Game of 25 or different constraints—and you'll need to re-tune prompts extensively. The framework doesn't automatically generalize. Additionally, results show high variance: the paper reports 74% accuracy on Game of 24, but community reproductions range from 65-75% depending on random seeds and exact GPT-4 version. This isn't a deterministic algorithm; it's a stochastic search through LLM output space, and variance is inherent. For production systems requiring reliability, this uncertainty is problematic.
Verdict
Use if: You're tackling genuinely complex reasoning tasks where exploration and backtracking provide clear value—mathematical puzzles, planning problems with constraints, creative generation with hard requirements, or research into LLM reasoning capabilities. The framework excels when solution paths aren't obvious and when you can afford 10-100x more API calls for 2-10x better accuracy. It's also invaluable as a learning tool for understanding advanced prompting techniques and the gap between next-token prediction and deliberate reasoning. Skip if: You need low-latency or cost-effective solutions, your tasks work adequately with Chain-of-Thought (most real-world problems do), you're using models weaker than GPT-3.5, or you need deterministic outputs. For production systems serving users at scale, the cost-accuracy tradeoff rarely justifies ToT over simpler alternatives. This is a research implementation that reveals what's possible when you treat LLMs as search oracles rather than answer generators—use it to understand the frontier, but think twice before deploying it in production.