Tree of Thoughts: Adding Deliberate Search to LLM Reasoning
Hook
What if your LLM could think like a chess player—exploring multiple moves ahead, abandoning dead ends, and deliberately searching for the best solution rather than committing to its first instinct?
Context
Traditional language models are one-shot thinkers. You send a prompt, the model generates a response token by token, and whatever comes out is what you get. Even with techniques like chain-of-thought prompting, the model follows a single linear path from problem to solution. If that path leads to a dead end, tough luck—you’re starting over from scratch.
The Tree of Thoughts paper from researchers at Princeton and Google DeepMind proposed a different approach: treat problem-solving as a search problem. Instead of generating one reasoning chain, generate multiple ‘thoughts’ at each step, evaluate their quality, prune the bad ones, and recursively explore the promising branches. The kyegomez/tree-of-thoughts repository implements this as a plug-and-play Python library, letting you experiment with the technique using your own models.
Technical Insight
The implementation centers on the ToTDFSAgent class, which wraps any language model in a depth-first search algorithm. Here’s how it works in practice:
from tree_of_thoughts import TotAgent, ToTDFSAgent
from dotenv import load_dotenv
load_dotenv()
# The TotAgent wraps your LLM (OpenAI by default)
tot_agent = TotAgent(use_openai_caller=False)
# The DFS agent coordinates the tree search
dfs_agent = ToTDFSAgent(
agent=tot_agent,
threshold=0.8, # Only accept thoughts scoring above 0.8
max_loops=1,
prune_threshold=0.5, # Cut branches scoring below 0.5
number_of_agents=4 # Generate 4 candidate thoughts per step
)
initial_state = """
Your task: use 4 numbers and basic arithmetic operations
(+-*/) to obtain 24 in 1 equation
"""
final_thought = dfs_agent.run(initial_state)
print(final_thought)
At each node in the search tree, the system prompts the LLM to generate multiple candidate ‘thoughts’ (controlled by number_of_agents). It then evaluates each thought’s quality, likely using another LLM call that scores the reasoning from 0 to 1 based on the architecture. Thoughts scoring below prune_threshold get discarded immediately. The remaining thoughts become child nodes, and the algorithm recursively explores each branch until it finds a solution scoring above threshold or exhausts the search space.
The prompt engineering is where things get interesting. Rather than asking the LLM to solve the problem directly, the system uses a meta-prompt that simulates multiple experts collaborating:
“Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they’re wrong at any point then they leave.”
This framing encourages the model to generate diverse reasoning paths and self-correct, effectively using prompt engineering to simulate the multi-agent exploration that the search algorithm requires. The repository includes several variations of this prompt formatted as markdown tables, which structure the output and make it easier to parse multiple expert ‘opinions’ from a single completion.
What makes this a ‘plug-and-play’ implementation is the abstraction. The search parameters (threshold, prune_threshold, number_of_agents) give you knobs to tune the exploration-exploitation tradeoff: higher number_of_agents means broader exploration but more API calls, while stricter thresholds prune more aggressively but risk missing creative solutions.
The claim of ‘70% improvement’ in the repository description refers to results from the original paper on specific benchmarks. The gains appear to come from the algorithm’s ability to backtrack from mistakes—something a single-pass LLM can’t do. When a reasoning path hits a dead end, the search simply explores a different branch rather than failing completely.
Gotcha
The repository’s feature set is incomplete. The README explicitly mentions in the TODO section that max_loops feature needs to be finished in the DFS class, BFS search algorithm is incomplete, and Monte Carlo tree search isn’t implemented yet. The visualization tools mentioned in the TODOs (“Make a function that can intake json and make a tree out of it visually”) don’t exist. This is clearly an early-stage implementation of the paper’s ideas rather than a production-ready library.
Cost is the elephant in the room. The architecture suggests that every node in the search tree likely requires multiple LLM calls: at least one to generate candidate thoughts and potentially one to evaluate them. With number_of_agents=4 and even a shallow tree, you’re making many API calls for a single problem. The repository has no apparent built-in caching, no token budgets, and no cost tracking based on the documentation.
The ‘70% improvement’ claim needs context. That number comes from the repo description (“Elevates Model Reasoning by atleast 70%”) and likely refers to specific tasks in the original paper where tree search helps. The repository provides no benchmarks, no guidance on which problem types benefit from this approach, and no baseline comparisons. You’ll need to run your own experiments to see if the gains justify the cost for your use case.
Verdict
Use if: you’re building research prototypes or experimental applications where complex reasoning matters more than cost, you have a specific problem type (mathematical reasoning, multi-step planning, strategic game-playing) where backtracking helps, and you want to quickly test whether tree search improves your results without implementing the algorithm from scratch. This library succeeds at making an academic paper concrete and runnable. Skip if: you need production-ready code with complete features and battle-tested reliability (the README’s TODO section shows several incomplete features), you’re cost-sensitive and can’t afford significantly more API calls than single inference, or you’re working on tasks where the baseline LLM already performs well. For alternative implementations, the README links to the original princeton-nlp/tree-of-thought-llm implementation from the paper’s authors.