Back to Articles

Tree of Thoughts: Solving the LLM's Linear Thinking Problem with Search Algorithms

[ View on GitHub ]

Tree of Thoughts: Solving the LLM's Linear Thinking Problem with Search Algorithms

Hook

Your ChatGPT query costs $0.002, but what if spending $0.20 and 30 seconds could turn a wrong answer into a correct one? That's the trade-off behind Tree of Thoughts, a reasoning framework that treats LLM problem-solving like a chess engine evaluates moves.

Context

Large language models have a fundamental limitation: they generate text autoregressively, token by token, left to right. This sequential generation works brilliantly for creative writing or summarization, but it struggles with problems that require exploration, backtracking, and deliberation. When GPT-4 fails at a logic puzzle, it's not because it lacks the knowledge—it's because it committed to a reasoning path too early and couldn't course-correct.

Researchers at Princeton and Google DeepMind published "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" to address this limitation. Their insight was elegantly simple: instead of forcing the model to generate one answer linearly, ask it to propose multiple reasoning steps, evaluate each option, and explore the most promising paths like a search algorithm traversing a decision tree. The kyegomez/tree-of-thoughts repository translates this academic concept into a practical Python library that wraps OpenAI's API with tree search logic, making the technique accessible to developers without requiring deep RL expertise or model fine-tuning.

Technical Insight

At its core, Tree of Thoughts transforms prompt engineering into a graph search problem. The library's architecture centers on the TotAgent and ToTDFSAgent classes, which orchestrate the generate-evaluate-search cycle. Instead of sending one prompt and receiving one response, these agents make multiple LLM calls: one to generate candidate "thoughts" (partial solutions), another to evaluate their quality, and iterative calls to explore the search tree.

Here's how you'd use it to solve a logic problem:

from tree_of_thoughts import TotAgent

# Initialize with your problem
agent = TotAgent(
    model="gpt-4",
    api_key="your-key",
    strategy="dfs",
    evaluation_threshold=0.6,
    max_depth=4
)

# Define your problem
problem = """
You have four numbers: 2, 8, 8, 14.
Use basic arithmetic (+, -, *, /) to make 24.
"""

# Let the tree search explore solution paths
solution = agent.solve(problem)

Under the hood, the DFS implementation maintains a stack of partial solutions. At each node, it prompts the LLM with a specific template: "Given the current state, propose 3 possible next steps." The model might suggest "(8 / 2) = 4", "(14 - 8) = 6", and "(8 * 2) = 16" as branches. Then comes the critical evaluation phase—the agent sends another prompt asking the model to rate each thought's promise on a scale of 0-1. Thoughts scoring above the threshold (default 0.6) get explored further; others are pruned.

The prompt engineering is where this library shines. Rather than generic queries, it frames problems using a multi-expert pattern:

evaluation_prompt = f"""
As three expert problem solvers, evaluate this reasoning step:

Problem: {original_problem}
Current state: {current_state}
Proposed step: {thought}

Expert 1 (Mathematician): [evaluation]
Expert 2 (Logician): [evaluation] 
Expert 3 (Strategist): [evaluation]

Consensus score (0.0-1.0): 
"""

This multi-persona framing leverages the model's training to simulate diverse perspectives, reducing the risk of confident but wrong evaluations. The consensus mechanism helps filter out low-quality branches early, preventing the tree from exploding exponentially.

The backtracking mechanism is where tree search earns its keep. When a branch dead-ends—say, the math produces 25 instead of 24—the DFS agent doesn't fail entirely. It pops the stack, returns to the previous decision point, and tries the next-highest rated alternative. This is fundamentally different from chain-of-thought prompting, where the model would need to recognize its own error and self-correct (which LLMs do poorly).

One clever architectural choice is the separation between the thought generator and evaluator. By using distinct prompts for generation versus evaluation, the library prevents the model from falling into confirmation bias—it can't just generate thoughts it already believes are good. The evaluation step acts as an independent quality filter, though it still suffers from the model's inherent biases and hallucination risks.

The library also implements a basic memoization strategy to avoid re-exploring identical states. If the search reaches "current numbers: 4, 6, 8" via different paths, it recognizes the duplicate and prunes one branch. This dramatically reduces redundant API calls in problems with commutative operations.

Gotcha

The elephant in the room is cost. A single Tree of Thoughts query can easily make 20-50 LLM API calls for a moderately complex problem. At GPT-4's pricing ($0.03 per 1K input tokens), a problem that takes 30 tree nodes to solve could cost $0.50-2.00 versus $0.002 for a standard prompt. The latency compounds too—expect 30-60 seconds for solutions that a human could verify in seconds. This makes ToT impractical for user-facing applications where responsiveness matters.

The implementation is also incomplete. Critical features like breadth-first search and Monte Carlo tree search are marked as TODO in the codebase. The DFS implementation lacks depth limiting (listed as a TODO), which means poorly-specified problems could spiral into infinite loops. Tree visualization, which would be incredibly valuable for debugging why the search chose certain paths, is entirely absent. You're flying blind unless you add custom logging.

More fundamentally, the "70% improvement" claim is misleading without context. The original research paper tested on specific benchmark tasks like the Game of 24, creative writing, and crossword puzzles—domains where exploration genuinely helps. But for many real-world tasks like question answering, summarization, or code generation, Tree of Thoughts adds overhead without meaningful accuracy gains. The benefit is highly problem-dependent, and the library provides no guidance on which problem types justify the cost. You'll need to run your own A/B tests to validate whether it's worth it for your domain.

Verdict

Use if: You're solving combinatorial puzzles, multi-step mathematical reasoning, strategic planning problems, or anything where backtracking and exploration genuinely improve solution quality, and you have budget for 10-50x API cost increases. This is excellent for research prototypes, competitive programming assistance, or internal tools where correctness matters more than speed. Skip if: You need production-grade performance, have tight latency requirements (under 5 seconds), are working on straightforward Q&A or classification tasks, or lack budget to experiment with high API costs. The incomplete implementation and sparse documentation make this unsuitable for mission-critical deployments. For most applications, well-crafted chain-of-thought prompting or ReAct patterns will give you 80% of the benefit at 5% of the cost.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/kyegomez-tree-of-thoughts.svg)](https://starlog.is/api/badge-click/ai-dev-tools/kyegomez-tree-of-thoughts)