Chain-of-Thought Reasoning for Any LLM in 150 Lines of PHP

Hook

A 150-line PHP script can make a basic LLM correctly count the letter 'r' in 'strawberry'—a task that stumps even GPT-4 without structured reasoning. The secret isn't in the model; it's in the prompt.

Context

When OpenAI released their o1 model with enhanced reasoning capabilities in September 2024, developers assumed these abilities required specialized model architectures with reinforcement learning-trained reasoning processes. The model's ability to 'think' through problems step-by-step seemed like magic—until projects like ReflectionAnyLLM demonstrated that similar chain-of-thought (CoT) behavior could be coaxed from any LLM through nothing more than strategic prompting.

The 'strawberry problem' became the viral benchmark: asking an LLM to count how many times the letter 'r' appears in the word 'strawberry.' Most models confidently answer 'two' when the correct answer is three. They fail because they tokenize text in chunks rather than processing individual characters, and without being prompted to reason explicitly, they rely on pattern matching rather than systematic analysis. ReflectionAnyLLM tackles this class of problems by forcing the model into a structured thinking process before delivering an answer—no special model training required.

Technical Insight

The architecture is deceptively simple: an HTML frontend sends messages to a PHP backend (chat.php) that injects a chain-of-thought system prompt before forwarding requests to any OpenAI-compatible API. The magic happens in that system prompt, which transforms a casual query into a structured reasoning task.

Here's the core prompt structure from chat.php:

$systemMessage = [
    'role' => 'system',
    'content' => 'You are a helpful assistant. When answering, break down your thinking into clear steps. '
               . 'Use the following format:\n\n'
               . '<thinking>\n'
               . 'Step 1: [First consideration]\n'
               . 'Step 2: [Second consideration]\n'
               . '...up to 10 steps as needed\n'
               . '</thinking>\n\n'
               . '<summary>\n'
               . '[Your final answer]\n'
               . '</summary>'
];

This prompt engineering approach works because it exploits a fundamental characteristic of transformer-based LLMs: they generate text token-by-token, and each token influences the probability distribution of subsequent tokens. By forcing the model to output <thinking> tags first, we ensure it commits to explicit reasoning steps before generating the final answer. The model can't skip ahead—it must write out its thought process, which paradoxically improves the quality of that reasoning.

The PHP backend manages conversation history with a rolling window approach:

if (count($history) > 30) {
    $history = array_slice($history, -30);
}

$messages = array_merge(
    [$systemMessage],
    $history,
    [['role' => 'user', 'content' => $userMessage]]
);

This 30-message limit serves two purposes: it prevents token count from exploding with long conversations, and it forces the model to work within a bounded context window. The entire state lives client-side in JavaScript's chatHistory array, eliminating database requirements at the cost of persistence across page refreshes.

The frontend implements a thoughtful UX detail for handling verbose reasoning outputs. Each response gets parsed into collapsible sections:

if (text.includes('<thinking>') && text.includes('</thinking>')) {
    const thinkingContent = text.match(/<thinking>([\s\S]*?)<\/thinking>/)[1];
    const summaryContent = text.match(/<summary>([\s\S]*?)<\/summary>/)[1];
    
    messageDiv.innerHTML = `
        <details>
            <summary>💭 View reasoning steps</summary>
            <pre>${thinkingContent}</pre>
        </details>
        <div class="summary">${summaryContent}</div>
    `;
}

This collapsible design solves a critical usability problem with CoT implementations: reasoning traces can be verbose and clutter the interface, but hiding them entirely defeats the transparency purpose. The <details> element provides progressive disclosure—casual users see clean answers while power users can audit the reasoning.

The API integration code demonstrates provider-agnostic design:

$apiUrl = getenv('API_URL') ?: 'http://localhost:1234/v1/chat/completions';
$apiKey = getenv('API_KEY') ?: 'lm-studio';

$ch = curl_init($apiUrl);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'Content-Type: application/json',
    'Authorization: Bearer ' . $apiKey
]);

By adhering to OpenAI's API specification, ReflectionAnyLLM works interchangeably with LM Studio, Groq, OpenRouter, Ollama (with appropriate adapters), or actual OpenAI endpoints. This standardization is the project's superpower—it's not tied to any specific vendor or model architecture.

What makes this implementation educational is what it doesn't do. There's no complex agent framework, no retrieval-augmented generation, no vector databases. It proves that structured reasoning is fundamentally a prompting problem, not an architecture problem. The 8B+ parameter recommendation in the documentation reflects an empirical finding: smaller models struggle to maintain coherent multi-step reasoning, not because of the prompting pattern, but because they lack the raw capacity to hold complex chains of logic in their attention mechanisms.

Gotcha

The repository's README contains a critical warning that many developers will overlook: 'This code is for demonstration purposes only and lacks security features.' This isn't boilerplate legal language—it's a genuine limitation. The PHP script performs zero input sanitization, has no rate limiting, and exposes your API key in server-side environment variables without encryption. Deploy this to a public server and you're creating an open proxy that anyone can use to drain your API credits.

The more subtle limitation is that ReflectionAnyLLM doesn't actually make models smarter—it just makes their existing capabilities more accessible through structured prompting. If your underlying model can't solve a problem with unlimited tokens and perfect prompting, this wrapper won't magically enable it. The strawberry example works on 8B+ models because those models already possess the capability to count characters when prompted carefully; the CoT structure just prevents them from taking shortcuts. Expecting this pattern to unlock PhD-level mathematical reasoning from a 7B parameter model will lead to disappointment. You're optimizing for consistency and transparency, not raw intelligence.

Verdict

Use if: You're experimenting with different LLM providers and want to quickly test their reasoning capabilities, building a proof-of-concept that demonstrates CoT prompting to stakeholders, or learning how chain-of-thought patterns work at a fundamental level without framework abstraction getting in the way. This is also ideal if you're running local models through LM Studio and want a simple interface to compare how different quantizations handle structured reasoning. Skip if: You need production-ready code with security, authentication, and rate limiting; you want advanced CoT capabilities like self-correction loops or dynamic step adjustment; or you're working in an environment without traditional PHP hosting (serverless, static sites, containerized microservices). For production applications, migrate the prompting patterns into LangChain or Guidance and implement proper security boundaries.

Chain-of-Thought Reasoning for Any LLM in 150 Lines of PHP

Chain-of-Thought Reasoning for Any LLM in 150 Lines of PHP

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

Chain-of-Thought Reasoning for Any LLM in 150 Lines of PHP

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Nanocoder: The Terminal Coding Agent That Lets You Switch Models Mid-Conversation

Shard: Proving LLM Inference Can Work Across Scattered GPUs and Terrible Internet

Harness-1: Training Search Agents with State Externalization

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Nanocoder: The Terminal Coding Agent That Lets You Switch Models Mid-Conversation

Shard: Proving LLM Inference Can Work Across Scattered GPUs and Terrible Internet

// CODEBASE INTELLIGENCE

Best for

Skip when