Back to Articles

Browser-Use: Teaching LLMs to See and Click Like Humans

[ View on GitHub ]

Browser-Use: Teaching LLMs to See and Click Like Humans

Hook

While you've been writing XPath selectors that break every UI update, browser-use agents are looking at websites the same way you do—with vision—and clicking elements they understand, not just locate.

Context

Web automation has been stuck in a paradox for two decades. Tools like Selenium and Playwright are powerful but brittle: change a CSS class, restructure your DOM, or add a loading animation, and your carefully crafted selectors shatter. You spend more time maintaining automation scripts than writing new ones. The promise was "automate once, run forever." The reality is "automate once, debug weekly."

The fundamental problem is that traditional automation treats web pages as data structures—traversable trees of elements with selectors as addresses. But modern web applications are visual interfaces built for human eyes, not DOM parsers. They use shadow DOM, dynamically generated IDs, lazy-loaded content, and JavaScript frameworks that rewrite the page on every interaction. Browser-use flips the paradigm: instead of teaching your scripts to navigate increasingly complex HTML structures, it gives language models eyes and hands to interact with browsers the way humans do.

Technical Insight

At its core, browser-use is an orchestration layer that connects three components: Playwright for browser control, vision-capable LLMs for decision-making, and a custom agent loop that translates natural language goals into executable browser actions. The architecture is deceptively simple but reveals careful design choices when you examine the code.

Here's what a basic agent looks like in practice:

from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio

async def main():
    agent = Agent(
        task="Find the cheapest flight from SFO to Tokyo in March",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result)

asyncio.run(main())

Under the hood, the agent breaks this request into a loop: capture the current browser state (screenshot + accessibility tree), send it to the LLM with available actions (click, type, scroll, navigate), parse the LLM's response into Playwright commands, execute them, then repeat. The magic is in how browser-use constructs the context window for the LLM.

Rather than sending raw HTML (which would overwhelm token limits and bury relevant content in boilerplate), browser-use creates a hybrid representation. It takes a screenshot for visual understanding and pairs it with a filtered accessibility tree—semantic labels for interactive elements, stripped of styling noise. The LLM sees both what a human would see (visual layout, colors, images) and what a screen reader would announce (button labels, form fields, landmarks).

The action space is carefully constrained. Instead of giving the LLM free-form Python or JavaScript execution, browser-use defines a small set of high-level primitives:

# The agent can respond with structured actions
actions = [
    {"type": "click", "element": 42},  # element ID from accessibility tree
    {"type": "input_text", "element": 15, "text": "San Francisco"},
    {"type": "scroll", "direction": "down"},
    {"type": "navigate", "url": "https://example.com"},
    {"type": "done", "text": "Task completed: found flight for $847"},
]

This constraint serves two purposes: it makes the agent's behavior predictable and debuggable (you can log every action), and it forces the LLM to think in terms of user interactions rather than implementation details. The agent doesn't inject JavaScript or manipulate the DOM directly—it clicks buttons and fills forms like a human would.

The vision-first approach shines when dealing with modern web patterns that confound traditional automation. Consider a modal dialog that appears after clicking a button. A Selenium script needs explicit waits, checks for element visibility, and fragile selectors for the modal's close button. Browser-use simply shows the LLM a screenshot with a modal in the center of the screen and asks, "What do you see? What should we do next?" The LLM recognizes the modal pattern and decides whether to interact with it or dismiss it based on the task context.

Browser-use also includes a sophisticated state management system. The agent maintains a history of actions, allowing the LLM to learn from recent interactions and avoid loops (clicking the same element repeatedly). It implements retry logic with backoff and can recover from common failures—if a click doesn't change the page state, it tries alternative elements or waits longer for content to load.

For production deployments, browser-use offers a CLI that spawns persistent browser sessions:

browser-use run --task "Monitor this dashboard and alert if CPU exceeds 80%" \
                --model claude-3-5-sonnet-20241022 \
                --headless false \
                --save-session

The --save-session flag is particularly clever: it persists cookies, local storage, and authentication state between runs, so your agent doesn't have to log in every time. This is crucial for tasks that require authenticated access to internal tools or SaaS platforms.

Gotcha

The vision-based approach that makes browser-use powerful also introduces real costs and constraints. Each agent step involves sending a screenshot and accessibility tree to your LLM provider, which means you're burning thousands of tokens per action—GPT-4 Vision charges about $0.01 per image, and a typical multi-step task might require 10-30 screenshots. A complex automation that runs hourly can rack up meaningful API bills. If you're automating price scraping across hundreds of products daily, traditional Playwright with CSS selectors will cost you pennies while browser-use could cost hundreds of dollars monthly.

Latency is another trade-off. Each step in the agent loop includes network round-trips to your LLM provider, model inference time, and browser rendering. Where a traditional script clicks through a form in 2-3 seconds, a browser-use agent might take 15-30 seconds for the same task because it's thinking about each field. This isn't a deal-breaker for human-replacing automation ("I need to fill out this vendor form once a week"), but it rules out latency-sensitive use cases like real-time data extraction or high-frequency monitoring.

The open-source version has accuracy limitations that the documentation is refreshingly honest about. Their benchmark shows the self-hosted agent with Claude Sonnet achieving 22% success rate on complex tasks while their managed cloud service hits 45%. That's a massive gap driven by cloud-specific features: residential proxy rotation to avoid blocks, CAPTCHA solving, stealth browser fingerprinting, and proprietary prompt engineering. For production workloads where success rate matters, you're likely paying for the cloud service—which shifts browser-use from a free open-source tool to a paid platform with vendor lock-in implications.

Verdict

Use browser-use if you're automating complex workflows on third-party websites you don't control—competitor monitoring, data enrichment from multiple sources, filling forms on legacy portals—where the flexibility to handle UI changes without rewriting scripts justifies the LLM costs. It's ideal when you're replacing human labor that costs $20-50/hour; burning $2-5 in API calls per task is excellent ROI. The vision-based approach excels with dynamic SPAs, sites that aggressively block traditional bots, and workflows requiring nuanced decisions ("click the blue button only if the price is under $100"). Skip browser-use if you're automating sites you control (use APIs instead), need millisecond response times, operate under strict budget constraints where LLM costs are prohibitive, or have deterministic workflows that succeed with traditional Playwright selectors. For high-volume, simple, repetitive tasks on stable websites, classical automation is faster, cheaper, and more reliable. Browser-use is for the messy middle ground where human-like adaptability is worth paying for.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/browser-use-browser-use.svg)](https://starlog.is/api/badge-click/automation/browser-use-browser-use)