Natural Language Playwright Tests with Anthropic’s Computer Use API

Hook

What if you could write browser tests by describing what you want in plain English, and an AI actually makes it happen? tests-ai makes this sci-fi premise work today with just one function call.

Context

Browser test automation has a persistent challenge: tests break constantly not because functionality changed, but because a developer renamed a CSS class or restructured the DOM. Traditional Playwright tests rely on selectors like page.click('button.submit-form') that become brittle the moment your design system evolves. You end up spending more time maintaining test selectors than actually testing user flows.

tests-ai takes a radically different approach by wrapping Anthropic’s Computer Use API—the same technology that lets Claude control computers through vision and reasoning. Instead of writing explicit selector chains, you describe user actions in natural language: “click on the counter button and verify that the count goes up.” The library appears to capture screenshots of your page, send them to Claude along with your instruction, receive back concrete mouse and keyboard actions, then execute them through Playwright. It’s a thin but powerful abstraction that trades API calls for test maintainability.

Technical Insight

The architecture is deliberately minimal—tests-ai provides essentially one function that does all the heavy lifting. The ai() function accepts a natural language instruction and a context object containing your Playwright page and test objects. Under the hood, it orchestrates interaction with Anthropic’s Computer Use API, which analyzes your browser state and returns structured commands that Playwright can execute.

Here’s what a complete test looks like:

import { test } from "@playwright/test";
import { ai } from "tests-ai";

test("click on counter button", async ({ page }) => {
  await page.goto("/");
  await ai("click on the counter button and verify that the count goes up", {
    page,
    test,
  });
});

Notice there are no selectors, no waiting strategies, no explicit assertions. You’re describing user intent, not implementation details. The AI handles the visual recognition of what a “counter button” looks like on your page, where it’s positioned, and how to verify the state change. This works because Anthropic’s Computer Use API is built on Claude’s vision capabilities—it sees your page the way a user would, identifies interactive elements visually, and determines appropriate actions.

The integration surface is intentionally small. You need to provide both the page and test objects—the page for action execution, while the test object provides context for the Playwright test runner. Setting up requires only an environment variable for your Anthropic API key:

export ANTHROPIC_API_KEY=your-api-key

Or in your .env file:

ANTHROPIC_API_KEY=your-api-key

This approach solves a real pain point in modern frontend development where component libraries, CSS-in-JS solutions, and frequent redesigns make selector-based tests a maintenance burden. When your button’s class changes from btn-primary to button--primary or gets replaced by a design system component, traditional tests break. Vision-based tests should continue working because they identify elements by appearance and spatial relationships, not fragile DOM queries.

The trade-off is that you’re outsourcing test determinism to an AI model. Each test execution makes API calls to Anthropic, likely introducing network latency and depending on model availability. The Computer Use API analyzes your browser state, reasons about element locations, and generates actions—a process that likely takes longer than traditional selector-based tests.

Gotcha

The elephant in the room is non-determinism. Traditional Playwright tests execute the same way every time—page.click('#submit') either finds the element or throws a predictable error. Vision-based AI testing introduces uncertainty. Claude might interpret “click the submit button” differently across runs if your UI has multiple buttons or ambiguous layouts. A test that passes five times might fail on the sixth because the model made a slightly different inference about which element matches your natural language description.

Cost and latency likely make this approach challenging for large test suites in CI/CD pipelines. API calls for screenshot analysis and action generation add both time and cost to each test execution. A suite that runs quickly with traditional selectors will take longer with API round-trips for each ai() call, and costs will accumulate with usage. Teams running tests on every commit should factor in both execution time and API expenses.

The library’s focused scope means some operational details aren’t documented. Configuration options for retries, timeouts, or debugging when the AI misinterprets instructions aren’t mentioned in the README. When a test fails, determining whether it’s a genuine bug or a misunderstood prompt becomes a harder diagnostic challenge than with traditional selector-based tests.

Verdict

Use if: You’re prototyping rapidly and need tests that won’t break during heavy UI iteration, you’re testing applications where adding test selectors is impractical, or you’re exploring AI-driven testing approaches. This could work well for exploratory testing sessions or teams that genuinely struggle with selector maintenance across frequent design changes. Skip if: You’re building production CI/CD pipelines where determinism, speed, and cost predictability matter most. Traditional Playwright with well-structured selectors gives you reliability at zero marginal cost per test run. Consider carefully if your test suite has more than a few dozen cases—the API costs and execution time may outweigh the maintenance benefits. Also evaluate whether sending page screenshots to third-party APIs aligns with your data governance requirements.

Natural Language Playwright Tests with Anthropic's Computer Use API

Natural Language Playwright Tests with Anthropic’s Computer Use API

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE

Natural Language Playwright Tests with Anthropic’s Computer Use API

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

fwknop: How Single Packet Authorization Makes Your SSH Server Invisible to Port Scanners

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE