Back to Articles

Shortest: Write End-to-End Tests in Plain English Using AI

[ View on GitHub ]

Shortest: Write End-to-End Tests in Plain English Using AI

Hook

What if you could write await shortest('user logs in, adds item to cart, and checks out') instead of 50 lines of Playwright selectors? That's exactly what Shortest does—and it's more production-ready than you'd think.

Context

End-to-end testing has always been a tax on developer productivity. You write a beautiful feature in hours, then spend another hour hunting down CSS selectors, waiting for elements to load, and handling race conditions in your Playwright or Cypress tests. The maintenance burden is even worse: rename one button, and you're updating selectors across a dozen test files. This friction means teams either skimp on E2E coverage or hire dedicated QA engineers just to maintain brittle test suites.

Shortest attacks this problem from a radical angle: what if tests didn't need selectors at all? Instead of telling the computer exactly how to click through your app, you describe what the user should do in plain English. An AI agent—specifically Anthropic's Claude—figures out the implementation details. The promise is compelling: test coverage that's as easy to write as a feature specification, with no selector maintenance when your UI changes. The risk is equally obvious: delegating test execution to a non-deterministic AI service sounds like a recipe for flaky tests and surprise API bills.

Technical Insight

Shortest is architecturally straightforward, which is part of its genius. It's a thin wrapper around Playwright that intercepts your natural language test descriptions, sends them to Claude via Anthropic's API, and translates Claude's responses into Playwright commands. Here's what a test looks like:

import { shortest } from '@antiwork/shortest';

test('complete checkout flow', async () => {
  await shortest('navigate to the product page and add item to cart');
  await shortest('proceed to checkout and enter shipping details');
  await shortest('complete payment with test credit card');
  
  // Mix AI tests with traditional assertions
  const orderConfirmation = page.locator('.order-confirmation');
  await expect(orderConfirmation).toContainText('Thank you for your order');
});

Under the hood, each shortest() call sends the page's DOM snapshot and the natural language instruction to Claude. The AI returns a series of Playwright actions—clicks, fills, navigations—which Shortest executes. The framework maintains conversation context across chained calls, so Claude remembers what happened in previous steps. This is critical for multi-step flows where later instructions reference earlier actions.

The real sophistication emerges in how Shortest handles state and context. The framework supports lifecycle hooks (beforeAll, afterEach) where you can inject programmatic logic:

import { config } from '@antiwork/shortest';

export default config({
  baseUrl: 'https://app.example.com',
  
  beforeAll: async ({ page }) => {
    // Programmatic setup that's too complex for AI
    await page.route('**/analytics/**', route => route.abort());
    await injectTestData(page);
  },
  
  afterEach: async ({ page }, testInfo) => {
    // Custom screenshot logic
    if (testInfo.status === 'failed') {
      await page.screenshot({ path: `failure-${testInfo.title}.png` });
    }
  }
});

Shortest also supports callback functions for complex assertions that AI shouldn't handle. After an AI-driven action completes, you can pass a callback to verify state:

await shortest(
  'add premium subscription to account',
  async ({ page }) => {
    const badge = await page.locator('.premium-badge');
    expect(badge).toBeVisible();
    
    // Verify API-level changes
    const response = await fetch('/api/user/subscription');
    const data = await response.json();
    expect(data.tier).toBe('premium');
  }
);

This hybrid approach—AI for the tedious navigation, programmatic code for precise verification—is what makes Shortest practical rather than just a demo. The framework recognizes that not every test concern should be delegated to an LLM.

Integration testing gets interesting with Shortest's support for external services. The repository includes examples for GitHub 2FA authentication and Mailosaur email verification, both notoriously annoying to automate. For email testing, you can write:

await shortest(
  'sign up with email test@mailosaur.io and verify the confirmation link',
  async ({ page }) => {
    // After AI completes signup, verify email arrived
    const email = await mailosaurClient.messages.get(serverId, {
      sentTo: 'test@mailosaur.io'
    });
    expect(email.subject).toContain('Confirm your account');
  }
);

The AI handles navigating the signup form and entering credentials, while your callback verifies the email service integration. This division of labor—AI for UI choreography, code for system verification—is the pattern that makes Shortest effective.

One architectural decision worth noting: Shortest doesn't try to hide Playwright. You still configure browsers, viewport sizes, and test parallelization exactly as you would in vanilla Playwright. The natural language layer is additive, not a replacement. This means you can incrementally adopt Shortest by converting flaky selector-heavy tests one at a time, while keeping deterministic tests in traditional Playwright syntax.

Gotcha

The elephant in the room is non-determinism. AI agents don't always interpret instructions identically across runs. "Click the submit button" might work 95% of the time, but if your page has multiple buttons with similar labels, Claude might occasionally click the wrong one. Debugging these failures is frustrating because you can't just inspect the code—you're debugging Claude's interpretation. The framework doesn't provide a dry-run mode to see what actions the AI plans to take before executing them, which would help catch ambiguous instructions.

Latency and cost are the other landmines. Each shortest() call makes an API request to Anthropic, adding 1-3 seconds per step. A test suite with 50 AI-driven steps could take several minutes just in API latency, before accounting for actual page loads and interactions. At Anthropic's current pricing ($0.025 per 1K tokens), a modest test suite might cost $5-10 per full run. That's sustainable for nightly CI runs but painful if you're running tests on every commit. The framework doesn't currently support caching or local LLM fallbacks, so you're locked into Anthropic's pricing and availability. If their API has an outage, your entire test suite is down.

Verdict

Use if: You're prototyping a new application and want to establish smoke test coverage fast, your team includes non-technical stakeholders who could write acceptance tests in natural language, or you're maintaining a legacy app where the DOM structure changes frequently and selector maintenance is a time sink. Shortest shines for happy-path user flows and integration testing where small variations in execution don't matter. Skip if: You need deterministic, millisecond-precise tests (like performance regression testing), your CI budget is tight, your application has complex state machines where ambiguous AI actions could cause cascading failures, or you're testing in an air-gapped environment without internet access. Best used as a complement to traditional E2E tests—let AI handle the tedious form-filling and navigation, while keeping critical assertions and edge cases in explicit Playwright code.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/antiwork-shortest.svg)](https://starlog.is/api/badge-click/automation/antiwork-shortest)