Back to Articles

LLM Scraper: When AI Replaces CSS Selectors in Web Scraping

[ View on GitHub ]

LLM Scraper: When AI Replaces CSS Selectors in Web Scraping

Hook

What if you could scrape a website by describing what you want in plain English instead of hunting through DevTools for the right CSS selector? That’s exactly what happens when you point an LLM at a webpage’s DOM.

Context

Traditional web scraping has always been a game of whack-a-mole. You spend hours crafting the perfect CSS selectors or XPath expressions, deploy your scraper, and three weeks later the site redesigns and everything breaks. Maintaining scrapers is like maintaining a house of cards—one structural change to the target site and you’re back in the Chrome inspector, hunting for new selectors.

The brittleness gets worse with dynamic content. Modern web applications render everything client-side, hide meaningful data behind user interactions, and structure their markup inconsistently across pages. You end up writing increasingly complex Playwright scripts with hard-coded waits, click sequences, and fragile element queries. LLM Scraper takes a fundamentally different approach: instead of instructing the computer exactly where to find data, you describe what data you want and let a language model figure out the extraction logic. Built on Playwright for browser automation and Vercel AI SDK 6 for LLM orchestration, it transforms any webpage into structured JSON by having GPT-4, Claude, or Llama read the page content and extract fields matching your Zod schema.

Technical Insight

LLM Processing

Browser Automation

Playwright Page + Zod Schema

Raw Page

HTML/Markdown/Text/Screenshot

Formatted Content + Schema

API Calls

Structured JSON

Validated Data

Optional

Playwright Script

User Code

LLM Scraper Core

Playwright

Content Formatter

Vercel AI SDK

LLM Provider

OpenAI/Anthropic/Google

Code Generator

System architecture — auto-generated

The architecture is deceptively simple but reveals clever design decisions when you dig into the implementation. At its core, LLM Scraper is a thin orchestration layer between Playwright’s browser automation and Vercel AI SDK’s language model abstraction. You give it a Playwright page object, a Zod schema defining your desired output structure, and optionally specify a formatting mode. The library converts the page content into LLM-friendly formats—HTML, markdown, plain text via Mozilla’s Readability.js, or even screenshots for vision models—then sends that representation to your chosen LLM with instructions to extract data matching your schema.

Here’s a real example from the README that extracts Hacker News stories. Notice how the schema definition directly maps to the structured output:

import { chromium } from 'playwright'
import { z } from 'zod'
import { Output } from 'ai'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'

const browser = await chromium.launch()
const llm = openai('gpt-4o')
const scraper = new LLMScraper(llm)

const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

const { data } = await scraper.run(page, Output.object({ schema }), {
  format: 'html',
})

console.log(data.top)

The Zod schema serves double duty: it defines the TypeScript types for compile-time safety and provides runtime validation of the LLM’s output. The .describe() method isn’t just documentation—it becomes part of the prompt sent to the LLM, giving it semantic context about what you’re extracting. This is prompt engineering hidden behind a schema definition API.

The six formatting modes reveal performance and cost trade-offs. The html mode preprocesses the DOM to remove scripts, styles, and irrelevant elements before sending it to the LLM. The text mode goes further, using Readability.js to extract just the article content, dramatically reducing token usage for content-heavy pages. The image mode takes a screenshot and uses vision models like GPT-4o or Claude 3.5 Sonnet, useful when the visual layout contains semantic meaning that HTML structure doesn’t capture. You can even define custom formatters—pass a function that takes the Playwright page and returns whatever string representation makes sense for your use case.

The streaming API (scraper.stream()) deserves attention because it exposes the incremental nature of LLM generation. Instead of blocking until the entire object is complete, you get partial updates as the model generates each field. For a schema with an array of items, you’d see the array populate item-by-item. This turns scraping into a progressive operation where you can show users partial results or start processing data before the full extraction completes.

The code generation feature is the most interesting architectural choice. Instead of using an LLM for every scrape, you can call scraper.generate() once to produce a standalone Playwright script that extracts data according to your schema. The LLM analyzes the page structure and writes actual JavaScript selector code that Playwright can execute:

const { code } = await scraper.generate(page, Output.object({ schema }))
const result = await page.evaluate(code)
const data = schema.parse(result)

This generated code doesn’t make LLM calls—it’s a traditional selector-based scraper that the LLM wrote for you. You get the best of both worlds: AI-powered scraper creation without ongoing API costs or latency. The trade-off is brittleness—the generated code will break when the site structure changes, just like hand-written selectors. But for stable sites or when you’re scraping thousands of pages with identical structure, this approach makes economic sense.

Gotcha

The elephant in the room is cost. Every call to scraper.run() sends the page content to an LLM API, incurring both latency and charges. For GPT-4, you’re looking at roughly $0.01-0.10 per page depending on content size and output complexity. That’s acceptable for one-off data extraction or prototyping, but it makes high-volume production scraping prohibitively expensive. If you’re scraping 10,000 product pages daily, you could rack up $1,000+ in API costs—far more than the server resources for running traditional scrapers.

Latency compounds the cost problem. Traditional CSS selectors execute in milliseconds. LLM Scraper requires a full LLM inference cycle, taking 2-10 seconds per page even with fast models like GPT-4o. The streaming mode helps perceived performance but doesn’t change the fundamental throughput ceiling. And you’re running a full Chromium instance through Playwright for every scrape, consuming 200-500MB of memory per browser context. This makes serverless deployments challenging—you’ll blow through memory limits quickly, and cold starts become painful with Playwright’s initialization overhead.

Then there’s accuracy unpredictability. LLMs are probabilistic, so you might get different extraction results for the same page across runs. Complex schemas with nested objects, ambiguous field names, or pages where the desired data isn’t clearly delineated can produce inconsistent results. The Zod validation catches type errors, but it won’t catch semantic errors like extracting the wrong price or confusing author names with usernames. You lose the determinism of selector-based scraping—for better and worse. When it works, it adapts to layout changes gracefully. When it fails, debugging why the LLM misinterpreted the page structure is harder than debugging a broken CSS selector.

Verdict

Use LLM Scraper if you’re building a research tool that scrapes dozens or hundreds of diverse websites where writing maintainable selectors for each would be prohibitive, if you’re prototyping a data extraction pipeline and need results in hours instead of days, if the target sites change layouts frequently and you’d rather pay API costs than engineering time, or if you’re doing one-off extractions where setup time matters more than per-page cost. The code generation feature specifically makes sense when you need to scrape large numbers of pages with identical structure—use the LLM once to generate the scraper, then run it deterministically. Skip LLM Scraper if you’re building high-volume production scrapers where cost and speed matter (stick with Cheerio or raw Playwright), if you’re scraping simple static sites where CSS selectors are trivial to write and maintain, if you need guaranteed deterministic results for compliance or audit purposes, or if you’re working in serverless environments with tight memory constraints. For most teams, the sweet spot is hybrid: use LLM Scraper for the long tail of infrequently-scraped sites, and write traditional selectors for your high-volume targets.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/mishushakov-llm-scraper.svg)](https://starlog.is/api/badge-click/automation/mishushakov-llm-scraper)