Back to Articles

Teracrawl: Why the #1 Scraping Benchmark Winner Depends on External Services

[ View on GitHub ]

Teracrawl: Why the #1 Scraping Benchmark Winner Depends on External Services

Hook

A web scraper just claimed the #1 spot on independent benchmarks with 82.1% coverage—yet its core search functionality requires running a separate external service. This architectural choice reveals competing philosophies in the LLM-focused web scraping space.

Context

Large Language Models are only as good as the data you feed them. RAG systems, AI agents, and search-augmented chat interfaces all share a common bottleneck: turning messy HTML into clean text without losing semantic structure. The problem isn’t just scraping—it’s scraping at scale, handling JavaScript-heavy SPAs, bypassing anti-bot measures, and producing output that doesn’t waste precious context tokens on navigation bars and cookie banners.

Firecrawl pioneered the “LLM-optimized scraper” category, but the space has fragmented into two camps: managed services that handle browser infrastructure for you, and self-hosted solutions that give you control at the cost of complexity. Teracrawl positions itself as a Firecrawl alternative, achieving top placement on the scrape-evals benchmark (the README shows 82.1% in the benchmark image, though the description text mentions 84.2%). Built on TypeScript and Browser.cash’s managed Chrome infrastructure, it promises production-ready scraping with intelligent fallback strategies. The architectural trade-off: it’s not a self-contained tool—it’s an orchestration layer that requires Browser.cash for all scraping and a separate SERP service for search functionality.

Technical Insight

Teracrawl’s architecture makes an opinionated bet: don’t reinvent browser management, focus on intelligent content extraction. The system delegates browser lifecycle management entirely to Browser.cash, a managed Chrome service that handles sessions, anti-bot evasion, and infrastructure scaling. This design choice permeates the codebase—there’s no Puppeteer or Playwright code here, just API calls to remote browsers.

The real innovation lies in the two-phase scraping strategy. For static or server-side rendered pages, Teracrawl runs a “fast mode” that reuses browser contexts and aggressively blocks heavy assets (images, fonts, videos). When fast mode fails to extract sufficient content—detected through heuristics on DOM structure and content length—the system automatically falls back to “dynamic mode,” which spawns a fresh browser context, waits for JavaScript hydration, and allows full rendering. This adaptive approach appears designed to handle both simple static sites and complex JavaScript-heavy applications efficiently.

The /crawl endpoint demonstrates the orchestration philosophy. A single API call triggers a Google search via the external browser-serp service (which must be running separately on port 8080), then scrapes the top N results in parallel:

curl -X POST http://localhost:8085/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "q": "What is the capital of France?",
    "count": 3
  }'

This returns an array of results, each containing the original URL, page title, and cleaned Markdown. The parallel execution model uses a session pool—you configure how many concurrent browsers to maintain via the POOL_SIZE environment variable (default: 1), and Teracrawl distributes scraping tasks across them. For AI agent workflows where you need “search then read top 5 results,” this eliminates multiple round-trips.

Content extraction relies on semantic HTML heuristics rather than configurable selectors. The system preferentially extracts from <article> and <main> tags, then strips known noise patterns: navigation elements, scripts, styles, ads, and trackers. The Markdown conversion pipeline has specific LLM optimizations—it removes base64-encoded images (which bloat token counts massively) while preserving image references and alt text. Analytics scripts and tracking pixels get stripped entirely. The goal isn’t perfect HTML-to-Markdown conversion; it’s maximizing signal-to-noise ratio for context windows.

The /scrape endpoint handles single-URL conversions without search:

curl -X POST http://localhost:8085/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/blog/post-1"}'

Configuration happens entirely through environment variables. You set BROWSER_API_KEY for Browser.cash authentication, and optionally configure session pool size, timeout values, and Datalab integration (an optional PDF-to-Markdown service). The Docker deployment is straightforward—copy .env.example to .env, set your API key, and run. No database, no complex infrastructure, just a stateless API that delegates heavy lifting to external services.

This architecture has scaling implications. Because browsers run remotely on Browser.cash infrastructure, your deployment is lightweight—you’re not managing Chrome processes, GPU drivers, or memory leaks. However, you’re dependent on external service availability and latency includes network round-trips to remote browsers. The economic model shifts from infrastructure costs to per-request API costs.

Gotcha

Teracrawl’s external dependencies create deployment complexity that may not be immediately apparent. The /crawl endpoint—which combines search and scraping—requires running a separate browser-serp instance. The README warns about this requirement but doesn’t provide detailed deployment guidance for that dependency. Browser-serp is another GitHub repository you need to clone, configure, and maintain on port 8080. For production deployments, you’re managing two services instead of one.

The Browser.cash dependency creates an ongoing service relationship. While the README doesn’t detail pricing, you need a Browser.cash API key for all functionality—this appears to be a commercial service rather than self-hosted infrastructure. For teams evaluating total cost of ownership, this represents a different economic model than self-hosted solutions.

Content extraction heuristics are opinionated and non-configurable. Unlike some alternatives that let you specify CSS selectors or custom extraction rules, Teracrawl’s approach focuses on semantic HTML patterns. It targets <article> and <main> tags while stripping navigation, scripts, and ads. This works well for standard blog posts, news articles, and documentation sites—content that follows semantic HTML conventions. But e-commerce product pages, social media threads, or custom web apps with non-standard layouts might have their main content missed if it’s not wrapped in expected semantic tags. The benchmark shows strong overall performance, but the methodology and edge case handling aren’t detailed in the README.

The project’s maturity level should be considered. With 242 GitHub stars and positioning as a “production-ready” solution, teams should evaluate whether the community size and ecosystem maturity meet their risk tolerance. The README provides good API documentation with working curl examples, but doesn’t detail error handling strategies, rate limiting, retry logic, or how the system handles browser crashes. For production use, you’d likely need to add your own circuit breakers and monitoring.

There’s a discrepancy in the benchmark numbers: the README image caption states “Teracrawl achieves #1 coverage at 82.1%” while the descriptive text claims “84.2% coverage.” This documentation inconsistency warrants clarification from the maintainers.

Verdict

Use Teracrawl if you’re building AI agents or RAG systems where scraping success rate directly impacts user experience, you’re comfortable with managed service dependencies, and your use case fits the “search then scrape top results” pattern. The top benchmark placement on scrape-evals (whether 82.1% or 84.2%) suggests meaningfully better handling of JavaScript-heavy sites compared to basic scraping approaches. The two-phase rendering strategy and parallel execution model are thoughtfully designed for AI workloads where you need reliable real-time web data. Skip it if you need a fully self-contained solution without external service dependencies, require fine-grained control over content extraction (custom selectors, data attribute targeting), or prefer infrastructure you can self-host entirely. For simple static sites, a Puppeteer script with a Markdown converter gives you basic functionality without service dependencies. For high-volume scraping where you process thousands of pages daily, evaluate the economics of managed browser services versus self-hosted infrastructure. Teracrawl appears designed for teams who value developer simplicity and best-in-class success rates over minimizing external dependencies—a reasonable trade-off for many production AI applications, but one that should be made deliberately rather than discovered after integration.

// QUOTABLE

A web scraper just claimed the #1 spot on independent benchmarks with 82.1% coverage—yet its core search functionality requires running a separate external service. This architectural choice reveal...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/browsercash-teracrawl.svg)](https://starlog.is/api/badge-click/developer-tools/browsercash-teracrawl)