Back to Articles

Crawlee: Building Production-Grade Web Scrapers Without Reinventing Infrastructure

[ View on GitHub ]

Crawlee: Building Production-Grade Web Scrapers Without Reinventing Infrastructure

Hook

Most web scraping tutorials end where production begins—right when you need request queuing, proxy rotation, retry logic, and session management. Crawlee ships with all of it.

Context

Web scraping in Node.js has historically meant choosing between simplicity and capability. Need to scrape static HTML? Reach for axios and Cheerio—fast and lightweight, but you’re building your own queue system, retry logic, and error handling. Need JavaScript rendering? Install Puppeteer or Playwright, then spend days implementing request management, proxy rotation, and anti-bot evasion. There’s no middle ground.

Crawlee emerged from Apify’s experience running millions of scraping jobs in production. The team recognized that every serious scraping project rebuilds the same infrastructure: persistent queues with breadth-first or depth-first traversal, automatic retry mechanisms, proxy and session rotation, resource scaling, and storage abstractions. Rather than forcing developers to reinvent these wheels, Crawlee provides a battle-tested foundation that works identically whether you’re using raw HTTP requests, JSDOM, Cheerio, Puppeteer, or Playwright. It’s designed for the reality that scraping projects evolve—today’s static site becomes tomorrow’s JavaScript-heavy application, and your crawler architecture shouldn’t require a rewrite when that happens.

Technical Insight

Anti-Detection

Core Engine

execute

manage resources

through

configure & run

fetch requests

persist data

apply

rotate

Storage Layer

Dataset

Key-Value Store

Crawler Layer

HttpCrawler

CheerioCrawler

JsdomCrawler

PuppeteerCrawler

PlaywrightCrawler

User Code

Request Queue

Lifecycle Hooks

Auto Scaling

Session Manager

Proxy Rotation

Browser Fingerprints

System architecture — auto-generated

Crawlee’s architecture centers on a unified crawler interface with swappable execution engines. Every crawler type—HttpCrawler, CheerioCrawler, JsdomCrawler, PuppeteerCrawler, PlaywrightCrawler—shares the same request queue, storage, and lifecycle hooks. This means you can start with lightweight HTTP crawling and upgrade to browser automation by changing one line of code.

Here’s a practical example showing this flexibility. A Cheerio-based crawler for static HTML:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        
        await Dataset.pushData({ title, url: request.loadedUrl });
        await enqueueLinks();
    },
});

await crawler.run(['https://crawlee.dev']);

If the target site later requires JavaScript rendering, you swap in PlaywrightCrawler with identical request handling logic:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);
        
        await Dataset.pushData({ title, url: request.loadedUrl });
        await enqueueLinks();
    },
});

await crawler.run(['https://crawlee.dev']);

The requestHandler signature changes ($ becomes page), but queue management, storage, retries, and proxy rotation remain identical. This API consistency is Crawlee’s killer feature for maintaining scrapers long-term.

Under the hood, Crawlee implements persistent request queuing with configurable strategies. Requests are stored locally by default (in the ./storage directory) with automatic deduplication based on URL normalization. The queue supports both breadth-first and depth-first traversal, and crucially, survives crashes—restart your crawler and it resumes where it left off. For distributed deployments, the storage layer is pluggable; you can swap local filesystem storage for Apify’s cloud storage or implement custom backends.

The framework’s anti-detection capabilities deserve attention. Crawlee automatically generates browser-like headers for HTTP requests, and replicates browser TLS fingerprints. For browser crawlers, it generates human-like fingerprints with zero configuration, though the README notably avoids claiming this defeats all bot protection—a refreshing dose of honesty in a space full of oversold promises.

Proxy rotation is built-in and appears to support session management. Configure a proxy pool, and Crawlee handles rotation:

const crawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({
        proxyUrls: [
            'http://proxy-1.com',
            'http://proxy-2.com',
        ],
    }),
    async requestHandler({ request, page }) {
        // Proxy rotation and session management
    },
});

The TypeScript-first design shines in data extraction workflows. Generic types flow through the entire pipeline, giving you autocomplete and type safety from extraction to storage:

interface ProductData {
    title: string;
    price: number;
    inStock: boolean;
}

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const product: ProductData = {
            title: $('h1.product-title').text(),
            price: parseFloat($('.price').text().replace('$', '')),
            inStock: $('.availability').text().includes('In Stock'),
        };
        
        await Dataset.pushData<ProductData>(product);
    },
});

Crawlee’s automatic scaling adjusts concurrency based on available system resources. This prevents the common pitfall of spawning too many browser instances and crashing. The framework monitors system metrics and dynamically adjusts the number of concurrent requests, though you can override with explicit concurrency settings when you know your infrastructure’s limits better than the heuristics do.

Gotcha

The Node.js 16+ requirement isn’t just a suggestion—Crawlee requires Node.js 16 or higher according to the documentation. If you’re stuck on older versions, you’re out of luck. Browser crawlers also require separate installation of Playwright or Puppeteer, as these aren’t bundled with Crawlee to reduce install size. The initial npm install can take time if you’re expecting a lightweight scraping library.

While Crawlee states that “crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration,” this needs context. Yes, the browser-like headers and fingerprint generation help against basic bot detection. No, this won’t defeat Cloudflare Turnstile, DataDome, PerimeterX, or other sophisticated anti-bot systems that analyze behavioral patterns and other advanced signatures. The README’s careful wording—“appear human-like” rather than “are undetectable”—is the honest take. For sites with serious bot protection, you’ll still need residential proxies, CAPTCHA solving services, or specialized scraping APIs regardless of which framework you use.

The default storage abstraction uses the local filesystem, which creates problems for containerized or serverless deployments. A scraper running in AWS Lambda or Google Cloud Functions can’t persist queue state to disk across invocations. The documentation mentions configurable storage backends and pluggable storage, but implementing cloud-native storage requires either adopting Apify’s platform (vendor lock-in) or building custom storage adapters (extra work). There’s no official AWS S3 or Google Cloud Storage adapter mentioned in the README.

Verdict

Use Crawlee if you’re building scrapers that need to scale beyond one-off scripts—especially if you anticipate switching between HTTP and browser-based crawling as target sites evolve, or if you’re deploying to Apify’s platform where the integration is seamless. It’s particularly valuable for data pipelines feeding AI/LLM systems (the README explicitly mentions “Extract data for AI, LLMs, RAG, or GPTs”) where reliable, repeatable extraction matters more than quick prototypes. The TypeScript support and unified API eliminate entire categories of bugs that plague hand-rolled scraping infrastructure. Skip it if you’re doing simple one-off scraping (axios + Cheerio is faster to set up), if you’re committed to Python (use Scrapy or Crawlee’s Python port, which the README mentions), if you need guaranteed success against advanced bot protection (you need specialized services, not a framework), or if you’re in a serverless environment without custom storage adapters ready. The learning curve is steeper than basic Puppeteer scripts, but pays dividends the moment your scraper needs to handle retries, proxy rotation, or persistent queues—which is roughly 30 minutes into any real project.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/apify-crawlee.svg)](https://starlog.is/api/badge-click/automation/apify-crawlee)