Crawlee: The Web Scraping Framework That Lets You Start Simple and Scale Complex
Hook
Most web scraping projects start with simple HTTP requests and end with frantically Googling 'how to bypass Cloudflare with Puppeteer.' Crawlee is the rare framework that anticipates this journey from day one.
Context
Web scraping has always involved an awkward choice: start with lightweight HTTP libraries like Axios and Cheerio for speed and simplicity, knowing you'll need to rewrite everything when you hit JavaScript-heavy sites, or go straight to Puppeteer/Playwright with all their overhead, even for sites that don't need a full browser. This forced architects to either over-engineer from the start or accept technical debt.
The problem compounds in production. Beyond just fetching pages, you need request queues that survive crashes, proxy rotation that handles failures gracefully, rate limiting that respects robots.txt, storage that doesn't corrupt under concurrent writes, and anti-detection measures that evolve with bot protection services. Every team ends up building the same infrastructure, and most get it wrong the first few times. Crawlee emerged from Apify's experience running millions of web scraping jobs, packaging patterns that previously required senior engineers into a batteries-included framework that works standalone or on their platform.
Technical Insight
Crawlee's architecture centers on a crawler abstraction with swappable backends. You start by choosing a crawler class—CheerioCrawler for raw HTTP with CSS selectors, PlaywrightCrawler for full browser automation, or PuppeteerCrawler if you prefer that ecosystem. The critical insight is that they all share nearly identical APIs.
Here's a basic CheerioCrawler that extracts product data:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ request, $, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Extract data using jQuery-like syntax
const title = $('h1.product-title').text().trim();
const price = $('.price-tag').text();
// Follow pagination
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING',
});
// Save structured data
await Dataset.pushData({
url: request.url,
title,
price,
});
},
});
await crawler.run(['https://example.com/products']);
When you inevitably encounter a site that renders content client-side, you swap CheerioCrawler for PlaywrightCrawler and the rest stays identical:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ request, page, enqueueLinks }) {
// Same URL processing logic
console.log(`Processing: ${request.url}`);
// Wait for dynamic content
await page.waitForSelector('h1.product-title');
const title = await page.locator('h1.product-title').textContent();
const price = await page.locator('.price-tag').textContent();
// Same link following logic
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING',
});
await Dataset.pushData({ url: request.url, title, price });
},
});
The framework handles the iceberg beneath this simple API. The RequestQueue persists URLs to disk by default, surviving crashes and allowing distributed crawling. When you call enqueueLinks(), Crawlee normalizes URLs, deduplicates them, respects your crawl strategy (breadth-first by default), and enforces domain restrictions if configured. The Dataset class similarly abstracts storage—it writes JSON lines locally in development but can swap to cloud storage in production without code changes.
Proxy rotation integrates at the crawler level rather than requiring manual request instrumentation. Configure a ProxyConfiguration and Crawlee handles session management automatically:
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
sessionPoolOptions: {
maxPoolSize: 100,
sessionOptions: {
maxUsageCount: 50, // Rotate after 50 requests
},
},
// Sessions automatically manage cookies and track errors
});
Sessions are more sophisticated than simple round-robin proxies. They maintain cookie state across requests, track error rates per session, and automatically retire sessions that accumulate blocks. If a session gets a 403 or times out repeatedly, Crawlee marks it bad and tries another. This session-aware proxy rotation is what separates production scrapers from naive implementations.
The AutoscaledPool powers resource management underneath. It monitors system memory and CPU, automatically adjusting concurrency to maximize throughput without crashing. You can crawl with maxConcurrency: 50 on a laptop and maxConcurrency: 1000 on a server using identical code—the pool prevents resource exhaustion either way.
For anti-detection, Crawlee integrates fingerprint-suite and header-generator libraries that were originally standalone Apify projects. Playwright and Puppeteer crawlers automatically inject stealth plugins that mask automation signals like navigator.webdriver, randomize canvas fingerprints, and generate realistic HTTP headers based on browser statistics. HTTP crawlers use header-generator to produce headers that match real Chrome TLS fingerprints, including proper header ordering and values that correlate with claimed browser versions.
Gotcha
The unified API works until you need crawler-specific features. PlaywrightCrawler gives you a page object with full Playwright capabilities, but if you start using page.route() for request interception or page.addInitScript() for injection, switching back to CheerioCrawler later requires refactoring. The abstraction leaks when you need it most—when sites are difficult and you're reaching for advanced browser capabilities. You can mitigate this by keeping crawler-specific code in separate route handlers, but the temptation to mix concerns is strong.
Anti-detection features help with casual bot protection but sophisticated systems still catch automated traffic. Cloudflare's Turnstile, PerimeterX, and DataDome analyze behavior patterns that no header spoofing fixes—mouse movement entropy, timing patterns, WebGL rendering inconsistencies. Crawlee's stealth plugins provide a baseline that works against basic checks, but they're not invisibility cloaks. You'll still need CAPTCHA solving services or residential proxies for hardened targets. The framework's value is making the 80% case trivial, not solving the 20% of sites with advanced protection.
Dependency weight is significant. Installing Crawlee with Playwright pulls down Chromium, Firefox, and WebKit binaries totaling over 1GB. The crawlee package alone is 50+ dependencies deep. For Docker deployments, this means multi-gigabyte images and slower CI/CD pipelines. The framework provides separate packages like @crawlee/cheerio if you want to avoid browser dependencies, but then you lose the unified API advantage—you might as well use Cheerio directly at that point. This is a framework for teams committed to crawling as a core competency, not for adding one-off scraping to an existing service.
Verdict
Use if: You're building a production web scraping system that will evolve over time, need to handle both static and dynamic sites, want built-in queue management and proxy rotation instead of building it yourself, or plan to scale from prototype to production without architectural rewrites. It's particularly valuable if you might deploy to Apify's platform later but want to start self-hosted, or if your team is already comfortable with TypeScript and Node.js async patterns. Skip if: You're working in Python (check out Crawlee's Python port instead), need an ultra-lightweight single-purpose scraper where the framework overhead isn't justified, are constrained to older Node.js versions or can't accommodate large dependencies in your deployment pipeline, or specifically need language ecosystems beyond JavaScript where Scrapy or Selenium provide better options.