Back to Articles

Flyscrape: Web Scraping With JavaScript Configuration Instead of Python Boilerplate

[ View on GitHub ]

Flyscrape: Web Scraping With JavaScript Configuration Instead of Python Boilerplate

Hook

Most web scrapers force you to choose between a simple CLI tool that can't handle complexity or a full programming framework that requires 15 dependencies. Flyscrape rejects this dichotomy entirely.

Context

The web scraping landscape has long been polarized. On one end, you have lightweight tools like curl and wget—great for quick HTTP requests but useless when you need to extract structured data from HTML. On the other, you have full-fledged frameworks like Scrapy or Puppeteer that offer immense power but require managing runtime environments, dependencies, and writing substantial boilerplate before you extract your first piece of data.

For developers who need to scrape data occasionally—not build a distributed crawling infrastructure—this choice feels broken. You might just need to pull product listings from a few pages, monitor changes to a documentation site, or extract tabular data for analysis. Do you really need to set up a Python virtual environment, install a dozen packages, and structure a project with spiders, middlewares, and pipelines? Flyscrape emerged from this frustration: a single binary that gives you configuration-driven scraping with JavaScript for the extraction logic, handling all the complex infrastructure (rate limiting, caching, cookie management, concurrent requests) through simple config options.

Technical Insight

Flyscrape's architecture revolves around a clever division of responsibilities. The heavy lifting—HTTP client management, rate limiting, caching, concurrency control—is implemented in Go, compiled into a standalone binary. Users write scraping logic in JavaScript, which runs in an embedded interpreter. This isn't a Node.js process; it's goja or a similar Go-native JavaScript engine, meaning zero external dependencies.

A basic Flyscrape script looks deceptively simple:

export const config = {
  url: "https://news.ycombinator.com",
  cache: "file",
  follow: ["a.morelink"],
  depth: 2,
  rate: 100  // milliseconds between requests
};

export default function({ doc, url }) {
  const stories = [];
  
  doc.find(".athing").each((_, elem) => {
    const title = doc.find(elem).find(".titleline > a").text();
    const link = doc.find(elem).find(".titleline > a").attr("href");
    const points = doc.find(elem).next().find(".score").text();
    
    stories.push({ title, link, points });
  });
  
  return stories;
}

This script exports two things: a configuration object and a default function. The config tells Flyscrape's Go engine how to behave—which URL to start from, whether to cache responses, which links to follow automatically, how deep to crawl, and rate limiting. The default function receives a jQuery-like document object and returns the extracted data. That's it. No class definitions, no inheritance, no middleware pipeline.

What makes this architecture interesting is the dual-mode operation. By default, Flyscrape makes standard HTTP requests and parses HTML—fast and lightweight. But add browser: "chromium" to your config, and it switches to headless browser mode:

export const config = {
  url: "https://example-spa.com",
  browser: "chromium",
  wait: 2000  // wait for JavaScript to execute
};

export default function({ doc }) {
  // Same extraction API works for both modes
  return doc.find(".dynamic-content").text();
}

This mode delegation happens transparently. Your extraction code doesn't change between HTTP and browser modes—the same jQuery-style API works regardless. Under the hood, Flyscrape manages browser lifecycle, waits for JavaScript execution, and feeds you the rendered DOM.

The nested scraping capability showcases how configuration replaces code. Instead of manually managing a queue of URLs to visit, you declare link-following patterns:

export const config = {
  url: "https://example.com/products",
  follow: [
    ".pagination a",  // Follow pagination links
    ".product-link"   // Follow individual product pages
  ],
  depth: 3,
  concurrency: 5
};

export default function({ doc, url }) {
  // This function runs on every page matched
  if (url.includes("/product/")) {
    return {
      name: doc.find("h1.title").text(),
      price: doc.find(".price").text()
    };
  }
  // Pagination pages return nothing, just followed
}

Flyscrape's engine handles the graph traversal, respects depth limits, manages concurrent requests (up to the specified limit), and deduplicates URLs automatically. The config-driven approach means common scraping patterns—pagination, link following, rate limiting—become declarative rather than imperative.

One particularly clever feature is system cookie integration. Add cookies: "system" to your config, and Flyscrape reads cookies from your local browser:

export const config = {
  url: "https://authenticated-site.com/dashboard",
  cookies: "system"
};

This is invaluable for scraping sites where you're already logged in. No need to extract authentication tokens, manage sessions, or handle complex login flows—just use your existing browser session. The Go implementation reads cookie stores from Chrome, Firefox, or Safari depending on your platform.

The caching system deserves attention too. Setting cache: "file" makes Flyscrape store HTTP responses locally. During development, this means you hit the target site once, then iterate on your extraction logic against cached responses—both faster and more respectful to the target server. The cache is content-addressed, so different URLs don't collide, and you can clear it when you need fresh data.

Gotcha

The proxy-browser incompatibility is a significant constraint. Flyscrape's documentation explicitly states that proxy configuration doesn't work in browser mode. This is likely a limitation of the underlying headless browser library, but it's a painful trade-off: you can have JavaScript rendering or proxy rotation, but not both. If you're scraping sites that require JavaScript execution and aggressive IP rotation, you'll need to look elsewhere or handle proxy management externally through a separate proxy service.

The JavaScript-only scripting is a double-edged sword. While JavaScript is widely known, the Python scraping community is enormous, and many data scientists and analysts are far more comfortable with Python than JavaScript. If your team's expertise is Python-based, introducing a JavaScript-based tool—even with a simple API—adds cognitive overhead. The jQuery-style API, while familiar to many, also feels dated compared to modern approaches and lacks advanced features like XPath selectors or sophisticated CSS pseudo-selectors. Complex extraction logic might require more verbose JavaScript than equivalent Python with libraries like BeautifulSoup or lxml. Additionally, being a relatively young project (compared to Scrapy's decade-plus maturity), Flyscrape lacks the ecosystem of plugins, extensions, and community patterns that make established frameworks so powerful for complex scenarios.

Verdict

Use Flyscrape if you need a portable, zero-dependency scraper for occasional data extraction tasks, especially in environments where installing Python/Node runtimes is friction (CI/CD pipelines, minimal containers, shared servers). It's ideal when you want more power than curl but less complexity than Scrapy—think extracting data from a dozen pages, not crawling millions. The single-binary distribution and configuration-driven approach make it perfect for ops teams, data analysts, or developers who scrape occasionally rather than building scraping infrastructure. Skip it if you're already invested in Python's scraping ecosystem with its rich libraries and extensions, need advanced anti-detection measures or distributed crawling, or require proxy rotation combined with JavaScript rendering. Also skip if your team doesn't know JavaScript or if you're building scrapers that need sophisticated middleware pipelines, item processing, or integration with data pipelines that already exist in Python/Node ecosystems.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/philippta-flyscrape.svg)](https://starlog.is/api/badge-click/automation/philippta-flyscrape)