Back to Articles

Flyscrape: A Command-Line Web Scraper That Speaks JavaScript, Ships as Go

[ View on GitHub ]

Flyscrape: A Command-Line Web Scraper That Speaks JavaScript, Ships as Go

Hook

Most web scraping tools force you to choose between ease of use and deployment simplicity. Flyscrape refuses to make that choice, embedding a full JavaScript runtime inside a single Go binary.

Context

Web scraping sits in an awkward middle ground between one-off data extraction and production-grade data pipelines. Python developers reach for BeautifulSoup or Scrapy, which work brilliantly until you need to share your scraper with someone who doesn’t have the exact Python version, lxml compiled correctly, or the patience to debug virtual environments. Node.js developers face similar dependency nightmares, plus the added complexity of managing Chromium installations for headless browsing.

Flyscrape emerged from a simple observation: most web scraping scripts are disposable. You write them to grab some data, run them a few times, then abandon them. But the tooling around scraping assumes you’re building a long-lived system. You don’t need Scrapy’s middleware architecture to extract product prices from ten pages. You don’t need Playwright’s full Chrome DevTools Protocol access to click a “Load More” button. What you need is something you can write quickly, run immediately, and share as a single file. Flyscrape delivers exactly that by compiling Go’s networking primitives and concurrency model into a binary that executes JavaScript scraping logic, giving you jQuery-like selectors without the npm install hell.

Technical Insight

Load .js script

Parse

Execute

URLs + depth

Fetch URL

Or launch

HTML response

Rendered HTML

Parse DOM

jQuery-like doc

Extract data

follow links

CLI Entry Point

JavaScript Engine

goja

Config Parser

export const config

Scraper Function

export default

HTTP Client

Go net/http

Headless Browser

Optional Mode

HTML Parser

golang.org/x/net/html

Cache Layer

file/memory

URL Queue

+ Rate Limiter

JSON Output

System architecture — auto-generated

Flyscrape’s architecture makes an unusual trade-off: it accepts the overhead of embedding a JavaScript interpreter to avoid the overhead of language runtime distribution. Under the hood, it uses a JavaScript engine (likely goja, the pure-Go ECMAScript implementation) to execute your scraping scripts, while the Go runtime handles HTTP requests, concurrency, and system integration.

A basic Flyscrape script demonstrates this hybrid approach:

export const config = {
  url: "https://news.ycombinator.com",
  cache: "file",
  follow: ["a.storylink"],
  depth: 2,
};

export default function({ doc, url }) {
  const title = doc.find("title").text();
  const links = doc.find("a.storylink");
  
  return {
    url: url,
    title: title,
    stories: links.map((el) => ({
      text: el.text(),
      href: el.attr("href"),
    })).get(),
  };
}

The config object is pure declarative configuration—no classes to instantiate, no middleware to configure. You specify URLs, caching behavior, pagination rules, and traversal depth in a plain JavaScript object. The follow array tells Flyscrape which links to crawl recursively, and depth limits how deep it goes. This is sophisticated functionality (recursive crawling with depth limits) expressed in five lines.

The extraction function receives a doc object that mimics jQuery’s API. Behind the scenes, Flyscrape parses HTML with Go’s golang.org/x/net/html package and exposes it through a JavaScript-friendly interface. The map() and get() pattern is pure jQuery, making the learning curve nearly flat for anyone who’s touched frontend code in the last decade.

Flyscrape offers two execution modes: static HTTP fetching and headless browser rendering. For static content, it makes standard HTTP requests and parses HTML directly—fast and lightweight. When you need JavaScript execution, you enable browser mode:

export const config = {
  url: "https://example.com/spa-app",
  browser: true,
  wait: "#dynamic-content",
};

export default function({ doc }) {
  // Page has been rendered, DOM is fully populated
  return {
    content: doc.find("#dynamic-content").text(),
  };
}

The browser: true flag spins up a headless Chrome instance, and wait tells it which selector to watch for before considering the page loaded. This is where Flyscrape’s architecture shines: the same JavaScript function works in both modes. Your extraction logic doesn’t change whether you’re parsing static HTML or waiting for React to hydrate.

Rate limiting and concurrency are handled at the configuration level, not in your code:

export const config = {
  url: "https://api.example.com/products?page=1",
  depth: 5,
  follow: ["a.next-page"],
  rate: 5,  // 5 requests per second
  concurrency: 3,  // 3 parallel requests
};

This is deceptively powerful. Flyscrape manages a request queue, enforces rate limits across all concurrent workers, and handles backpressure automatically. In Python’s Scrapy, you’d configure this through settings files and middleware. In custom Node.js scripts, you’d manually implement promise pools and rate limiters. Flyscrape bakes it into the runtime.

The tool also supports proxy rotation and cookie persistence, though with limitations. Proxies are configured via command-line flags, not in scripts, which keeps scripts portable but limits per-request proxy selection. Browser mode cannot use proxies at all—a significant constraint if you’re rotating IPs to avoid rate limits on JavaScript-heavy sites.

Gotcha

Flyscrape’s biggest limitation is the incompatibility between browser mode and proxy support. You can render JavaScript or you can rotate IPs, but you cannot do both. This makes it unsuitable for scraping modern single-page applications that also implement aggressive rate limiting based on IP addresses. Many e-commerce sites and social media platforms fall into this category.

The embedded JavaScript runtime, while convenient, is not a full Node.js environment. You can’t import npm packages, which means no access to libraries like date-fns for parsing, csv-parse for data transformation, or cheerio for additional DOM manipulation. Your extraction logic must use only JavaScript’s built-in capabilities plus Flyscrape’s provided APIs. For simple extraction tasks, this is fine. For complex data transformation pipelines, you’ll be writing more verbose code than you would in a full scripting environment. Additionally, error handling patterns are underdocumented—the examples show happy paths but provide little guidance on catching failed requests, handling timeouts, or implementing retry logic. The tool appears to fail fast on errors rather than providing sophisticated error recovery mechanisms, which means your scrapers may be more brittle than equivalent Scrapy implementations with built-in retry middleware.

Verdict

Use if: You need to scrape data quickly without setting up Python virtual environments or Node.js projects, you’re deploying scrapers to machines where you don’t control the runtime environment (cron jobs on minimal servers, shared hosting), you want to share scrapers as single executable files with non-technical colleagues, or you’re scraping mostly static sites with occasional JavaScript rendering needs. Also excellent for prototyping scraping logic before committing to a full framework. Skip if: You’re scraping sophisticated single-page applications that require both JavaScript rendering and proxy rotation, you need deep integration with existing Python or Node.js data pipelines, your extraction logic depends on npm packages or complex date/string parsing libraries, or you’re building a long-lived scraping infrastructure with sophisticated error handling, monitoring, and anti-bot evasion. In those cases, Scrapy’s maturity or Playwright’s comprehensive browser control will serve you better despite their heavier dependencies.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/philippta-flyscrape.svg)](https://starlog.is/api/badge-click/automation/philippta-flyscrape)