Back to Articles

htmlq: CSS Selectors Meet Unix Pipelines for HTML Parsing

[ View on GitHub ]

htmlq: CSS Selectors Meet Unix Pipelines for HTML Parsing

Hook

While developers have relied on jq for JSON wrangling, HTML—the web’s most ubiquitous format—has languished without an equally elegant command-line companion. Until now.

Context

Command-line HTML parsing has traditionally been a mess of awk scripts, regex nightmares, or heavyweight tools like xmlstarlet with their arcane XPath syntax. Web developers debug browser issues with CSS selectors daily, yet when they need to extract content from HTML in shell scripts or CI/CD pipelines, they’re forced to either write throwaway Python scripts with Beautiful Soup or resort to fragile regex patterns. The cognitive dissonance is jarring: why should extracting an element with id=“content” require learning XPath or spinning up a full programming environment?

htmlq bridges this gap by bringing the CSS selector syntax developers already know into the Unix pipeline philosophy. The tool treats HTML extraction as a filter operation—read from stdin, apply CSS selectors, write to stdout—making it naturally composable with curl, grep, and every other Unix tool. The project has gained 7,509 GitHub stars precisely because it solves a problem every developer has faced: quickly extracting structured data from HTML without leaving the terminal.

Technical Insight

Processing

Default

--text

--attribute

--pretty

--remove-nodes

--ignore-whitespace

HTML Input

stdin/file

HTML Parser

DOM Tree

CSS Selector

Matcher

Output Mode

Full HTML

Text Only

Attributes

Formatted HTML

CLI Flags

stdout

System architecture — auto-generated

At its core, htmlq uses CSS selectors to extract bits of content from HTML files. The tool defaults to stdin/stdout, making it trivially composable. Here’s how you’d extract all navigation links from a website:

curl --silent https://www.rust-lang.org/ | htmlq --attribute href a

This pipeline fetches HTML with curl and pipes it to htmlq, which selects all anchor tags and extracts only their href attributes. The output is a clean list of URLs, one per line—perfect for further processing with grep, sort, or xargs. Compare this to the equivalent xmllint command, which requires remembering XPath syntax like //a/@href and often chokes on real-world HTML.

The tool supports multiple output modes through flags. By default, htmlq returns the full HTML of matched elements, but --text extracts only text nodes, stripping tags entirely. This is invaluable for content analysis:

curl --silent https://nixos.org/nixos/about.html | htmlq --text .main

This extracts the text content of elements with class “main”, ignoring all markup. The --ignore-whitespace flag further cleans output by skipping text nodes that consist entirely of whitespace—a common annoyance when parsing formatted HTML.

One of htmlq’s most powerful features is --remove-nodes, which allows you to prune unwanted elements before output. When scraping documentation pages, you might want the main content but not the navigation sidebar:

curl --silent https://example.com/docs | htmlq '.content' --remove-nodes 'nav' --remove-nodes '.sidebar'

You can specify multiple --remove-nodes selectors, and htmlq will strip all matching elements from the DOM before serializing your selected content. This preprocessing capability transforms htmlq from a simple extractor into a DOM manipulation tool.

The --detect-base and --base flags handle relative URLs intelligently. When extracting links, you often need absolute URLs rather than relative paths. If the HTML contains a <base> tag, --detect-base will use it to resolve relative URLs; otherwise, you can provide --base https://example.com explicitly. This attention to real-world scraping needs distinguishes htmlq from simplistic regex-based tools.

For debugging and presentation, htmlq includes a --pretty flag that reformats HTML with proper indentation, though the README notes this is “a bit of a work in progress.” The tool even pairs elegantly with syntax highlighters like bat:

curl --silent example.com | htmlq 'body' | bat --language html

This pipeline extracts the body element and pipes it through bat for color-coded syntax highlighting in your terminal—a developer experience polish that acknowledges how these tools actually get used.

Being written in Rust appears to provide performance advantages over interpreted alternatives, and it’s distributed through Cargo, Homebrew, FreeBSD pkg, and Scoop, indicating serious cross-platform commitment and community trust.

Gotcha

htmlq’s simplicity is both its strength and limitation. Unlike jq, which can transform JSON with filters, map operations, and arithmetic, htmlq is strictly an extraction tool. You can select elements and pull out attributes or text, but you can’t perform computations, conditional logic, or complex transformations. If you need to extract a price, remove currency symbols, and calculate a total, you’ll need to pipe htmlq’s output to awk or another tool.

The “work in progress” caveat on pretty-printing is significant. If you need consistently formatted HTML output for version control or human review, you may find the current implementation insufficient. More critically, htmlq has no concept of JavaScript rendering. Modern single-page applications that load content dynamically will appear mostly empty to htmlq, since it only sees the initial HTML shell. For SPAs, you need a headless browser like Puppeteer or Playwright—htmlq is fundamentally a static HTML parser. It also won’t follow links or handle sessions, so multi-page scraping requires orchestrating multiple curl calls yourself.

Verdict

Use htmlq if you’re writing shell scripts that need to extract content from HTML, building CI/CD pipelines that parse API documentation or test reports, or doing quick one-off web scraping in your terminal. It’s ideal when you’re already comfortable with CSS selectors and want to stay in a Unix pipeline workflow. Skip it if you need to scrape JavaScript-heavy SPAs (use Puppeteer), require complex HTML transformations beyond extraction (reach for a full programming language with Beautiful Soup or Nokogiri), need to follow links and maintain session state (try xidel or a proper scraping framework), or want jq-level data transformation capabilities. For static HTML and straightforward extraction tasks, htmlq is exactly the tool that should have existed a decade ago.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/mgdm-htmlq.svg)](https://starlog.is/api/badge-click/developer-tools/mgdm-htmlq)