htmlq: Bringing jq's Pipeline Philosophy to HTML Extraction
Hook
While developers have enjoyed jq's elegant JSON manipulation for years, HTML has remained stuck in the world of verbose scripting libraries and clunky DOM APIs—until now.
Context
HTML parsing on the command line has traditionally been a frustrating experience. Need to extract links from a webpage? You'd reach for a Python script with BeautifulSoup, a Node.js file with Cheerio, or resort to fragile regex patterns that break on the first malformed tag. These solutions work, but they violate the Unix philosophy: they're heavy, require context switching between languages, and don't compose naturally with other command-line tools.
The problem became more glaring after jq's release demonstrated how powerful a domain-specific query tool could be. Developers could pipe JSON through transformations, extract nested values, and reshape data structures—all without leaving the shell. Meanwhile, HTML extraction still required writing throwaway scripts. htmlq emerged to fix this gap, bringing CSS selector-based querying to the command line with the performance of Rust and the composability of traditional Unix utilities.
Technical Insight
htmlq's architecture is deceptively simple but carefully designed. At its core, it uses Rust's html5ever parser (via the scraper crate) to build a standards-compliant DOM tree from input HTML. Unlike regex-based approaches that fail on nested structures or malformed markup, html5ever implements the full HTML5 parsing algorithm, meaning it handles the same edge cases as browsers.
The tool operates as a classic Unix filter: HTML goes in through stdin or a file argument, CSS selectors define what to extract, and matching elements stream out to stdout. Here's a practical example scraping GitHub repository stars:
curl -s https://github.com/mgdm/htmlq | \
htmlq '.Counter' --attribute title
This pipeline fetches the page, parses it, finds elements matching the CSS selector .Counter, and extracts their title attribute—all without temporary files or scripting glue. The --attribute flag demonstrates htmlq's output flexibility: you can extract whole elements (default), specific attributes, or just text content with --text.
The architecture shines in its handling of selector specificity. htmlq supports the full CSS selector syntax that scraper provides, including pseudo-classes, attribute selectors, and combinators:
# Extract all external links
curl -s https://example.com | \
htmlq 'a[href^="http"]' --attribute href
# Get text from all paragraph tags inside articles
curl -s https://blog.example.com | \
htmlq 'article > p' --text
# Find inputs with specific types
cat form.html | htmlq 'input[type="email"]'
One particularly clever design decision is the --base flag. Web scraping often requires resolving relative URLs to absolute ones. Instead of forcing users to implement this logic externally, htmlq can detect or accept a base URL and resolve all href and src attributes automatically:
curl -s https://example.com/blog/ | \
htmlq --base https://example.com/blog/ 'a' --attribute href
This transforms /posts/article.html into https://example.com/blog/posts/article.html automatically—a small feature that eliminates a common post-processing step.
The tool also supports negative selection through --remove-nodes, which deletes matching elements before processing the main query. This is invaluable for cleaning markup before extraction:
# Extract article text while removing ads and sidebars
curl -s https://news.example.com/article | \
htmlq --remove-nodes '.ad,.sidebar' 'article' --text
Under the hood, htmlq's Rust implementation provides significant performance advantages over interpreted language alternatives. Parsing megabytes of HTML and applying complex selectors completes in milliseconds, making it viable for batch processing thousands of documents. The single-binary distribution (via cargo, homebrew, or direct download) means no dependency management—copy the executable and it works.
The output modes deserve special attention because they determine composability. Default output returns serialized HTML fragments, which you can pipe to other htmlq invocations for multi-stage extraction. Text mode strips all tags and returns just content. Attribute mode returns raw values, one per line—perfect for feeding into other Unix tools:
# Download all images from a page
curl -s https://example.com | \
htmlq 'img' --attribute src | \
xargs -n1 curl -O
# Count unique domains linked from a page
curl -s https://example.com | \
htmlq 'a' --attribute href | \
grep -o 'https\?://[^/]*' | \
sort -u | \
wc -l
This pipeline composability—where htmlq's output becomes another tool's input—embodies the Unix philosophy and makes htmlq more powerful than standalone scraping libraries.
Gotcha
htmlq's simplicity is both its strength and limitation. Unlike jq, which can transform, compute, and restructure JSON, htmlq is strictly an extraction tool. You can select elements and grab their content or attributes, but you can't perform transformations, computations, or restructuring. If you need to convert HTML tables to JSON, calculate values, or reshape extracted data, you'll need to pipe htmlq's output to jq or other tools.
The pretty-printing functionality remains incomplete. While htmlq can output HTML fragments, it doesn't reliably format or indent the results. If you're trying to clean up malformed HTML or need human-readable output, you'll need a separate formatter. The --pretty flag exists but is marked as work-in-progress, so don't depend on it for production scripts. Additionally, htmlq only supports CSS selectors—if your workflow relies on XPath queries (common in XML processing or when working with legacy scraping code), you'll need to translate selectors or use a different tool like xidel. CSS selectors are powerful and more familiar to web developers, but XPath offers capabilities like selecting parent nodes and more complex axis navigation that CSS can't match.
Verdict
Use htmlq if you're building shell scripts that need HTML extraction, processing web scraping tasks from the command line, or want a fast, zero-dependency tool for ad-hoc HTML querying. It's ideal when you already know CSS selectors, need to integrate HTML parsing into existing Unix pipelines, or want the performance of a compiled language without the overhead of a full scripting environment. Skip it if you need complex HTML transformations beyond extraction, require XPath query support, depend on reliable pretty-printing for malformed HTML, or are building programmatic applications where a library like scraper (Rust), Cheerio (Node.js), or BeautifulSoup (Python) would provide more flexibility. htmlq excels at doing one thing well: extracting HTML content via CSS selectors in a composable, pipeline-friendly way.