Back to Articles

urlwatch: A Filtering Pipeline Approach to Web Change Detection

[ View on GitHub ]

urlwatch: A Filtering Pipeline Approach to Web Change Detection

Hook

Most web monitoring tools break when a site redesigns its layout. urlwatch users don't notice—their XPath selectors keep working because they're watching semantic HTML structure, not visual presentation.

Context

Before structured APIs became ubiquitous, developers needed creative ways to track changes on websites that didn't offer feeds or notifications. You might want to know when a product comes back in stock, when job postings appear, or when documentation updates. Screen scraping was the obvious answer, but the devil was in the details: how do you store snapshots efficiently? How do you avoid false positives from timestamps and ads? How do you get notified without building your own notification infrastructure?

Thomas Perl created urlwatch in 2008 to solve this problem with a Unix philosophy approach: do one thing well, be composable, and work with existing tools. Rather than building a comprehensive monitoring platform with dashboards and user management, urlwatch focuses on the core challenge—detecting meaningful changes in noisy web content—and delegates scheduling to cron and notifications to your preferred channels. Fifteen years later, it remains relevant because it solves the fundamental problem elegantly: transforming messy HTML into comparable text through a pipeline of filters.

Technical Insight

urlwatch's architecture centers on three components that operate in sequence: jobs define what to watch, filters transform content before comparison, and reporters send notifications. This separation of concerns makes the system remarkably flexible. A job might fetch a webpage, extract a specific element with CSS selectors, convert it to text, strip out noise with regex, and then let urlwatch's diff engine detect changes.

The filtering pipeline is where urlwatch shines. Instead of comparing raw HTML—which changes constantly due to ads, timestamps, and tracking parameters—you build a preprocessing chain that extracts signal from noise. Here's a practical example monitoring a product availability page:

name: "Monitor GPU Stock"
url: https://example.com/products/gpu-3080
filter:
  - css: 'div.product-status'
  - html2text:
      method: pyhtml2text
  - strip:
  - grep: 'In Stock|Out of Stock'

This configuration fetches the page, extracts only the product status div, converts HTML to plain text, removes whitespace, and filters to lines containing stock status. urlwatch stores a snapshot of the filtered output, and on subsequent runs, generates a unified diff. Only when that diff is non-empty do the reporters activate.

The job system supports more than HTTP requests through a plugin architecture. The browser job type integrates with headless browsers for JavaScript-heavy sites that render content dynamically:

name: "Monitor SPA Application"
job: browser
url: https://example.com/dashboard
wait: 5
filter:
  - css: '#main-content'
  - html2text:

This uses Playwright or pyppeteer to render JavaScript before extracting content. The wait parameter lets the page settle before capturing. For authenticated content, urlwatch supports cookies and headers directly in job definitions, or you can use the browser type to handle complex login flows.

The shell job type monitors command output, making urlwatch useful beyond web scraping:

name: "Monitor open ports"
job: shell
command: "nmap -p 80,443 localhost | grep open"
filter:
  - strip:

The storage system uses a simple key-value model where each job's filtered output becomes the value. urlwatch ships with multiple storage backends (plain text files, minidb, Redis) and includes migration tools. The default minidb backend stores snapshots as JSON, making it easy to inspect or manipulate history programmatically.

Reporters implement a simple interface—receive a job result with diff information, format it, and send it somewhere. The built-in reporters cover email (SMTP, mailgun, sendmail), messaging platforms (Telegram, Discord, Slack, Matrix), mobile push (Pushover, Pushbullet), and webhooks. You can chain multiple reporters, so a critical change might trigger both an email and a Telegram message. The reporter configuration is refreshingly straightforward:

report:
  telegram:
    bot_token: 'YOUR_BOT_TOKEN'
    chat_id: 'YOUR_CHAT_ID'
  email:
    enabled: true
    from: 'urlwatch@example.com'
    to: 'you@example.com'
    smtp:
      host: localhost
      port: 25

The diff engine uses Python's difflib to generate unified diffs, but urlwatch adds contextual intelligence. You can configure how many lines of context to include, whether to show only additions or deletions, and whether to use word-level diffs instead of line-level. For structured data, the deepdiff filter compares JSON or YAML semantically, ignoring irrelevant ordering changes.

Gotcha

urlwatch is a command-line tool that expects you to provide the scheduling infrastructure. There's no daemon mode by default—you run urlwatch --urls urls.yaml manually or via cron. For users comfortable with Unix tools, this is liberating. For those expecting a web dashboard or automatic scheduling, it's a hurdle. The documentation provides systemd timer and cron examples, but you're responsible for ensuring they run and handling failures.

The filtering system's flexibility becomes a debugging challenge when things don't work as expected. You chain multiple filters together, but if the output isn't what you want, you need to test each stage individually using the --test-filter command. Complex XPath or CSS selectors can break when sites change structure, and there's no visual feedback showing what elements matched. You'll find yourself iterating: run urlwatch, check the diff, adjust filters, repeat. The --test-diff-filter flag helps by showing you what would be compared, but expect some trial and error.

Browser-based jobs add significant overhead. Launching a headless browser for every check is slow and memory-intensive compared to simple HTTP requests. If you're monitoring dozens of JavaScript-heavy sites every few minutes, you'll need considerable resources. The browser integration also requires installing additional dependencies (Playwright or pyppeteer) and managing browser binaries. For simple sites that server-render HTML, the HTTP job type is vastly more efficient.

Verdict

Use if: You need fine-grained control over what parts of web pages to monitor and can invest time crafting filter pipelines. You're comfortable with YAML configuration and command-line tools. You want monitoring that runs on your infrastructure with complete ownership of data and logic. You need to monitor diverse sources (web pages, command output, APIs) with a single tool and notification system. Skip if: You want a GUI for visual element selection and monitoring setup—browser extensions or changedetection.io provide better user experiences. You need real-time webhooks or sub-minute monitoring intervals—urlwatch's polling model introduces inherent latency. You're monitoring APIs where you care about structured data semantics rather than rendered output—dedicated API monitoring tools offer better assertion frameworks. You need collaborative monitoring where non-technical users configure checks—urlwatch's YAML configuration requires developer comfort.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/thp-urlwatch.svg)](https://starlog.is/api/badge-click/automation/thp-urlwatch)