urlwatch: Building a Personal Web Surveillance System with Python Filters
Hook
Every price tracker, job board monitor, and RSS reader you’ve ever used is just urlwatch with a prettier UI and a monthly subscription fee.
Context
The web changes constantly, but most changes happen silently. That apartment listing you’ve been watching gets marked as rented. A product you want goes on sale. A government RFP opens for bidding. By the time you manually check these pages, the opportunity has passed.
Traditional solutions fall into two camps: browser extensions that are fragile and platform-locked, or SaaS monitoring services that charge monthly fees and can’t handle complex extraction logic. urlwatch emerged as a different approach—a Unix-philosophy tool that does one thing well: fetch content, compare it to a previous snapshot, and notify you of changes. Created by Thomas Perl in 2008 and actively maintained with 3,000+ GitHub stars, it treats web monitoring as a composable pipeline problem rather than a monolithic application.
Technical Insight
The genius of urlwatch lies in its filter chain architecture. Every monitoring job is a YAML configuration that defines a URL, a sequence of filters to extract and transform content, and reporters to handle notifications. This composability means you can monitor anything from simple text changes to complex JSON API responses with surgical precision.
A basic configuration might monitor a product page for price changes:
name: "Monitor laptop price"
url: https://example.com/products/laptop-x1
filter:
- css: '.price-container'
- html2text:
- strip:
But the real power emerges when you chain filters. CSS selectors extract DOM elements, html2text converts to plain text, and strip removes whitespace. Need to monitor a JSON API and alert only when a specific field exceeds a threshold? Chain json, jq (using jq query syntax), and grep filters. The filter system includes 40+ built-in filters covering CSS/XPath selection, regex operations, JSON/XML parsing, sorting, deduplication, and even OCR for image-based content.
For complex scenarios, urlwatch supports custom Python filters through hooks. Create a hooks.py file and define filter functions that receive content and return transformed output:
import re
from urlwatch import filters
class FilterExtractJobSalary(filters.FilterBase):
__kind__ = 'extract-salary'
def filter(self, data, subfilter):
# Extract salary from job posting text
pattern = r'\$([0-9,]+)\s*-\s*\$([0-9,]+)'
matches = re.findall(pattern, data)
if matches:
return f"Salary range: ${matches[0][0]} - ${matches[0][1]}"
return "Salary not listed"
This filter can now be used in any job configuration with - extract-salary:. The hook system transforms urlwatch from a static tool into a programmable monitoring platform.
The storage layer uses a simple but effective approach: each job’s content is stored in a local SQLite database (or optionally minidb/textfiles), and diffs are computed using Python’s difflib. When changes are detected, urlwatch generates unified diffs and dispatches them through the reporter chain. You might configure email for important alerts, Pushover for mobile notifications, and a webhook for integration with home automation:
report:
email:
enabled: true
from: urlwatch@example.com
to: alerts@example.com
smtp:
host: smtp.gmail.com
port: 587
starttls: true
pushover:
enabled: true
user: user_key_here
device: iphone
webhook:
enabled: true
webhook:
url: https://homeassistant.local/api/webhook/urlwatch
For JavaScript-heavy sites that require browser rendering, urlwatch integrates Pyppeteer (a Python port of Puppeteer) through the navigate job type. This spins up a headless Chromium instance, executes JavaScript, and captures the rendered DOM. While resource-intensive, it handles single-page applications that traditional HTTP requests can’t:
name: "Monitor React app content"
navigate: https://spa-example.com/dashboard
wait_until: networkidle
filter:
- css: '#data-container'
- html2text:
The entire workflow is orchestrated by a command-line interface designed for cron. Running urlwatch executes all configured jobs, computes diffs, and sends notifications. Set up a cron job like */30 * * * * urlwatch and you have a monitoring system that runs every 30 minutes, consuming minimal resources between executions.
Gotcha
urlwatch’s simplicity is also its constraint. It’s a command-line tool that runs and exits—there’s no persistent daemon, no web interface, no real-time monitoring. You’re responsible for scheduling via cron or systemd timers, which means minimum check intervals are typically measured in minutes, not seconds. If you need sub-minute alerting or true real-time monitoring, urlwatch’s architecture fundamentally doesn’t support it.
The browser automation feature, while powerful, introduces significant complexity. Installing Pyppeteer means downloading a full Chromium binary (100MB+), and each browser-based job can consume 200-300MB of RAM during execution. On resource-constrained VPS instances or when monitoring dozens of JavaScript-heavy sites, this overhead becomes prohibitive. There’s also no built-in rate limiting or parallelization control—all jobs run sequentially, so 50 jobs with browser automation might take 10+ minutes to complete. The single-machine, local-storage architecture means you can’t distribute monitoring across multiple nodes or achieve high availability. If your monitoring host goes down, your monitoring stops. Period.
Verdict
Use urlwatch if you’re monitoring a handful to a few hundred URLs with cron-based scheduling, need powerful content extraction through filter chains, and want notifications delivered to email, mobile apps, or webhooks. It’s perfect for tracking price changes, job postings, RSS feeds, API responses, or government procurement sites where you need precise filtering logic and don’t want to pay monthly SaaS fees. The Python hooks system makes it infinitely extensible for developers comfortable with scripting. Skip if you need a web UI for non-technical users, real-time monitoring with sub-minute intervals, distributed architecture across multiple servers, team collaboration features, or built-in visualization/dashboard capabilities. In those cases, look at changedetection.io for a Docker-based GUI experience, Huginn for workflow automation beyond simple monitoring, or commercial services like visualping.io if budget permits and you value managed infrastructure.