Back to Articles

urlwatch: Building a Personal Web Surveillance System with Python Filters

[ View on GitHub ]

urlwatch: Building a Personal Web Surveillance System with Python Filters

Hook

Every price tracker, job board monitor, and RSS reader you’ve ever used is just urlwatch with a prettier UI and a monthly subscription fee.

Context

The web changes constantly, but most changes happen silently. That apartment listing you’ve been watching gets marked as rented. A product you want goes on sale. A government RFP opens for bidding. By the time you manually check these pages, the opportunity has passed.

Traditional solutions fall into two camps: browser extensions that are fragile and platform-locked, or SaaS monitoring services that charge monthly fees and can’t handle complex extraction logic. urlwatch emerged as a different approach—a Unix-philosophy tool that does one thing well: fetch content, compare it to a previous snapshot, and notify you of changes. Created by Thomas Perl in 2008 and actively maintained with 3,000+ GitHub stars, it treats web monitoring as a composable pipeline problem rather than a monolithic application.

Technical Insight

Retrievers

Filters

Job definitions

Raw content

Processed content

Previous snapshot

Store new snapshot

Changes detected

Notifications

YAML Job Config

Content Retriever

Filter Chain Engine

Cache Database

Diff Calculator

Reporter Dispatcher

Email/Telegram/Webhooks

CSS/XPath

html2text

JSON/jq

Custom Python

HTTP Client

Browser Automation

Shell Commands

System architecture — auto-generated

The genius of urlwatch lies in its filter chain architecture. Every monitoring job is a YAML configuration that defines a URL, a sequence of filters to extract and transform content, and reporters to handle notifications. This composability means you can monitor anything from simple text changes to complex JSON API responses with surgical precision.

A basic configuration might monitor a product page for price changes:

name: "Monitor laptop price"
url: https://example.com/products/laptop-x1
filter:
  - css: '.price-container'
  - html2text:
  - strip:

But the real power emerges when you chain filters. CSS selectors extract DOM elements, html2text converts to plain text, and strip removes whitespace. Need to monitor a JSON API and alert only when a specific field exceeds a threshold? Chain json, jq (using jq query syntax), and grep filters. The filter system includes 40+ built-in filters covering CSS/XPath selection, regex operations, JSON/XML parsing, sorting, deduplication, and even OCR for image-based content.

For complex scenarios, urlwatch supports custom Python filters through hooks. Create a hooks.py file and define filter functions that receive content and return transformed output:

import re
from urlwatch import filters

class FilterExtractJobSalary(filters.FilterBase):
    __kind__ = 'extract-salary'
    
    def filter(self, data, subfilter):
        # Extract salary from job posting text
        pattern = r'\$([0-9,]+)\s*-\s*\$([0-9,]+)'
        matches = re.findall(pattern, data)
        if matches:
            return f"Salary range: ${matches[0][0]} - ${matches[0][1]}"
        return "Salary not listed"

This filter can now be used in any job configuration with - extract-salary:. The hook system transforms urlwatch from a static tool into a programmable monitoring platform.

The storage layer uses a simple but effective approach: each job’s content is stored in a local SQLite database (or optionally minidb/textfiles), and diffs are computed using Python’s difflib. When changes are detected, urlwatch generates unified diffs and dispatches them through the reporter chain. You might configure email for important alerts, Pushover for mobile notifications, and a webhook for integration with home automation:

report:
  email:
    enabled: true
    from: urlwatch@example.com
    to: alerts@example.com
    smtp:
      host: smtp.gmail.com
      port: 587
      starttls: true
  pushover:
    enabled: true
    user: user_key_here
    device: iphone
  webhook:
    enabled: true
    webhook:
      url: https://homeassistant.local/api/webhook/urlwatch

For JavaScript-heavy sites that require browser rendering, urlwatch integrates Pyppeteer (a Python port of Puppeteer) through the navigate job type. This spins up a headless Chromium instance, executes JavaScript, and captures the rendered DOM. While resource-intensive, it handles single-page applications that traditional HTTP requests can’t:

name: "Monitor React app content"
navigate: https://spa-example.com/dashboard
wait_until: networkidle
filter:
  - css: '#data-container'
  - html2text:

The entire workflow is orchestrated by a command-line interface designed for cron. Running urlwatch executes all configured jobs, computes diffs, and sends notifications. Set up a cron job like */30 * * * * urlwatch and you have a monitoring system that runs every 30 minutes, consuming minimal resources between executions.

Gotcha

urlwatch’s simplicity is also its constraint. It’s a command-line tool that runs and exits—there’s no persistent daemon, no web interface, no real-time monitoring. You’re responsible for scheduling via cron or systemd timers, which means minimum check intervals are typically measured in minutes, not seconds. If you need sub-minute alerting or true real-time monitoring, urlwatch’s architecture fundamentally doesn’t support it.

The browser automation feature, while powerful, introduces significant complexity. Installing Pyppeteer means downloading a full Chromium binary (100MB+), and each browser-based job can consume 200-300MB of RAM during execution. On resource-constrained VPS instances or when monitoring dozens of JavaScript-heavy sites, this overhead becomes prohibitive. There’s also no built-in rate limiting or parallelization control—all jobs run sequentially, so 50 jobs with browser automation might take 10+ minutes to complete. The single-machine, local-storage architecture means you can’t distribute monitoring across multiple nodes or achieve high availability. If your monitoring host goes down, your monitoring stops. Period.

Verdict

Use urlwatch if you’re monitoring a handful to a few hundred URLs with cron-based scheduling, need powerful content extraction through filter chains, and want notifications delivered to email, mobile apps, or webhooks. It’s perfect for tracking price changes, job postings, RSS feeds, API responses, or government procurement sites where you need precise filtering logic and don’t want to pay monthly SaaS fees. The Python hooks system makes it infinitely extensible for developers comfortable with scripting. Skip if you need a web UI for non-technical users, real-time monitoring with sub-minute intervals, distributed architecture across multiple servers, team collaboration features, or built-in visualization/dashboard capabilities. In those cases, look at changedetection.io for a Docker-based GUI experience, Huginn for workflow automation beyond simple monitoring, or commercial services like visualping.io if budget permits and you value managed infrastructure.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/thp-urlwatch.svg)](https://starlog.is/api/badge-click/automation/thp-urlwatch)