Inside CitizenLab's Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

Hook

Over 530,000 censorship measurements run daily against a dataset maintained by volunteers who know which local news sites, political blogs, and messaging apps their governments want to silence.

Context

Before the CitizenLab test-lists repository, censorship measurement tools faced a fundamental problem: how do you test for website blocking when you don't know which websites to test? Early tools either used generic top-1000 lists from Alexa (missing locally relevant sites) or required researchers to manually compile URLs for each study (making systematic global monitoring impossible). A blocked news site in Thailand matters more for understanding Thai censorship than whether Facebook is accessible, but automated approaches couldn't capture this nuance.

CitizenLab created test-lists in 2014 to solve this expert knowledge problem through crowdsourcing. The repository maintains country-specific CSV files with URLs chosen by regional contributors who understand local politics, languages, and censorship patterns. A human rights activist in Egypt knows which opposition websites matter; a digital rights researcher in Russia understands which VPN sites get targeted. This expert curation, combined with a standardized categorization framework, transformed censorship research from ad-hoc studies into systematic global monitoring. Today, the Open Observatory of Network Interference (OONI) runs millions of measurements monthly using these lists, creating the world's largest censorship measurement dataset.

Technical Insight

System architecture — auto-generated

The architecture is deceptively simple: country-specific CSV files (like cn.csv, ir.csv, us.csv) plus a global.csv containing universally relevant sites. Each row contains a URL, category code, category description, date added, source, and notes. The categories follow a four-theme taxonomy: POLR (political criticism), HUMR (human rights), CULTR (culture), XED (sex education), NEWS, HATE, REL (religion), and critically, ANON (circumvention tools themselves).

Here's what a typical entry looks like in the CSV structure:

url,category_code,category_description,date_added,source,notes
https://www.bbc.com/persian,NEWS,News Media,2014-04-15,Citizenlab,Persian language BBC service
https://telegram.org,ANON,Anonymization and circumvention tools,2016-03-22,OONI,Messaging app frequently blocked
https://iranhumanrights.org,HUMR,Human Rights Issues,2014-04-15,Community,Iranian human rights documentation

The Python tooling around these lists handles validation and conversion. The validate.py script ensures URLs are properly formatted, categories are valid, and there are no duplicates. Here's how you might validate and parse these lists programmatically:

import csv
from urllib.parse import urlparse
import json

VALID_CATEGORIES = ['POLR', 'HUMR', 'ENV', 'MILX', 'HATE', 'NEWS', 'XED', 
                    'PORN', 'PROV', 'PUBH', 'GMB', 'ANON', 'DATE', 
                    'GRP', 'LGBT', 'FILE', 'HACK', 'COMT', 'MMED', 
                    'HOST', 'SRCH', 'GAME', 'CULTR', 'ECON', 'GOVT', 
                    'COMM', 'REL', 'ALDR', 'CTRL']

def validate_test_list(csv_path):
    errors = []
    urls_seen = set()
    
    with open(csv_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader, start=2):
            url = row.get('url', '').strip()
            category = row.get('category_code', '').strip()
            
            # Validate URL format
            try:
                parsed = urlparse(url)
                if not parsed.scheme or not parsed.netloc:
                    errors.append(f"Line {i}: Invalid URL format: {url}")
            except Exception as e:
                errors.append(f"Line {i}: URL parse error: {e}")
            
            # Check for duplicates
            if url in urls_seen:
                errors.append(f"Line {i}: Duplicate URL: {url}")
            urls_seen.add(url)
            
            # Validate category
            if category not in VALID_CATEGORIES:
                errors.append(f"Line {i}: Invalid category '{category}' for {url}")
    
    return errors

# Usage for testing Iran's list
errors = validate_test_list('lists/ir.csv')
if errors:
    for error in errors:
        print(error)
else:
    print("List validation passed")

The repository's real power lies in its integration with measurement platforms. OONI Probe, the most widely deployed censorship measurement tool, downloads these lists and tests each URL using multiple techniques: DNS lookup consistency, HTTP header examination, and TCP connection attempts. When OONI detects anomalies—DNS responses pointing to block pages, TCP connections timing out, or HTTP responses with government warning messages—it logs the measurement along with the URL's category from test-lists.

This creates a feedback loop: researchers analyze OONI data to see which categories get blocked most (often ANON and POLR in authoritarian countries), then update test-lists with new URLs in those categories. For example, after Telegram got blocked in Iran, contributors added dozens of Telegram proxy domains to ir.csv in the ANON category, enabling ongoing monitoring of the block's effectiveness.

The JSON format mirrors the CSV structure but enables easier programmatic consumption:

[
  {
    "url": "https://www.hrw.org",
    "category_code": "HUMR",
    "category_description": "Human Rights Issues",
    "date_added": "2014-04-15",
    "source": "Citizenlab",
    "notes": "International human rights monitoring"
  },
  {
    "url": "https://www.amnesty.org",
    "category_code": "HUMR",
    "category_description": "Human Rights Issues",
    "date_added": "2014-04-15",
    "source": "Citizenlab",
    "notes": ""
  }
]

Contributors follow documented guidelines about URL selection: prefer HTTPS, include popular local sites, represent diverse viewpoints (not just opposition), and prioritize sites actually blocked or likely targets. The review process happens through GitHub pull requests where regional experts debate whether specific URLs belong. This transforms censorship research from closed academic work into transparent collaborative infrastructure.

Gotcha

The repository's biggest limitation is the static snapshot problem. Websites disappear, URLs change, and new censorship targets emerge constantly. A political blog blocked in 2015 might be offline by 2023, creating false negatives in measurement data. OONI's measurements show roughly 15-20% of URLs in some country lists return errors unrelated to censorship (DNS failures, dead sites, hosting changes). There's no automated mechanism to detect and remove dead URLs, so the lists gradually decay without active maintenance.

Coverage varies wildly by country based on contributor availability. Germany's list has 1,000+ well-maintained URLs spanning 25+ categories. Smaller countries or those with fewer digital rights researchers might have 50 URLs added years ago. This creates measurement bias: we have detailed data about censorship in countries with strong digital rights communities but sparse data from countries where censorship is likely more severe. The non-commercial license (CC BY-NC-SA 4.0) also limits practical applications—commercial VPN providers or security companies can't integrate these lists without licensing concerns, reducing the dataset's impact on tools consumers actually use.

Verdict

Use if: You're building censorship measurement tools, conducting Internet freedom research, need curated lists of politically/socially relevant websites by country, or want to contribute to global censorship monitoring. The repository is the de facto standard for this domain and integrates seamlessly with OONI, Censored Planet, and similar platforms. Also use if you're studying regional Internet ecosystems and need expert-selected URLs representing local online spaces. Skip if: You need commercial licensing for a product, require real-time always-current URL lists, want comprehensive coverage of all websites rather than censorship-relevant targets, or need actual measurement data rather than test targets. Also skip if you're looking for executable code rather than curated data files—this is infrastructure for other tools, not a standalone measurement platform.

Inside CitizenLab's Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

Inside CitizenLab's Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Inside CitizenLab's Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

Inside awesome-selfhosted: How a 292K-Star GitHub List Became the Self-Hosting Movement's Central Nervous System

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]