Back to Articles

Inside Citizen Lab's Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

[ View on GitHub ]

Inside Citizen Lab’s Test Lists: The Crowdsourced Dataset Powering Global Censorship Research

Hook

While tech giants debate content moderation, a 519-star GitHub repository of CSV files has quietly become a widely-used dataset for measuring what governments block on the internet.

Context

Internet censorship research faces a fundamental chicken-and-egg problem: you can’t measure blocking without knowing what to test, but you can’t know what to test without local expertise in hundreds of countries and dozens of languages. Before Citizen Lab’s test-lists repository, censorship measurement was fragmented—researchers maintained private lists, duplicated effort, and lacked the regional knowledge to identify culturally relevant blocked content. A Saudi activist knows which opposition news sites matter in Riyadh; a researcher in Toronto doesn’t.

The test-lists repository solves this through radical simplicity: country-coded CSV files, each curated by regional experts who understand local politics, language, and censorship patterns. Combined with a global list of internationally significant sites (mostly in English), this creates a community-maintained foundation that tools like OONI use to detect blocking. It’s not software—it’s structured knowledge about what matters enough to censor, encoded in a format that any measurement tool can consume.

Technical Insight

Data Repository

Validated URLs

by Category

Validated URLs

by Category

CSV/JSON Files

Country Lists

Python Validation

Scripts

Global List

CSV/JSON

Category Legend

00-LEGEND

OONI Probe

Other Network

Measurement Tools

Censorship Test

Results

System architecture — auto-generated

The repository’s architecture is deliberately minimal: CSV and JSON files organized by ISO country codes, with a global list supplementing country-specific lists. This isn’t accidental—it’s a design decision prioritizing interoperability over features. The README indicates lists are available in both CSV and JSON formats, though it doesn’t specify the exact schema.

The category taxonomy is the critical piece. The repository uses a standardized classification system across four broad themes: Political (focused on opposition viewpoints, human rights, freedom of expression, minority rights, religious movements), Social (sexuality, gambling, illegal drugs and alcohol, other sensitive topics), Conflict/Security (armed conflicts, border disputes, separatist movements, militant groups), and Internet Tools (email, hosting, search, translation, VoIP, circumvention methods). A legend file (00-LEGEND-new_category_codes.csv) defines the category codes. This standardization lets researchers aggregate findings globally using consistent taxonomy.

For tool builders, integration appears straightforward based on the repository structure. The separation between data (this repository) and measurement (tools like OONI) is clean—you could build different measurement tools using these same lists. The regional approach scales better than centralization. Country-specific lists contain URLs with local language content and culturally relevant sites that international researchers might miss—opposition news, religious content, VPN services, and other material specific to each region’s censorship landscape.

The CIS (Commonwealth of Independent States) list is notable as the only regional list applying to multiple countries, intended for testing across former Soviet nations. Each list represents accumulated local knowledge curated by regional experts who understand what content is relevant or allegedly blocked in their areas.

Contributing follows a pull request workflow as indicated in the README’s pointer to OONI’s contribution guide. The README notes that the repository contains the newest list for every unique country code, suggesting ongoing maintenance, though update frequency varies by community involvement.

Gotcha

The repository’s fundamental limitation is stated in its README: these lists ‘are not the entirety of testing lists but rather just the newest list for every unique country code.’ They’re curated samples, not comprehensive catalogs. If a government blocks thousands of sites, a test list might include hundreds. This creates selection bias—lists over-represent known cases and under-represent newly targeted content. There’s no indication of automated discovery; if censorship patterns change, lists won’t reflect this until someone submits updates. This lag could be substantial.

Data quality and coverage likely vary by country based on community involvement. The repository is community-maintained (described as ‘Citizen Lab and Others’ in the citation), so coverage depends on whether local activists, researchers, or organizations have bandwidth to contribute. The README provides no guarantees about list freshness, comprehensiveness, or uniform quality across countries. Lists may contain outdated URLs or sites that have changed since addition. The validation mentioned in the contribution process likely checks format rather than whether destinations still exist or remain relevant.

For automated tools, this means you cannot rely on uniform global coverage—you’re working with a community-curated dataset of varying quality and maintenance levels across different countries.

Verdict

Use test-lists if you’re building censorship measurement tools, conducting academic research on internet freedom, or need structured test data representing culturally relevant content across specific countries and regions. It’s designed as infrastructure for network measurement platforms and investigations where understanding what gets blocked matters. The community curation and regional expertise—local language content, culturally significant sites, regionally relevant categories—provide value that’s difficult to replicate independently. The Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license permits research and non-commercial use.

Skip it if you need comprehensive URL coverage, real-time censorship detection, or are working on general web testing unrelated to censorship research. This is a specialized dataset for internet freedom measurement. Also skip if you need guaranteed data freshness, uniform quality across all countries, or commercial use cases (the license is non-commercial). The README makes clear these are samples (‘not the entirety’), not exhaustive catalogs. Consider other sources if you need SLAs, consistent update schedules, or commercial licensing, though you’ll lose the regional expertise and open collaboration model that characterize this repository.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/citizenlab-test-lists.svg)](https://starlog.is/api/badge-click/automation/citizenlab-test-lists)