Building a Local Mirror of 70,000+ WordPress Plugins with SVN Revision Tracking
Hook
What if you could grep through every single WordPress plugin ever published—all 70,000+ of them—as easily as searching your local codebase? That’s exactly what this tool enables, turning the entire WordPress plugin ecosystem into a searchable local archive.
Context
The WordPress plugin directory is one of the largest open-source software repositories in existence, containing over 60,000 active plugins with millions of lines of code. For security researchers hunting for vulnerable patterns, plugin developers researching API usage, or WordPress core contributors analyzing ecosystem impact, accessing this data has traditionally meant either writing custom scrapers or manually downloading hundreds of individual plugins.
Mark Jaquith’s WordPress Plugin Directory Slurper emerged from a simple need: efficient bulk analysis of WordPress plugins. Before tools like this, developers attempting comprehensive security audits or compatibility research faced a painful choice—spend days writing a custom downloader, manually fetch individual plugins, or attempt to maintain a full SVN checkout of the WordPress plugin repository (a process that could take literally days and consume enormous disk space with version history). The Slurper took a different approach: download just the stable releases, track changes incrementally, and make the entire ecosystem searchable locally.
Technical Insight
The architecture of WordPress Plugin Directory Slurper is deceptively simple, but its efficiency comes from three key design decisions: SVN revision tracking, ZIP file downloads instead of SVN checkouts, and parallelized fetching.
At its core, the tool maintains a .last-revision file that stores the SVN revision number from the last successful sync. On each run, it queries the WordPress.org SVN repository API to retrieve a log of all changes since that revision. This incremental approach transforms what would be a multi-hour operation into something that completes in minutes:
// Pseudocode representation of the core logic
$last_revision = file_exists('.last-revision')
? trim(file_get_contents('.last-revision'))
: 0;
$svn_log = shell_exec("svn log https://plugins.svn.wordpress.org/ \
-r" . ($last_revision + 1) . ":HEAD --verbose --xml");
// Parse the XML to extract changed plugin slugs
$changed_plugins = parse_svn_log_xml($svn_log);
foreach ($changed_plugins as $plugin_slug) {
download_plugin_zip($plugin_slug);
}
// Store the latest revision number
file_put_contents('.last-revision', $latest_revision);
The second crucial decision is downloading pre-packaged ZIP files rather than performing SVN checkouts. When you request a plugin from WordPress.org’s download API (formatted as https://downloads.wordpress.org/plugin/{slug}.zip), you receive only the current stable version—no commit history, no branches, no SVN metadata. For a mirror intended for code analysis rather than version control, this is perfect. A full SVN checkout of the plugin repository can exceed hundreds of gigabytes with all the version history; the stable ZIPs alone require only about 45GB (as of 2017, likely 60-80GB today).
The download process itself leverages shell parallelization rather than implementing threading in PHP. The script generates a list of plugin URLs and pipes them to wget via xargs with parallel execution:
# Generated command structure (simplified)
cat plugin-urls.txt | xargs -n 1 -P 10 wget -q -nc -P ./plugins/
# Then extract all ZIPs in parallel
find ./plugins -name '*.zip' | xargs -n 1 -P 10 -I {} unzip -q -o {} -d ./extracted/
The -P 10 flag tells xargs to run 10 parallel processes, dramatically reducing sync time. On a decent connection, this parallelization can turn a 12-hour sequential download into a 2-3 hour operation.
The repository also includes a clever scan summarization tool. After downloading the entire plugin directory, you can search for patterns using tools like ag (The Silver Searcher) or ack, then pipe the results through the summarizer to enrich them with plugin metadata:
# Search for direct database queries (potential SQL injection vectors)
ag "\$wpdb->query\(" ./extracted/ -l | \
php summarize-scan.php
# Output includes plugin name, active installations, last updated, etc.
# Example output:
# Plugin: contact-form-7 (5M+ installs)
# File: wp-content/plugins/contact-form-7/includes/contact-form.php
# Last Updated: 2024-01-15
This summarization layer transforms raw grep results into actionable intelligence, automatically fetching metadata like active installation counts from the WordPress.org API. A security researcher can immediately prioritize findings based on impact—a vulnerability in a plugin with 5 million active installs deserves immediate attention, while one in a plugin with 100 installs might be lower priority.
Gotcha
The Unix-only nature of this tool is its most significant limitation. It’s hardcoded to use wget, svn, and standard Unix shell utilities. Windows developers are completely out of luck unless they install WSL (Windows Subsystem for Linux) or use a fork like chriscct7’s version that reimplements the functionality using cURL and PHP pThreads. Even on macOS, you’ll need to install dependencies via Homebrew—svn isn’t included by default in modern macOS versions, and wget has never been.
Disk space requirements are substantial and growing. The 45GB figure mentioned in the 2017 README is likely outdated—the WordPress plugin repository has continued growing exponentially. Budget at least 80-100GB for the extracted plugins, plus additional space for the ZIP files if you don’t delete them after extraction. The initial sync will also consume significant bandwidth (potentially hundreds of gigabytes) and time. On a fast connection, expect 2-4 hours for the initial download; on slower connections or during peak hours when WordPress.org servers are busy, it could take overnight.
There’s also a maintenance consideration: this tool downloads stable versions only. If a plugin has vulnerabilities in beta or trunk versions that haven’t been released to stable, you won’t catch them. Additionally, plugins that have been removed from the directory for security reasons may disappear from your local mirror on the next sync, potentially destroying evidence if you’re conducting security research and haven’t backed up your findings.
Verdict
Use if: you’re conducting large-scale security research across the WordPress ecosystem, analyzing API usage patterns across thousands of plugins, performing compliance audits that require scanning every plugin for specific code patterns, or you’re a WordPress core contributor studying the impact of proposed API changes. The ability to run tools like ag or ripgrep across the entire plugin directory is genuinely transformative for these use cases—queries that would take hours through web interfaces complete in seconds locally. Skip if: you only need to analyze a handful of specific plugins (just download them individually), you’re on Windows without WSL, you lack 100GB+ of free disk space, or you need version history rather than just current stable releases. For targeted analysis or small-scale research, the WordPress.org REST API or targeted downloads will be faster and more practical than maintaining a complete local mirror.