Building a 45GB Mirror: How WordPress Plugin Directory Slurper Tames 70,000 Plugins
Hook
Want to search for security vulnerabilities across every WordPress plugin ever published? You'll need 45GB of disk space and a clever way to avoid downloading the entire repository every time something changes.
Context
The WordPress plugin directory hosts over 70,000 plugins—a massive ecosystem that security researchers, plugin developers, and core contributors frequently need to analyze in bulk. Need to find every plugin using a deprecated function? Want to audit how many plugins properly sanitize user input? Curious which plugins depend on a specific API that's about to change? The WordPress.org website doesn't offer bulk download, and checking out individual plugins from SVN one at a time is prohibitively slow.
Mark Jaquith created WordPress Plugin Directory Slurper to solve this exact problem. Before this tool, researchers had limited options: manually download thousands of plugins through the web interface, write custom scrapers that might miss plugins or violate terms of service, or attempt to clone the entire SVN repository (which includes every revision of every file ever committed—an impossibly large dataset). The Slurper takes a different approach: it downloads only the stable release ZIP files that WordPress.org generates, tracks what's changed using SVN revision numbers, and parallelizes downloads to make the process reasonably fast. It's essentially rsync for the WordPress plugin ecosystem, purpose-built for developers who need local access to everything.
Technical Insight
The architecture of WordPress Plugin Directory Slurper reveals several clever decisions that make managing 70,000+ plugins tractable. At its core, it's a bash script that orchestrates PHP, SVN, wget, and Unix pipelines into an efficient synchronization engine.
The tool avoids the biggest trap: trying to use SVN checkouts. WordPress plugins live in a Subversion repository where each plugin can have trunk, branches, and tags directories. The stable version might be in trunk, or in a tag like /tags/2.4.1, or somewhere else entirely. Instead of parsing this structure, the Slurper downloads the pre-built ZIP files that WordPress.org generates at URLs like https://downloads.wordpress.org/plugin/plugin-name.zip. This is the same file users download through the admin interface, and WordPress.org has already figured out which version is stable.
The synchronization intelligence comes from tracking SVN revision numbers. Here's the core logic:
// Get the list of plugins that changed since last sync
$last_revision = file_exists('.last-revision')
? (int) file_get_contents('.last-revision')
: 0;
$current_revision = (int) shell_exec(
'svn info https://plugins.svn.wordpress.org/ | grep "Revision:" | awk "{print \$2}"'
);
// Query SVN for changed paths since last revision
$changed = shell_exec(
"svn log https://plugins.svn.wordpress.org/ -v -r" .
($last_revision + 1) . ":" . $current_revision .
" | grep '^ [A-Z] /[^/]*$' | awk '{print \$2}' | cut -d'/' -f2 | sort -u"
);
$plugins_to_update = array_filter(explode("\n", trim($changed)));
This approach is elegant: SVN's log with -v (verbose) shows which paths changed. By filtering for top-level directory changes and comparing against the stored .last-revision file, the script identifies only plugins that have been updated. On subsequent runs, instead of downloading 70,000 plugins, you might only download 200.
The parallelization strategy leverages Unix pipelines and xargs. After generating a list of plugin ZIP URLs to download, the script pipes them through xargs with the -P flag to spawn multiple wget processes:
# Download plugins in parallel (default 10 concurrent)
cat plugins-to-download.txt | \
xargs -P 10 -I {} wget -q -nc \
https://downloads.wordpress.org/plugin/{}.zip \
-P ./plugins/
The -P 10 flag creates up to 10 parallel wget processes. On a fast connection, this transforms a 3-hour sequential download into a 20-minute parallel one. The -nc (no-clobber) flag prevents re-downloading if the file already exists, adding another layer of efficiency.
After downloading, the script extracts each ZIP into its own directory, maintaining a clean structure: plugins/akismet/, plugins/jetpack/, etc. This structure integrates beautifully with code search tools. The repository includes helper scripts for common searches:
# Find all plugins using wp_remote_get without timeout parameters
ag "wp_remote_get\([^)]*\)" --php | \
grep -v "timeout" | \
cut -d: -f1 | \
sort -u
# Identify plugins with SQL injection patterns
ag "\$wpdb->query.*\$_(GET|POST|REQUEST)" --php
These patterns are gold for security research. The tool transforms an impossibly large manual audit into a greppable filesystem.
One architectural decision worth noting: the script uses shell commands rather than PHP's native SVN or HTTP libraries. This makes it more portable (works anywhere svn and wget are installed) and leverages decades of Unix tool optimization. The PHP portions handle logic and orchestration, while battle-tested Unix tools handle the heavy lifting.
Gotcha
WordPress Plugin Directory Slurper is decidedly Unix-centric, and this creates real limitations. The script assumes bash, SVN command-line tools, wget, and standard Unix utilities like awk and grep. Windows users need WSL (Windows Subsystem for Linux) or must use one of the community forks like chriscct7's version that replaces wget with cURL and adds Windows compatibility. Even then, path handling and permissions can be quirky.
The initial sync is genuinely heavyweight. Expect 45-50GB of disk space and several hours to overnight for the first run, depending on your connection speed. The parallel downloads are aggressive—10 concurrent wget processes by default—which can saturate bandwidth, trigger ISP rate limiting, or even look suspicious to network monitoring systems. On cloud instances with usage-based bandwidth billing, that initial sync might cost real money. The incremental updates are reasonable (usually under 1GB and minutes), but there's no escaping that first big pull. You also need to maintain this regularly; if you let the mirror go stale for months, catching up means downloading thousands of changed plugins again. The tool doesn't help you clean up old plugin versions either—it just keeps overwriting the directory for each plugin, so you only ever have the latest stable release, not a historical archive.
Verdict
Use if: You're conducting WordPress security research, need to perform bulk code analysis across the plugin ecosystem (finding deprecated function usage, API patterns, security vulnerabilities), contributing to WordPress core and need to test backward compatibility, or building tools that analyze plugin code at scale. The smart incremental updates make ongoing maintenance tractable, and the local filesystem makes sophisticated searches possible. Skip if: You only need a handful of specific plugins (just download them directly), you're on Windows without WSL and want native tools, you have limited disk space (<50GB free), your use case is one-time analysis rather than ongoing research (the maintenance burden isn't worth it), or you need historical versions rather than just current stable releases. For narrow use cases, the WordPress.org Plugin API or targeted downloads are more appropriate than maintaining a 45GB mirror.