Back to Articles

Probable-Wordlists: The 2 Billion Password Dataset That Ranks Human Behavior Over Alphabets

[ View on GitHub ]

Probable-Wordlists: The 2 Billion Password Dataset That Ranks Human Behavior Over Alphabets

Hook

The password '123456' appears over 24 million times in public breaches. Traditional alphabetically-sorted wordlists would put it after '000000'—but in reality, you should try it first.

Context

Password cracking and security testing have relied on wordlists for decades, but most suffer from a fundamental inefficiency: alphabetical ordering. When you're running a dictionary attack against a login endpoint with rate limiting, or testing your organization's password policy, trying 'aardvark' before '123456' is statistically absurd. The problem is that until recently, nobody had systematically ranked passwords by their actual occurrence frequency across multiple breach datasets.

Berzerk0's Probable-Wordlists emerged to solve this ordering problem at scale. By aggregating over 1,600 password dumps from public breaches—representing roughly 350GB of raw data and approximately 13 billion password instances—the project deduplicates and ranks nearly 2 billion unique passwords by cross-source frequency. The core insight is treating occurrence count as a proxy for probability: if 'password123' appears 50,000 times across breaches and 'xK9#mP2qL' appears 6 times, humans are vastly more likely to choose the former. For penetration testers working against time constraints, security researchers analyzing password patterns, or developers validating password policies, this probability-based approach transforms efficiency.

Technical Insight

Analysis Outputs

Count occurrences

Remove <5 occurrences

Generate patterns

Generate masks

1600+ Breach Sources

~350GB Raw Data

Deduplication Engine

Frequency Map

~2B unique passwords

Filter & Threshold

Probability Sorting

Real-Passwords

Size-based Lists

WPA-Length

8-40 char filtered

Dictionary-Style

Clean variations

Security Tools

hashcat/John/hydra

PACK Rules

HashCat Masks

System architecture — auto-generated

The architecture of Probable-Wordlists isn't code—it's a data pipeline methodology that transforms chaos into actionable intelligence. The process starts with 1,600+ breach sources totaling ~350GB, where each password instance gets counted across sources. Duplicates are collapsed, creating a frequency map of approximately 2 billion unique entries. The critical filtering step removes passwords appearing fewer than 5 times, establishing a statistical threshold that balances comprehensiveness with noise reduction. The final output: pre-sorted wordlists where position correlates with real-world probability.

The repository structure reflects different use cases. The Real-Passwords directory contains probability-sorted lists organized by size (Top 12 Thousand, Top 304 Thousand, etc.), while WPA-Length subdirectories filter for 8-40 character passwords required by WiFi security protocols. The Dictionary-Style lists provide cleaned variations without special characters for different testing scenarios. Most importantly, these aren't just text files—they're optimized inputs for tools like hashcat and John the Ripper.

For practical application, consider a typical penetration test scenario. You're testing a web application's login endpoint with aggressive rate limiting—maybe 50 attempts before lockout. Using Probable-Wordlists with hydra or a custom script looks like this:

# Traditional approach: alphabetically sorted
hydra -l admin -P /usr/share/wordlists/rockyou.txt \
  -t 4 -f https-post-form \
  "example.com/login:username=^USER^&password=^PASS^:Invalid"

# Probable-Wordlists approach: probability-sorted
hydra -l admin -P ./Real-Passwords/Top-304Thousand-probable.txt \
  -t 4 -f https-post-form \
  "example.com/login:username=^USER^&password=^PASS^:Invalid"

The difference isn't the tool—it's the ordering. With Probable-Wordlists, your first 50 attempts target the passwords that appeared most frequently across billions of real choices. You're testing '123456', 'password', 'qwerty', and '123456789' in your first dozen attempts, not after sorting through entries starting with 'a' or special characters.

For security researchers analyzing password composition patterns, the repository includes HashCat rules and masks generated via PACK (Password Analysis and Cracking Kit). These files codify statistical patterns extracted from the corpus:

# Using generated masks for targeted attacks
hashcat -a 3 -m 1000 hashes.txt \
  --increment --increment-min=8 \
  -1 ?l?u ?1?l?l?l?l?l?d?d

# Applying probability-derived rules
hashcat -a 0 -m 1000 hashes.txt \
  base_words.txt -r generated_rules.rule

The masks represent common structural patterns (lowercase letters followed by digits, capitalized first letter with special character suffix, etc.) weighted by occurrence frequency. This transforms guessing from random permutation exploration into targeted human behavior modeling.

The methodology's power comes from cross-source aggregation rather than single-breach analysis. A password appearing in multiple unrelated breaches signals widespread human selection patterns, not coincidence. The 5-occurrence threshold filters out typos, random generated passwords, and other noise while preserving statistically significant patterns. This is dataset design as applied probability theory.

Gotcha

The elephant in the room: Probable-Wordlists is massive, unwieldy, and ethically fraught. The repository itself is so large that GitHub warns users against cloning it directly—you're directed to a downloads page for specific files. Even the 'small' Top 12 Thousand list requires careful consideration of where you store it and who has access. The full corpus approaches gigabytes of sensitive data containing real passwords that real people actually used (and likely reused across services). If you're working in a regulated environment or handling this data without proper security controls, you're creating liability.

More fundamentally, this is a static dataset, not a tool. You can't run Probable-Wordlists—you use it with other tools. If you're expecting a GUI, automated cracking suite, or even basic filtering utilities, you'll be disappointed. It's raw data that assumes you already know what hashcat is, understand dictionary attacks, and have a legitimate reason to be testing password security. The learning curve isn't about the repository itself; it's about the entire password cracking ecosystem it feeds into. New security professionals might download these lists and have no idea how to actually apply them effectively, while experienced practitioners already have their own curated datasets and may not need another massive collection.

Verdict

Use Probable-Wordlists if you're conducting professional penetration testing against rate-limited authentication systems where attempt efficiency matters more than coverage, researching password psychology and human behavior patterns at scale, validating organizational password policies against real-world weakness data, or building security awareness training that demonstrates why common passwords fail. The probability-based ordering delivers maximum value in minimum attempts, making it ideal when time or attempt budgets are constrained. Skip it if you need lightweight, portable wordlists for basic testing (rockyou.txt serves most casual needs), lack proper security controls for storing breach data (compliance and ethics matter), want an all-in-one cracking tool rather than raw inputs, or don't have existing expertise with password cracking frameworks like hashcat or John the Ripper. This is professional-grade security research infrastructure, not a beginner's toolkit.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/berzerk0-probable-wordlists.svg)](https://starlog.is/api/badge-click/developer-tools/berzerk0-probable-wordlists)