Back to Articles

Statistically Likely Usernames: Why Smart Pentesters Count Names Like Casinos Count Cards

[ View on GitHub ]

Statistically Likely Usernames: Why Smart Pentesters Count Names Like Casinos Count Cards

Hook

A horizontal password attack using ‘Password123’ against 10,000 usernames is statistically more successful than trying 10,000 passwords against a single account—and it won’t trigger a single lockout.

Context

Traditional password cracking follows a vertical pattern: take one username, throw thousands of password guesses at it until the account locks or you break in. This approach dominated the early 2000s when account lockout policies were inconsistent and three-strike rules were rare. Modern enterprise systems flipped the script. Now, five failed login attempts trigger lockouts, alerting security teams and burning your access attempt.

The statistically-likely-usernames repository emerged from a simple observation: organizations follow predictable patterns when creating user accounts. HR departments don’t get creative—they follow formulaic rules. John Smith becomes jsmith, j.smith, or john.smith based on whatever naming convention IT established in 1997. If you know the most common first names and surnames (thanks, US Census Bureau and Facebook’s 171 million name index), you can generate username lists ordered by statistical likelihood. Test ‘Password123’ against the 1,000 most probable usernames before trying the 10,000th most probable. You’re not trying to crack one specific account—you’re fishing in a barrel where you know exactly where the biggest fish swim.

Technical Insight

US Census Data

Name Frequencies

Name Combination

Engine

Facebook Index

171M Names

Format Generator

john.smith, jsmith, etc.

Pareto Probability

Ranking Algorithm

Interleaving

Strategy

Awesome Mix

Wordlists

Format-Specific

Lists

Age Distribution

Assumptions

DOBer Tool

Date-of-Birth

Sequences

Username

Enumeration

Password Reset

Attacks

System architecture — auto-generated

The core methodology relies on the Pareto principle applied to onomastics (the study of names). Approximately 20% of names account for 80% of the population. If ‘James’ is the most common male first name and ‘Smith’ is the most common surname, then ‘jsmith’ should be your first guess, not your 50,000th. The repository provides pre-generated lists combining census-derived name frequency with common corporate username formats.

The real intelligence lives in the ordering algorithm and format interleaving. Rather than generating all ‘firstname.lastname’ combinations, then all ‘firstinitiallastname’ combinations, the Awesome Mix volumes interleave formats based on combined probability:

# Naive approach (inefficient)
john.smith
james.johnson
michael.williams
# ... 10,000 more firstname.lastname ...
jsmith
jjohnson
mwilliams

# Awesome Mix approach (statistically optimized)
john.smith
jsmith
j.smith
james.johnson
jjohnson
j.johnson
michael.williams
mwilliams

This interleaving means you’re testing the most statistically likely username across multiple possible format conventions early, rather than exhausting one format before moving to the next. If an organization uses ‘firstinitiallastname’ format, you don’t waste 50,000 guesses on ‘firstname.lastname’ before discovering that.

The DOBer companion tool adds another dimension—date-of-birth enumeration for password reset attacks. It generates dates using a normal distribution assumption radiating from likely employee ages:

# Simplified concept from DOBer methodology
import datetime
from dateutil.relativedelta import relativedelta

def generate_dob_list(avg_age=38, std_dev=12):
    """Generate DOBs in statistically likely order"""
    today = datetime.date.today()
    dobs = []
    
    # Start with most likely age (mean of distribution)
    center_dob = today - relativedelta(years=avg_age)
    dobs.append(center_dob)
    
    # Radiate outward based on standard deviation
    for offset in range(1, std_dev * 3):
        dobs.append(center_dob - relativedelta(years=offset))
        dobs.append(center_dob + relativedelta(years=offset))
    
    return dobs

This approach recognizes that corporate workforces cluster around certain age ranges. You’re more likely to find employees born in 1985 than 1945 in most tech companies.

The Unix pipeline examples in the repository reveal sophisticated understanding of operational workflows. Need to test only users with specific surname patterns? Pipe through grep. Need to add domain suffixes for email-based authentication? Pipe through sed:

# Generate custom format from existing lists
cat top-1000-first-names.txt top-1000-surnames.txt | \
  awk '{first=$1; getline; print tolower(first"."$1)}' | \
  sed 's/$/@target-corp.com/' | \
  head -n 5000 > custom-email-list.txt

The repository doesn’t provide attack tooling because it doesn’t need to. These lists integrate seamlessly with existing frameworks like Hydra, Burp Intruder, or custom Python scripts using requests. The value proposition is pure intelligence—knowing which usernames to try first.

One underappreciated aspect: the lists are sorted descending by frequency. This means you can truncate at any point and still have the most statistically valid subset. Need only 1,000 guesses due to time constraints? Use head -n 1000. The 1,000th entry is more valuable than the 50,000th entry in an unsorted list.

Gotcha

The Western naming bias is severe and non-negotiable. These lists derive from US Census data and Facebook’s primarily Western user base. If you’re pentesting a company in China, India, or Nigeria, the statistical distributions collapse. ‘Wang Wei’ is vastly more common than ‘John Smith’ in Chinese populations, but you won’t find that intelligence here. Even within Western contexts, the lists skew toward older naming conventions—millennial and Gen-Z name popularity shifts (more Aidens, fewer Johns) aren’t fully reflected in static census snapshots.

The repository also assumes organizations use predictable, human-readable username formats. Modern identity providers increasingly use UUID-style usernames (a3f8d9c2-4b6e-4f2a-8c9d-1e5f6a7b8c9d) or email addresses from non-corporate domains. SaaS platforms with social login bypass username enumeration entirely. Account enumeration protections—timing attack mitigations, generic error messages, CAPTCHA after N failed attempts—can render even perfect username lists operationally useless. You might know ‘jsmith’ exists, but if the login page returns identical responses for valid and invalid users, you can’t confirm it without triggering other defenses. The repository provides ammunition, but won’t help you bypass the armor protecting the target.

Verdict

Use if you’re conducting authorized penetration tests against traditional enterprise environments (Active Directory, corporate VPNs, legacy web apps) where account lockout is a concern and username formats follow HR conventions. The Awesome Mix volumes are exceptional starting points when you know the organization is Western-focused but haven’t identified the exact username format. Use this for horizontal password spraying attacks where stealth and efficiency matter more than brute force. Skip if you’re targeting non-Western organizations, modern SaaS platforms with UUID usernames or federated authentication, or environments with robust account enumeration protections. Skip if you need an all-in-one attack framework—this is intelligence, not weaponry. You’ll need to pair these lists with Hydra, Burp, or custom scripts to operationalize them. Skip if you’re targeting younger demographic organizations where name popularity has shifted significantly from census baselines.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/insidetrust-statistically-likely-usernames.svg)](https://starlog.is/api/badge-click/cybersecurity/insidetrust-statistically-likely-usernames)