Back to Articles

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

[ View on GitHub ]

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

Hook

A text file with 47,000+ GitHub stars has probably prevented more production bugs than most testing frameworks—and it's just a list of weird strings that break things.

Context

Every application that accepts user input eventually encounters "that one string" that crashes the parser, corrupts the database, or renders the UI unusable. Maybe it's a name like "Patrick O'Brien" that breaks your SQL query, or "𝕿𝖍𝖊 𝕼𝖚𝖎𝖈𝖐 𝕭𝖗𝖔𝖜𝖓 𝕱𝖔𝖝" in mathematical alphanumeric symbols that your font renderer can't handle. Perhaps it's "" that somehow made it past your sanitizer, or "../../../etc/passwd" lurking in a filename parameter.

Before the Big List of Naughty Strings (BLNS), developers discovered these edge cases the hard way—through bug reports, security incidents, and angry customers whose legitimate names contained apostrophes. Testing frameworks offered fuzzing capabilities, but they generated random noise rather than the specific pathological inputs that reliably break real-world systems. Security professionals maintained their own collections of injection payloads, but these weren't designed for general QA use. Max Woolf created BLNS in 2015 to consolidate the "greatest hits" of problematic strings into a single, accessible resource that any developer could integrate into their test suite. The repository's massive adoption reflects a simple truth: curated test data is infrastructure, and most teams were building it from scratch.

Technical Insight

Consumer Integration

Data Repository

Parse sections

Convert to JSON

Import

Iterate strings

Response validation

blns.txt

Master String List

Python Script

scripts/blns.py

blns.json

Programmatic Format

Test Framework

Any Language

Application Under Test

Test Results

Pass/Fail

System architecture — auto-generated

The genius of BLNS lies in its organization and accessibility. The core artifact is blns.txt, a plain text file containing over 500 strings organized into commented sections: Reserved Strings (like "undefined" and "null"), Numeric Edge Cases (like "999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999"), Script Injection patterns, SQL Injection attempts, Server Code Injection, Special Characters, Unicode symbols, two-byte characters, Japanese emoticons, Emoji, Right-to-Left text, Quotation Marks, and more. Each section targets specific failure modes in modern software.

Integrating BLNS into a test suite is deliberately trivial. Here's how you'd use it in a Python test to verify that your user registration endpoint handles edge cases:

import requests
import json

with open('blns.json', 'r', encoding='utf-8') as f:
    naughty_strings = json.load(f)

def test_user_registration_handles_edge_cases():
    for test_string in naughty_strings:
        if not test_string or test_string.startswith('#'):
            continue  # Skip empty lines and comments
        
        response = requests.post('https://api.example.com/users', json={
            'username': test_string,
            'email': f'test@example.com',
            'bio': test_string
        })
        
        # Your API should either accept the input cleanly
        # or reject it with a proper 400 error, never 500
        assert response.status_code in [200, 201, 400], \
            f"Server error on input: {test_string[:50]}"
        
        # If accepted, ensure it round-trips correctly
        if response.status_code in [200, 201]:
            user_id = response.json()['id']
            get_response = requests.get(f'https://api.example.com/users/{user_id}')
            retrieved_username = get_response.json()['username']
            assert retrieved_username == test_string, \
                f"Data corruption: '{test_string[:50]}' became '{retrieved_username[:50]}'"

This simple loop tests 500+ edge cases automatically. You'll quickly discover which strings cause crashes (500 errors), data corruption (strings that don't round-trip), or validation problems. The commented sections in blns.txt help you understand what broke: if "<img src=x onerror=alert('XSS') />" passes through unsanitized, you have an XSS vulnerability. If "1;DROP TABLE users" crashes your database layer, your SQL parameterization is broken.

The repository also provides language-specific integrations. For JavaScript/Node.js projects, you can install it via npm and import it directly:

const blns = require('blns');
const sanitizeInput = require('./sanitizer');

describe('Input sanitizer', () => {
  blns.forEach((naughtyString) => {
    it(`should handle: ${naughtyString.substring(0, 30)}...`, () => {
      // Should not throw
      const result = sanitizeInput(naughtyString);
      
      // Should return a string
      expect(typeof result).toBe('string');
      
      // Should not contain unescaped HTML
      expect(result).not.toMatch(/<script/i);
    });
  });
});

The architectural decision to maintain both .txt and .json formats is deliberate: the text format enables manual exploratory testing (copy-paste strings into forms), while the JSON format enables automation. Language-specific packages (blns-python, blns-dotnet, blns-php) wrap the core data asset, proving that well-curated test data can become genuine infrastructure with an ecosystem built around it.

One underappreciated aspect is how BLNS organizes strings by failure mode rather than by character encoding or attack type alone. The "Unicode Symbols" section includes Emoji, zalgo text (c̸̰̹̼͖o̴͙̊͛͋m̷͔̞̟̌b̷̰͎͋i̸̲̐̓n̸̡͔͆̓͠i̶̜̭̊̚n̴̰̥̊̎g̸̱̳͌̚ ̵̣̫̈́̑͝c̴̰̙̿h̴̝̄ā̶̰̎r̵͙̙̆ā̸̹̍̕c̷̰̙̏t̷̩̾e̴͉̓͠r̸̢̤̓s̶̢̛), and zero-width characters—all of which break rendering or length calculations differently. This organization guides developers toward understanding their system's specific vulnerabilities rather than just throwing random strings at it.

Gotcha

BLNS is explicitly not a comprehensive security testing solution, and the README warns against treating it as one. It contains example injection payloads but lacks the context and variations that professional penetration testing requires. If you're looking for exhaustive SQL injection testing, tools like SQLMap or dedicated security-focused datasets like SecLists provide thousands of variants with context about which database engines they target. BLNS gives you enough to catch obvious vulnerabilities in development, but won't replace proper security audits.

The self-imposed 255-character limit means certain real-world edge cases are excluded. Many database VARCHAR fields and legacy systems have length limits around this boundary, but some attacks (polyglot files, billion laughs attacks) require longer payloads. The repository also intentionally excludes null bytes (\x00) because they cause issues in the testing tools themselves, and it omits the EICAR test string to avoid false positives from antivirus software. These practical compromises make BLNS usable, but mean you'll need supplementary testing for systems that process arbitrary-length input or binary data. Finally, because it's a static curated list, it doesn't automatically update with new Unicode standards or emerging attack patterns—you're dependent on community contributions and maintainer availability for new edge cases.

Verdict

Use if: You're building any application that accepts user input and want to catch 80% of edge-case bugs before they reach production. This is essential baseline testing for web forms, REST APIs, GraphQL endpoints, file upload handlers, search functionality, and database layers. Integrate it into your CI pipeline for automated regression testing, or keep it open during exploratory testing sessions to quickly probe new features. It's particularly valuable if your application serves international users, since the Unicode and emoji sections catch rendering and encoding issues that English-only testing misses. Skip if: You need specialized security testing beyond basic injection prevention—use dedicated security scanners and fuzzing frameworks like AFL or libFuzzer for that. Also skip it if your input domain is highly constrained (numeric-only fields, date pickers with UI constraints) or you're working with binary protocols where string-based testing doesn't apply. BLNS is a force multiplier for general-purpose input validation testing, not a replacement for domain-specific or security-focused tools.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/minimaxir-big-list-of-naughty-strings.svg)](https://starlog.is/api/badge-click/data-knowledge/minimaxir-big-list-of-naughty-strings)