Back to Articles

Testing DLP Systems with Real Source Code: Inside brian-dlptest/dlptest

[ View on GitHub ]

Testing DLP Systems with Real Source Code: Inside brian-dlptest/dlptest

Hook

Your company just spent six figures on a Data Loss Prevention solution, but have you actually verified it catches sensitive data in the messy, real-world code your developers write every day?

Context

Data Loss Prevention systems are supposed to be the last line of defense against accidental data leaks—catching when developers accidentally commit AWS keys, embed production database credentials in test files, or leave customer credit card numbers in debug logs. But there's a fundamental problem with validating these systems: most DLP vendors provide test cases using plain text files or simple CSV datasets. Real data leaks don't happen in pristine .txt files. They happen in Python scripts with convoluted string formatting, in JavaScript files with obfuscated variable names, in configuration files with nested structures, and in log outputs buried among thousands of legitimate entries.

The dlptest repository emerged as a companion to dlptest.com, addressing this testing gap by providing actual source code files—primarily Python—that contain sensitive data patterns embedded in realistic programming contexts. Instead of testing whether your DLP can find '4532-1234-5678-9010' in a text file, you test whether it catches that same credit card number when it's assigned to a variable, passed through a function, or logged with surrounding context. This distinction matters because many DLP solutions rely on pattern matching that can break when sensitive data appears within code syntax, comments, or complex string operations.

Technical Insight

The repository's architecture is deliberately simple: it's a collection of Python source files, each demonstrating different ways sensitive data appears in real code. Unlike a traditional testing framework with assertions and test runners, dlptest functions as a static dataset—you download the files and run them through your DLP system to see what gets flagged.

A typical test file might look like this:

# api_keys_test.py
import requests

class APIClient:
    def __init__(self):
        # AWS Access Key embedded in initialization
        self.aws_key = 'AKIAIOSFODNN7EXAMPLE'
        self.aws_secret = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
        
    def make_request(self):
        headers = {
            'Authorization': f'Bearer {self.aws_key}'
        }
        # GitHub Personal Access Token in comment
        # ghp_1234567890abcdefghijklmnopqrstuvwxyz
        return requests.get('https://api.example.com', headers=headers)

def legacy_function():
    # Slack webhook URL
    webhook = 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
    api_key = "sk-proj-1234567890abcdefghijklmnopqrstuvwxyz"  # OpenAI API key
    return webhook, api_key

This approach tests multiple DLP detection scenarios simultaneously: keys in class attributes, keys in string interpolation, keys in comments, and keys in different assignment patterns. A robust DLP solution should flag all of these, but many only catch the most obvious cases.

The repository also includes files testing different obfuscation levels. Developers don't always commit secrets in plain text—sometimes they're base64 encoded, split across multiple lines, or constructed dynamically:

import base64

def get_database_connection():
    # Base64 encoded connection string
    encoded = 'UG9zdGdyZVNRTDovL3VzZXI6cGFzc3dvcmRAaG9zdDo1NDMyL2Ri'
    conn_string = base64.b64decode(encoded).decode('utf-8')
    
    # SSN split across variables
    ssn_parts = ['123', '45', '6789']
    full_ssn = '-'.join(ssn_parts)
    
    # Credit card with spaces
    cc_number = '4532 1234 5678 9010'
    
    return conn_string, full_ssn, cc_number

Advanced DLP systems should catch the base64-encoded connection string after decoding, recognize that the split SSN parts form a complete social security number when joined, and identify the credit card despite the spacing. Testing these edge cases reveals the difference between basic regex-based DLP and more sophisticated solutions using semantic analysis.

The practical workflow for using this repository involves cloning it locally, then running your DLP scanning tool against the directory. For example, if you're testing AWS Macie or a git secrets scanner, you'd configure it to scan the dlptest directory and review which sensitive patterns were detected versus missed. This gives you a baseline understanding of your DLP coverage before deploying it across actual codebases.

What makes this repository valuable isn't sophisticated tooling—it's the curation of realistic test cases that mirror how developers actually introduce sensitive data into code. The Python-centric approach reflects that Python is widely used in data processing, API integrations, and scripting—all contexts where sensitive data frequently appears.

Gotcha

The repository's greatest strength—its simplicity—is also its most significant limitation. There's no documentation explaining which specific sensitive data patterns are included, no index of test cases, and no guidance on interpreting results. You're expected to browse through Python files manually to understand what's being tested. For a security team evaluating a DLP solution with specific compliance requirements (PCI-DSS, HIPAA, GDPR), you can't quickly verify whether the test suite covers your necessary data types.

The repository also lacks any programmatic interface or automation framework. If you want to integrate these tests into a CI/CD pipeline or run them as part of automated DLP validation, you'll need to build that infrastructure yourself. There's no test runner, no scoring mechanism, and no way to track which patterns your DLP successfully detected versus missed except through manual review. For enterprise environments where DLP testing needs to be repeatable, auditable, and integrated into security workflows, this creates significant friction. Additionally, the repository appears to focus primarily on API keys, credit cards, and SSNs—common patterns, but not comprehensive for organizations dealing with PII variations across international markets, healthcare identifiers, or industry-specific sensitive data formats.

Verdict

Use if: you need a quick, no-setup collection of realistic test files to validate a new DLP solution's effectiveness with actual source code contexts, especially if you're evaluating vendor claims before purchase. This is perfect for proof-of-concept testing or getting a baseline understanding of detection capabilities. Also valuable if you're building internal security awareness training and want real examples of how sensitive data leaks into code. Skip if: you need comprehensive DLP testing with documentation, automated validation, programmatic integration, or coverage of modern secrets management patterns and international data formats. For ongoing enterprise DLP validation, you'll want to supplement this with custom test cases matching your specific data types and compliance requirements, or invest in a proper testing framework that provides metrics and reporting. The repository serves as a starting point, not a complete solution.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/brian-dlptest-dlptest.svg)](https://starlog.is/api/badge-click/developer-tools/brian-dlptest-dlptest)