Inside JailbreakHub: A 15,000-Prompt Dataset Exposing How Users Actually Attack LLMs
Hook
Between December 2022 and December 2023, 803 adversarial users published 1,405 jailbreak prompts across Reddit, Discord, and specialized websites—and researchers catalogued every single one. This is what they found.
Context
When ChatGPT launched in November 2022, it took less than a month for users to discover ‘DAN’ (Do Anything Now)—a prompt that convinced the model to ignore its safety guidelines. What followed was an arms race: OpenAI patching vulnerabilities, users inventing increasingly sophisticated workarounds, and the security community scrambling to understand the scope of the problem.
But here’s the issue: most academic research on LLM jailbreaks relied on synthetic attacks generated in labs. No one had systematically measured what actual users were trying in the wild. Were these attacks effective? How did they evolve? Which platforms housed the most sophisticated adversarial communities? The verazuo/jailbreak_llms repository—published at ACM CCS 2024—answers these questions with the first large-scale measurement study of real-world jailbreak prompts, collecting 15,140 total prompts from four platform types and identifying 1,405 jailbreaks through a novel classification framework called JailbreakHub.
Technical Insight
The repository’s true value lies in its dual nature: it’s both a research artifact (the dataset itself) and a methodological blueprint (the collection and evaluation framework). The architecture consists of three interdependent pipelines that researchers can adapt for ongoing monitoring.
The data collection pipeline aggregates prompts from heterogeneous sources with full provenance tracking. For Reddit, the team scraped three subreddits (r/ChatGPT, r/ChatGPTPromptGenius, r/ChatGPTJailbreak) covering 168,687 posts. For Discord, they monitored six servers including specialized communities like ‘BreakGPT’ and ‘Spreadsheet Warriors’. Website scraping targeted prompt-sharing platforms like FlowGPT (8,754 prompts from 3,505 users) and AIPRM. Each record includes temporal metadata and source attribution, with adversarial user account tracking across platforms. This allows researchers to identify prolific jailbreak authors and analyze cross-platform patterns.
Loading the dataset is straightforward using HuggingFace’s Datasets library, with both historical snapshots available:
from datasets import load_dataset
# Load May 2023 snapshot (early jailbreak landscape)
dataset_early = load_dataset(
'TrustAIRLab/in-the-wild-jailbreak-prompts',
'jailbreak_2023_05_07',
split='train'
)
# Load December 2023 snapshot (mature ecosystem)
dataset_late = load_dataset(
'TrustAIRLab/in-the-wild-jailbreak-prompts',
'jailbreak_2023_12_25',
split='train'
)
# Access prompt text
for example in dataset_early:
print(f"Prompt: {example['prompt']}")
The classification system identifies jailbreaks through a combination of manual labeling and pattern analysis. The team doesn’t just flag obvious ‘DAN’ variants—they track how techniques evolve in response to vendor patches. The raw CSV files expose this taxonomy, enabling researchers to train jailbreak detectors on labeled real-world data rather than synthetic examples.
The evaluation framework, ChatGLMEval, is where things get methodologically rigorous. Rather than testing jailbreaks with arbitrary harmful questions, the researchers constructed a curated question set of 390 forbidden queries spanning 13 OpenAI policy violation categories—excluding Child Sexual Abuse for ethical reasons. These scenarios include Illegal Activity, Hate Speech, Malware Generation, Physical Harm, Economic Harm, Fraud, Pornography, Political Lobbying, Privacy Violence, Legal Opinion, Financial Advice, Health Consultation, and Government Decision. You can load this question set the same way:
forbidden_question_set = load_dataset(
"TrustAIRLab/forbidden_question_set",
split='train'
)
# Example: Test a jailbreak prompt against forbidden questions
for question in forbidden_question_set:
category = question['category'] # e.g., 'Malware Generation'
query = question['question']
# Combine jailbreak prompt + forbidden question
# Send to LLM API and evaluate response
This evaluation approach enables reproducible jailbreak effectiveness measurements. Instead of subjective assessments, researchers can quantify attack success rates across policy domains and compare defensive mechanisms. The notebook-based analysis tools (the repo is primarily Jupyter Notebooks) include visualization code for semantic analysis of jailbreak prompts.
One underappreciated aspect: the repository documents adversarial user behavior longitudinally. By tracking the 803 adversarial accounts across platforms, researchers can identify users who author multiple successful jailbreaks and analyze how patterns emerge across communities. This social network dimension is critical for understanding jailbreak ecosystems beyond just the prompts themselves.
Gotcha
The most significant limitation is temporal decay—this dataset captures jailbreaks from December 2022 to December 2023, which is ancient history in LLM defense timelines. Models like GPT-4 and Claude have received multiple safety updates since collection ended. Jailbreaks that worked in mid-2023 likely fail against current versions, and the dataset doesn’t include newer attack techniques that have emerged since. If you’re building a production safety system in 2024, these historical prompts are valuable for training baseline detectors but insufficient as your sole evaluation corpus.
The documentation explicitly warns about duplicate prompts in the raw dataset, recommending preprocessing before training models. This isn’t a trivial deduplication task—many jailbreaks are slight variations (“You are DAN” vs. “You are now DAN”), and naive string matching misses semantic duplicates. The HuggingFace Datasets discussion thread linked in the README reveals researchers debating the ‘correct’ deduplication strategy. Additionally, the dataset doesn’t include LLM responses—only the prompts themselves. To evaluate effectiveness, you need to replay prompts against models and manually assess whether outputs violate policies, which is resource-intensive and requires IRB approval if publishing results. The paper describes their evaluation methodology, but the repository doesn’t provide automated policy violation detection code.
Verdict
Use this dataset if you’re conducting academic research on LLM safety mechanisms, building jailbreak detection classifiers that need training data with real-world adversarial language patterns, or studying the sociology of adversarial AI communities. It’s invaluable for understanding how jailbreaks evolve in response to defenses and for benchmarking safety improvements between model versions. The question set alone is worth using as a standardized red-teaming evaluation suite. Skip it if you need current jailbreak techniques (you’ll need to supplement with more recent data for coverage through 2024), want plug-and-play safety filtering (this is research infrastructure, not a production API), or lack institutional resources for responsible disclosure (testing these prompts against production LLMs requires careful ethical oversight). This is a dataset for security researchers and red teams at AI labs, not for casual experimentation or—obviously—malicious use.