Back to Articles

Best-of-N Jailbreaking: How Sampling Multiple Attack Variants Breaks LLM Safety Guardrails

[ View on GitHub ]

Best-of-N Jailbreaking: How Sampling Multiple Attack Variants Breaks LLM Safety Guardrails

Hook

Asking an AI chatbot a harmful question once might fail. But ask it many slightly different ways—especially through audio transcription—and you may significantly increase your success rate. That’s the premise of Best-of-N jailbreaking.

Context

As large language models have become more capable, their safety guardrails have become more sophisticated. Companies like OpenAI invest heavily in alignment techniques to prevent their models from generating harmful content. But these defenses are tested against a growing arsenal of adversarial techniques—jailbreaking methods designed to bypass safety filters.

Traditional jailbreaking research focused on single, carefully crafted prompts. Techniques like PAIR (Prompt Automatic Iterative Refinement) and TAP (Tree of Attacks with Pruning) would generate adversarial prompts through iterative optimization. The jplhughes/bon-jailbreaking repository takes a different approach: instead of optimizing for the perfect attack, it generates many variants and selects the most effective one. Combined with multimodal transformations—converting text to speech and back—this Best-of-N (BoN) approach appears to exploit transcription errors and modality gaps that single-shot attacks miss. The result is a comprehensive testbed for understanding just how vulnerable LLMs remain to systematic adversarial testing.

Technical Insight

The repository supports multiple attack generation strategies including PAIR, TAP, Circuit Breaking, and PrePAIR variants. Rather than executing these attacks once, the system generates N variants and evaluates all of them, aiming to increase success rates through statistical sampling.

The modality conversion pipeline is where things get interesting. By converting text prompts to audio using text-to-speech services, then transcribing them back to text using speech recognition systems, the repository appears to exploit transcription errors as a form of adversarial noise. A prompt that fails in text form might succeed after an audio round-trip introduces subtle changes that slip past safety filters. The repository integrates multiple services including ElevenLabs for TTS and includes Kaldi installation for speech processing capabilities.

Setting up the environment requires careful attention to dependencies. The repository uses micromamba for environment management and includes WavAugment for audio perturbations:

micromamba env create -n bon python=3.11.7
micromamba activate bon
pip install -r requirements.txt
pip install -e .
git clone git@github.com:facebookresearch/WavAugment.git
cd WavAugment && python setup.py develop

The Kaldi installation adds complexity—it’s a full speech recognition toolkit that requires compilation via the provided script: ./scripts/install_kaldi.sh. This isn’t a lightweight dependency; you’re pulling in serious audio processing infrastructure.

The system connects to multiple LLM and service providers through their APIs. Looking at the required secrets file reveals the scope:

OPENAI_API_KEY=<your-key>
OPENAI_ORG=<openai-org-id>
GOOGLE_API_KEY=<your-key>
GOOGLE_PROJECT_ID=<GCP-project-name>
GOOGLE_PROJECT_REGION=<GCP-project-region>
HF_API_KEY=<your-key>
GRAYSWAN_API_KEY=<your-key>
ELEVENLABS_API_KEY=<your-key>

This isn’t casual experimentation—you need paid accounts across multiple platforms. The repository is designed for well-resourced research teams, not hobbyist security researchers.

The human-verbalized jailbreak dataset is one of the most valuable research artifacts. It contains 774 total items from HarmBench (308 PAIR attacks, 307 TAP attacks, 159 direct requests), each with audio files showing how humans actually verbalize adversarial prompts. Loading the dataset is straightforward:

import pandas as pd
df = pd.read_json('bon_human_data/verbalized_requests.jsonl', lines=True)
# Access jailbreak text in 'rewrite' column
# Audio files and metadata included per row

This dataset bridges a crucial gap in jailbreaking research: understanding how written adversarial prompts translate to spoken language. Humans don’t read PAIR-generated text verbatim—they paraphrase, add natural hesitations, change inflection. These human variations become part of the attack surface when testing multimodal systems.

The experiment scripts in the repository are designed to replicate specific paper figures. Running ./experiments/1_run_text_bon.sh executes the text-based Best-of-N experiments. The repository is fundamentally a research artifact for reproducibility rather than a library with stable APIs. You’re meant to modify the experiment scripts directly, not import modules into your own projects.

Gotcha

The cost barrier is the first wall you’ll hit. Running comprehensive experiments across multiple LLM providers with text-to-speech services isn’t cheap. ElevenLabs charges per character for TTS, OpenAI and Google have per-token pricing, and generating N variants of jailbreaks for statistical testing multiplies your costs by N. Budget hundreds or thousands of dollars for serious experimentation, not pocket change.

The installation complexity is non-trivial. Kaldi is a research-grade speech recognition toolkit that expects a Linux environment and proper compilation toolchains. The WavAugment dependency requires building from source. If you’re on macOS or Windows, expect platform-specific issues. The micromamba setup is clean, but the audio processing dependencies have sharp edges. This isn’t pip install bon-jailbreak and you’re done.

Most importantly, this repository is ethically sensitive research infrastructure. It’s designed for red-teaming and safety research in controlled environments, not for malicious use. The techniques it implements are powerful specifically because they work—they find real vulnerabilities in production LLM systems. Using this tool requires understanding responsible disclosure practices and having proper authorization for security testing. The repository documentation doesn’t include explicit guardrails or ethical guidelines, assuming users already operate within appropriate research or security contexts.

Verdict

Use this repository if you’re an AI safety researcher, academic studying adversarial robustness, or part of a red team with budget for API costs and authorization to test LLM vulnerabilities. The Best-of-N approach combined with multimodal transformations represents cutting-edge jailbreaking methodology, and the human-verbalized dataset is uniquely valuable for understanding real-world attack vectors. The experiment scripts provide a reproducible starting point for extending this research. Skip it if you need production-ready security tooling with stable APIs, lack access to the required paid services across multiple vendors, or want lightweight jailbreaking detection rather than attack generation. This is a specialized research artifact that assumes significant technical sophistication and proper ethical framing—powerful in the right hands, but not a tool for casual experimentation.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/jplhughes-bon-jailbreaking.svg)](https://starlog.is/api/badge-click/developer-tools/jplhughes-bon-jailbreaking)