Back to Articles

Inside Haize Labs' Automated Jailbreak Discovery: When AI Red-Teams Itself

[ View on GitHub ]

Inside Haize Labs' Automated Jailbreak Discovery: When AI Red-Teams Itself

Hook

What if finding ways to break AI safety guardrails wasn't a creative writing exercise, but an automated algorithmic process that could scale indefinitely? Haize Labs claims their suite does exactly that.

Context

The AI safety community has a cat-and-mouse problem. Every time a new safety guardrail gets deployed to prevent harmful LLM outputs, a new wave of manually-crafted jailbreaks emerges on Reddit and Discord. Someone discovers that saying 'DAN' (Do Anything Now) or framing requests as hypothetical fiction bypasses content filters, the trick spreads, model providers patch it, and the cycle repeats.

This manual approach doesn't scale for either attackers or defenders. Security teams at companies deploying LLMs can't possibly anticipate every creative prompt variation humans might devise. Meanwhile, traditional red-teaming—hiring security researchers to manually probe systems—is expensive and limited by human creativity and stamina. Haize Labs' 'haizing suite' represents a different paradigm: automated adversarial search across the entire input space of AI models. The get-haized repository showcases outputs from this system, providing a window into what systematic, algorithmic vulnerability discovery looks like when applied to large language models and multimodal AI systems.

Technical Insight

Output Corpus

Attack Vectors

Adversarial Prompts

Response Analysis

Successful Bypass

Failed Attempt

Demonstration

Haize Labs Testing Suite

Target AI Models

Safety Filter Detection

Red-Team Methods

Fuzzing Engine

Gradient Optimization

get-haized Repository

Public Showcase

System architecture — auto-generated

Unlike traditional open-source repositories that contain executable code, get-haized is fundamentally a demonstration corpus—a curated collection of successful jailbreaks discovered through automated fuzzing and optimization techniques. While the repository doesn't expose the underlying algorithms, the patterns in the examples reveal the technical approach.

The disclosed jailbreaks span multiple attack vectors: role-playing scenarios ('You are a scriptwriter for a gritty crime drama...'), dialect manipulation (using non-standard English to bypass keyword filters), encoding tricks (ROT13, base64, or fictional language translations), and fictional framing ('In a novel I'm writing, how would a character...'). What's significant isn't any individual jailbreak—those exist freely on social media—but rather that these were discovered systematically rather than through human creativity.

The underlying methodology appears to draw from adversarial machine learning research, particularly gradient-based optimization attacks. Academic work like 'Universal and Transferable Adversarial Attacks on Aligned Language Models' (Zou et al., 2023) demonstrated that you can optimize prompt suffixes to maximize probability of harmful outputs. A simplified conceptual example of this approach:

# Pseudocode for gradient-based jailbreak optimization
# (This is NOT code from get-haized, which is closed-source)

def optimize_adversarial_prompt(target_model, harmful_query, iterations=500):
    # Start with random token suffix
    adversarial_suffix = initialize_random_tokens(length=20)
    
    for i in range(iterations):
        # Forward pass: compute probability of compliance
        prompt = harmful_query + adversarial_suffix
        output_logits = target_model(prompt)
        
        # Loss: maximize probability of affirmative response
        # instead of refusal ("I cannot help with that...")
        loss = -log_probability(output_logits, affirmative_response_tokens)
        
        # Backward pass: compute gradients w.r.t. suffix tokens
        gradients = compute_gradients(loss, adversarial_suffix)
        
        # Update suffix using gradient descent
        adversarial_suffix = update_tokens(adversarial_suffix, gradients)
        
        if successful_jailbreak(target_model(prompt)):
            return adversarial_suffix
    
    return adversarial_suffix

Haize Labs' system likely extends this concept across multiple dimensions: multimodal inputs (text, images, audio, video, code), transfer learning across different models, and evolutionary algorithms that mutate successful jailbreaks to find variations. The cross-modal aspect is particularly interesting—an image containing adversarial perturbations might bypass safety filters that would catch the same concept expressed in text.

The repository structure itself reveals the taxonomy of their approach. Jailbreaks are categorized by modality and attack type, suggesting a systematic testing matrix rather than ad-hoc discovery. This architectural choice—organizing vulnerabilities by input type and technique—mirrors how security researchers structure exploit databases, treating AI safety vulnerabilities with the same rigor as traditional software security flaws.

What get-haized demonstrates, even without source code, is that jailbreak discovery can be industrialized. Instead of relying on human red-teamers spending hours crafting clever prompts, an automated system can explore millions of variations, applying optimization pressure to find the exact phrasing that maximizes the probability of bypassing safety filters. The examples in the repository are merely the tip of the iceberg—successful outputs selected from presumably thousands of automated attempts.

Gotcha

The most obvious limitation is that get-haized is marketing material, not a tool. You cannot clone this repository and start testing your own models. There's no pip install, no API documentation, no configuration files. It's a portfolio designed to demonstrate capabilities and generate leads for Haize Labs' commercial platform. For developers or security teams looking to actually implement adversarial testing, this repository provides conceptual value but zero executable functionality.

The second major limitation is staleness. AI safety is a rapidly moving target. The jailbreaks showcased might have been effective against GPT-4 or Claude 2 at the time of discovery, but there's no indication of how effective they remain against current model versions or safety systems. The authors acknowledge examples are 'mildly provocative,' which could mean either responsible disclosure (not publishing the most harmful exploits) or that their most severe discoveries don't work reliably. Additionally, without access to the actual fuzzing algorithms, false positive rates, or success metrics, it's impossible to evaluate the efficacy of the underlying approach. A system that finds one working jailbreak per million attempts has very different practical implications than one with a 10% success rate, and this repository provides no data to distinguish between those scenarios.

Verdict

Use if: You're researching AI safety vulnerabilities and want to understand current attack patterns, you're building internal red-teaming processes and need to familiarize your team with jailbreak taxonomies, or you're evaluating whether to invest in commercial adversarial testing services and want to assess Haize Labs' approach. The repository serves as a valuable educational resource for understanding how systematic adversarial testing differs from manual prompt engineering. Skip if: You need actual executable tooling for testing your LLM applications—garak, PyRIT, or promptfoo provide open-source alternatives with real functionality. Also skip if you're looking for academic rigor or reproducible research; this is commercial research with proprietary methods. For hands-on adversarial testing, you need tools you can actually run, not a showcase of results from tools you can't access.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/haizelabs-get-haized.svg)](https://starlog.is/api/badge-click/automation/haizelabs-get-haized)