Building Self-Optimizing Red Team Attacks with DSPy: No Prompt Engineering Required
Hook
What if your red team could write its own attack prompts, test them against your LLM, learn from failures, and iterate—all automatically while you sleep?
Context
Traditional red-teaming of large language models is a grind. Security researchers manually craft adversarial prompts, test them against target models, analyze failures, and refine their attacks through dozens of iterations. It’s creative work, but it doesn’t scale. When you’re trying to test safety guardrails across multiple models or need to continuously verify that your latest model version hasn’t introduced new vulnerabilities, manual prompt engineering becomes a bottleneck.
The adversarial ML community has developed sophisticated attack methods, but many are difficult to adapt to new scenarios. Meanwhile, the DSPy framework emerged as a solution for eliminating manual prompt engineering for general LLM applications through structured program optimization. Haize Labs saw an opportunity to bridge these worlds—could a framework designed for building compositional LLM programs also be weaponized to break them? According to their work, this represents the first attempt at using any auto-prompting framework to perform red-teaming tasks.
Technical Insight
The core insight of dspy-redteam is treating adversarial prompt generation as a program synthesis problem rather than a manual creative task. Instead of hand-crafting jailbreak attempts, you define an architecture of interacting modules and let DSPy’s optimizer discover effective attack strategies through automated compilation.
The architecture uses a five-layer stack of alternating Attack and Refine modules. Each Attack module attempts to generate an adversarial prompt that will cause the target LLM to produce harmful content. Each Refine module takes the previous attempt, analyzes the outcome, and adjusts the strategy for the next layer. This creates a multi-step reasoning chain where layers work together to improve attack effectiveness.
What makes this powerful is DSPy’s MIPRO optimizer working in the outer loop. The optimizer treats each module’s internal prompts as learnable parameters. It uses an LLM-as-judge approach to evaluate attack success, then iteratively refines the instructions and demonstrations passed to each module. The system essentially learns how to red-team by red-teaming, with the optimizer hill-climbing toward better attack success rates.
Based on DSPy’s standard compilation pattern, the approach appears to involve defining modules as DSPy signatures (input-output specifications), chaining them together in a sequential program, then invoking the optimizer with a training set of harmful behaviors to elicit and a metric function that determines success. The metric needs to reliably detect when the target model has been successfully jailbroken versus when it has refused the request.
The optimization loop is where the system demonstrates its power. Across multiple iterations, MIPRO explores different prompt variations for each module, evaluates their contribution to overall attack success rate, and progressively builds up a set of instructions that work well together. This is compositional optimization: the prompt for each module is optimized in the context of what the other modules are doing. The result is a coordinated attack strategy that no human explicitly designed.
The numbers tell the story: raw harmful inputs achieve only 10% attack success rate against Vicuna. The five-layer architecture without optimization jumps to 26%—the compositional structure alone helps. But after DSPy compilation, success rate reaches 44%, a 4x improvement over baseline. Critically, Haize Labs emphasizes they achieved this with ‘no specific prompt engineering’ and ‘almost no hyperparameter tuning.’ This isn’t state-of-the-art performance, but it’s remarkable efficiency: minimal human effort, maximum automation.
This approach represents a paradigm shift for red-teaming workflows. Instead of security researchers spending days crafting attack variants, they can invest time in architecture design—defining what kinds of attack strategies to explore, how many reasoning steps to use, what information to pass between modules—and let optimization do the tactical work. It’s the difference between writing individual attacks and writing attack generators.
Gotcha
The repository README is refreshingly honest about limitations: this is not state-of-the-art. Dedicated red-teaming research using other methods still achieves higher success rates. If you’re conducting critical safety assessments where you need maximum attack effectiveness—say, validating that your production LLM won’t leak PII or generate illegal content—this framework probably shouldn’t be your only tool.
There are also practical constraints the README hints at but doesn’t fully detail. Running a deep multi-layer DSPy program through multiple optimization iterations is computationally expensive. Each optimization step requires querying both your attack modules and the target model multiple times to evaluate success. With five layers and an LLM-as-judge in the evaluation loop, you’re potentially making dozens of LLM calls per training example. The README mentions ‘almost no hyperparameter tuning except to fit compute constraints,’ which suggests the team had to make tradeoffs between optimization thoroughness and available resources. For continuous testing scenarios or teams with limited computational budgets, this could be prohibitive. The generalization question also looms: results are demonstrated against Vicuna, but how the approach performs against other models with different safety mechanisms would require additional validation.
Verdict
Use this tool if you’re building LLM safety infrastructure and value automation over peak performance. It’s ideal for research teams who want to quickly establish baseline adversarial testing capabilities, explore how different architectural choices affect attack success, or need to scale red-teaming across multiple model versions without manually rewriting prompts each time. The real power is in iteration speed: you can experiment with deeper architectures, different module compositions, or new attack strategies by changing code rather than crafting prompts. It’s also valuable if you’re already invested in the DSPy ecosystem and want to add adversarial testing to your evaluation pipeline. Skip it if you need state-of-the-art jailbreak success rates for high-stakes security assessments, have limited computational resources for running optimization loops, or need validated results against specific production models beyond Vicuna. In those cases, lean on established red-teaming methods or manually curated attack datasets from published safety research. This is a power tool for researchers and safety engineers who think in systems, not a drop-in replacement for expert human red-teamers.