Back to Articles

AgentDojo: Testing Prompt Injection Vulnerabilities in LLM Agents Before They Cost You

[ View on GitHub ]

AgentDojo: Testing Prompt Injection Vulnerabilities in LLM Agents Before They Cost You

Hook

While companies rush to deploy LLM agents with access to emails, databases, and internal tools, there’s no standardized way to test if a cleverly crafted message could trick your agent into leaking customer data or executing unauthorized commands—until now.

Context

LLM agents represent a fundamental shift from simple chatbots to autonomous systems that can call APIs, query databases, and interact with external tools. This power comes with serious security implications: prompt injection attacks, where malicious instructions are hidden in user inputs or retrieved documents, can manipulate agents into bypassing restrictions, leaking sensitive data, or executing unintended actions. Unlike traditional software vulnerabilities that can be patched, prompt injections exploit the fundamental way language models process text—there’s no clear boundary between instructions and data.

Researchers and security engineers have struggled with a chicken-and-egg problem: how do you systematically evaluate defenses against prompt injection when there’s no standardized benchmark? Ad-hoc testing is inconsistent, academic papers use incompatible methodologies, and real-world incidents go unreported. AgentDojo, developed by ETH Zurich’s Security, Privacy, and Machine Learning Lab and presented at NeurIPS 2024, provides the first comprehensive framework specifically designed to benchmark both attacks and defenses for LLM agents in a reproducible, systematic way.

Technical Insight

injects malicious instructions

filters/validates

invokes

returns results

Suite

(workspace)

User Tasks

(user_task_0, ...)

Agent Executor

Attack Strategy

(tool_knowledge)

Defense Mechanism

(tool_filter)

LLM Tools

Benchmark Results

Metrics:

Utility & Security

Web Interface

(Visualization)

System architecture — auto-generated

AgentDojo’s architecture appears to center on a modular pipeline where tasks, attacks, and defenses are pluggable components based on the benchmark script’s command-line interface. The framework includes suites—collections of related tasks—with at least one documented suite called workspace that the README describes as simulating agent interactions. Each suite contains user tasks (legitimate objectives) and adversarial scenarios where attackers attempt to inject malicious instructions.

The framework appears to separate three concerns: the agent’s execution logic, the attack vector, and the defense mechanism. This separation enables systematic testing across multiple dimensions. Here’s how you’d run a basic benchmark comparing different models against a tool knowledge attack with a tool filter defense:

python -m agentdojo.scripts.benchmark \
    -s workspace \
    -ut user_task_0 -ut user_task_1 \
    --model gpt-4o-2024-05-13 \
    --defense tool_filter \
    --attack tool_knowledge

The tool knowledge attack is described in the README as an attack type, though the specific mechanism isn’t detailed. The tool filter is listed as a defense option available in the benchmark script.

What makes AgentDojo particularly valuable based on its documentation is its ability to measure both utility (whether the agent completed the legitimate task) and security (whether it resisted the attack). The benchmark script appears designed to track these metrics, allowing evaluation of the security-utility trade-off where defensive measures might impact legitimate functionality.

The benchmark script outputs structured results that can be inspected through AgentDojo’s dedicated results page or viewed in the Invariant Benchmark Registry. This promotes transparency—researchers can compare results across papers, practitioners can validate defense claims, and the community can track progress over time. The framework appears to support different LLM models through the —model parameter.

For organizations building agents, the modular design suggests you could implement custom suites that mirror actual use cases, though specific documentation on creating custom suites isn’t provided in the README. The framework handles the orchestration, measurement, and reporting through its benchmark script.

Gotcha

AgentDojo’s README prominently warns that the API is under active development and subject to breaking changes. This isn’t a stable foundation for long-term projects or production systems. If you’re building automated security testing into a CI/CD pipeline, expect to pin versions carefully and budget time for updates as the API evolves.

The framework is explicitly research-oriented, not a turnkey security solution. It won’t secure your production LLM agent—it’s designed to evaluate and compare security approaches in a controlled environment. The built-in attacks and defenses are reference implementations for benchmarking, not battle-tested security controls. If you need to actually defend a live agent, you’ll need to implement robust security measures separately (and then use AgentDojo to test their effectiveness). Additionally, the prompt injection detector defense requires extra dependencies via the transformers extra (pip install "agentdojo[transformers]"), increasing installation complexity and potential version conflicts in environments with existing ML dependencies.

Verdict

Use AgentDojo if you’re researching LLM security, building agents with sensitive capabilities, or need to systematically compare defense strategies before deployment. It’s ideal for security teams that want empirical data about prompt injection risks, academics studying AI safety, or engineering teams evaluating whether a defense actually works or just adds latency. The reproducible benchmark format makes it invaluable for validating vendor claims about “prompt injection protection.” Skip it if you need production-ready security tooling, want a general-purpose agent framework without security focus, or require API stability for long-term automation. This is a measurement tool, not a security product—essential for understanding the problem, but not sufficient for solving it alone.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/ethz-spylab-agentdojo.svg)](https://starlog.is/api/badge-click/llm-engineering/ethz-spylab-agentdojo)