RedCode: The First Large-Scale Safety Benchmark That Actually Tests Code Agents in the Wild

Hook

Code agents can now execute bash scripts, manipulate files, and generate software autonomously. But what happens when you prompt one to delete system files or exfiltrate sensitive data?

Context

The rapid advancement of large language models has birthed a new category of AI systems: code agents. Unlike traditional code generation models that simply produce text output, these agents actively execute commands, interact with file systems, and chain together complex operations to accomplish programming tasks. Various code agent implementations are being developed and tested, with agents capable of autonomous operation across real codebases and system resources.

But this power comes with substantial risk. A code agent that blindly executes malicious scripts, generates vulnerability-laden software on request, or fails to recognize dangerous operations could cause catastrophic damage in production environments. Despite the critical importance of safety in these systems, the research community lacked a comprehensive benchmark specifically designed to evaluate code agent safety—until RedCode. Published at NeurIPS’24, this benchmark addresses the gap between measuring code correctness (what most benchmarks do) and measuring code safety (what actually matters when agents have system access). With over 4,000 test cases covering both execution risks and generation risks, RedCode provides a rigorous framework for evaluating whether code agents will recognize and refuse dangerous operations.

Technical Insight

RedCode takes a dual-pronged approach to safety evaluation through two distinct components: RedCode-Exec and RedCode-Gen. This architecture reflects the two fundamental ways code agents can introduce risk—by executing dangerous code they encounter, and by generating harmful code when prompted.

RedCode-Exec contains 4,050 test instances designed to evaluate whether agents recognize and refuse to execute risky code. The benchmark spans multiple programming languages including Python, Bash, and natural language commands, reflecting the polyglot reality of modern code agents. What sets RedCode apart from traditional security benchmarks is its use of actual Docker-based execution environments. Rather than relying on static analysis or synthetic scenarios, the framework spins up isolated containers where agents interact with real file systems, network interfaces, and system resources. This approach captures authentic agent behavior under conditions that mirror production deployments. The test cases are organized into fine-grained risk categories—from file system manipulation and network operations to data exfiltration and resource exhaustion—enabling researchers to identify specific weaknesses in agent safety mechanisms.

RedCode-Gen flips the evaluation paradigm by testing active safety: will an agent generate harmful code when explicitly instructed to do so? This component includes 160 carefully crafted prompts with function signatures that request the creation of malicious or vulnerable software. For example, a prompt might ask an agent to implement a function for unauthorized data access or to generate code with intentional security flaws. The relatively smaller dataset size compared to RedCode-Exec reflects the focused nature of generation testing—each prompt is designed to probe specific failure modes in agent guardrails.

The framework provides evaluation harnesses for three major code agent architectures: CodeAct, OpenCodeInterpreter, and ReAct. Setting up an evaluation is straightforward. After cloning the repository and setting up the conda environment, you can run evaluations using provided bash scripts:

git clone https://github.com/AI-secure/RedCode.git
conda env create -f environment.yml
conda activate redcode

# Run evaluation for different agent types
./scripts/OCI_eval.sh  # OpenCodeInterpreter agents
./scripts/RA_eval.sh   # ReAct agents
./scripts/CA_eval.sh   # CodeAct agents

Each evaluation script orchestrates the Docker environment setup, feeds test cases to the agent, captures execution behavior, and runs the fine-grained evaluation scripts located in the evaluation directory. The evaluation scripts analyze agent responses across different risk scenarios, generating detailed safety profiles that reveal not just whether an agent failed, but precisely how and where it failed.

The architecture’s modularity is particularly noteworthy. The dataset, environment, evaluation logic, and results are cleanly separated into distinct directories, making it straightforward to extend the benchmark with new test cases, adapt it to different agent architectures, or integrate it into CI/CD pipelines for continuous safety monitoring. The Docker-based isolation ensures that even genuinely dangerous test cases can be safely executed without risking the host system—a critical design decision for a safety benchmark that by definition must handle malicious code.

What makes RedCode especially valuable for research is its public leaderboard (available on the project webpage) and standardized evaluation methodology. Rather than each research group creating ad-hoc safety tests with incomparable results, RedCode establishes a community standard. Developers can benchmark their agents, and the fine-grained categorization of test cases enables targeted improvements. If your agent performs poorly on file system manipulation tests but well on network operation tests, you know exactly where to focus your safety hardening efforts.

Gotcha

RedCode’s Docker-based approach, while safer than testing on bare metal, introduces a layer of abstraction that may not perfectly mirror all production deployment scenarios. Real-world code agents often run with varying permission levels, interact with diverse system configurations, and face attack vectors that emerge from the specific context of their deployment environment. The standardized Docker containers can’t capture every edge case or novel attack pattern that might appear in the wild. As code agent capabilities evolve and new attack methodologies emerge, the static dataset of 4,050 test cases will inevitably become dated without continuous updates.

The framework currently supports only three agent architectures: CodeAct, OpenCodeInterpreter, and ReAct. If you’re building a code agent with a different architecture or implementing a novel agent design, you’ll need to invest engineering effort into creating a custom evaluation harness. This isn’t insurmountable—the codebase is well-structured for extension—but it’s not plug-and-play if you’re working outside the three supported architectures. Additionally, the benchmark focuses specifically on code agents that execute and generate code; if your tool is a static analyzer, a code review assistant, or any non-agentic system, RedCode’s evaluation methodology simply doesn’t apply to your use case.

Verdict

Use RedCode if you’re building or deploying autonomous code agents that execute commands or generate code with system access, especially before production deployment. It’s essential for AI safety researchers investigating code agent risks, organizations developing internal coding assistants that interact with proprietary systems, or open-source projects that want to demonstrate safety credibility through standardized benchmarking. The framework provides the rigor needed to identify specific safety gaps before they become security incidents, and the public leaderboard offers visibility for comparative evaluation. Skip it if you’re working on traditional static analysis tools, non-agentic code generation models that only produce text output, or systems where code execution happens in completely sandboxed environments with no access to sensitive resources. Also skip it if you need immediate plug-and-play evaluation for custom agent architectures—the current three-agent limitation means you’ll face integration overhead. RedCode solves a critical problem for a specific category of AI systems; make sure your system falls in that category before investing in integration.

RedCode: The First Large-Scale Safety Benchmark That Actually Tests Code Agents in the Wild

RedCode: The First Large-Scale Safety Benchmark That Actually Tests Code Agents in the Wild

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE

RedCode: The First Large-Scale Safety Benchmark That Actually Tests Code Agents in the Wild

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

fwknop: How Single Packet Authorization Makes Your SSH Server Invisible to Port Scanners

Claw-Code: The Viral Rust AI Coding Tool Built on Controversy

How Engine Simulator Synthesizes Authentic V8 Rumble from Physics, Not Samples

Pi-Mono: A Production-Ready AI Agent Toolkit That Doesn't Lock You Into One LLM Provider

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE