Back to Articles

ps-fuzz: Red-Teaming Your LLM System Prompts Before Attackers Do

[ View on GitHub ]

ps-fuzz: Red-Teaming Your LLM System Prompts Before Attackers Do

Hook

Your carefully crafted system prompt that enforces access controls? An attacker might extract it with the right approach. And that’s just one of multiple ways your GenAI application could be compromised.

Context

As organizations rush to deploy LLM-powered applications, system prompts have become the primary mechanism for enforcing business logic, safety guardrails, and access controls. These prompts tell the AI how to behave—what data it can access, what topics to avoid, what tone to use. But here’s the problem: system prompts are fundamentally vulnerable. Unlike traditional application code hidden on a server, system prompts sit in the context window of a model that’s specifically designed to be helpful and follow instructions. An attacker doesn’t need to exploit a buffer overflow or SQL injection—they just need to ask nicely, or trick the model into ignoring its instructions.

The security community has documented dozens of attack patterns: jailbreaks that bypass safety filters, prompt injections that override instructions, RAG poisoning that corrupts retrieval systems. But testing for these vulnerabilities has been manual, time-consuming, and inconsistent. Static lists of attack prompts become stale as models evolve. Generic security scanners miss domain-specific vulnerabilities. What developers needed was a tool that could think like an attacker—dynamically generating adversarial prompts tailored to their specific application, testing systematically, and providing a workflow for iterative hardening. That’s what ps-fuzz delivers.

Technical Insight

Attack Types

Provider Abstraction

System Prompt + Attack Config

Attack Parameters

Generate Adversarial Prompts

Adversarial Prompt

Response

Success/Failure Detection

Iterate/Next Attack

Attack Results

Feedback

User/CI Pipeline

Configuration Layer

Attack Engine

Attacker LLM

Target LLM

Response Evaluator

Results & Reports

System architecture — auto-generated

ps-fuzz implements a red-team-as-code architecture where one LLM attacks another. The core concept is elegant: configure an “attacker” LLM to generate adversarial prompts against your “target” LLM, then evaluate whether the attack succeeded. This creates a feedback loop that mimics real-world adversarial testing without requiring security expertise.

The tool appears to consist of three functional layers. First, a provider abstraction layer supports 16 different LLM providers through environment variables—OpenAI, Anthropic, Azure OpenAI, Cohere, and more. This means you can test cross-provider scenarios, like using GPT-4 as the attacker against Claude as the target. Second, an attack engine that implements 16 different attack categories: jailbreaks, prompt injection, system prompt extraction, and RAG poisoning among others. Third, an evaluation system that analyzes responses to determine if an attack succeeded.

Here’s how you’d run a basic security assessment:

# Set your API keys
export OPENAI_API_KEY=sk-your-key-here

# Install and launch
pip install prompt-security-fuzzer
prompt-security-fuzzer

# The tool prompts you interactively:
# 1. Enter your system prompt
# 2. Select attack types to test
# 3. Configure number of attempts per attack
# 4. Review results

For CI/CD integration, ps-fuzz supports non-interactive batch mode with configurable attack intensity:

prompt-security-fuzzer -b ./system_prompt.examples/medium_system_prompt.txt \
  --attack-provider openai \
  --attack-model gpt-4 \
  --target-provider anthropic \
  --target-model claude-3-opus \
  --num-attempts 10 \
  --num-threads 4 \
  --attack-temperature 0.8

The --num-threads flag enables parallel testing—critical for comprehensive assessments. If you’re testing 16 attack types with 10 attempts each, that’s 160 API calls. Multi-threading reduces wall-clock time from hours to minutes.

What makes ps-fuzz particularly powerful is its dynamic approach. According to the README, the tool “dynamically tailors its tests to your application’s unique configuration and domain.” Unlike static attack databases, it examines your system prompt to understand the application’s purpose, then generates attacks tailored to that context. Testing a customer service bot? It’ll try to extract PII handling rules. Testing a code generation assistant? It’ll attempt to inject malicious code patterns. This dynamic adaptation means you’re testing realistic threats, not generic exploits.

The Playground feature deserves special attention. After running automated tests, you can enter an interactive chat session with your system prompt. This lets you manually probe edge cases the automated fuzzer might have missed, then immediately re-run the full test suite to verify your hardening worked. It’s a tight iteration loop: test, identify weakness, harden prompt, re-test.

For RAG systems, ps-fuzz includes specific support with embedding configuration:

prompt-security-fuzzer -b ./system_prompt.examples/medium_system_prompt.txt \
  --embedding-provider openai \
  --embedding-model text-embedding-ada-002

The tool also supports custom benchmarks and selective testing:

# Run only specific attack types
prompt-security-fuzzer -b ./system_prompt.examples/medium_system_prompt.txt \
  --custom-benchmark=ps_fuzz/attack_data/custom_benchmark1.csv \
  --tests='["ucar","amnesia"]'

Gotcha

The elephant in the room is cost. Every test run consumes tokens on both the attacker and target sides. A comprehensive assessment with 10 attempts across 16 attack types means 160 API calls. The README explicitly warns: “Using the Prompt Fuzzer will lead to the consumption of tokens.” Depending on your chosen models and prompt complexity, costs can accumulate quickly, especially for continuous testing in CI/CD pipelines. Budget accordingly, and consider using cheaper models for preliminary testing before running expensive comprehensive assessments with frontier models.

ps-fuzz is strictly a detection tool—it identifies vulnerabilities but provides no automated remediation. When it discovers that your system prompt leaks under a certain attack pattern, you’re on your own to figure out how to harden it. The Playground helps with iteration, but there’s no “auto-fix” button or library of hardened prompt templates. You’ll need prompt engineering expertise to translate test results into actual security improvements. Additionally, the tool’s effectiveness is bounded by the attacker model’s capabilities. If you use a weak model as the attacker, you’ll miss sophisticated exploits. But using cutting-edge models as attackers amplifies the cost problem. Finally, ps-fuzz doesn’t provide runtime protection—it’s a development and testing tool, not a WAF for your LLM. Once deployed, your application remains vulnerable unless you implement separate input/output filtering.

Verdict

Use ps-fuzz if you’re deploying customer-facing GenAI applications where system prompt compromise has real consequences—leaked business logic, bypassed content filters, unauthorized data access, or brand reputation damage. It’s invaluable during the prompt engineering phase when you’re still iterating on system prompt design, and it integrates cleanly into CI/CD for regression testing as you update prompts. Organizations with compliance requirements (healthcare, finance, legal) should absolutely run adversarial testing before production deployment. The multi-provider support (16 LLM providers) makes it ideal if you’re evaluating different vendors or running multi-model architectures. The tool supports both interactive mode for exploration and batch mode for automation. Skip it if you’re building internal prototypes where prompt leakage poses minimal risk, or if your GenAI application doesn’t rely on system prompts for security enforcement. Also skip if token budget is severely constrained—comprehensive testing consumes significant API credits. If you need automated defense rather than just testing, you’ll need to supplement ps-fuzz with runtime protection mechanisms.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/prompt-security-ps-fuzz.svg)](https://starlog.is/api/badge-click/ai-dev-tools/prompt-security-ps-fuzz)