ps-fuzz: Testing GenAI Security with LLM-Against-LLM Combat
Hook
Your carefully crafted system prompt that prevents data leakage? A sufficiently motivated attacker LLM can probably bypass it in under 10 attempts. The question isn't whether your GenAI app has prompt vulnerabilities—it's how many you haven't discovered yet.
Context
The explosion of GenAI applications has created a new vulnerability surface that traditional security tools can't address. Unlike SQL injection or XSS attacks with well-established patterns, prompt injection and jailbreaking are fundamentally different: they exploit the semantic understanding of language models rather than parsing bugs. A simple 'Ignore previous instructions' might seem trivial, but when embedded in legitimate user input or poisoned RAG documents, it becomes a vector for data exfiltration, unauthorized actions, or system prompt extraction.
Traditional security testing approaches fall short here. Static analysis can't evaluate natural language semantics. Penetration testers familiar with web vulnerabilities lack frameworks for systematically testing LLM-specific attack vectors. Meanwhile, the OWASP LLM Top 10 provides excellent taxonomy but leaves implementation of actual test cases to individual teams. ps-fuzz emerged to fill this gap: a systematic, automated framework that uses the very technology under attack—LLMs themselves—to generate contextually relevant adversarial prompts and evaluate whether your defenses hold.
Technical Insight
The architectural brilliance of ps-fuzz lies in its LLM-against-LLM methodology. Rather than maintaining static attack databases that quickly become stale, it employs one LLM as the 'attacker' to dynamically generate adversarial prompts tailored to your specific system prompt, then tests these against your 'target' LLM. This creates an arms race in miniature: the attacker model reasons about your defenses and crafts contextually appropriate attacks.
The framework supports 16+ LLM providers through a unified interface, meaning you can use GPT-4 to attack a Claude-based application, or vice versa. This cross-pollination is valuable because different model families have different failure modes. Here's how you'd configure a basic test scenario:
from psfuzz import Fuzzer
# Define your system prompt that needs hardening
system_prompt = """
You are a customer service agent for FinanceApp.
You can access user account balances and transaction history.
NEVER reveal information about users other than the authenticated user.
NEVER execute commands or run code.
"""
# Configure the fuzzer with attacker and target LLMs
fuzzer = Fuzzer(
target_provider="openai",
target_model="gpt-4",
target_system_prompt=system_prompt,
attacker_provider="anthropic",
attacker_model="claude-3-opus-20240229"
)
# Run specific attack categories
results = fuzzer.test(
attack_modes=[
"jailbreak",
"prompt_injection",
"context_manipulation",
"system_prompt_extraction"
],
num_attempts=10
)
# Evaluate results
for result in results:
print(f"Attack: {result.attack_type}")
print(f"Prompt: {result.adversarial_prompt}")
print(f"Response: {result.target_response}")
print(f"Bypassed: {result.security_bypassed}")
print(f"Score: {result.vulnerability_score}/10")
The real power emerges in ps-fuzz's evaluation framework. It doesn't just send attacks; it intelligently assesses whether the target's response indicates a security bypass. This uses configurable evaluation strategies—from simple keyword matching to LLM-based semantic analysis. For a financial services application, you might configure custom evaluation criteria that specifically detect account number disclosure or unauthorized transaction details.
Particularly innovative is the RAG poisoning module. Retrieval-augmented generation systems pull context from vector databases, creating a unique attack surface: malicious content in embedded documents can influence model behavior. ps-fuzz can test this by generating poisoned documents and evaluating whether they successfully manipulate the model's outputs:
# Test RAG-specific vulnerabilities
rag_fuzzer = Fuzzer(
target_provider="openai",
target_model="gpt-4",
target_system_prompt=system_prompt,
rag_config={
"embedding_model": "text-embedding-ada-002",
"vector_db": "pinecone",
"namespace": "customer_docs"
}
)
rag_results = rag_fuzzer.test_rag_poisoning(
attack_types=["context_injection", "semantic_manipulation"],
num_poisoned_docs=5
)
The framework's multi-threaded execution model runs multiple attack vectors concurrently, dramatically reducing total test time. For a comprehensive security audit covering all 16 attack categories with 10 attempts each, you might expect 160 sequential LLM calls to take hours. ps-fuzz's parallelization brings this down to minutes, though at the cost of significantly higher token consumption during the test window.
The interactive playground mode deserves special mention. Rather than batch testing, you can iteratively harden your system prompt: run an attack, see what bypassed your defenses, refine your prompt's security instructions, and immediately re-test. This tight feedback loop is invaluable during the development phase, transforming prompt engineering from an art into a measurable, improvable process.
Gotcha
The elephant in the room is token consumption costs. Since ps-fuzz uses LLM calls for both attack generation and target evaluation, a comprehensive test suite can burn through tokens rapidly. A full security audit with 10 attempts across 16 attack categories means 320+ LLM API calls (attack generation + target testing + evaluation). If you're using GPT-4 or Claude Opus on both sides, you could easily rack up $50-100 per complete test run. For teams with tight budgets or those wanting to run tests frequently in CI/CD pipelines, this becomes prohibitively expensive. The documentation doesn't provide clear guidance on cost optimization strategies like using cheaper models for attack generation or caching attack prompts.
The second limitation is more subtle: your security testing is fundamentally limited by the capabilities of your chosen attacker model. If you use a weak or heavily safety-tuned model as your attacker, it may refuse to generate certain adversarial prompts or produce attacks that lack sophistication. Conversely, the most capable models (which make the best attackers) are also the most expensive. There's also an evaluation accuracy problem: determining whether a response represents a 'security bypass' requires nuanced judgment. The automated evaluation can produce false positives (flagging benign responses as vulnerabilities) or false negatives (missing subtle information leakage). Teams need manual review processes to validate findings, which reduces the automation value proposition.
Verdict
Use if: You're deploying customer-facing GenAI applications where prompt injection could lead to data leakage, unauthorized access, or compliance violations. This is essential for multi-tenant SaaS products, financial services chatbots, healthcare applications, or any RAG system processing sensitive documents. It's particularly valuable when you need repeatable security testing across multiple LLM providers or want to integrate adversarial testing into your development workflow. Teams with security compliance requirements (SOC 2, HIPAA, GDPR) will find the automated reporting invaluable for demonstrating due diligence. Skip if: You're building internal tools with trusted users, working on low-stakes prototypes, or operating under severe budget constraints where $50-100 per test run isn't feasible. If your application doesn't use system prompts or has no sensitive data exposure risk, manual testing with OWASP LLM Top 10 guidelines might suffice. Also skip if you expect a fully automated solution that requires zero manual validation—you'll still need security expertise to interpret results and properly harden your prompts based on findings.