Promptfoo: The Open-Source LLM Security Scanner That OpenAI and Anthropic Actually Use
Hook
When OpenAI acquired the company behind Promptfoo in late 2024, they made an unusual commitment: keep the MIT license and continue development as open-source. That alone tells you this tool is different from the typical LLM evaluation landscape.
Context
The explosion of LLM applications exposed a fundamental problem: developers were shipping prompts to production with no systematic way to test them. Unlike traditional software where unit tests, integration tests, and security scanning are standard practice, LLM development became a wild west of manual testing and vibes-based evaluation. You'd tweak a prompt, run it a few times in a playground, maybe show it to a colleague, and ship it. Then users would discover your chatbot leaked PII, hallucinated pricing information, or got jailbroken with a simple "Ignore previous instructions" attack.
Existing solutions fell into two camps: heavyweight MLOps platforms that required sending your proprietary prompts and data to cloud services, or OpenAI-specific eval frameworks that locked you into a single provider. For enterprises handling sensitive data or teams comparing Claude against GPT against open-source models, neither option worked. Promptfoo emerged from this gap with a radical premise: run everything locally, support every provider, and make security testing as rigorous as functional testing. The fact that it now powers applications serving 10M+ users while remaining a CLI tool speaks to how well it nailed the developer experience.
Technical Insight
Promptfoo's architecture centers on a declarative configuration system that separates concerns cleanly. You define your prompts, test cases, models, and assertions in YAML or JSON files, then the execution engine handles provider orchestration, output collection, and grading. This sounds simple, but the implementation reveals thoughtful design decisions.
Here's a minimal configuration that demonstrates the core concepts:
prompts:
- "Summarize this in one sentence: {{text}}"
- "TL;DR: {{text}}"
providers:
- openai:gpt-4
- anthropic:claude-3-opus-20240229
- ollama:llama2:13b
tests:
- vars:
text: "Long article about AI safety..."
assert:
- type: contains
value: "safety"
- type: llm-rubric
value: "output is one sentence or less"
- type: cost
threshold: 0.01
This configuration runs two prompt variants against three different providers (commercial and local), then applies three different assertion types. The llm-rubric assertion is particularly clever—it uses an LLM as a judge to evaluate subjective criteria that would be brittle to encode as regex patterns. You can specify which model does the judging, creating a separation between the model under test and the evaluation model.
The provider abstraction layer is where Promptfoo really shines. Rather than coupling to specific SDKs, it implements a unified interface across 50+ providers including OpenAI, Anthropic, AWS Bedrock, Azure, HuggingFace, Replicate, and local models via Ollama. This lets you swap openai:gpt-4 for bedrock:anthropic.claude-v2 with a single line change, which is invaluable when you're navigating vendor lock-in concerns or comparing cost/performance tradeoffs.
The red-teaming capabilities go beyond basic testing into adversarial territory. Promptfoo includes plugin-based vulnerability scanners that generate malicious inputs designed to trigger specific failure modes:
redteam:
plugins:
- prompt-injection
- jailbreak
- pii-leak
- hallucination
- contracts # Generates inputs to test unwanted contract generation
numTests: 50
purpose: "A customer service chatbot that helps users track orders"
Under the hood, each plugin uses a different strategy. The prompt injection plugin generates variations of "Ignore previous instructions" attacks. The jailbreak plugin tries role-playing scenarios and hypothetical framings to bypass safety guardrails. The PII leak plugin seeds the context with fake sensitive data, then checks if the model reproduces it inappropriately. These aren't random mutations—they're informed by real-world attack patterns documented in the AI red-teaming literature.
The execution model deserves attention too. Promptfoo runs evaluations concurrently with configurable parallelism, caches provider responses keyed by prompt+model+parameters, and supports live-reload during development. This makes iteration fast: change a prompt, save the file, and within seconds you see updated results for just the changed variants without re-running expensive API calls for unchanged configurations.
Results are stored locally in a SQLite database, which powers both the CLI output and a web UI that runs on localhost. The web UI isn't trying to be a collaborative platform—it's a visualization layer for exploring results, comparing model outputs side-by-side, and drilling into specific test failures. This local-first architecture means you can evaluate prompts containing confidential business logic without that data ever leaving your machine.
The CI/CD integration is straightforward but powerful. Promptfoo exits with a non-zero code if assertions fail, so you can gate deployments on prompt quality and security:
# In your CI pipeline
npx promptfoo@latest eval
if [ $? -ne 0 ]; then
echo "Prompt evaluation failed, blocking deployment"
exit 1
fi
This turns prompt testing from an optional quality check into an enforceable requirement, similar to how you'd use ESLint or pytest in traditional development workflows.
Gotcha
The Node.js version requirements are stricter than you'd expect. You need Node.js 20.20+ or 22.22+, and some older versions in those ranges won't work due to specific API dependencies. This can be friction in corporate environments with locked-down Node versions or teams standardized on Node 18 LTS. While you can use Docker to work around this, it adds complexity to local development workflows.
The local-first architecture is a feature and a limitation. If you need team collaboration features like shared eval runs, commenting on results, or centralized dashboards, Promptfoo won't help—it's designed for individual developers and CI pipelines, not cross-functional teams reviewing results together. The web UI only runs on localhost, so you can't easily share a link to results with a product manager or security reviewer. You'd need to export reports or screenshot the interface.
Red-teaming quality varies significantly based on how well you define the system's purpose. The adversarial generators work better when you give them context about what the LLM application actually does. A vague purpose like "helpful assistant" produces generic jailbreak attempts, while a specific purpose like "financial advice chatbot that should never make specific stock recommendations" generates targeted attacks probing that exact boundary. You'll need to invest time tuning the configuration and reviewing false positives, especially for domain-specific applications where the default vulnerability definitions don't quite map to your risk model.
Verdict
Use if: You're building production LLM applications and need systematic testing that runs locally, you're comparing multiple LLM providers and want provider-agnostic evaluation, you're shipping AI features in regulated industries where data privacy matters, or you're setting up CI/CD for prompt quality gates. The privacy-first design and battle-tested maturity (used by OpenAI themselves) make this the default choice for serious LLM development. Skip if: You need a hosted collaborative platform for cross-functional teams to review results together, you're working exclusively in Python environments and want native integration without Node.js, or you're doing one-off prompt experiments without automation requirements. For quick playground testing, stick with provider-native tools; Promptfoo's value compounds when you invest in systematic evaluation.