garak: Red-Teaming LLMs with a Security Scanner’s Mindset
Hook
If you’re deploying an LLM without security testing, you’re essentially running a production API without ever checking for SQL injection. garak treats AI models like attack surfaces—because they are.
Context
Large language models have moved from research curiosities to production infrastructure powering customer support, code generation, and decision-making systems. But unlike traditional software where we’ve spent decades developing security testing methodologies, LLM security remains ad-hoc. Developers test for accuracy and latency, but rarely systematically probe for adversarial failures—prompt injection attacks that leak system prompts, jailbreaks that bypass safety guardrails, or subtle biases that generate toxic content.
NVIDIA’s garak fills this gap by bringing penetration testing discipline to generative AI. Named after a cunning character from Star Trek: Deep Space Nine, it mirrors the philosophy of security tools like nmap and Metasploit: systematically enumerate vulnerabilities before attackers do. The project was presented at DEF CON and has accumulated over 7,325 GitHub stars from security researchers and ML engineers who recognize that pre-deployment scanning is cheaper than post-breach damage control.
Technical Insight
garak’s architecture follows a probe-detector pattern borrowed from security testing frameworks. Probes generate adversarial inputs targeting specific vulnerability classes—prompt injection, jailbreaks, hallucination, data leakage, toxicity. Detectors analyze model outputs to classify failures. This separation of concerns allows the system to mix and match attack strategies with detection heuristics.
The tool abstracts LLM interaction through a generator layer that supports virtually any model interface. You can test OpenAI’s API, Hugging Face models, AWS Bedrock endpoints, or even local GGUF models via llama.cpp (version >= 1046). Here’s how you’d scan GPT-3.5 for encoding-based prompt injection attempts:
export OPENAI_API_KEY="sk-your-key-here"
python3 -m garak --target_type openai --target_name gpt-3.5-turbo --probes encoding
Behind the scenes, garak loads the encoding probe family, which crafts prompts using techniques like base64 encoding, ROT13 ciphers, and Unicode smuggling to bypass input filters. Each probe runs multiple generations (default 10) per prompt to account for temperature-induced variability in LLM outputs. Results stream to JSONL files, creating an audit trail of every prompt-response pair and detection verdict.
The probe taxonomy is extensive. The promptinject family implements attacks from the PromptInject research framework. The dan probes test resistance to “Do Anything Now” jailbreaks—carefully crafted personas that trick models into ignoring safety training. The lmrc (Language Model Risk Cards) probes check for slur generation and biased outputs. You can target specific vulnerabilities:
# Test only for DAN 11.0 jailbreak variant
python3 -m garak --target_type huggingface --target_name gpt2 --probes dan.Dan_11_0
# Scan for specific weakness categories
python3 -m garak --target_type replicate --target_name meta/llama-2-70b --probes promptinject
The plugin architecture allows extending garak with custom probes and detectors. This design makes it straightforward to encode organization-specific red-team knowledge—perhaps you’ve discovered a novel injection technique or need to test compliance with domain-specific content policies.
What distinguishes garak from simple prompt libraries is its systematic coverage and statistical rigor. Running all probes against a model can exercise hundreds of attack vectors across multiple turns of conversation. The default 10 generations per prompt helps distinguish genuine vulnerabilities from statistical flukes caused by sampling randomness. The JSONL output format integrates cleanly into CI/CD pipelines, allowing you to fail builds when vulnerability thresholds are exceeded.
Gotcha
garak’s thoroughness comes with resource costs. A full scan generates thousands of LLM completions, which translates to significant API expenses for commercial models or hours of GPU time for self-hosted ones. The default 10 generations per prompt multiplies costs further—necessary for statistical confidence, but painful for large-scale testing. Budget accordingly, or narrow your scope with specific probe selections.
Detector reliability varies across vulnerability categories. Prompt injection and jailbreak detection can produce false positives, especially with models that creatively interpret instructions. A detector might flag a benign response as a safety failure, or miss a subtle data leak. Some vulnerability types require domain expertise to validate—automated detectors catch obvious fabrications but struggle with plausible-sounding misinformation. You’ll need human review of flagged outputs, particularly for high-stakes deployments.
Finally, remember that garak performs pre-deployment assessment, not runtime protection. It tells you where your model is vulnerable but doesn’t prevent attacks in production. Think of it as a security audit, not a firewall. If a garak scan reveals your model is susceptible to prompt injection, you’ll need separate mitigations—input validation, output filtering, or runtime defense systems. The tool is developed primarily for Linux and OSX, though Windows support exists via active test workflows.
Verdict
Use garak if you’re deploying LLMs in customer-facing applications, handling sensitive data, or operating in regulated industries where model failures carry real consequences. Security teams evaluating AI vendors should require garak scan results as part of procurement. AI safety researchers will find it invaluable for systematically exploring model boundaries. Skip it if you’re prototyping low-stakes applications, working with heavily resource-constrained environments where you can’t afford thousands of test generations, or if you need real-time guardrails rather than offline testing—pair garak scans with runtime protection systems for defense-in-depth. For serious LLM deployments, running garak before launch should be as automatic as running a vulnerability scanner before deploying a web application.