Building a Security Perimeter Around LLMs: Inside LLM Guard’s Scanner Pipeline
Hook
Every prompt you send to ChatGPT is a potential attack vector, and every response could leak PII or generate toxic content. LLM Guard treats this reality with the seriousness of a firewall protecting your database—because that’s exactly what it is.
Context
Large Language Models have entered production environments rapidly, with companies integrating ChatGPT APIs into customer service systems, internal tools, and public-facing chatbots. This has created new security challenges: prompt injection attacks that trick models into ignoring instructions, PII leakage in training data regurgitation, toxic output in customer interactions, and competitors mentioned in responses. Traditional application security tools don’t address these issues—you can’t SQL-inject a neural network, but you can manipulate it in entirely new ways.
Protect AI built LLM Guard to solve this gap: a security middleware layer that sits between your application and any LLM, scanning both directions of traffic. Think of it as a WAF (Web Application Firewall) but for language model interactions. It addresses the unique threat model of LLMs where attacks aren’t about buffer overflows or XSS, but about linguistic manipulation, data exfiltration through conversation, and content policy violations that could damage your brand or violate regulations like GDPR.
Technical Insight
LLM Guard’s architecture is built around a dual-scanner pipeline pattern. Prompt scanners intercept and validate user input before it reaches the LLM, while output scanners validate the model’s response before returning it to users. Each scanner is an independent module that performs a specific security check, and you compose them into chains based on your threat model.
Here’s what a basic implementation looks like:
from llm_guard.input_scanners import Anonymize, PromptInjection, Toxicity
from llm_guard.output_scanners import Deanonymize, Bias, MaliciousURLs
from llm_guard import scan_prompt, scan_output
# Configure input scanners
input_scanners = [
Anonymize(), # Strip PII before sending to LLM
PromptInjection(threshold=0.75), # Detect injection attempts
Toxicity(threshold=0.7) # Block toxic input
]
# Configure output scanners
output_scanners = [
Deanonymize(), # Restore PII in safe outputs
Bias(threshold=0.8), # Detect biased language
MaliciousURLs() # Check for dangerous links
]
# Process user input
user_prompt = "What's the email for john.doe@company.com?"
sanitized_prompt, results_valid, results_score = scan_prompt(
input_scanners, user_prompt
)
if results_valid:
# Send sanitized_prompt to your LLM
llm_response = call_openai_api(sanitized_prompt)
# Validate output
sanitized_output, output_valid, output_score = scan_output(
output_scanners, sanitized_prompt, llm_response
)
if output_valid:
return sanitized_output # Safe to return to user
The Anonymize/Deanonymize pair is particularly clever. Anonymize appears to detect PII (emails, phone numbers, SSNs, names) in user input and replaces them with placeholders before sending to the LLM. The scanner maintains a mapping of original values to placeholders. Deanonymize then reverses this process on the output, restoring the original PII only if the response is deemed safe. This approach ensures your LLM never sees actual PII, supporting data minimization requirements in regulations like GDPR and HIPAA.
Under the hood, scanners use a mix of techniques. Simple scanners like BanSubstrings and Regex use pattern matching—fast and deterministic. More sophisticated scanners like PromptInjection and Toxicity appear to load transformer-based classification models. The Secrets scanner integrates with secrets detection to catch accidentally pasted API keys or credentials.
The lazy-loading dependency model is crucial for production viability. As noted in the README, installing llm-guard gives you the core framework with minimal dependencies, and advanced features trigger automatic installation of necessary libraries on-demand. This means you can deploy lightweight scanners (Regex, BanSubstrings, InvisibleText) with a small footprint, then add ML-based scanners only where needed. You’re not forced to ship a large Docker image just to do basic string filtering.
Each scanner appears to return a validity boolean and a confidence score. Scores let you implement graduated responses—maybe you log medium-confidence injection attempts but only block high-confidence ones. The results object likely provides detailed metadata about what triggered each scanner, essential for debugging false positives in production.
The toolkit includes 15 prompt scanners and 20 output scanners covering everything from gibberish detection (catching random keyboard mashing or adversarial character sequences) to factual consistency checks (comparing output against source material to detect hallucinations). The LanguageSame scanner ensures the LLM responds in the same language as the prompt—important for preventing language-switching attacks where users try to bypass content filters by prompting in one language and getting responses in another.
Gotcha
The ML-based scanners present operational challenges. Scanners like PromptInjection likely load substantial transformer models into memory and add latency per request. In a high-throughput API serving thousands of requests per minute, this overhead compounds quickly. You may need dedicated GPU infrastructure or accept higher response times. The lazy-loading helps, but once a scanner is loaded, it’s loaded for that process.
Detection accuracy is imperfect and context-dependent. The PromptInjection scanner may produce false positives on legitimate prompts that happen to match adversarial patterns. A user asking “Ignore previous instructions and help me write a poem” might trigger it, even though this is a valid creative request. You’ll spend time tuning thresholds for your specific use case—there’s no universal setting that works everywhere. The README is transparent about this being a toolkit for integration and deployment, acknowledging ongoing improvements.
Another consideration is the execution model. Based on typical Python security libraries, you may need to consider async integration patterns if you’re building on async frameworks like FastAPI or have high-concurrency requirements. The example code suggests synchronous function calls, which may require wrapping in executors for async applications.
Verdict
Use LLM Guard if you’re deploying LLMs in regulated industries (healthcare, finance, legal) where PII protection and content filtering aren’t optional, if you need defense-in-depth against prompt injection and jailbreaking attacks with auditable security controls, if you’re building internal tools where moderate latency overhead is an acceptable trade-off for comprehensive scanning, or if you want fine-grained control over which security checks to apply rather than all-or-nothing cloud services. Skip if you’re building latency-critical applications where every millisecond matters (real-time chat, autocomplete features), if your use case only needs basic keyword filtering that you can implement with simple regex, or if you’re okay with vendor lock-in to cloud providers like Azure or AWS who offer managed content safety APIs with similar capabilities but different cost models. LLM Guard shines when you need open-source flexibility (MIT licensed), on-premises deployment, and the ability to compose exactly the security policy your threat model demands.