Back to Articles

Building a Serverless Prompt Injection Detector with Cascading Similarity Metrics

[ View on GitHub ]

Building a Serverless Prompt Injection Detector with Cascading Similarity Metrics

Hook

A naive prompt injection detector that checks every message against a database of known attacks can cost you hundreds of dollars per day in Lambda execution time. A cascading filter approach can cut that cost by 80% while maintaining detection accuracy.

Context

As LLM-powered applications move from experimental demos to production systems handling real user input, prompt injection has emerged as a critical security vulnerability. Unlike traditional injection attacks that exploit parsing logic, prompt injections manipulate the model's instruction-following behavior itself—convincing GPT-4 to ignore its system prompt and execute attacker instructions is as simple as typing "Ignore previous instructions and..."

Most developers approach this problem by implementing expensive request-time validations: sending every user message to another LLM for safety classification, running regex patterns against thousands of attack signatures, or using commercial APIs that charge per request. In serverless environments where you pay for execution time, these approaches become prohibitively expensive at scale. Denzel Crocker (named after the Fairly OddParents character obsessed with detecting fairy godparents) takes a different approach: use cheap operations to filter out obviously benign traffic, then apply expensive validation only to suspicious messages.

Technical Insight

The architecture implements a two-stage filtering pipeline that exploits the cost differential between mathematical operations and string comparison algorithms. Stage one calculates cosine similarity between incoming messages and a corpus of known malicious prompts from HuggingFace's labeled datasets. This operation is computationally cheap—you're just comparing vector embeddings in high-dimensional space. Messages that score above a similarity threshold get flagged for deeper inspection.

Stage two applies ROUGE-L (Longest Common Subsequence) scoring to these flagged messages, but with a clever inversion: instead of comparing against bad prompts, it measures similarity to known benign prompts. A message that looks like a known attack AND doesn't look like legitimate user input gets routed to a separate SQS queue for additional scrutiny or rejection. This two-dimensional comparison dramatically reduces false positives compared to single-metric approaches.

Here's how the core detection logic works in practice:

import numpy as np
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer

class PromptInjectionDetector:
    def __init__(self, malicious_embeddings, benign_prompts, 
                 cosine_threshold=0.75, rouge_threshold=0.3):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.malicious_embeddings = malicious_embeddings
        self.benign_prompts = benign_prompts
        self.cosine_threshold = cosine_threshold
        self.rouge_threshold = rouge_threshold
        self.scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    
    def detect(self, incoming_message):
        # Stage 1: Fast cosine similarity against known attacks
        message_embedding = self.model.encode(incoming_message)
        similarities = np.dot(self.malicious_embeddings, message_embedding)
        max_similarity = np.max(similarities)
        
        if max_similarity < self.cosine_threshold:
            return {"classification": "benign", "confidence": 1 - max_similarity}
        
        # Stage 2: ROUGE-L validation against benign corpus
        rouge_scores = [
            self.scorer.score(benign, incoming_message)['rougeL'].fmeasure
            for benign in self.benign_prompts
        ]
        max_benign_similarity = max(rouge_scores)
        
        if max_benign_similarity > self.rouge_threshold:
            return {"classification": "benign", "confidence": max_benign_similarity}
        
        return {
            "classification": "suspicious",
            "cosine_score": float(max_similarity),
            "rouge_score": float(max_benign_similarity)
        }

The Lambda function implementation loads pre-computed embeddings from S3 at cold start, processes batches of messages from an input SQS queue, and routes results to classification-specific queues. The cascading approach means most messages never reach the ROUGE-L calculation step, which is significantly more expensive computationally.

What makes this architecture particularly well-suited for serverless is its cost profile optimization. Cosine similarity calculations complete in single-digit milliseconds even for large corpuses because you're performing dot products on fixed-size vectors. ROUGE-L scoring requires iterating through string subsequences and scales with message length. By filtering out 70-80% of messages in the cheap stage one, you avoid the expensive stage two computation for the majority of your traffic.

The routing to separate SQS queues is architecturally significant because it decouples detection from enforcement. Your downstream systems can implement different policies: immediately reject suspicious messages, send them for human review, apply additional LLM-based validation, or log them for analysis while still allowing the request to proceed. This flexibility is crucial in production where security requirements vary by use case—a customer support chatbot might accept more risk than a code generation assistant with database access.

One subtle but important implementation detail: the system uses HuggingFace datasets as its source of truth for prompt corpuses rather than hardcoding examples. This means the detection signatures can be updated by pointing to new dataset versions without code changes. The downside is dependency on external data sources, but the upside is that as the security research community discovers new injection techniques and publishes them to shared datasets, your detector automatically improves.

Gotcha

The repository documentation explicitly calls out its conceptual nature—this is reference architecture, not production-ready code. There's minimal error handling, no structured logging, and critical configuration values like similarity thresholds appear to be hardcoded rather than externalized as environment variables. You'll need to wrap this in significant operational scaffolding before deploying it to handle real traffic.

More fundamentally, this approach inherits all the limitations of signature-based detection systems. It will catch known prompt injection patterns and variations that are semantically similar to training examples, but novel attack techniques will sail right through. The security research community regularly publishes new injection methods—token smuggling, context confusion, Unicode obfuscation—and there's necessarily a lag between discovery and incorporation into public datasets. If an attacker crafts a prompt injection that doesn't resemble your training corpus, cosine similarity won't flag it. This is a first line of defense, not a complete solution. You'd want to layer this with input sanitization, output validation, and privilege separation in your LLM system architecture.

Verdict

Use if: You're building an LLMOps pipeline on AWS with budget constraints, need a cost-effective filter for known injection patterns, want a reference implementation to understand cascading detection strategies, or need something that integrates cleanly with existing SQS-based message processing. This is excellent as a learning resource or starting point for custom detection systems. Skip if: You need production-grade code with comprehensive error handling and observability, require protection against zero-day prompt attacks, have strict latency SLAs under 100ms, operate outside AWS infrastructure, or need a vendor-supported solution with guaranteed updates. For production deployments, consider this a proof-of-concept that demonstrates valuable architectural patterns but requires significant hardening before handling real user traffic.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/rabbidave-denzel-crocker-hunting-for-fairly-odd-prompts.svg)](https://starlog.is/api/badge-click/ai-dev-tools/rabbidave-denzel-crocker-hunting-for-fairly-odd-prompts)