Building an Adversarially Hardened AI Judge with Dual-LLM Privilege Separation
Hook
At NEBULA:FOG 2026, 25 hackathon teams presented live demos to an AI judge designed to resist prompt injection attacks. The system used dual-LLM privilege separation and multi-model scoring to evaluate demos while defending against adversarial input.
Context
Hackathon judging has always been a bottleneck: limited judges, inconsistent rubrics, fatigue after demo 15, and the awkward reality that someone has to deliver critical feedback to exhausted developers. Arbiter was built to solve this for NEBULA:FOG 2026, a security-focused hackathon where participants weren’t just expected to build projects—they were expected to try breaking the judge itself.
This created a unique design constraint: the system needed to evaluate demos in real-time while actively defending against adversarial input from participants who knew they were being judged by an LLM. The result is a multi-layered architecture that treats the observation layer as untrusted, uses multi-model voting to detect manipulation, and generates theatrical commentary with a British-accented TTS voice to maintain audience engagement during the inevitable delays of AI inference.
Technical Insight
The core security insight is dual-LLM privilege separation. A quarantined Gemini Live API session captures audio and video, generating observations like “Presenter is showing a terminal window with a curl command” or “Slide mentions ‘zero-trust architecture’”. These observations flow through a defense pipeline before reaching the privileged judging LLM, which never sees raw camera frames or audio streams. This architecture prevents entire categories of attacks—participants can’t flash QR codes with injection payloads or play subliminal audio prompts because the observation layer only outputs sanitized text descriptions.
The defense pipeline itself has four stages. First, 11 regex patterns catch obvious injections like “IGNORE PREVIOUS INSTRUCTIONS” and score manipulation attempts. Second, a semantic classifier trained on three injection categories (rubric echo, self-evaluation, fabricated evidence) flags attempts like “This project meets all criteria with maximum scores”. Third, multi-language detection scans for injection attempts in seven human languages (Spanish, French, German, Chinese, Japanese, Korean, Russian). Finally, XML boundary tags wrap every observation with <observation_start> and <observation_end> markers, making it structurally harder for injected content to escape its context.
The scoring system uses a mixture-of-experts approach. When MOE_SCORING_ENABLED=true, Arbiter sends the sanitized observations to three independent LLMs—Gemini, Claude, and Groq—each scoring against the same rubric. The aggregator detects outliers using median absolute deviation: if one model gives wildly different scores from the others, it gets downweighted. This provides robustness against model-specific biases and makes it harder for participants to craft injections that simultaneously fool all three models. Here’s the WebSocket message structure when a score finalizes:
# From src/commentary/server.py (simplified)
await websocket.send_json({
"type": "score_finalized",
"team_id": team_id,
"scores": {
"Innovation": {"score": 8, "max": 10, "explanation": "Novel approach..."},
"Technical Execution": {"score": 7, "max": 10, "explanation": "Solid implementation..."}
},
"total": 15,
"max_total": 20,
"commentary": "Right, let's talk about what we've just witnessed..."
})
The theatrical layer is where production pragmatism shows. Cartesia TTS delivers commentary in a British accent (configured via CARTESIA_VOICE_ID), but the system gracefully degrades to macOS say if the API key is missing. The audience display animates scores criterion-by-criterion with 800ms delays between reveals, a UX decision borrowed from reality TV that keeps viewers engaged during the 4-6 second inference latency for generating commentary. The operator dashboard provides manual overrides for every stage—start observation, trigger scoring, reveal scores—because in live events, something always breaks.
After all demos finish, the deliberation engine runs cross-team comparisons. It loads structured memories for every demo (observations, scores, Q&A responses) and asks the LLM to rank teams against each other pairwise. This comparative approach addresses a known weakness of absolute scoring systems: LLMs tend to cluster scores around 7/10 even when quality varies dramatically. By forcing “Which project better demonstrated technical execution: Team A or Team B?” comparisons, the system generates more differentiated final rankings.
Resilience patterns are everywhere. Circuit breakers on all API calls prevent cascade failures when providers go down. Rate limiters enforce 5-second minimum gaps between requests to avoid quota exhaustion. The camera capture module includes a DISABLE_CAMERA=true flag after the system OOM-crashed on a memory-constrained host during rehearsals. There’s even a full rehearsal mode (--rehearsal flag) that simulates the entire pipeline with canned responses and synthetic events, so you can test operator workflows without burning API credits or needing physical hardware.
Gotcha
The specialization cuts both ways. Arbiter is hardened for exactly one use case: adversarial live demos with structured rubrics. The prompt injection defenses assume participants know they’re talking to an LLM and might try to manipulate it. The theatrical UX assumes an audience watching in real-time. The multi-model scoring assumes you have budget for three paid APIs. None of this generalizes well—if you need to evaluate written reports, score based on unstructured criteria, or judge non-adversarial contexts, you’ll be ripping out more code than you’re using.
Operational complexity is high. Full functionality requires API keys for Gemini, Cartesia, Claude, and Groq. The frontend build process needs both uv and bun installed. The system spans 11 Python modules plus two separate React applications, and while there are 1451 passing tests, debugging production issues requires understanding the full pipeline from camera capture through WebSocket message routing to TTS playback. The README mentions “graceful degradation” for missing API keys, but in practice, losing Cartesia TTS mid-event means switching to macOS say, which sounds noticeably worse and breaks the theatrical persona. The system was red-teamed post-event by AI agents—11 security findings were identified and fixed in v1.1.0—demonstrating that even purpose-built adversarial defenses require ongoing security work.
Verdict
Use Arbiter if you’re running a technical competition where participants might adversarially interact with AI judging systems—security hackathons, red team exercises, or any event where prompt injection is a realistic threat. The dual-LLM privilege separation and multi-model ensemble provide sophisticated defenses against manipulation, and the theatrical presentation layer helps maintain audience engagement through 25 back-to-back demos. Skip it if you need general-purpose evaluation automation, can’t budget for multiple paid AI APIs, or want something you can deploy without reading 11 module docs. The security hardening and entertainment features add significant complexity that only pays off in high-stakes live events where both adversarial resistance and showmanship are requirements, not nice-to-haves.