Back to Articles

Yolonda: Teaching AI Coding Agents to Ask Permission Before They Break Production

[ View on GitHub ]

Yolonda: Teaching AI Coding Agents to Ask Permission Before They Break Production

Hook

Your AI coding assistant just attempted to recursively delete your home directory while 'refactoring build scripts.' When autonomous agents have filesystem access and API keys, milliseconds matter for security decisions.

Context

AI coding agents like Claude Code, GitHub Copilot Workspace, and Aider have crossed a threshold: they're no longer autocomplete tools but autonomous executors with shell access, file system permissions, and the ability to chain dozens of actions without human intervention. They can spawn subprocesses, modify config files, and commit code—all while you grab coffee. The security model for these tools remains stuck in 2019: either trust them completely or don't use them at all.

Traditional application security doesn't map well to AI agents. Static sandboxes break legitimate workflows (agents need network access for package installs). Rate limiting is too coarse (100 file edits might be normal refactoring or a supply chain attack). Log review is too late (the damage happened 30 seconds ago). What's missing is a middle layer that understands intent—comparing what the user asked for against what the agent is actually doing, with sub-millisecond latency for obvious decisions and human escalation for suspicious drift. Yolonda is a hackathon-born prototype exploring this three-tier security model: instant policy enforcement for clear cases, LLM-powered analysis for ambiguity, and human review for high-risk actions.

Technical Insight

Yolonda's architecture splits security decisions across three evaluation tiers, each optimized for different latency-risk profiles. The first tier is a Go CLI that hooks into Claude Code's tool execution layer through MCP (Model Context Protocol), intercepting every filesystem operation, shell command, and API call before execution. Each intercepted action hits an embedded OPA (Open Policy Agent) engine running compiled Rego policies:

# Example OPA policy auto-generated from user approval
package yolonda.filesystem

allow {
    input.action == "write_file"
    startswith(input.path, "/workspace/src")
    not contains(input.path, "..")
    input.size_bytes < 1048576  # 1MB limit
}

deny {
    input.action == "delete_file"
    regex.match("\\.(env|pem|key)$", input.path)
}

This fast path runs in under 500 microseconds, making allow/deny decisions for ~80% of actions with zero user-perceived latency. The policies are versioned, auditable, and—critically—automatically generated from human override decisions, creating a feedback loop from slow to fast evaluation.

When OPA returns no definitive verdict (no matching allow or deny rule), the CLI forwards the action to a Python FastAPI backend that performs intent drift analysis. The system extracts structured goals from the original user prompt using Claude's structured output, then scores the current tool call against 13 behavioral signals: out-of-scope file access (editing files unrelated to stated goal), command risk patterns (regex matching for dd, rm -rf, curl | bash), velocity anomalies (sudden spike in filesystem writes), credential proximity (actions near .env files), and time-of-day deviations. Here's the drift scoring logic:

# Simplified from yolonda's drift detection engine
class IntentDriftAnalyzer:
    def score_action(self, user_goal: Goal, action: ToolCall, context: SessionContext) -> DriftScore:
        signals = []
        
        # Signal 1: Scope drift - file path relevance
        relevant_paths = self.extract_paths_from_goal(user_goal)
        if action.path not in relevant_paths:
            path_distance = self.semantic_distance(action.path, relevant_paths)
            signals.append(Signal("scope_drift", severity=path_distance))
        
        # Signal 2: Command risk pattern matching
        if action.type == "execute_command":
            risk_score = self.risk_classifier.score(action.command)
            signals.append(Signal("command_risk", severity=risk_score))
        
        # Signal 3: Velocity anomaly detection
        recent_actions = context.get_actions_in_window(seconds=30)
        if len(recent_actions) > context.baseline_velocity * 3:
            signals.append(Signal("velocity_spike", severity=0.8))
        
        # Aggregate signals into four-tier routing
        aggregate_score = self.weighted_sum(signals)
        return DriftScore(
            value=aggregate_score,
            tier=self.score_to_tier(aggregate_score),  # Green/Yellow/Orange/Red
            contributing_signals=signals
        )

Actions scoring Green or Yellow get auto-approved with logging. Orange-tier actions trigger a Slack notification with one-click approve/deny buttons, including the drift analysis breakdown and session context. Red-tier actions are blocked by default and require explicit override with a justification that feeds into the policy-generation system.

The third tier handles the self-improving policy loop. When a human overrides a decision (approves a blocked action or denies an allowed one), the backend clusters similar overrides using vector embeddings of the action context. Once a cluster reaches confidence threshold (typically 3-5 similar overrides), it auto-generates an OPA policy:

# Backend policy synthesis from human decisions
def generate_policy_from_overrides(overrides: List[Override]) -> RegoPolicy:
    # Extract common patterns from clustered overrides
    common_attributes = extract_commonalities(overrides)
    
    policy_template = '''
package yolonda.learned

{allow_or_deny} {{
    input.action == "{action_type}"
    {conditions}
}}
'''
    
    conditions = []
    if common_attributes.path_pattern:
        conditions.append(f'regex.match("{common_attributes.path_pattern}", input.path)')
    if common_attributes.size_range:
        conditions.append(f'input.size_bytes < {common_attributes.size_range[1]}')
    
    return RegoPolicy(
        code=policy_template.format(
            allow_or_deny="allow" if overrides[0].decision == "approve" else "deny",
            action_type=common_attributes.action_type,
            conditions="\n    ".join(conditions)
        ),
        confidence=len(overrides) / CONFIDENCE_THRESHOLD
    )

These synthesized policies deploy to the CLI's local OPA engine, migrating learned behaviors from the slow LLM-based path (100-300ms latency) to the instant fast path. Over time, the ratio of fast-path to slow-path decisions improves as the system learns your team's patterns.

The architecture's offline-first design deserves special attention. The Go CLI bundles a policy snapshot and can operate indefinitely without the backend—critical for developers on flaky connections or security-paranoid environments that block external API calls. When the backend is unreachable, the CLI falls back to a local Ollama instance running Llama 3.1, performing intent analysis entirely on-device with degraded accuracy but maintained functionality. The Next.js dashboard provides a SOC-style view of all verdicts, with drill-down timelines showing the sequence of actions per coding session, drift score trends, and policy effectiveness metrics (false positive rate, human override frequency).

Gotcha

Yolonda wears its hackathon origins openly—the README explicitly warns it's not production-hardened. There's no security audit, the FastAPI backend has no authentication layer, and the Slack integration stores webhook URLs in plaintext config files. The policy synthesis system can theoretically be poisoned by an attacker who generates strategic overrides to gradually widen permission boundaries (imagine approving progressively larger file deletions until rm -rf passes through). The LLM-based drift detection inherits prompt injection vulnerabilities: a carefully crafted commit message or filename could potentially manipulate the intent analysis to misclassify malicious actions as benign.

The Claude Code-only support is a significant constraint. The MCP hooking mechanism is specific to Anthropic's architecture, and extending to GitHub Copilot or Cursor would require rewriting the interception layer for each tool's plugin API. There's no abstraction layer for multi-agent support, meaning teams using multiple AI coding tools would need separate Yolonda deployments with unshared learned policies. The intent drift detection also struggles with legitimate exploratory workflows—developers investigating an unfamiliar codebase trigger constant Yellow/Orange alerts as they hop between unrelated files, creating alert fatigue that trains users to blindly approve Slack notifications.

Verdict

Use if: You're researching AI agent security patterns, need a prototype for exploring guardrail architectures at a hackathon, or want a concrete example of intent drift detection to inform building your own system. The two-tier evaluation strategy and self-improving policy loop are genuinely novel ideas worth studying. Use if you're in an education or research context where 'working proof of concept' is the goal, not production reliability. Skip if: You need production-grade security for AI agents in actual development workflows, require support for multiple AI coding tools beyond Claude Code, or expect the LLM-based security layer to be adversarially robust. Skip if you can't tolerate the operational overhead of running three separate services (CLI, FastAPI backend, Next.js dashboard) plus a Slack workspace integration just to add guardrails to your coding assistant. For production needs, wait for mature commercial offerings or treat this as a reference architecture to build upon with proper hardening, multi-agent abstractions, and security audits.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/trustmindcom-yolonda.svg)](https://starlog.is/api/badge-click/developer-tools/trustmindcom-yolonda)