> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Inside Visa's Nine-Stage LLM Pipeline for Automated Vulnerability Discovery

[ View on GitHub ]

Inside Visa's Nine-Stage LLM Pipeline for Automated Vulnerability Discovery

Hook

Most AI security tools prompt an LLM to find bugs in your code. Visa's harness inverts this: it builds a threat model first, then aims LLM compute only at high-risk attack surfaces—because the bottleneck isn't finding vulnerabilities, it's triaging them.

Context

Traditional static analysis tools like Semgrep and CodeQL are deterministic but noisy. They flag thousands of potential issues across large codebases, forcing security teams to manually triage findings that rarely align with actual business risk. Human security audits focus on what matters—authentication flows, payment processing, admin panels—but don't scale beyond a few high-value repos per quarter.

Large language models promised a middle ground: the contextual reasoning of a human auditor with the scalability of automated scanning. Early experiments pointed GPT-4 or Claude at codebases and asked 'find vulnerabilities,' but produced overwhelming false positives and missed critical issues because models lacked organizational context. They'd flag theoretical SQL injection in a deprecated script while ignoring broken access control in the billing API. Visa's Vulnerability Agentic Harness (VVAH) attempts to solve this by structuring LLM-based vulnerability discovery as a nine-stage pipeline that mirrors how senior security engineers actually work: understand the system's purpose, identify what's worth attacking, then systematically analyze those surfaces with specialized lenses. It's not trying to replace Semgrep—it's trying to replicate the judgment calls a staff security engineer makes before even opening a code editor.

Technical Insight

Backend Execution

Attack Surface Map

Threat Model JSON

Attack Surface Map

Threat Model JSON

Findings JSON

Findings JSON

Verified Findings

Unique Findings

Chained Exploits

Model Selection

Model Selection

Model Selection

Vote-based FP Reduction

Vote-based FP Reduction

Skills + Roles

Skills + Roles

Skills + Roles

Repository Input

S1-S3: Context Building

S4: Crypto Analysis

S5: Logic Bug Analysis

S6: Adversarial Verification

S7: Deduplication

S8: Chain Discovery

S9: SARIF Export

Anthropic SDK

Claude CLI + OAuth

OpenAI-Compatible

Backend

YAML Config

Run Manifest

Config Hash + Git SHA

System architecture — auto-generated

VVAH's architecture decouples 'skills' (reusable prompt templates for specific analysis tasks) from 'roles' (model configurations and backends). A YAML config file maps stages to skills and roles, allowing you to run Stage 2 threat modeling with Claude Opus via SDK, Stage 4 cryptographic analysis with GPT-4 via OpenAI, and Stage 6 adversarial verification with a local Llama model through an OpenAI-compatible endpoint—all in the same scan.

The nine stages split into three phases. Stages 1-3 build context: S1 ingests repository metadata (GitLab CI files, GitHub Security insights, optionally CMDB records mapping repos to business functions), S2 generates a threat model identifying high-value attack surfaces, and S3 maps components to risk levels. Stages 4-6 execute specialized analysis: S4 hunts cryptographic vulnerabilities (weak RNG, ECB mode, hardcoded keys), S5 targets logic bugs (TOCTOU races, state machine violations), and S6 runs adversarial verification where a separate model challenges findings from S4-S5. Stages 7-9 post-process: S7 deduplicates findings across stages, S8 attempts to chain low-severity issues into exploitable sequences, and S9 exports SARIF for integration with GitHub Advanced Security or SIEM platforms.

Here's how a skill definition looks for Stage 4 crypto analysis:

skills:
  crypto_analysis:
    name: "Cryptographic Vulnerability Research"
    prompt_template: |
      You are analyzing {component_name} which the threat model identified as handling {data_classification} data.
      
      Focus exclusively on cryptographic issues:
      - Weak random number generation (Math.random, time-based seeds)
      - ECB mode or other non-authenticated encryption
      - Hardcoded keys or passwords in source
      - Custom crypto implementations
      - Certificate validation bypass
      
      Repository context:
      {repo_metadata}
      
      Threat model excerpt:
      {threat_model_surface}
      
      Code to analyze:
      {component_code}
      
      Return findings as JSON array with: title, severity, cwe_id, evidence_snippet, exploitation_steps.
    max_tokens: 8000
    temperature: 0.3

The orchestrator loads this skill, hydrates the template with artifacts from S1-S3 (repository metadata, threat model excerpts, component code), and sends it to whichever role you've assigned to Stage 4. If you've configured voting for false positive reduction, it runs this same prompt three times with temperature > 0 and only promotes findings that appear in at least two of three runs.

The multi-backend design is where VVAH gets interesting. It supports three execution paths:

  1. Anthropic SDK: Directly calls Claude via API key. Supports temperature control, so voting works. Standard rate limits apply.
  2. Claude CLI: Uses the claude command-line tool that ships with VSCode's Anthropic extension. This authenticates via OAuth to your Claude account and gets access to the extended context window (200K tokens). No temperature control—single-pass only.
  3. OpenAI-compatible endpoints: Any API matching OpenAI's spec (OpenAI itself, Azure OpenAI, local llama.cpp servers, Ollama). Temperature works if the backend supports it.

Your config might look like:

roles:
  threat_modeler:
    backend: anthropic_sdk
    model: claude-opus-4-20250514
    temperature: 0.1
  
  crypto_researcher:
    backend: openai_compatible
    base_url: http://localhost:8080/v1
    model: deepseek-coder-33b-instruct
    temperature: 0.3
  
  adversarial_verifier:
    backend: claude_cli
    # No temperature control—CLI doesn't expose it

stages:
  stage_2_threat_model:
    skill: threat_modeling
    role: threat_modeler
  
  stage_4_crypto:
    skill: crypto_analysis
    role: crypto_researcher
    enable_voting: true  # Runs 3x, requires temperature support
  
  stage_6_verification:
    skill: adversarial_review
    role: adversarial_verifier

This setup uses paid Claude Opus for threat modeling (where quality matters most), a local DeepSeek model for crypto analysis (cheaper, runs voting to compensate for lower quality), and Claude CLI for verification (free via your existing subscription, but no voting because temperature isn't exposed).

The token budget enforcement is per-stage and per-finding rather than global. If Stage 4 exceeds max_tokens analyzing a single component, it truncates context (code snippets first, then threat model excerpts) and logs a warning. There's no mechanism to abort a scan mid-flight if aggregate costs exceed a threshold—you can estimate costs beforehand with vvah estimate, but it's advisory.

Each run produces a manifest JSON capturing model roles, config file hash, and Git commit SHA. This enables reproducibility tracking—if a finding appears in run A but not run B, you can verify they used the same models and config—but it doesn't log the actual prompt text sent to models. If you've edited a skill's prompt template between runs, the manifest won't surface that as the divergence cause.

The adversarial verification stage (S6) is conceptually clever but mechanically unfinished. A separate model receives findings from S4-S5 plus the original code and is prompted to argue why each finding is a false positive. The output is logged, but there's no tiebreaker logic. If the researcher model says 'SQL injection in login handler' and the verifier model says 'parameterized query is safe,' both opinions land in the SARIF output as separate findings—you still triage manually.

Gotcha

VVAH requires organizational context to function as designed. Stage 1 ingests CMDB mappings (which repos handle PII, PCI data, admin functions) and control inventories (which systems enforce MFA, which deploy to prod without review). The README's truncated snippet shows CSV/JSON schemas for these feeds, but if your org doesn't maintain a CMDB or you can't export it in VVAH's expected format, you're building parsers before you analyze a single repo. For teams without this infrastructure—most startups, many mid-size companies—you're either stubbing out S1 with manual annotations or skipping the threat-modeling benefit entirely.

The tool explicitly warns it runs with elevated privileges and must only analyze trusted repositories. There's no sandboxing, no capability dropping, no filesystem isolation. If you point VVAH at a malicious repo with a poisoned dependency that exfiltrates data during import, you're compromised. This isn't negligence—the team is honest that they built this for internal use where Visa controls the repository sources—but it means you can't safely add this to a public GitHub App or run it on community contributions. You need dedicated, hardened analysis infrastructure, which most teams lack.

Reproducibility is impossible beyond 'same models and config were used.' Two scans of identical code with identical settings will produce different findings because LLMs are non-deterministic. The voting mechanism reduces variance for backends that support it, but the CLI backend (which is the default profile, since it requires no API keys) runs single-pass. A finding might appear in Monday's scan and vanish from Tuesday's scan of the same commit. The manifest doesn't log enough—no prompt versions, no sampled reasoning chains, no token-level generation parameters—to debug why. For compliance regimes requiring auditable, deterministic results, this is disqualifying.

Finally, there are no published precision or recall metrics. The README frames findings as 'triage candidates' rather than confirmed vulnerabilities, which is honest, but without ground truth comparisons to manual audits or commercial SAST tools, you can't calibrate trust. Is a 10% true positive rate acceptable? 50%? You won't know until you've burned hours validating findings. The team open-sourced their internal tool, not a productized offering with SLAs.

Verdict

Use if: You're a security engineering team at an enterprise with existing CMDB infrastructure, you already budget for LLM API costs in five figures annually, and your bottleneck is triaging findings across dozens of repositories rather than analyzing a handful deeply. The threat-modeling-first workflow and SARIF output make sense if you're feeding results into Jira, ServiceNow, or Splunk SOAR. The multi-backend flexibility is real—you can mix proprietary and open-source models, run local inference for sensitive code, and avoid vendor lock-in. Treat VVAH as a reference architecture: even if you don't deploy it verbatim, the skills-and-roles abstraction and nine-stage pipeline are worth studying if you're building agentic security tooling.

Skip if: You lack organizational context feeds, can't run code analysis tools with elevated privileges in isolated environments, or need deterministic results for compliance. Most teams will get more value from Semgrep Pro (deterministic, established false positive rates, scales to monorepos) or GitHub Copilot Autofix (LLM assistance on proven CodeQL findings) than wrestling with VVAH's non-determinism and setup complexity. If you're a solo developer or small team, the nine-stage orchestration is over-engineered—just prompt Claude or GPT-4 directly in a CI script. This is a research artifact from a sophisticated security org open-sourcing their internal tooling, not a drop-in SaaS. You're committing to ongoing maintenance, prompt engineering, and model evaluation without published benchmarks to guide you.