Back to Articles

Vibe Security Radar: Tracking CVEs Caused by AI-Generated Code

[ View on GitHub ]

Vibe Security Radar: Tracking CVEs Caused by AI-Generated Code

Hook

As AI coding assistants write increasing amounts of production code, a team at Georgia Tech asked a question no one had systematically answered: how many real CVEs were introduced by AI-generated code?

Context

The explosion of AI coding tools has fundamentally changed how developers write software. But unlike traditional code review where you can trace decisions back to a human author, AI-assisted code often enters repositories with minimal scrutiny. When vulnerabilities emerge later, we’ve had no systematic way to determine whether AI tools contributed to the problem.

Vibe Security Radar addresses this gap with a research project from Georgia Tech’s Systems Software & Security Lab (School of Cybersecurity and Privacy). Instead of speculation about AI safety risks, the team built an automated pipeline that analyzes CVEs from multiple databases, traces each vulnerability back through git history to the introducing commit, detects AI tool signatures in commit metadata, and uses LLM-based verification to establish causality. The result is a public research effort tracking which CVEs can be attributed to AI-generated code—providing empirical ground truth for discussions about AI coding tool safety.

Technical Insight

The architecture is a multi-stage pipeline that solves several challenging problems in automated vulnerability attribution. First, the system aggregates CVE data from multiple sources: OSV (Open Source Vulnerabilities), GitHub Advisory Database (GHSA), and NVD. For each CVE, it attempts to identify the fix commit from structured metadata. When that fails, it falls back to an LLM-assisted git log search—a pragmatic acknowledgment that CVE databases are inconsistently structured.

The second stage is where things get technically interesting. Once you have a fix commit, you need to trace back to the code that introduced the vulnerability. Vibe uses SZZ-style git blame analysis, a technique from mining software repositories research that identifies “inducing changes” by blaming lines touched in a fix. But there’s a catch: squash merges destroy the commit history you need for attribution. The researchers solved this with “squash-merge decomposition”—using the GitHub API to reconstruct the original PR commits even when they’ve been squashed into a single merge commit. This is clever git forensics that works around a common barrier in repository analysis.

Here’s how you’d run a focused analysis on a specific ecosystem to avoid the substantial disk requirements of a full scan:

cd cve-analyzer && uv sync
export GITHUB_TOKEN="ghp_your_token_here"

# Analyze only Python ecosystem CVEs since May 2025
uv run cve-analyzer batch --ecosystem pypi --since 2025-05-01 --llm-verify

# Or test specific CVEs
uv run cve-analyzer batch --cve-list CVE-2024-1234,CVE-2024-5678 --llm-verify

The AI detection layer operates on commit metadata signatures. The system searches for co-author trailers like “Co-authored-by: GitHub Copilot”, bot email addresses from AI services, and commit message markers from over 15 AI tools. This heuristic approach is fast but incomplete—any AI-assisted code that doesn’t leave metadata traces goes undetected. The researchers are explicit about this: their results represent a strict lower bound on AI-caused vulnerabilities.

What elevates this beyond simple pattern matching is the dual-LLM verification system. After heuristic detection flags potential AI involvement, a screening LLM performs triage to filter out false positives—cases where an AI-touched commit exists in the repository but clearly didn’t contribute to the vulnerability. The README notes this achieves approximately 80% precision. For remaining candidates, a deep investigation phase deploys an LLM agent with git tool access, capable of making multiple tool calls (up to 50 per CVE) to answer the core question: did the AI-authored code actually help cause this vulnerability? When the primary model fails, a fallback mechanism using the Claude Agent SDK retries the analysis.

This multi-layered approach—heuristic detection, LLM screening, agent-based deep investigation, and fallback retry—reflects the complexity of real-world vulnerability attribution. There’s no clean API for “was this bug caused by AI?” You need git forensics, metadata analysis, and reasoning about code causality. The pipeline acknowledges these challenges rather than pretending they don’t exist.

Gotcha

The most fundamental limitation is right in the README: “Detection relies on commit metadata — not all AI-assisted code leaves traces.” If a developer uses an AI assistant without it adding co-author trailers or bot signatures, the code is invisible to this analysis. This means the dataset captures only AI tool usage that explicitly marks itself in version control, likely missing casual usage or manual copy-paste from AI tools. The numbers are a floor, not a ceiling.

The infrastructure requirements are genuinely prohibitive for most users. A full “—all” analysis clones many repositories (~10k according to the README) and requires over 2TB of disk space. This isn’t something you run on a laptop or even a typical development workstation. The researchers designed this for substantial research infrastructure, and while you can use “—ecosystem” or “—cve-list” flags to analyze smaller subsets, reproducing their complete findings requires serious compute resources. The project is also in active development with a warning that “results may contain errors.” The LLM-based verification introduces non-determinism—running the same analysis twice may yield different conclusions depending on model behavior. For academic research exploring the problem space, this is acceptable. For production security tooling where you need audit trails and deterministic results, it’s a significant limitation.

Verdict

Use Vibe Security Radar if you’re conducting academic security research on AI coding tools, need empirical data for policy discussions about AI-generated code risks, or want to audit specific high-profile CVEs for potential AI involvement. It’s a public tool systematically attempting this analysis at scale, and the methodology is sophisticated enough to produce credible research findings. The Georgia Tech team has done the hard work of building infrastructure that actually works across many repositories. Skip it if you need production-ready vulnerability scanning—this is explicitly a research prototype with non-deterministic behavior and substantial infrastructure requirements. Skip it if you have limited compute resources or need results you can present in an audit without caveats about LLM verification uncertainty. Also skip it if you’re looking for comprehensive coverage of AI-caused bugs; the metadata-based detection means you’re only seeing the tip of the iceberg. This is a tool for researchers and security academics studying an emerging problem, not a drop-in solution for enterprise security teams.

// QUOTABLE

As AI coding assistants write increasing amounts of production code, a team at Georgia Tech asked a question no one had systematically answered: how many real CVEs were introduced by AI-generated c...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/hq1995-vibe-security-radar.svg)](https://starlog.is/api/badge-click/developer-tools/hq1995-vibe-security-radar)