Tracking AI-Generated Code in Real-World Security Vulnerabilities: Inside Vibe Security Radar
Hook
What if the vulnerability in your dependency wasn’t written by a human at all? Vibe Security Radar attempts to answer a question nobody was tracking: how many real-world CVEs originate from AI-generated code.
Context
The explosion of AI coding assistants—GitHub Copilot, Cursor, Cody, Amazon Q, and dozens more—has fundamentally altered how code gets written. Developers now generate entire functions, classes, and modules through natural language prompts. But while we’ve debated AI code quality in the abstract, we’ve lacked empirical data on a critical question: are AI coding assistants introducing exploitable vulnerabilities into production systems?
Vibe Security Radar emerged from this knowledge gap. Unlike traditional vulnerability scanners that detect what’s broken, this tool performs forensic archaeology—tracing published CVEs backward through git history to identify the exact commit that introduced each vulnerability, then analyzing whether that commit bears the fingerprints of AI authorship. It’s not about preventing future bugs; it’s about quantifying a phenomenon we’ve suspected but never measured: the security debt being accumulated through AI-assisted development at scale.
Technical Insight
The architecture operates as a four-stage pipeline that combines traditional software forensics with modern LLM capabilities. Stage one aggregates vulnerability data from OSV, GitHub Advisory Database, and NVD, filtering for entries that include git commit references to fixes. Stage two applies the SZZ (Susceptible, Zombie, and Zombie Killer) algorithm—a git blame technique that walks backward from a fix commit to identify the original bug-introducing commit. This isn’t trivial; the tool includes squash-merge decomposition to handle repositories where multiple commits get collapsed, requiring reconstruction of the original change sequence.
The third stage is where things get novel: AI authorship detection through metadata fingerprinting. The system scans commit metadata for 15+ distinct AI tool signatures. GitHub Copilot leaves Co-authored-by: GitHub <noreply@github.com> trailers. Cursor adds specific commit message patterns. GitLab Duo, Cody, and Amazon Q each have identifiable markers in author emails or commit metadata. Here’s a simplified version of the detection logic:
AI_SIGNATURES = {
'copilot': [
r'Co-authored-by: GitHub <noreply@github\.com>',
r'Co-authored-by: github-advanced-security',
],
'cursor': [
r'cursor\.sh',
r'@cursor\.com',
],
'cody': [
r'sourcegraph\.com/cody',
r'@sourcegraph\.com.*bot',
],
# ... 12 more tool signatures
}
def detect_ai_authorship(commit_metadata):
signals = []
for tool, patterns in AI_SIGNATURES.items():
for pattern in patterns:
if re.search(pattern, commit_metadata['message']) or \
re.search(pattern, commit_metadata['author']) or \
re.search(pattern, commit_metadata.get('trailers', '')):
signals.append({
'tool': tool,
'confidence': 'high',
'evidence': pattern
})
return signals
But metadata detection has obvious limits—developers can remove co-author trailers or copy-paste AI output without attribution. This is where stage four becomes crucial: LLM-augmented verification. The system employs Claude with tool-calling capabilities, giving it access to git operations (diff inspection, commit traversal, author pattern analysis) and instructing it to make causality determinations. The LLM can make up to 50 tool calls per CVE, examining code patterns, commit timing, authorship changes, and contextual clues that suggest AI involvement beyond explicit metadata.
The two-tier LLM approach is particularly clever. A fast triage pass filters candidates using pattern matching and basic heuristics, achieving roughly 80% precision. Only candidates that pass triage enter the expensive deep investigation phase where the LLM agent performs git archaeology. This dramatically reduces API costs while maintaining coverage. The system explicitly acknowledges that results represent a ‘strict lower bound’—AI-introduced vulnerabilities without detectable signatures won’t appear in the dataset.
The final output feeds into a static web dashboard that visualizes CVE distributions across repositories, AI tools, and vulnerability types. Researchers can filter by programming language, trace individual vulnerabilities to their introducing commits, and examine the git forensic evidence that led to each AI authorship determination.
Gotcha
The storage and compute requirements are genuinely prohibitive for casual use. A full analysis run on approximately 10,000 repositories requires 2TB+ of disk space just for git repository clones, plus significant memory for LLM context windows when analyzing complex commit histories. The authors don’t provide cost estimates, but running Claude API calls with tool augmentation across thousands of CVEs would easily reach hundreds or thousands of dollars per complete analysis cycle.
More fundamentally, the detection methodology has an unfixable blind spot: stealth AI usage. If a developer generates code with ChatGPT, reviews it, makes modifications, and commits without any tool-specific metadata, it’s invisible to this system. The metadata fingerprinting approach only catches developers who use integrated IDE assistants or who leave co-author trailers intact. This isn’t a bug—the authors explicitly state findings are lower bounds—but it means the tool answers ‘how many detectable AI-introduced CVEs exist’ rather than ‘how many AI-introduced CVEs exist.’ That distinction matters enormously for drawing conclusions about AI coding assistant safety. The tool also lacks precision/recall metrics beyond the 80% triage filter, making it difficult to assess how many false positives slip through the deep investigation phase.
Verdict
Use if: you’re a security researcher studying AI code generation risks empirically, need a dataset of real-world AI-introduced vulnerabilities for academic analysis, or want to audit large open-source ecosystems for AI-assisted security issues at scale. This tool excels at forensic investigation across thousands of repositories when you have the computational budget and storage capacity. Skip if: you need production vulnerability scanning (this is a research prototype with acknowledged error rates), want to audit individual repositories in real-time (the pipeline is designed for batch processing), have limited infrastructure (2TB+ storage and significant LLM API budgets required), or need to detect AI involvement that doesn’t leave metadata traces. For standard vulnerability detection without AI attribution, stick with Semgrep or CodeQL. For tracking Copilot usage without security correlation, use GitHub’s native analytics.