> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Building an AI Bug Hunter: Inside VSCode's AutoK Security Extension

[ View on GitHub ]

Building an AI Bug Hunter: Inside VSCode's AutoK Security Extension

Hook

Press F12 on a function, and within seconds an AI tells you if it's exploitable. No rules to write, no database to maintain—just a conversation with a language model that reads code like a security researcher.

Context

Traditional static analysis security testing (SAST) tools have a fundamental problem: they're only as good as their rules. Miss a pattern, and you miss vulnerabilities. Keep rules too strict, and developers drown in false positives. This creates a maintenance burden—security teams constantly updating rule sets as new vulnerability classes emerge, new frameworks launch, and old patterns evolve.

Large language models changed the equation. They've read millions of lines of vulnerable code during training and can recognize security antipatterns without explicit rules. The autok-extension brings this capability directly into VSCode, implementing the Autokaker algorithm as an interactive extension. Instead of running nightly scans that developers ignore, it puts AI-powered security analysis at your fingertips—literally bound to a hotkey. It's an experiment in making security analysis conversational rather than declarative.

Technical Insight

The extension's architecture is surprisingly minimal for what it accomplishes. At its core, it's a TypeScript wrapper around three operations: extract code context, prompt an LLM, and parse the response into VSCode decorations. The interesting decisions happen in how it handles each step.

Code extraction starts with cursor position. When you trigger analysis (F12 by default), the extension identifies the function containing your cursor using regex patterns that match common function signatures across languages. For whole-file analysis, it attempts to split the file into discrete functions. This approach is intentionally loose—rather than building AST parsers for every language, it relies on the LLM to understand context even from imperfectly extracted snippets.

The LLM integration supports three backend types through a unified interface: the free Neuroengine.ai service, OpenAI-compatible APIs, or local llama.cpp servers. Here's the configuration structure:

{
  "autok.mode": "openai",  // or "free" or "llamacpp"
  "autok.apiKey": "sk-...",
  "autok.endpoint": "https://api.openai.com/v1/chat/completions",
  "autok.model": "gpt-4",
  "autok.maxTokens": 2000,
  "autok.multiShot": 3,
  "autok.verificationMode": true
}

The multi-shot parameter is where things get interesting. Set to 1, the extension sends a single prompt asking the LLM to identify vulnerabilities. Set to 3, it sends the same prompt three times and aggregates results—a crude but effective ensemble method. The LLM might hallucinate different false positives each run, but real vulnerabilities tend to appear consistently across shots.

Verification mode adds a second prompt phase. After initial detection, the extension sends findings back to the LLM with skeptical prompts: "Are you sure this is exploitable? Could this be a false positive?" The responses tag results as LIKELY or UNLIKELY, which the extension uses to color-code decorations. Black for info-level findings, yellow for medium, red for critical—but dimmed for UNLIKELY tags.

The prompt engineering is deliberately minimal. Rather than building elaborate few-shot examples, the extension sends straightforward instructions:

Analyze this code for security vulnerabilities.
Respond in this exact format:
VULNERABILITY: [type]
SEVERITY: [INFO|LOW|MEDIUM|HIGH|CRITICAL]
LINE: [number]
DESCRIPTION: [explanation]

This structured format lets the extension parse responses with simple string matching rather than trying to interpret free-form text. When the LLM responds with multiple vulnerabilities, each gets converted into a VSCode decoration—those colorful inline labels you see in the gutter.

The report generation feature writes findings to external files with full context: the vulnerable code snippet, the LLM's reasoning, remediation suggestions, and severity ratings. These reports are markdown-formatted, making them easy to share in pull requests or security tickets.

What's clever about this architecture is its backend agnosticism. Because it uses standard OpenAI-compatible endpoints, you can point it at local Llama models running on consumer GPUs, avoiding the cost and privacy concerns of cloud APIs. A developer with a decent gaming rig can run security scans without sending proprietary code to external services. The tradeoff is speed—local models are slower and less capable than GPT-4, but for sensitive codebases, that's often acceptable.

The language-agnostic approach deserves emphasis. By delegating language understanding to the LLM, the extension works across C, C++, JavaScript, Solidity, Python, and anything else the model was trained on. There's no plugin architecture for adding language support, no grammar files to maintain. The LLM just... figures it out. This is both the extension's superpower and its Achilles heel.

Gotcha

The false positive rate will frustrate you. The developer openly acknowledges this, and it's not fixable through better prompting or settings tweaking—it's fundamental to using LLMs for security analysis. Language models are pattern matchers, not theorem provers. They'll flag a SQL query construction as vulnerable even when you're using parameterized statements correctly. They'll worry about integer overflows in contexts where bounds are guaranteed by earlier checks. Even with verification mode enabled and GPT-4 on the backend, expect to dismiss half the findings.

Performance becomes a problem quickly. With verification mode and multi-shot=3, you're making six LLM API calls per function analysis. At ~2-5 seconds per call, that's 12-30 seconds of waiting. The extension blocks during analysis, freezing your editor. For quick spot-checks on single functions, this is tolerable. For whole-file analysis on anything larger than a few hundred lines, it's productivity-killing. The free tier using smaller models is even slower.

The manual installation process—downloading .vsix files and installing via command line—signals this is a research prototype, not production software. There's no automatic updates, no telemetry to help the developer understand failure modes, no crash reporting. You're on your own for troubleshooting, and the documentation is minimal. If you're not comfortable reading TypeScript source to understand behavior, you'll struggle.

Context window limitations hit hard on larger functions. Send a 500-line function to the LLM, and it might focus on the first 100 lines while ignoring later vulnerabilities. The extension doesn't implement any smart chunking strategies—it just sends whatever it extracted and hopes the model's context window is large enough.

Verdict

Use if: You're actively reviewing security-sensitive code (crypto implementations, authentication flows, input handlers) and want a second pair of AI eyes to catch things you might miss. The F12 hotkey makes it frictionless to spot-check suspicious functions during code review. It's especially valuable if you have GPT-4 or Claude API access and can tolerate the latency, or if you're already running local LLMs for other tasks and can reuse that infrastructure. It's also an excellent learning tool—reading the LLM's explanations of why code is vulnerable teaches security concepts interactively. Skip if: You need low-latency feedback in your development loop, require high precision for compliance or release gating, or work in languages with established SAST tools (Java, C#). The false positive rate makes it unsuitable as a CI/CD gate, and the lack of organizational features (no dashboards, no trend tracking, no team-wide policy configuration) means it won't replace enterprise security platforms. If you're already happy with Semgrep or CodeQL and have invested in rule development, the marginal benefit here is small.