Back to Articles

How NVIDIA Built an AI Agent to Triage Container Vulnerabilities in Seconds Instead of Days

[ View on GitHub ]

How NVIDIA Built an AI Agent to Triage Container Vulnerabilities in Seconds Instead of Days

Hook

Security teams waste 80% of their CVE triage time investigating vulnerabilities that don't actually affect their containers. NVIDIA built an AI agent that does this contextual analysis in seconds.

Context

If you've ever run a container vulnerability scanner like Trivy or Grype against a production image, you know the problem: you get back a list of 200+ CVEs and spend the next three days researching which ones actually matter. A critical vulnerability in libssl sounds terrifying until you discover your application never actually calls the affected function. Or you find a Python package CVE, but you're only using that package as a transitive dependency for a feature you disabled via environment variables.

This is the false positive problem that plagues modern DevSecOps. Traditional scanners detect presence, not exploitability. They tell you a vulnerable library exists in your container, but they can't tell you if that vulnerability is reachable given your specific configuration, runtime behavior, and deployment context. Security teams end up doing manual research: reading CVE descriptions, checking NVD and vendor advisories, examining application code to trace dependency usage, and consulting with developers who may or may not remember why a particular package was included. NVIDIA's vulnerability-analysis blueprint attacks this problem by building an agentic AI workflow that automates the entire research and triage process using large language models and retrieval-augmented generation.

Technical Insight

The architecture orchestrates four distinct components into a unified pipeline: SBOM generation, vulnerability database indexing, RAG-powered retrieval, and parallel LLM analysis. The workflow starts with standard tooling—Syft or Grype extracts a Software Bill of Materials from your container image, producing a structured list of every package and library. This SBOM feeds into the core innovation: a RAG pipeline built on NVIDIA's NeMo Agent Toolkit that indexes multiple vulnerability databases (NVD, OSV, vendor advisories) using embedding models, then retrieves relevant context for each detected CVE.

The breakthrough is how the system uses LLMs for contextual reasoning rather than just pattern matching. When a CVE is detected, the agent doesn't just flag it—it asks questions. Is this library actually used at runtime? Does the vulnerable code path get executed given the container's environment variables? Are there compensating controls in the deployment configuration? The LLM analyzes the CVE description, the application's dependency graph, container metadata, and even Kubernetes deployment specs to make an informed risk assessment.

Here's what the core agentic workflow looks like in the NeMo Agent Toolkit:

from nemo_agent.toolkit import Agent, Tool
from nemo_agent.llm import NIM

# Define tools the agent can use
vulnerability_search = Tool(
    name="search_vulnerability_db",
    description="Search NVD and OSV databases for CVE details",
    func=search_vulnerability_database
)

sbom_analyzer = Tool(
    name="analyze_sbom",
    description="Analyze SBOM to determine package usage and dependencies",
    func=analyze_sbom_context
)

# Initialize agent with Llama 3.1 70B
llm = NIM(model="meta/llama-3.1-70b-instruct")
agent = Agent(
    llm=llm,
    tools=[vulnerability_search, sbom_analyzer],
    max_iterations=5
)

# Run analysis on detected CVE
result = agent.run(
    f"""Analyze CVE-2024-1234 in the context of this container.
    SBOM: {sbom_json}
    Container config: {container_metadata}
    
    Determine:
    1. Is the vulnerable code path actually used?
    2. What's the exploitability given this deployment?
    3. Recommended mitigation priority (Critical/High/Low/False Positive)
    """
)

The system leverages parallel inference to analyze dozens of CVEs simultaneously. Instead of sequential API calls that would take minutes, the blueprint spawns concurrent LLM requests—one per CVE—and aggregates results. This is where the hardware recommendations become clear: NVIDIA suggests 8+ H100 GPUs for production workloads because you're running multiple 70B parameter models in parallel. Each GPU handles a separate vulnerability analysis, turning what would be a 10-minute sequential task into a 15-second parallel operation.

The RAG component is particularly sophisticated. Rather than using a generic vector database, the system indexes vulnerability data with domain-specific chunking strategies. CVE descriptions are split to preserve technical details about affected versions, attack vectors, and CVSS metrics. When the LLM needs context, embeddings retrieve not just the CVE itself, but related vulnerabilities in the same library, historical exploit data, and vendor patch notes:

# RAG indexing with domain-aware chunking
from nemo_agent.rag import RAGPipeline, Chunker

chunker = Chunker(
    strategy="semantic",
    metadata_fields=["cve_id", "affected_versions", "cvss_score"],
    preserve_technical_terms=True
)

rag = RAGPipeline(
    embedding_model="nvidia/nv-embed-v1",
    chunker=chunker,
    retrieval_top_k=10
)

# Index multiple vulnerability sources
rag.index_documents([
    {"source": "nvd", "data": nvd_database},
    {"source": "osv", "data": osv_database},
    {"source": "vendor_advisories", "data": vendor_data}
])

Optional NGINX caching sits in front of the LLM APIs to avoid re-analyzing the same CVEs across multiple containers. If 50 of your containers all use OpenSSL 1.1.1k with the same CVE, the system caches the first analysis and returns instant results for subsequent queries. This caching layer is transparent to the agent workflow but critical for production efficiency.

The evaluation framework is equally impressive. The blueprint includes built-in metrics to assess agent accuracy—comparing LLM recommendations against ground truth data from security researchers. You can define custom evaluators that check whether the agent correctly identified false positives or properly prioritized critical vulnerabilities. This feedback loop enables iterative prompt engineering and workflow refinement:

from nemo_agent.evaluation import Evaluator, Metric

evaluator = Evaluator(
    metrics=[
        Metric("accuracy", compare_to_ground_truth),
        Metric("false_positive_rate", measure_fp_rate),
        Metric("consistency", check_consistency_across_runs)
    ]
)

results = evaluator.evaluate(agent, test_dataset)
print(f"Accuracy: {results['accuracy']}")
print(f"FP Rate: {results['false_positive_rate']}")

Gotcha

The licensing and infrastructure requirements create significant adoption friction. You need an NVIDIA AI Enterprise developer license, API keys for multiple vulnerability databases, and either NVIDIA API credits for hosted NIMs or your own GPU infrastructure for self-hosted models. The blueprint assumes you're running on Linux with NVIDIA GPUs—macOS support is explicitly limited, and the self-hosted NIM option requires hardware that small teams simply don't have.

The production hardware recommendations reveal the real cost: 8+ H100 GPUs for optimal parallel performance. Even if you use NVIDIA's hosted API endpoints instead of self-hosting, you're paying per-token for dozens of concurrent LLM calls analyzing 70B parameter models. For a small startup scanning a handful of containers, this is massive overkill. The blueprint is architected for enterprises dealing with hundreds of container images and thousands of CVEs per week—organizations where the cost of manual triage exceeds the cost of GPU infrastructure. If you're just trying to add some AI-powered prioritization to your personal project's vulnerability scanning, the authentication setup alone will take longer than manually researching your CVEs.

Verdict

Use if: You're an enterprise security team drowning in CVE alerts across dozens or hundreds of containerized applications, you already have NVIDIA GPU infrastructure or budget for hosted AI services, and you need to defensibly triage vulnerabilities with contextual analysis that's auditable and reproducible. This blueprint transforms vulnerability management from reactive firefighting into systematic, AI-accelerated risk assessment. Skip if: You're a small team, lack NVIDIA hardware access, or manage fewer than 20-30 containers—the licensing complexity, API costs, and infrastructure requirements far outweigh the benefits. For smaller scopes, you're better off with Grype plus manual review or commercial solutions like Snyk that handle the infrastructure complexity for you. This is a power tool for scale problems, not a lightweight developer utility.