Back to Articles

Cog: How Replicate Solves CUDA Hell for ML Model Deployment

[ View on GitHub ]

Cog: How Replicate Solves CUDA Hell for ML Model Deployment

Hook

The Docker Compose creator and a former Spotify ML infrastructure engineer teamed up to solve a problem so painful that companies like Uber and Spotify built internal versions of the same tool independently.

Context

Shipping machine learning models to production remains one of the most friction-filled handoffs in software engineering. A researcher trains a model with specific CUDA, cuDNN, and Python versions. An engineer tries to containerize it and discovers incompatibilities between framework versions, CUDA versions, and base images. What should take an afternoon becomes a multi-day process of navigating compatibility matrices and Docker optimization strategies.

This pain is so universal that multiple companies built internal solutions. Uber’s Michelangelo PyML, Spotify’s custom ML deployment tools, and others all converged on Docker-based packaging systems. Cog is the open-source distillation of that pattern, created by Ben Firshman (Docker Compose) and Andreas Jansson (Spotify ML infrastructure). Rather than making researchers learn Dockerfile syntax and CUDA compatibility tables, Cog offers a declarative YAML configuration that generates optimized containers with correct dependencies, then wraps models in a production HTTP server.

Technical Insight

Runtime

Build Time

CUDA/Python compatibility

Nvidia base images

Type validation

Inference result

JSON response

cog.yaml Config

Build System Go

Dependency Resolver

Docker Image Builder

Optimized ML Container

predict.py Predictor

Rust/Axum HTTP Server

HTTP Request

Python Predictor.predict

OpenAPI REST API

System architecture — auto-generated

Cog’s architecture splits cleanly into build-time dependency resolution and runtime prediction serving. At build time, you define your environment in cog.yaml with high-level constraints like GPU requirements, Python version, and dependencies. Cog translates this into a Docker image that uses Nvidia CUDA base images, installs the correct Python version, and efficiently caches dependencies.

The elegance shows in the configuration format. Instead of wrestling with Dockerfile syntax, you declare what you need:

build:
  gpu: true
  system_packages:
    - "libgl1-mesa-glx"
    - "libglib2.0-0"
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"

Cog reads this and selects compatible CUDA/cuDNN/PyTorch/TensorFlow/Python combinations based on your framework requirements. The tool knows which combinations are compatible and configures the base image accordingly, eliminating the “works on my machine” problem where local environments use different versions than production containers.

The runtime architecture is where Cog differentiates itself from typical Flask-based ML serving. User models implement a simple Python interface with a Predictor class:

from cog import BasePredictor, Input, Path
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.model = torch.load("./weights.pth")

    def predict(self,
          image: Path = Input(description="Grayscale input image")
    ) -> Path:
        """Run a single prediction on the model"""
        processed_image = preprocess(image)
        output = self.model(processed_image)
        return postprocess(output)

Cog wraps this Python class with a high-performance Rust/Axum HTTP server, not a Python web framework. The Rust server handles HTTP parsing, request queuing, and I/O, only crossing the language boundary to invoke your Python prediction function.

The type hints in the predict method signature aren’t just documentation—Cog uses them to automatically generate OpenAPI schemas and validate inputs. When you declare image: Path = Input(description="Grayscale input image"), Cog creates an API endpoint that accepts file uploads or URLs, validates the input is a path, and generates OpenAPI documentation describing that parameter. This means your API is self-documenting without writing separate schema files.

Deployment follows standard Docker workflows. Running cog build -t my-model produces a Docker image you can push to any registry and deploy anywhere containers run. The built image includes the Rust HTTP server listening on port 5000, so docker run -p 5000:5000 --gpus all my-model immediately exposes a production-ready prediction API. For local development, cog serve combines build and run steps, rebuilding the container and starting the server in one command—useful for rapid iteration without managing Docker commands manually.

Cog also provides cog predict for testing predictions locally without writing HTTP client code. You pass inputs as command-line flags, and Cog builds the container, runs the prediction, and returns outputs. This tight feedback loop during development mirrors production behavior since both use the same containerized environment, eliminating surprises when deploying.

Gotcha

Cog’s abstraction layer is its primary strength and its primary limitation. The tool optimizes for the common case: a single Python-based model with standard dependencies deployed as a Docker container. If your deployment pattern diverges from this, you’ll fight the abstractions. Complex custom builds or non-Python runtimes may require dropping down to raw Dockerfiles, at which point Cog provides less value.

The Replicate ecosystem integration also creates subtle orientation toward that platform. While Cog-built containers run anywhere Docker does, the tool was created by the Replicate team and certain features and workflows align with Replicate as a deployment target. The documentation focuses heavily on Replicate deployment, and the HTTP API schema matches Replicate’s prediction API conventions. If you’re deploying to other platforms, you may need to write integration code. The tool isn’t hostile to other platforms, but it’s clearly optimized for one particular destination.

Finally, Cog handles prediction serving well but doesn’t address broader MLOps concerns like model versioning, A/B testing, or monitoring. The tool makes a deliberate trade-off: simplicity and fast time-to-deployment over comprehensive feature coverage. For teams with sophisticated ML infrastructure needs, Cog becomes one component in a larger system rather than a complete solution.

Verdict

Use Cog if you’re a researcher or small ML team that needs to containerize models quickly without Docker expertise, especially when dealing with GPU dependencies or deploying to Replicate. It excels at eliminating the CUDA compatibility nightmare and provides production-ready containers with minimal configuration. The automatic OpenAPI schema generation and input validation save significant boilerplate compared to writing Flask servers manually. Skip it if you need fine-grained Docker control for complex builds, aren’t using Python-based models, or have invested heavily in alternative ML platforms. Also skip if your MLOps requirements extend beyond prediction serving to include experiment tracking, model registries, or advanced deployment patterns—in those cases, evaluate more comprehensive ML deployment platforms.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/infrastructure/replicate-cog.svg)](https://starlog.is/api/badge-click/infrastructure/replicate-cog)