Codel: Building a Self-Hosted Autonomous AI Agent with Docker Sandboxing
Hook
Most AI coding assistants wait for your next prompt. Codel decides what to do next, spins up Docker containers to execute it, and keeps going until your project is done—or until it breaks something in a sandbox you can easily discard.
Context
The release of Devin in early 2024 sparked a wave of interest in fully autonomous AI agents that could handle entire software projects end-to-end. Unlike traditional code completion tools that operate at the function or file level, autonomous agents decompose high-level goals into concrete tasks, execute them using real developer tools, and iterate based on outcomes. But Devin remained closed-source and expensive, leaving developers who wanted to experiment with autonomous agents or maintain data privacy without good options.
Codel emerged as an open-source answer to this gap. Built by semanser, it provides a self-hosted autonomous agent that can operate a terminal, browse the web for documentation, and edit files—all while running every action in isolated Docker containers. This architectural choice addresses one of the biggest concerns with autonomous AI: giving LLM-generated commands direct access to your system. When an agent decides to run rm -rf / or accidentally executes a malicious script it found online, you want that happening in a container you can destroy, not on your host machine.
Technical Insight
Codel's architecture centers on a decision loop that coordinates between an LLM reasoning engine and three containerized execution environments. The system is split between a TypeScript frontend that displays the agent's stream-of-consciousness decision-making and a Go backend that manages container orchestration, LLM interactions, and state persistence.
The core workflow starts when a user submits a task like "Build a REST API for a todo app with PostgreSQL." The agent enters a reasoning loop where it queries the configured LLM (OpenAI GPT-4 or a local Ollama model) with the task description, available tools (terminal, browser, editor), and the complete history of previous actions and their outputs. The LLM responds with a JSON structure specifying which tool to use and what command to execute:
{
"tool": "terminal",
"action": "execute",
"command": "mkdir todo-api && cd todo-api && npm init -y",
"reasoning": "First, I need to create a project directory and initialize a Node.js project"
}
When the agent selects the terminal tool, Codel's Go backend spins up a Docker container with the appropriate base image. Here's where it gets clever: the system doesn't use a single fixed container. Instead, it analyzes the task and selects suitable images. For the todo API example, it might start with node:18-alpine, but if the agent later decides it needs Python for a data processing script, it can switch to a python:3.11 container. Each container mounts a shared workspace volume, so files persist across tool switches.
The browser tool uses go-rod, a Chrome DevTools Protocol implementation, to perform actual web automation. When the agent needs to research how to set up PostgreSQL with Node.js, it can navigate to documentation sites, extract code examples, and parse tutorial content. This is substantially more powerful than embedding-based retrieval systems because the agent accesses real-time information and can interact with dynamic content:
// Simplified browser tool implementation
func (b *BrowserTool) Navigate(url string) (string, error) {
page := b.browser.MustPage(url)
page.MustWaitLoad()
// Extract text content for LLM processing
content := page.MustElement("body").MustText()
// Take screenshot for visual verification
screenshot := page.MustScreenshot()
b.saveScreenshot(screenshot)
return content, nil
}
All command executions and their outputs are stored in PostgreSQL with full provenance tracking. This creates an audit trail showing exactly how the agent approached the problem, which is invaluable when debugging why it made specific decisions or when trying to reproduce results. The database schema captures the timestamp, tool used, input command, stdout/stderr output, and exit codes.
The editor tool operates by mounting the workspace directory and using standard file I/O operations. When the agent needs to create or modify files, it constructs the complete file content and writes it atomically. This is more reliable than trying to apply diffs or patches, which can fail if the LLM hallucinates the current file state.
Codel's security model revolves around the Docker sandbox barrier. However, it requires mounting the Docker socket (/var/run/docker.sock) into the Codel container, which effectively grants root-equivalent access to the host system. The threat model assumes that the LLM itself isn't malicious, but that it might generate commands with unintended consequences. The containerization prevents accidental damage, but a determined attacker who compromises the LLM API keys could potentially escape the sandbox by spawning privileged containers. This is acceptable for local development but problematic for multi-tenant deployments.
Gotcha
The Docker socket requirement is Codel's biggest operational constraint. Mounting /var/run/docker.sock gives the Codel container the ability to spawn sibling containers and access the Docker daemon, which is equivalent to root access. This works fine on a developer's local machine but creates serious security concerns in shared environments, CI/CD pipelines, or Kubernetes clusters where Docker-in-Docker patterns are often restricted by policy. There's no easy workaround—the entire architecture depends on dynamic container creation.
The autonomous execution loop also lacks sophisticated cost controls or safety guardrails. While running locally with Ollama eliminates per-token costs, using OpenAI's API for a complex task can rack up significant charges if the agent gets stuck in an unproductive loop. The system doesn't have built-in circuit breakers for maximum iterations, token budget limits, or human-in-the-loop approval for destructive commands. You're trusting the LLM to eventually reach a conclusion or explicitly tell you it's stuck. In practice, this means you need to monitor the web UI actively rather than kicking off a task and walking away. The documentation doesn't provide guidance on typical task completion times or token consumption patterns, making it difficult to budget for production use.
Verdict
Use if: You want to experiment with autonomous AI agents in a controlled local environment, need full data privacy with self-hosted LLMs via Ollama, or are researching agent architectures and need complete observability into decision-making processes. It's excellent for personal projects where you can monitor execution and don't mind occasionally nuking containers that went sideways. Skip if: You need production-grade reliability, are deploying in environments with Docker socket restrictions (most corporate Kubernetes clusters), require strict cost controls for LLM API usage, or want enterprise compliance features. The active supervision requirement also makes it unsuitable for truly hands-off automation scenarios where you expect the agent to complete multi-hour tasks independently.