Inside HGM: A Self-Rewriting AI That Improves Its Own Code

Hook

What happens when an AI agent can modify not just its outputs, but the source code that defines how it thinks? HGM doesn't just solve coding problems—it rewrites itself to get better at solving them.

Context

Jürgen Schmidhuber proposed the Gödel Machine in 2003 as a theoretical framework for optimal self-improvement: an AI that can formally prove modifications to its own code will lead to better performance, then apply those modifications. For two decades, this remained largely theoretical—the computational complexity of proving arbitrary code improvements was intractable, and the framework predated modern large language models.

The Huxley-Gödel Machine (HGM) bridges this gap by combining Schmidhuber's theoretical foundation with contemporary AI capabilities. Instead of formal proofs, it uses LLMs to generate plausible code modifications and tree search with empirical evaluation to discover improvements. The system tackles real benchmarks like SWE-bench (solving GitHub issues) and Polyglot (multi-language coding challenges), where the agent doesn't just generate solutions but rewrites the very algorithms it uses to generate solutions. This meta-level approach has already shown promise—the work was accepted as an oral presentation at ICLR 2026, suggesting competitive performance against traditional coding agents that lack self-modification capabilities.

Technical Insight

HGM's architecture centers on a self-improvement loop that treats code modifications as a tree search problem. The agent starts with base implementation code for solving tasks (like generating patches for GitHub issues). It then uses an LLM to propose modifications to this code—not the task solutions, but the meta-level code that produces those solutions. Each modification creates a branch in the search tree, and the system must decide which branches are worth exploring.

The key innovation is clade-based evaluation. Rather than testing every individual code modification, HGM groups related modifications into "clades" (borrowing terminology from evolutionary biology). It samples a subset of modifications within a clade, evaluates their performance on benchmark tasks, and uses these results to estimate the promise of the entire clade. This dramatically reduces the search space:

# Simplified clade evaluation pseudocode
def evaluate_clade(clade_root, benchmark_tasks, sample_size=5):
    # Sample subset of modifications in this clade
    modifications = sample_modifications(clade_root, sample_size)
    
    performances = []
    for mod in modifications:
        # Execute modified agent code in Docker isolation
        agent_code = apply_modification(base_agent, mod)
        score = run_in_docker(agent_code, benchmark_tasks)
        performances.append(score)
    
    # Estimate clade promise from samples
    clade_score = aggregate_performance(performances)
    clade_variance = calculate_variance(performances)
    
    # High-scoring, low-variance clades get expanded
    return clade_score, clade_variance

The system inherits Docker-based isolation from its Darwin-Gödel Machine predecessor, executing all self-modified code in containers to prevent destructive behavior. When the agent rewrites its own logic—say, modifying how it parses test cases or generates code patches—that new version runs in a fresh Docker environment with limited filesystem access and network restrictions.

The modification generation process itself is fascinating. HGM prompts an LLM (typically GPT-4) with the current agent code, performance metrics on recent tasks, and asks for specific improvements. The prompts are structured to encourage algorithmic changes rather than just parameter tuning:

# Example modification prompt structure
prompt = f"""
Current agent code:
{agent_source_code}

Performance on last 10 tasks: {performance_metrics}
Common failure modes: {analyze_failures()}

Propose a modification to improve the agent's ability to:
1. Parse complex test requirements
2. Generate correct patches on first attempt
3. Handle edge cases in multi-file changes

Provide the modified code section with explanation.
"""

The system then applies these modifications, runs the new agent version on validation tasks, and updates its search tree accordingly. Successful modifications become the new baseline for future iterations. Over multiple rounds of self-improvement, HGM can discover non-obvious algorithmic enhancements—like better heuristics for selecting which files to modify first, or improved strategies for breaking down complex requirements.

Crucially, HGM operates at the level of Python source code, not neural network weights. This makes modifications interpretable: you can diff the agent before and after improvement to see exactly what changed. It also means improvements can be discrete algorithmic insights ("try simpler solutions before complex ones") rather than continuous optimization over thousands of parameters. The trade-off is that the search space is discrete and vast, making the clade-based pruning essential for tractability.

The benchmark integration deserves attention. For SWE-bench, HGM doesn't just run once per issue—it uses its self-modified code to generate solutions, evaluates success rates, and feeds that performance back into the modification search. The agent might discover, for instance, that issues with failing CI tests require a different patch generation strategy than issues with feature requests, then rewrite itself to handle that distinction.

Gotcha

The README's safety warnings aren't hyperbole—running HGM means executing arbitrary AI-generated code that modifies itself. Even with Docker isolation, the system can waste resources, enter infinite loops, or produce subtly broken code that passes initial validation but fails on edge cases. The Docker setup is mandatory and non-trivial: you need proper container configuration, volume mounts for benchmark data, and careful resource limits to prevent runaway processes. If you're not comfortable debugging containerized Python environments and monitoring resource usage, you'll struggle.

The infrastructure requirements extend beyond Docker. SWE-bench requires cloning and setting up numerous open-source repositories with their specific Python environments and dependencies. Polyglot needs multiple language runtimes configured. OpenAI API costs add up quickly—each modification generation and evaluation cycle hits the API multiple times, and tree search means exploring many branches. The documentation provides setup scripts but assumes significant familiarity with these systems. There's no simple "pip install hgm" experience here. Moreover, the self-improvement process is opaque and non-deterministic. You might run HGM for hours without meaningful improvements, then suddenly get a breakthrough modification. Debugging why certain clades underperform requires understanding both the LLM's code generation patterns and the specific benchmark failure modes.

Verdict

Use HGM if: you're researching meta-learning and self-improving AI systems, have access to substantial compute resources and OpenAI API credits, are comfortable with Docker-based isolation for untrusted code execution, and want to explore the frontier of AI agents that modify their own algorithms rather than just their outputs. This is a research artifact that pushes theoretical concepts into practice—valuable for understanding how self-improvement might work at scale, and potentially competitive on coding benchmarks if you can tune it properly. Skip if: you need a production-ready coding assistant, lack the infrastructure for safe execution of self-modifying code, want something you can integrate into existing workflows without extensive setup, or prefer predictable, debuggable tool behavior. HGM is explicitly experimental software with meaningful safety considerations. For actual development tasks, traditional coding agents like Aider or Cursor provide better user experience without the complexity of self-modification.

Inside HGM: A Self-Rewriting AI That Improves Its Own Code

Inside HGM: A Self-Rewriting AI That Improves Its Own Code

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Inside HGM: A Self-Rewriting AI That Improves Its Own Code

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

OpenSandbox: Building Production-Grade Isolation for AI Agents That Actually Execute Code

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]