Back to Articles

OpenMythos: Recurrent Transformers and the Quest for Depth-Variable Reasoning

[ View on GitHub ]

OpenMythos: Recurrent Transformers and the Quest for Depth-Variable Reasoning

Hook

What if a language model could 'think' for 10 iterations on a simple question and 100 iterations on a complex proof—all within a single forward pass, without generating a single intermediate token?

Context

Chain-of-thought prompting transformed LLM capabilities by forcing models to articulate reasoning steps as tokens. But this approach has inherent inefficiencies: every reasoning step consumes context window space, generates billable tokens, and exposes intermediate logic that may not need external representation. The standard transformer architecture processes inputs with fixed computational depth—a 32-layer model always applies exactly 32 layers, regardless of whether you're asking for a capital city or requesting a mathematical proof.

OpenMythos emerges from speculation about Anthropic's rumored 'Mythos' architecture, which allegedly enables Claude models to perform variable-depth reasoning within latent space. Rather than generating explicit chain-of-thought tokens, the theory suggests Claude might loop internal representations through transformer blocks multiple times, effectively 'thinking harder' on difficult problems without external token generation. While Anthropic has never confirmed this architecture, the concept is compelling enough that kyegomez built a complete theoretical reconstruction from first principles, combining research on Universal Transformers, recurrent neural networks, and modern efficiency techniques into a single experimental codebase.

Technical Insight

The core innovation in OpenMythos is the Recurrent-Depth Transformer (RDT) architecture, which splits the model into three distinct stages. The Prelude consists of standard transformer layers that run exactly once, establishing initial representations. The Recurrent Block contains transformer layers that loop up to max_loop_iters times, with each iteration refining the hidden state through a learned recurrence formula. Finally, the Coda applies final transformer layers to produce output tokens. This structure allows the model to invest different amounts of computation based on the problem difficulty, conceptually similar to how humans might ponder a question for varying durations.

The recurrent mechanism uses a critical stability formula: h_{t+1} = A·h_t + B·e + Transformer(h_t, e), where h_t represents the hidden state at iteration t, e is the original input embedding, and A and B are learned parameter matrices. The A·h_t term carries forward the refined representation, while B·e continuously injects the original input to prevent the recurrence from drifting away from the problem context. This design addresses a fundamental challenge in recurrent architectures: maintaining gradient flow and preventing either vanishing or exploding gradients across loop iterations.

Here's how the recurrent block is instantiated in OpenMythos:

from openmythos import RecurrentDepthTransformer

model = RecurrentDepthTransformer(
    vocab_size=50304,
    d_model=2048,
    n_heads=16,
    prelude_layers=8,      # Initial processing depth
    recurrent_layers=12,   # Layers that loop
    coda_layers=8,         # Final processing depth
    max_loop_iters=16,     # Maximum recurrence depth
    num_experts=32,        # MoE expert count
    num_shared_experts=2,  # Always-active experts
    attention_type="mla",  # Multi-Latent Attention
    spectral_radius_target=0.95  # Stability constraint
)

# The model automatically determines loop iterations
# based on input complexity (in theory)
output = model(input_ids, max_new_tokens=100)

The spectral radius monitoring (spectral_radius_target=0.95) is particularly clever. The spectral radius ρ(A) represents the largest eigenvalue of matrix A, and keeping ρ(A) < 1 ensures the recurrence remains stable. If ρ(A) ≥ 1, repeated applications of A·h_t would cause representations to explode exponentially. OpenMythos includes monitoring hooks that track this value during training, allowing researchers to observe whether the learned recurrence maintains stability or requires regularization.

The architecture also integrates Multi-Latent Attention (MLA), an alternative to standard multi-head attention that compresses key-value representations using LoRA-style low-rank projections. Instead of storing full KV caches for each attention head, MLA projects them into a lower-dimensional latent space and reconstructs them when needed. This becomes crucial in the recurrent setting: when you're looping 16 times through 12 transformer layers, KV cache size multiplies rapidly. MLA provides a memory-efficient alternative while maintaining representational capacity.

The Mixture-of-Experts (MoE) implementation distinguishes between routed experts (selected dynamically per token) and shared experts (always active). This hybrid approach ensures that critical general knowledge remains accessible to every token while allowing specialized experts to activate for domain-specific processing. In the context of recurrent loops, this means each iteration can route to different experts as the representation refines, theoretically allowing the model to apply different 'cognitive strategies' at different reasoning depths.

OpenMythos provides pre-configured model variants scaling from 1B to 1T parameters:

from openmythos.configs import get_model_config

# 7B parameter variant optimized for research
config_7b = get_model_config("7B")
print(f"Recurrent layers: {config_7b.recurrent_layers}")
print(f"Max loop iterations: {config_7b.max_loop_iters}")
print(f"Experts: {config_7b.num_experts}")

# 70B parameter variant with extended context
config_70b = get_model_config("70B")
print(f"Context length: {config_70b.max_seq_len}")
# Output: Context length: 32768

These configurations follow Chinchilla-style scaling laws, adjusting not just parameter counts but also loop iterations, expert counts, and context lengths proportionally. The 1B model loops up to 8 times, while the 1T variant loops up to 32 times, reflecting the hypothesis that larger models can benefit from deeper recurrent reasoning.

The implementation quality is production-grade in structure: clean abstractions, type hints, modular attention mechanisms that swap between MLA and Grouped Query Attention with a single parameter change. However, this is explicitly a research artifact. There are no training scripts, no published benchmarks, no evidence that training this architecture converges to useful behavior. It's a beautifully engineered hypothesis waiting for empirical validation.

Gotcha

The elephant in the room: this is theoretical reverse-engineering with zero affiliation to Anthropic, and the actual Claude architecture may bear no resemblance to this implementation. The name 'Mythos' itself comes from unconfirmed speculation, and Anthropic has never publicly validated any details of this design. You're working with an educated guess, not a blueprint.

Computational costs multiply brutally with loop iterations. A model configured for 16 recurrent loops through 12 layers effectively applies 192 layer operations per forward pass (plus prelude and coda layers). Training this architecture requires 16x the compute of a standard transformer with equivalent parameter count for the recurrent block alone. Inference latency suffers proportionally—you can't generate tokens until all loop iterations complete. The theoretical benefit of 'thinking harder' on difficult problems only materializes if the architecture actually learns to use that capacity effectively, which remains unproven. Without adaptive halting mechanisms (where the model learns to exit loops early for simple inputs), you pay the maximum compute cost on every forward pass regardless of input complexity.

Documentation is sparse beyond the architecture description. There are no published training runs, no benchmark comparisons against standard transformers, no ablation studies showing whether the recurrent mechanism actually improves reasoning tasks. The repository provides the model architecture but not the empirical validation needed to recommend it for serious research investments. You're adopting an architecture pattern based on theoretical appeal rather than demonstrated results.

Verdict

Use if: You're a researcher exploring alternatives to chain-of-thought token generation, investigating recurrent transformer architectures, or building experimental models where variable computational depth is theoretically valuable. The clean PyTorch implementation and pre-configured scaling variants make this an excellent foundation for academic experiments on depth-variable reasoning. The codebase quality is high enough to fork and extend with your own training infrastructure. Skip if: You need production-ready models with proven performance characteristics, are operating under compute constraints (the recurrent loops are expensive), expect this to faithfully reproduce Claude's actual architecture (it's explicitly speculative), or want battle-tested training recipes and empirical benchmarks. For practical applications, use established open-source models like Llama or Mistral, or access Claude directly through Anthropic's API. OpenMythos is an architecture playground, not a deployment-ready solution.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/kyegomez-openmythos.svg)](https://starlog.is/api/badge-click/ai-dev-tools/kyegomez-openmythos)