Back to Articles

Claude Octopus: Orchestrating Three AI Models Through Design Thinking Workflows

[ View on GitHub ]

Claude Octopus: Orchestrating Three AI Models Through Design Thinking Workflows

Hook

What if the answer to AI’s hallucination problem isn’t better models, but treating AI agents like a design committee that must reach consensus before shipping code?

Context

The AI coding assistant landscape has evolved rapidly from single-model autocomplete to conversational pair programming, but a fundamental problem remains: every AI model has blind spots. Claude excels at reasoning but can be verbose. GPT-4’s Codex variant generates implementation-focused code but sometimes misses security implications. Gemini brings Google’s research depth but may over-engineer solutions. Developers typically pick one model and live with its weaknesses, or manually copy-paste prompts between different AI chat interfaces to compare outputs—a tedious, context-losing process.

Claude Octopus takes a different approach: it treats AI models as specialized team members in a structured design workflow. Built as a shell-based orchestration layer for Claude Code, it implements the Double Diamond methodology—a design thinking framework that moves through Discover, Define, Develop, and Deliver phases with enforced quality gates. Instead of asking one AI for code and hoping it’s correct, Octopus routes your request through multiple models simultaneously, assigns domain-specific expert personas based on intent detection, and synthesizes their responses only after they reach consensus thresholds. It’s not AI pair programming; it’s AI committee programming with Robert’s Rules of Order.

Technical Insight

Parallel_Execution

> 0.80

0.70-0.80

< 0.70

High

Low

Context Enrichment

User Input

Intent Parser

Confidence Check

Auto Route

Confirm Route

Clarify Intent

Workflow Selector

Diamond Phase Router

Codex Provider

Gemini Provider

Claude Provider

Response Synthesizer

Consensus Score

Unified Response

Persona Library

Output to User

System architecture — auto-generated

At its core, Claude Octopus is a command parser and workflow orchestrator written in shell scripts that intercepts requests to Claude Code and fans them out to three AI providers. The architecture reveals itself in the session initialization:

# Simplified orchestration flow
parse_intent() {
  local user_input="$1"
  local intent=$(claude_analyze_intent "$user_input")
  local confidence=$(echo "$intent" | jq -r '.confidence')
  
  if (( $(echo "$confidence > 0.80" | bc -l) )); then
    route_workflow "$intent" "auto"
  elif (( $(echo "$confidence > 0.70" | bc -l) )); then
    confirm_route "$intent"
  else
    clarify_intent "$user_input"
  fi
}

route_workflow() {
  local intent="$1"
  local workflow=$(select_diamond_phase "$intent")
  
  # Parallel execution across providers
  codex_response=$(execute_codex "$intent" "$workflow") &
  gemini_response=$(execute_gemini "$intent" "$workflow") &
  claude_response=$(execute_claude "$intent" "$workflow") &
  wait
  
  synthesize_responses "$codex_response" "$gemini_response" "$claude_response"
}

The intent detection system is where things get interesting. Rather than simple keyword matching, Octopus uses Claude’s reasoning capabilities to classify requests into workflow categories and assign appropriate expert personas from its library of 29 specialists. Ask about database indexing, and the Database Architect persona activates. Request a security review, and the Security Auditor joins the conversation. These personas aren’t just system prompts—they maintain context across the session and proactively contribute when their domain is relevant, even if you didn’t explicitly invoke them.

The Double Diamond implementation adds structure that most AI tools lack. In the Discover phase, all three models explore the problem space—Gemini pulls from its broad knowledge base, Claude reasons about requirements, Codex examines existing code patterns. The system calculates a consensus score by comparing semantic similarity of responses using embedding vectors. Only when consensus exceeds 75% does the workflow advance to Define, where models agree on the specific problem to solve. This gating mechanism prevents the common AI pitfall of rushing to implementation before fully understanding requirements.

The Develop phase showcases the multi-provider strength distribution. Codex generates implementation code with deep API knowledge. Gemini performs security analysis and checks against vulnerability databases. Claude synthesizes both perspectives and identifies integration concerns. Here’s a simplified quality gate check:

check_quality_gate() {
  local phase="$1"
  local responses="$2"
  
  # Calculate consensus using semantic similarity
  local consensus=$(python3 -c "
import json
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')
responses = json.loads('$responses')
embeddings = model.encode([r['content'] for r in responses])
similarity = cosine_similarity(embeddings)
print(similarity.mean())
")
  
  local threshold=$(get_phase_threshold "$phase")
  
  if (( $(echo "$consensus >= $threshold" | bc -l) )); then
    log_gate_pass "$phase" "$consensus"
    return 0
  else
    log_gate_fail "$phase" "$consensus" "$threshold"
    trigger_reconciliation "$responses"
    return 1
  fi
}

When quality gates fail, Octopus enters a reconciliation mode where it highlights the disagreements between models and asks the user to provide tiebreaker guidance. This is actually more valuable than automatic consensus—divergent AI opinions often reveal edge cases or architectural tradeoffs that a single model would silently choose for you.

The graceful degradation architecture deserves attention. Octopus detects available providers at runtime and adjusts workflows accordingly. With all three providers active, it runs full parallel execution. If Gemini’s API is down, it falls back to Codex+Claude synthesis. If only Claude is available (the host environment), it runs single-model workflows with the persona system still active. The API abstraction layer handles this transparently:

execute_provider() {
  local provider="$1"
  local prompt="$2"
  local context="$3"
  
  case "$provider" in
    codex)
      if check_provider_health "openai"; then
        openai api completions.create \
          --model "gpt-4" \
          --prompt "$prompt" \
          --context "$context"
      else
        echo '{"status": "unavailable", "fallback": true}'
      fi
      ;;
    gemini)
      if check_provider_health "google"; then
        curl -s "https://generativelanguage.googleapis.com/v1/models/gemini-pro:generateContent" \
          -H "Content-Type: application/json" \
          -d '{"contents": [{"parts":[{"text":"'"$prompt"'"}]}]}'
      else
        echo '{"status": "unavailable", "fallback": true}'
      fi
      ;;
  esac
}

The session context management is surprisingly sophisticated for a shell implementation. Octopus maintains a conversation history that includes not just user prompts and AI responses, but also metadata about which personas were active, what consensus scores were achieved, and which quality gates passed or failed. This creates an audit trail useful for understanding how the system arrived at recommendations, and allows the AI models to reference earlier decision points when evaluating new requests.

Gotcha

The biggest limitation is Claude Code lock-in. Octopus isn’t a standalone tool—it requires Claude Code v2.1.34 or later as the host environment, which means you can’t use it in VS Code, JetBrains IDEs, or any other editor. This architectural decision makes sense (it needs a base AI provider to orchestrate), but it severely limits adoption. If you’re happily working in Cursor or using Aider in your terminal, migrating to Claude Code just for Octopus is a substantial commitment.

The shell-based implementation also raises concerns about reliability at scale. Shell scripts excel at gluing commands together, but error handling is notoriously fragile. What happens when one of three parallel API calls hangs? How does timeout handling work when you’re waiting on multiple providers? The repository doesn’t show extensive error recovery logic, and shell’s signal handling isn’t as robust as what you’d get in Go or Rust. Running three API calls in parallel could also introduce significant latency—if each takes 3-5 seconds and they occasionally retry, you might wait 10-15 seconds for responses that a single model would return in 4 seconds. The parallelization is only beneficial if the synthesized multi-model output is substantially better than single-model output, and that’s context-dependent.

The consensus scoring and quality gate thresholds appear to be hardcoded or at least not exposed in the documentation. A 75% consensus threshold might be perfect for some workflows and paralyzing for others. Fast prototyping benefits from lower bars; security-critical work needs higher thresholds. Without tunability, you’re stuck with the author’s opinionated defaults. The 317 stars suggest this is still early-stage—impressive for a niche orchestration tool, but not battle-tested at enterprise scale.

Verdict

Use if: You’re already invested in Claude Code and working on complex projects where different AI perspectives add genuine value—architectural decisions, security reviews, research synthesis, or any scenario where you’ve been burned by single-model blind spots and want enforced design rigor. The Double Diamond methodology is particularly valuable for teams that struggle with AI assistants encouraging rushed implementation. Skip if: You’re committed to a different IDE ecosystem, need standalone tooling, want faster single-model iteration cycles, or prefer exploratory coding over structured workflows. The orchestration overhead only pays dividends on sufficiently complex tasks; for typical CRUD features or bug fixes, a single well-prompted AI is faster and simpler. Also skip if you need production-grade reliability or fine-grained control over AI interaction patterns—the shell implementation and opaque scoring make this more experimental framework than hardened tool.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/nyldn-claude-octopus.svg)](https://starlog.is/api/badge-click/ai-agents/nyldn-claude-octopus)