Back to Articles

Super Agent Party: Building Self-Evolving AI Companions with Desktop Vision and Multi-Platform Reach

[ View on GitHub ]

Super Agent Party: Building Self-Evolving AI Companions with Desktop Vision and Multi-Platform Reach

Hook

What if your AI assistant could watch your screen, control your mouse, write its own plugins, and simultaneously manage your Discord server, Twitch chat, and smart home—all while appearing as an anime avatar?

Context

The rise of AI VTubers like Neuro-sama captured imaginations in 2023, demonstrating that audiences would engage with AI personalities that could play games, respond to chat, and maintain coherent long-term interactions. But Neuro-sama remained a closed system accessible only to its creator. Meanwhile, Anthropic released Computer Use for Claude, showing that AI agents could autonomously operate desktop applications through visual feedback. Super Agent Party emerges at the intersection of these trends, aiming to democratize both the AI VTuber experience and desktop automation capabilities in a single self-hosted package.

The project targets a specific pain point: developers and content creators want AI companions that can interact across multiple platforms (streaming services, IM apps, smart home systems) while also performing meaningful work through desktop automation. Existing solutions forced users to cobble together separate tools—VTube Studio for avatars, custom bot frameworks for chat, standalone automation tools for desktop control. Super Agent Party bundles these capabilities with a crucial addition: Model Context Protocol (MCP) integration that lets AI agents coordinate complex tasks and even generate new extensions for themselves. It's an ambitious attempt to create the 'all-in-one AI companion' that previous projects approached piecemeal.

Technical Insight

Super Agent Party's architecture revolves around a Node.js core that orchestrates multiple subsystems through an event-driven plugin model. The application maintains persistent WebSocket connections to AI providers (OpenAI, Claude, local models via OpenAI-compatible endpoints) while exposing a task queue for background operations. What makes the architecture interesting is how it bridges the gap between high-level AI reasoning and low-level system control.

The desktop vision system uses native OS APIs to capture screenshots, process them through vision-capable models, and translate AI decisions into mouse movements and keyboard events. Unlike web automation frameworks like Puppeteer that rely on DOM manipulation, Super Agent Party performs true visual recognition—it sees your desktop the way you do. This means it can interact with any application, not just web browsers. The implementation uses platform-specific bindings (RobotJS for cross-platform input simulation, native Windows/macOS APIs for screen capture) wrapped in a unified abstraction layer.

Here's how the plugin system enables desktop automation through the MCP protocol:

// Example SAP skill for desktop automation
class ScreenAnalysisSkill {
  constructor(agent) {
    this.agent = agent;
    this.vision = agent.getService('vision');
    this.input = agent.getService('input');
  }

  async execute(params) {
    // Capture current screen state
    const screenshot = await this.vision.captureScreen({
      region: params.region || 'full'
    });

    // Send to vision model with task context
    const analysis = await this.agent.query({
      model: 'gpt-4-vision',
      messages: [
        {
          role: 'user',
          content: [
            { type: 'text', text: params.instruction },
            { type: 'image_url', image_url: screenshot }
          ]
        }
      ],
      tools: this.getInputTools()
    });

    // Execute tool calls from AI response
    if (analysis.tool_calls) {
      for (const call of analysis.tool_calls) {
        await this.executeToolCall(call);
      }
    }

    return analysis;
  }

  getInputTools() {
    return [
      {
        type: 'function',
        function: {
          name: 'click_element',
          description: 'Click at specific screen coordinates',
          parameters: {
            type: 'object',
            properties: {
              x: { type: 'number' },
              y: { type: 'number' },
              button: { type: 'string', enum: ['left', 'right'] }
            }
          }
        }
      },
      {
        type: 'function',
        function: {
          name: 'type_text',
          description: 'Type text using keyboard',
          parameters: {
            type: 'object',
            properties: {
              text: { type: 'string' }
            }
          }
        }
      }
    ];
  }

  async executeToolCall(call) {
    switch (call.function.name) {
      case 'click_element':
        await this.input.moveMouse(call.arguments.x, call.arguments.y);
        await this.input.click(call.arguments.button);
        break;
      case 'type_text':
        await this.input.typeText(call.arguments.text);
        break;
    }
  }
}

The MCP integration is where Super Agent Party differentiates itself from simpler bot frameworks. It implements the Model Context Protocol as a coordination layer, allowing multiple AI agents to share context and delegate tasks. When you ask the companion to "check my calendar and order dinner if I have a meeting tonight," MCP orchestrates the calendar API call, decision logic, and food delivery API interaction across separate skill modules. Each skill registers its capabilities as MCP tools, and the central orchestrator maintains conversation state across these boundaries.

The avatar rendering system supports both VRM (3D) and Live2D (2D) models through a unified interface. It maps AI emotional state and speech patterns to facial expressions and body movements. The system analyzes response sentiment in real-time and triggers corresponding animation sequences. For Live2D, it uses the official Cubism SDK; for VRM, it leverages Three.js with VRM extensions. Audio input and output flow through speech-to-text and text-to-speech pipelines, with support for multiple providers (Azure, Google, ElevenLabs, local VITS models).

The extension bootstrapping feature is perhaps the most novel architectural choice. The sap-extension-creator skill gives the AI agent access to templates and the file system to generate new plugin modules. When you tell the agent "I need a skill to check Bitcoin prices," it can scaffold the API integration code, register the new skill, and reload itself—all without developer intervention. This creates a self-evolving system where the companion's capabilities expand through natural conversation. The implementation uses code generation prompts with strict validation to prevent arbitrary code execution exploits.

For multi-platform bot deployment, Super Agent Party maintains adapter modules for Discord, QQ, WeChat Work, Telegram, and Chinese streaming platforms like Bilibili. Each adapter translates platform-specific events (messages, reactions, subscriptions) into a unified internal event format. The agent processes these through the same reasoning pipeline regardless of origin, with platform-specific formatting applied to responses. This means you write interaction logic once and deploy everywhere.

Gotcha

The platform restrictions are immediately apparent—Windows 10/11 or Apple Silicon macOS only. If you're running Linux or an Intel Mac, you're locked out entirely. This stems from the desktop automation components using platform-specific native bindings that the team hasn't ported. Even more limiting, the macOS support explicitly requires M-chip machines, excluding a huge swath of Intel Mac users who might want AI companion features. For a project positioning itself as accessible and self-hosted, these hard platform requirements create significant barriers.

The bundled runtime approach, while convenient for non-technical users, creates deployment headaches for developers. Each portable package includes full Python and Node.js installations, ballooning download sizes to multiple gigabytes. You can't easily swap in your preferred versions or share runtimes across projects. The architecture locks you into the bundled versions, which may lag behind security patches or lack dependencies you need for custom extensions. Documentation is fragmented across Feishu (Chinese), Notion (English), and video tutorials on Bilibili and YouTube, making it difficult to find authoritative answers. The computer control features, while impressive, raise legitimate security concerns—you're giving an AI agent the ability to control your mouse and keyboard based on visual analysis. A misinterpreted screen element could result in unintended clicks on destructive actions. The project is still at v0.4.1, and the stability needed for production environments isn't there yet. Early adopters should expect breaking changes and bugs in desktop automation workflows.

Verdict

Use Super Agent Party if you're a content creator building AI VTuber personas for streaming platforms, a developer prototyping multi-platform bot interactions with minimal infrastructure setup, or an enthusiast exploring AI desktop automation without diving into Anthropic's enterprise-focused Computer Use. The bundled portable packages make experimentation frictionless, and the Chinese market integrations (Bilibili, QQ) are unmatched elsewhere. It's ideal for hackathon projects and proof-of-concepts where you need rapid assembly of AI agent capabilities across streaming, IM, and desktop control. Skip it if you require Linux deployment, need production-grade reliability for critical workflows, want lightweight containerized deployments, or can't tolerate scattered documentation. The broad feature surface means depth suffers compared to specialized tools—SillyTavern offers richer character chat, Streamer.bot gives more granular streaming control, and Claude Desktop provides more mature computer use. If you're building one specific capability rather than experimenting with the full AI companion vision, dedicated tools will serve you better. The security implications of desktop automation also warrant careful consideration before deployment in any environment with sensitive data.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/heshengtao-super-agent-party.svg)](https://starlog.is/api/badge-click/ai-agents/heshengtao-super-agent-party)