Magentic-UI: Microsoft's Plan-Then-Execute Web Agent That Shows Its Work

Hook

Most AI agents are black boxes that click around your browser autonomously. Magentic-UI flips this: it shows you exactly what it plans to do, lets you edit the plan like a document, then executes only what you approve.

Context

The explosion of LLM-powered agents has created a trust problem. Tools like AutoGPT and browser automation frameworks can accomplish impressive tasks, but they operate opaquely—you click 'go' and hope they don't accidentally delete your production database or submit a form with hallucinated data. This works fine for sandboxed demos, but falls apart when agents interact with real systems where mistakes have consequences.

Microsoft Research's Magentic-UI tackles this with a fundamentally different approach: transparency through explicit planning. Instead of immediately executing actions, the agent generates a step-by-step plan that users can review, edit, and approve before anything happens. It's the difference between handing your car keys to a self-driving system versus reviewing the proposed route and making adjustments. For tasks like filling out government forms, monitoring competitor pricing, or analyzing financial data, this human-in-the-loop architecture transforms agents from risky experiments into practical tools.

Technical Insight

Magentic-UI's architecture centers on a plan-then-execute pattern implemented through Microsoft's AutoGen multi-agent framework. Unlike traditional agents that immediately translate instructions into actions, Magentic-UI uses an LLM to generate an explicit, editable plan as an intermediate artifact. This plan becomes a contract between human and machine.

The system employs three specialized agents: a browser automation agent using Playwright, a code execution agent running in Docker containers, and a file operations agent. When you submit a task like 'Find the cheapest flights from Seattle to Tokyo next month,' the orchestrator doesn't start clicking immediately. Instead, it generates a plan:

# Example plan structure (simplified)
plan = {
  "task": "Find cheapest flights SEA to Tokyo",
  "steps": [
    {
      "id": 1,
      "action": "navigate",
      "target": "https://google.com/flights",
      "description": "Open Google Flights"
    },
    {
      "id": 2,
      "action": "fill_form",
      "fields": {"origin": "SEA", "destination": "NRT"},
      "description": "Enter origin and destination"
    },
    {
      "id": 3,
      "action": "code_execution",
      "code": "parse_flight_results(page_html)",
      "description": "Extract and sort flight prices"
    }
  ],
  "expected_output": "CSV with flight options sorted by price"
}

This plan renders in the React frontend as an editable document. You can modify step 2 to search a different airport, delete step 3 if you want to review results manually, or insert a new step to apply specific filters. The frontend sends the modified plan back to the Python backend only when you click 'Execute.'

The backend's orchestration layer implements configurable action guards—safety checks that pause execution for human approval on sensitive operations. For instance, any action involving file deletion, form submission, or external API calls can trigger a confirmation prompt. These guards are defined in YAML configuration:

action_guards:
  - type: form_submission
    require_approval: true
    show_preview: true
  - type: file_operation
    operations: [delete, modify]
    require_approval: true
  - type: navigation
    domains_requiring_approval:
      - "*.bank.com"
      - "admin.*"

The Docker-based code execution agent deserves special attention. When the plan includes data analysis or web scraping logic, the agent spins up an isolated container with a Python environment. This sandbox prevents code from accessing your local filesystem or network beyond what's explicitly permitted. The agent passes extracted data (like HTML content or API responses) into the container, executes the code, and returns results—all without the LLM-generated code touching your host machine.

Perhaps most innovative is the 'Tell Me When' feature for long-running monitoring tasks. Traditional agents either run continuously (expensive) or require you to manually re-trigger them. Magentic-UI introduces a stateful monitoring pattern where you can instruct the agent to check a condition periodically: 'Tell me when the price drops below $500.' The backend maintains a scheduler that re-runs the plan at specified intervals (hourly, daily, etc.), comparing results against your condition. When the trigger fires, it alerts you through the UI or configured notification channel.

The system also implements plan learning through a gallery mechanism. Successful task executions get stored as templates with parameterized inputs. If you create a plan for 'Extract all product prices from Amazon search for [QUERY],' that plan becomes reusable. The next time you (or another user) issues a similar request, the LLM retrieves the successful plan from the gallery and adapts it, reducing planning time and improving reliability through proven patterns.

Gotcha

The Docker requirement is non-negotiable for serious use. While Magentic-UI technically runs without Docker, you lose code execution capabilities—which eliminates data analysis, web scraping, and most sophisticated automation tasks. For Windows users, this means setting up WSL2, which adds friction to what could otherwise be a simple npm-install experience. If your organization has strict container policies or you're deploying to environments without Docker access, you'll hit a hard wall.

The research prototype status manifests in practical ways. The documentation acknowledges potential breaking changes, and the GitHub issues reveal edge cases around session management and agent coordination. Plan execution can fail partway through, leaving browser state inconsistent—imagine a form half-filled because step 4 errored out. The system doesn't yet have robust rollback or recovery mechanisms. You also can't easily distribute this to non-technical users; it requires command-line setup and understanding of API keys, model configuration, and Docker networking. This is a tool for developers exploring agent architectures, not a polished product you'd hand to your marketing team to automate their workflows.

Verdict

Use if: You're building automation for high-stakes tasks where transparency justifies overhead—compliance workflows, financial data analysis, or research tasks where you need to audit agent actions. Also ideal if you're researching human-agent collaboration patterns or need a foundation for building custom web agents with safety guardrails. The plan editing and action guard features are genuinely novel for scenarios where full autonomy is risky. Skip if: You need production-grade reliability, want fully autonomous operation without human checkpoints, or can't accommodate Docker in your deployment environment. For simple web scraping or form filling where speed matters more than oversight, pure automation tools like Playwright with custom scripts will be faster and simpler. The transparency features are powerful but come with interaction costs that only make sense when the alternative is unacceptable risk.

Magentic-UI: Microsoft's Plan-Then-Execute Web Agent That Shows Its Work

Magentic-UI: Microsoft's Plan-Then-Execute Web Agent That Shows Its Work

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Magentic-UI: Microsoft's Plan-Then-Execute Web Agent That Shows Its Work

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

OpenSandbox: Building Production-Grade Isolation for AI Agents That Actually Execute Code

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]