Back to Articles

Data Formulator: Microsoft's Hybrid AI Agent for Visual Analytics

[ View on GitHub ]

Data Formulator: Microsoft's Hybrid AI Agent for Visual Analytics

Hook

What if your data visualization tool could understand "show me sales trends for underperforming regions" and generate both the SQL transformation and the chart—while still letting you drag fields around when the AI gets it wrong?

Context

Modern data exploration sits in an awkward middle ground between two paradigms. Traditional BI tools like Tableau require users to manually wrangle data through GUI menus and drag-and-drop interfaces—powerful but tedious when exploring unfamiliar datasets. On the other end, AI chatbots like ChatGPT's Advanced Data Analysis can generate visualizations from natural language but offer no mechanism to refine results except writing more prompts, creating a frustrating "one-shot" experience where small adjustments require full regeneration.

Microsoft Research's Data Formulator attacks this gap with a hybrid architecture that treats AI and direct manipulation as equals. Built by the same team behind tools like SandDance and Charticulator, it introduces a "data thread" paradigm—an interactive timeline where chat-based AI transformations and manual visual encoding specifications coexist. Users can ask the AI to reshape data, then immediately tweak the resulting visualization by dragging fields, or vice versa. This isn't just a chatbot with a chart library bolted on; it's a fundamental rethinking of how humans and language models should collaborate on exploratory data analysis.

Technical Insight

At its core, Data Formulator orchestrates three distinct layers: a TypeScript frontend for the data thread UI, a Python backend housing AI agents and data processing, and DuckDB as the computational engine. The architectural choice to use DuckDB rather than in-memory Pandas operations is critical—it enables the system to handle datasets with millions of rows without loading everything into browser memory. When you drag a field onto the chart, Data Formulator generates SQL queries that DuckDB executes, streaming only the aggregated results needed for rendering.

The AI architecture evolved significantly. Early versions used four separate agents: a DataTransformer, ChartGenerator, DataInsighter, and Recommender. The current implementation consolidates these into a unified DataAgent powered by LiteLLM, which abstracts over OpenAI, Azure OpenAI, Anthropic, and local models like Ollama. Here's how the system handles a natural language request:

// Simplified flow of AI-driven transformation
interface TransformRequest {
  prompt: string;
  currentData: DataSchema;
  visualContext?: ChartSpec;
}

async function handleTransform(req: TransformRequest): Promise<TransformResult> {
  // AI generates transformation code (SQL or Python)
  const generatedCode = await dataAgent.generateTransform({
    userIntent: req.prompt,
    dataSchema: req.currentData.schema,
    sampleRows: req.currentData.sample(100)
  });
  
  // Code execution in sandboxed environment
  const signature = await signCode(generatedCode);
  const result = await executeSandboxed(generatedCode, signature);
  
  // Return both data and reasoning
  return {
    transformedData: result.data,
    sqlQuery: result.query,
    reasoning: generatedCode.explanation,
    confidence: generatedCode.confidence
  };
}

The "data thread" UI materializes this process as a vertical timeline. Each node represents either a data transformation (AI-generated or manual) or a visualization specification. Critically, these nodes maintain bidirectional links—you can branch from any point in the thread to explore alternative transformations without losing your previous work. This lineage tracking means if the AI generates SQL that filters data incorrectly, you can fork from the node just before that transformation and try a different approach.

The semantic chart engine deserves attention. Rather than binding raw columns to visual encodings directly, Data Formulator infers semantic concepts from your data transformations. If you aggregate sales by month, it understands the temporal nature and automatically suggests appropriate chart types (line charts for trends, bar charts for comparisons). The system supports 30+ chart types from the Vega-Lite grammar, but abstracts away much of Vega's verbose JSON specification:

# Example of generated transformation for "compare revenue by region, showing top 5"
generated_query = '''
SELECT 
  region,
  SUM(revenue) as total_revenue
FROM sales_data
GROUP BY region
ORDER BY total_revenue DESC
LIMIT 5
'''

# Data Formulator auto-infers encoding:
# X: region (nominal), Y: total_revenue (quantitative), Mark: bar
# User can then drag additional fields to color, size, or facet

The workspace architecture positions Data Formulator as more than a visualization tool—it's a lightweight data lake. The backend supports pluggable storage (local filesystem, Azure Blob Storage) and external data loaders for MySQL, PostgreSQL, Azure Data Explorer, and S3. This means you can connect to production databases, let the AI explore schema relationships, and generate visualizations without ETL pipelines. The security model uses code signing and sandboxed execution to prevent malicious code injection, a necessity when LLMs generate arbitrary SQL and Python that runs server-side.

One architectural trade-off worth noting: the system maintains dual execution paths. Simple transformations run as DuckDB SQL for performance. Complex operations that require Python (regex, custom date parsing, ML feature engineering) execute in isolated processes. This hybrid approach provides flexibility but introduces complexity in error handling—SQL syntax errors surface differently than Python exceptions, and the UI must translate both into user-friendly messages.

Gotcha

The alpha status isn't just a version number—it reflects genuine instability. Between v0.5 and v0.7, the agent architecture underwent a complete rewrite, and the configuration format changed incompatibly. If you're building on Data Formulator, expect to refactor integration code with each minor version bump. More fundamentally, the AI dependency introduces non-determinism that's uncomfortable for analysts accustomed to reproducible workflows. Run the same natural language query twice, and you might get different SQL. The system includes confidence scores and shows generated code for transparency, but there's no guarantee the AI won't hallucinate column names or apply incorrect aggregation logic. You're expected to validate every AI-generated transformation, which somewhat undermines the time-saving promise.

The LLM requirement also creates operational friction. You need API keys for OpenAI, Azure, Anthropic, or a local Ollama instance. For organizations with data governance policies prohibiting external API calls, this is a dealbreaker—your sensitive sales data will be sent to OpenAI's servers for prompt processing unless you self-host models, which requires GPU infrastructure. Latency matters too; complex transformations can take 5-10 seconds for the LLM to generate code, breaking the fluid exploration experience. And costs accumulate quickly if you're using GPT-4 with large context windows containing sample data.

Verdict

Use Data Formulator if you're a data analyst or researcher exploring unfamiliar, large-scale datasets where the AI can accelerate initial discovery, and you're comfortable validating generated transformations. It excels for rapid prototyping visualizations during stakeholder meetings where you need to answer ad-hoc questions without writing SQL. The data thread paradigm genuinely shines when your workflow alternates between "show me something interesting" (AI-driven) and "adjust this specific encoding" (direct manipulation). Skip it if you need production-grade, auditable analytics where every transformation must be deterministic and version-controlled—Jupyter notebooks or dbt remain better choices. Also avoid if you're in a regulated industry that can't send data to external LLM APIs, or if you're working with small datasets where traditional tools like Excel or Observable are simpler. The alpha status makes it unsuitable for mission-critical workflows where breaking changes would disrupt operations. Think of it as a powerful exploration scratchpad, not an analytics platform of record.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/microsoft-data-formulator.svg)](https://starlog.is/api/badge-click/data-knowledge/microsoft-data-formulator)