Back to Articles

Pydantic AI: The Agent Framework That Treats LLMs Like Statically-Typed APIs

[ View on GitHub ]

Pydantic AI: The Agent Framework That Treats LLMs Like Statically-Typed APIs

Hook

While most LLM frameworks treat AI outputs as unpredictable black boxes, Pydantic AI enforces compile-time contracts on model responses—making agent failures detectable before runtime, not after your users complain.

Context

The explosion of LLM frameworks in 2023-2024 created a paradox: tools designed to make AI development easier often made production deployment harder. LangChain offered hundreds of integrations but sacrificed type safety for flexibility. LlamaIndex excelled at RAG but struggled with complex agent workflows. Most frameworks treated LLM interactions as inherently dynamic, forcing developers to add validation as an afterthought—usually after a production incident where an agent hallucinated malformed JSON or called the wrong tool.

Pydantic AI emerged from the team behind Pydantic, the validation library already powering FastAPI, LangChain, OpenAI's SDK, and virtually every major Python LLM tool. They recognized that agent frameworks were recreating validation logic Pydantic had already solved. Rather than bolt validation onto agent workflows, they built the framework around it from day one. The result is an opinionated framework that trades some flexibility for guarantees: your agents are typed, validated, and observable by default—or they don't compile.

Technical Insight

At its core, Pydantic AI inverts the traditional agent architecture. Instead of wrapping LLM calls in validation layers, it uses Pydantic models as the contract between your code and the AI. Define an agent with a result type, and the framework enforces that structure across model providers, streaming outputs, and tool executions.

Here's what a type-safe agent looks like in practice:

from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

class ResearchResult(BaseModel):
    summary: str
    confidence: float
    sources: list[str]
    
    @field_validator('confidence')
    def check_confidence(cls, v):
        if not 0 <= v <= 1:
            raise ValueError('Confidence must be between 0 and 1')
        return v

agent = Agent(
    'openai:gpt-4',
    result_type=ResearchResult,
    system_prompt='You are a research assistant. Always provide sources.'
)

@agent.tool
async def search_papers(ctx: RunContext[str], query: str) -> str:
    """Search academic papers for the given query."""
    # Your search implementation
    return f"Found papers about {query}"

result = await agent.run('What are the latest developments in RAG?')
# result.data is guaranteed to be a ResearchResult with validated fields
print(f"Confidence: {result.data.confidence}")

The magic happens in that result_type parameter. By declaring ResearchResult, you're not just hoping the LLM returns the right shape—Pydantic AI generates the JSON schema, includes it in the model prompt, validates the response, and retries with error feedback if validation fails. If the model returns confidence: 1.5, your validator catches it before it reaches your application logic. The LLM gets the error message and tries again.

This architecture extends to the 'capabilities' system, which solves a real problem in agent composition. Traditional frameworks force you to choose between monolithic agents (hard to test, impossible to reuse) or manual orchestration (brittle, verbose). Capabilities bundle tools, hooks, and instructions into reusable units:

from pydantic_ai.capabilities import WebSearch, Thinking

# Compose capabilities like middleware
research_agent = Agent(
    'anthropic:claude-3-5-sonnet',
    result_type=ResearchResult,
    capabilities=[WebSearch(), Thinking()]
)

# WebSearch adds search tools automatically
# Thinking adds chain-of-thought before responses
# Both include their own system prompts and hooks

Under the hood, capabilities are just Python classes that implement a standard interface. WebSearch registers search tools and adds instructions about citation formatting. Thinking injects a reasoning step before the final answer. You can compose them like building blocks without worrying about prompt conflicts or tool namespace collisions—the framework merges them intelligently.

The model-agnostic design deserves attention because it's more sophisticated than simple adapter patterns. Pydantic AI doesn't just abstract provider APIs; it normalizes their capabilities. When you switch from openai:gpt-4 to gemini-1.5-pro, the framework automatically adjusts for differences in tool calling formats, streaming protocols, and token counting. Your agent code stays identical:

# Same code, different model
for model in ['openai:gpt-4', 'anthropic:claude-3-5-sonnet', 'gemini-1.5-flash']:
    agent = Agent(model, result_type=ResearchResult)
    result = await agent.run('Summarize quantum computing')
    # Identical interface, provider-specific optimizations

For production deployments, durable execution support is the standout feature. Traditional agents fail catastrophically when services restart or connections drop mid-workflow. Pydantic AI checkpoints agent state after each step, allowing resume from the last successful operation:

from pydantic_ai.durable import DurableAgent

agent = DurableAgent(
    'openai:gpt-4',
    result_type=ResearchResult,
    storage='postgres://localhost/agent_state'
)

# If this crashes after the search but before the final response,
# rerunning with the same run_id resumes from the search result
result = await agent.run(
    'Deep research task',
    run_id='unique-job-id'
)

The observability integration with Pydantic Logfire (and OpenTelemetry) goes beyond basic logging. Every agent run generates structured traces showing token usage, latency per tool call, validation failures, and retry attempts. In production, this means you can correlate slow responses with specific tool executions or identify which validation errors trigger the most retries—critical data for optimizing costs and reliability.

Gotcha

The framework's strength—deep Pydantic integration—is also its constraint. If your team isn't already comfortable with Pydantic's validation patterns, field validators, and type hints, there's a learning curve. You can't just pass dictionaries around and hope for the best; the framework enforces structure at every layer. This is exactly what makes it production-ready, but rapid prototyping can feel slower compared to LangChain's "anything goes" flexibility.

The ecosystem gap is real. LangChain has hundreds of pre-built integrations with vector databases, document loaders, and specialized tools. Pydantic AI launched in late 2024, so you'll often need to write custom tool implementations. The Model Context Protocol (MCP) capability helps—it lets you integrate MCP-compatible tools from the growing ecosystem—but you're still trading maturity for type safety. If your use case needs exotic integrations (legacy enterprise systems, niche APIs), you might spend more time building adapters than building agents. The documentation, while excellent for core features, doesn't yet cover every edge case you'll encounter when combining multiple capabilities or implementing custom model providers.

Verdict

Use if: You're building production GenAI applications where reliability justifies upfront structure, you're already using Pydantic/FastAPI and want consistent validation patterns, you need to support multiple LLM providers without vendor lock-in, or observability and debugging are critical (especially if cost tracking and performance optimization matter). Pydantic AI shines when agent outputs feed into typed systems—APIs, databases, financial calculations—where validation failures should fail fast, not corrupt data. Skip if: You're in exploratory phase and need rapid prototyping with minimal ceremony, your use case requires extensive third-party integrations that don't exist in the Pydantic AI ecosystem yet, your team strongly prefers dynamic typing and finds Pydantic's strictness more friction than safety, or you're building simple single-model workflows where the observability and durability features would be overkill. For proof-of-concept projects or research experiments where agent failure is low-stakes, lighter frameworks will let you move faster.