Back to Articles

Instructor: How Pydantic Models Became the Best Way to Extract Structured Data from LLMs

[ View on GitHub ]

Instructor: How Pydantic Models Became the Best Way to Extract Structured Data from LLMs

Hook

Over 3 million monthly downloads make Instructor the go-to solution for extracting structured data from LLMs—solving the problem of getting predictable JSON instead of creative garbage.

Context

Large language models are brilliant at understanding context but terrible at following instructions precisely. Ask GPT-4 to extract a person’s name and age, and you might get back a beautifully formatted JSON object, a paragraph of prose, or—if you’re unlucky—a philosophical meditation on the nature of time. This unpredictability makes LLMs nearly unusable for production systems that need structured data.

The traditional solution involves writing complex prompts with JSON schema examples, manually parsing responses, implementing retry logic for malformed data, and crossing your fingers. A simple extraction task balloons into 30+ lines of error-prone boilerplate. Instructor eliminates this entire category of problems by treating structured extraction as a type-safety challenge, not a prompt engineering problem. It leverages Pydantic—Python’s de facto data validation library—to define what you want, then handles everything else automatically.

Technical Insight

Core Pipeline

Providers

Schema Introspection

Function/Tool Schema

API Call + Schema

JSON Response

Deserialize

Validation Error

Feed Error Back

Max Retries OK

Valid

Pydantic Model

Schema Converter

LLM Client Wrapper

User Request

LLM Provider

Response Parser

Validator

Retry Logic

Typed Python Object

System architecture — auto-generated

Instructor’s core insight is deceptively simple: if you can describe your desired output as a Pydantic model, the library can handle the entire LLM interaction pipeline. Under the hood, it converts your Pydantic schema into function-calling schemas (for providers like OpenAI and Anthropic) or JSON mode configurations, injects them into API calls, and deserializes responses back into validated Python objects.

Here’s what makes this powerful in practice:

from pydantic import BaseModel, field_validator
from typing import List
import instructor

class Address(BaseModel):
    street: str
    city: str
    country: str

class User(BaseModel):
    name: str
    age: int
    addresses: List[Address]
    
    @field_validator('age')
    def validate_age(cls, v):
        if v < 0:
            raise ValueError('Age must be positive')
        return v

client = instructor.from_provider("openai/gpt-4o-mini")

user = client.chat.completions.create(
    response_model=User,
    messages=[{
        "role": "user", 
        "content": "John Doe, 32 years old, lives at 123 Main St, NYC, USA and has a summer home at 456 Beach Rd, Miami, USA"
    }],
    max_retries=3
)

print(user.addresses[0].city)  # "NYC" - fully typed, validated

The magic happens in three layers. First, Instructor introspects your Pydantic model and generates the appropriate schema format for your LLM provider—OpenAI’s function calling, Anthropic’s tool use, or Google’s schema format. Second, when the LLM responds, Instructor deserializes the JSON and runs it through Pydantic’s validation engine. Third—and this is where it gets interesting—when validation fails, Instructor automatically feeds the ValidationError back to the LLM as a message, giving it a chance to self-correct.

This retry mechanism is smarter than it appears. Instead of generic “try again” prompts, the LLM receives specific error messages like “Age must be positive” or “Expected string, got null for field ‘city’”. In practice, this catches hallucinations, type mismatches, and missing required fields without custom error handling code.

The library also solves streaming elegantly through Partial models:

from instructor import Partial

for partial_user in client.chat.completions.create(
    response_model=Partial[User],
    messages=[{"role": "user", "content": "..."}],
    stream=True
):
    print(partial_user.name)  # Updates as tokens arrive
    # None → "John" → "John Doe"

Partial validation means you can update UI elements in real-time as nested objects fill in, without waiting for complete responses. This is particularly valuable for complex extractions from long documents.

Instructor’s provider abstraction is deliberately thin—it wraps native client APIs rather than replacing them. You can pass any parameter your underlying provider supports (temperature, top_p, etc.) directly through. This design choice means you’re never fighting the library when you need provider-specific features, and upgrading to new model capabilities is trivial.

Gotcha

Instructor makes structured extraction reliable, but it can’t fix fundamental LLM limitations. Weaker models like GPT-3.5 or smaller Llama variants still hallucinate fields, ignore schema constraints, or return semantically wrong data that passes validation. A field validator can check that age is a positive integer, but it can’t verify that “35” is actually the person’s correct age versus an LLM guess.

The automatic retry mechanism is both a feature and a potential cost consideration. Each retry consumes additional tokens and API credits. With complex schemas or strict validation rules, you may see increased token usage on difficult inputs. The library defaults to 3 retries, but there’s no guarantee of success—some inputs simply won’t parse correctly no matter how many attempts. You’ll want monitoring around retry counts in production.

Instructor is deliberately scoped to extraction only. If you need agentic workflows—multi-step reasoning, tool orchestration, memory between interactions—you’ll outgrow it quickly. The README explicitly points developers toward PydanticAI for agent capabilities, which is telling. Instructor shines for stateless parsing tasks but lacks the scaffolding for complex AI applications.

Verdict

Use Instructor if you’re building production systems that extract structured data from text at scale—document parsing pipelines, form auto-fill, data enrichment APIs, or any workflow where you’re tired of writing JSON validation boilerplate. The automatic retry logic and provider abstraction are production-grade conveniences that save hours of debugging. It’s especially valuable when working with nested objects or streaming partial results to UIs. Skip it if you need full agent frameworks with memory and tools (reach for PydanticAI or LangChain instead), or if you’re doing simple one-off scripts where manual JSON parsing is trivial. Also skip if you need guaranteed schema compliance—consider constrained generation alternatives if you can run models locally. The sweet spot is teams shipping LLM features who want reliability without reinventing validation infrastructure.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/567-labs-instructor.svg)](https://starlog.is/api/badge-click/llm-engineering/567-labs-instructor)