Guidance: Programming with LLMs Like They're Regular Expressions
Hook
What if you could make an LLM output valid JSON with 100% certainty—not through prompt engineering tricks or validation layers, but by making invalid outputs literally impossible to generate?
Context
The traditional LLM workflow is frustratingly probabilistic: you craft a prompt, cross your fingers, parse the response, and hope it matches your expected format. When it doesn't—and it often doesn't—you're stuck in a generate-validate-retry loop that burns tokens, adds latency, and still might fail after multiple attempts. Want a list of exactly five items? The model might give you four or six. Need valid JSON? Prepare for trailing commas and missing brackets. Asking for a selection from predefined options? The model will confidently invent new ones.
This isn't just annoying in development; it's a production nightmare. Every failure case needs error handling. Every retry costs money and time. Systems built on unreliable outputs require elaborate validation pipelines, fallback strategies, and defensive programming. Guidance emerged from a fundamental insight: instead of asking the LLM to follow instructions and validating afterward, what if we constrained the generation process itself? What if invalid outputs weren't wrong answers but structurally impossible results? This shifts LLM interaction from probabilistic hope to deterministic control, treating language models less like creative writers and more like programmable state machines.
Technical Insight
Guidance's core innovation is constrained decoding through context-free grammars. Instead of generating text freely and validating afterward, it restricts the model's token selection in real-time, ensuring every token conforms to your specified grammar. The library intercepts the generation process at the token level, presenting the model with only valid next-token options based on the current parse state.
Here's what this looks like in practice. Suppose you need a model to classify sentiment and provide a confidence score in a specific format:
from guidance import models, gen, select
# Load your model (works with Transformers, llama.cpp, OpenAI)
lm = models.TransformersChat('mistralai/Mistral-7B-Instruct-v0.1')
# Define the constrained generation
lm += f"""Analyze this review: "The product exceeded expectations!"
Sentiment: {select(['positive', 'negative', 'neutral'], name='sentiment')}
Confidence: {gen(regex=r'[0-9]{1,2}', name='confidence')}%
Reasoning: {gen(stop='\\n', max_tokens=50, name='reasoning')}"""
print(lm['sentiment']) # Guaranteed to be 'positive', 'negative', or 'neutral'
print(lm['confidence']) # Guaranteed to be 1-2 digits
The select() function doesn't just prompt the model to choose; it literally prevents the model from generating any tokens outside that set. The regex constraint on confidence ensures the output is always 1-2 digits—the model cannot generate "high" or "85.5%" even if those would be higher probability completions. This is fundamentally different from prompt engineering.
The architecture achieves this through an immutable state chain. Each operation creates a new model state, appending to the context while maintaining the constraint grammar. When you call select() or gen(regex=...), Guidance constructs a parse tree representing valid continuations. During generation, it queries the backend model's logits (probability distribution over next tokens), masks out invalid tokens according to the grammar, and samples from the remaining valid set. This happens for every single token until generation completes.
The stateless decorator pattern enables composable grammar functions, letting you build complex structures from reusable components:
from guidance import models, gen, block
lm = models.TransformersChat('mistralai/Mistral-7B-Instruct-v0.1')
@guidance(stateless=True)
def json_object(lm, fields):
lm += '{\n'
for i, (key, value_type) in enumerate(fields.items()):
lm += f' "{key}": '
if value_type == 'string':
lm += f'"{gen(stop='"', name=key)}"'
elif value_type == 'number':
lm += gen(regex=r'-?[0-9]+(\\.[0-9]+)?', name=key)
if i < len(fields) - 1:
lm += ','
lm += '\n'
lm += '}'
return lm
# Now use it to guarantee valid JSON structure
lm += "Extract user info: "
lm = json_object(lm, {'name': 'string', 'age': 'number', 'email': 'string'})
This function constructs a grammar that generates syntactically valid JSON every time. No more escaped quotes in the wrong place, no missing commas, no type mismatches. The model is guided through the structure step by step.
A particularly powerful feature is offline validation with Mock models. Before burning API credits, you can test your grammars:
from guidance import models
mock = models.Mock()
mock += f"Pick a color: {select(['red', 'blue', 'green'])}"
# This validates the grammar without calling any LLM
For multiple backends, Guidance provides a unified interface. The same constraint code works whether you're using a local Transformers model, a llama.cpp server, or OpenAI's API. The library handles backend-specific token masking implementations, though the level of control varies—local models give you true token-level masking, while API-based models may require prompt-based approximations that are less reliable.
The role-based context managers integrate naturally with chat models:
with system():
lm += "You are a helpful assistant that outputs valid JSON."
with user():
lm += "Give me info about Paris."
with assistant():
lm += json_object(lm, {'city': 'string', 'country': 'string', 'population': 'number'})
This combines the conversational flow developers expect with the structural guarantees Guidance provides.
Gotcha
Constrained generation sounds like magic until you hit its practical boundaries. The first limitation is backend support: true token-level control requires access to the model's logit distribution. Local models via Transformers or llama.cpp work perfectly, but API-based services like OpenAI's don't expose logits. Guidance attempts to work around this with prompt-based approximations for API models, but you lose the hard guarantees—it becomes sophisticated prompting rather than true constraint.
The immutable state model creates mental friction. Every operation returns a new model state, so you can't do lm.generate() and mutate state like traditional libraries. You must reassign: lm = lm + "text" or use lm += "text". This functional approach is intentional for composability, but developers accustomed to mutable objects will write bugs until the paradigm clicks. The error messages when you forget this aren't always intuitive.
Performance overhead exists but varies wildly. Simple constraints (selecting from a list) add minimal latency. Complex context-free grammars—especially deeply nested JSON schemas or intricate regular expressions—require more computation per token. Each token generation involves parsing the grammar, computing valid next states, and masking logits. For a local model, this might add 10-50% to generation time depending on grammar complexity. Debugging complex grammars can be painful; when generation fails or produces unexpected results, you're debugging both the LLM's tendencies and your grammar definition. There's no visual debugger for parse states.
Finally, constrained generation doesn't mean semantically correct generation. Guidance guarantees your output is syntactically valid JSON, but the values inside might still be nonsense. A model constrained to output {"age": <number>} might generate {"age": 999}. You've eliminated parsing errors, not reasoning errors. This sounds obvious but trips up developers who assume structural constraints somehow improve factual accuracy.
Verdict
Use if: You're building production systems where output format reliability matters more than creative flexibility—data extraction pipelines, structured API responses, form filling, classification tasks with predefined categories. You have access to local models or endpoints that expose token-level control. You're tired of validate-retry loops burning tokens and adding latency. You need to compose complex output structures from reusable components and want compile-time-ish guarantees about format validity. Skip if: You're doing exploratory prompting where flexibility matters more than format guarantees. You're locked into API-based LLM services without logit access and need true constraint guarantees (not approximations). Your use case is primarily unstructured creative generation where rigid formats would be counterproductive. You need maximum generation speed and can tolerate occasional parsing failures handled by retry logic. The learning curve and paradigm shift aren't justified for simple prompting tasks.