Guidance: Programming Language Models Like They’re Python Objects
Hook
What if you could constrain an LLM’s output to match a specific format—not through prompt engineering or hoping for the best, but through generation-time grammar constraints that prevent invalid tokens from being produced?
Context
The traditional approach to getting structured outputs from language models is a game of probabilistic whack-a-mole. You craft careful prompts, add examples, maybe fine-tune the model, then parse the output with regex and hope it works. When it fails—and it will—you add retry logic, validation layers, and error handling. The entire pipeline becomes a Rube Goldberg machine of prompt templates, string parsing, and exception handlers.
Guidance flips this paradigm on its head. Instead of treating LLMs as black boxes that you prompt and pray, it gives you a programming interface where generation is constrained by grammars. You define what valid output looks like using context-free grammars, regular expressions, or selection lists, and the library constrains the model to only produce tokens that match. No post-processing. No validation layers. The constraints are enforced during generation itself. This isn’t prompt engineering—it’s programming with probabilistic primitives.
Technical Insight
Guidance’s core innovation is treating language models as immutable objects that you build up through operations. Each interaction returns a new model instance, similar to how string operations work in Python. This might seem like a small detail, but it enables a remarkably clean API. You don’t mutate a global chat history or manage conversation state—you compose model interactions like you’d compose functions.
The library provides Pythonic context managers for chat roles that feel natural while maintaining complete control:
from guidance import system, user, assistant, gen
from guidance.models import Transformers
phi_lm = Transformers("microsoft/Phi-4-mini-instruct")
lm = phi_lm
with system():
lm += "You are a helpful assistant"
with user():
lm += "Hello. What is your name?"
with assistant():
lm += gen(name="lm_response", max_tokens=20)
print(f"{lm['lm_response']=}")
The real power emerges when you add constraints. The gen() function accepts a regex parameter that constrains generation to match a pattern. This isn’t post-generation filtering—the model’s token sampling is constrained to only produce tokens that could lead to a valid match. For a teenager chatbot that needs to state its age as a number:
with assistant():
lm += gen("lm_age", regex=r"\d+", temperature=0.8)
The model will only sample from tokens that could potentially complete a valid number. This dramatically reduces parsing failures and output validation errors in production systems.
For discrete choices, the select() function is even more elegant. Instead of prompting “Answer A, B, C, or D” and parsing the response, you constrain generation to exactly those tokens:
from guidance import select
with user():
lm += """What is the capital of Sweden?
A) Helsinki
B) Reykjavík
C) Stockholm
D) Oslo
"""
with assistant():
lm += select(["A", "B", "C", "D"], name="model_selection")
The model will produce exactly one of those four tokens. Not “The answer is C” or “I believe C is correct”—just “C”. This is transformative for structured workflows where you need to branch on LLM decisions.
Guidance takes this further with composable grammar functions using the @guidance decorator. You can build reusable components that encapsulate both the prompt structure and output constraints:
import guidance
from guidance.models import Model
ASCII_OFFSET = ord("a")
@guidance
def zero_shot_multiple_choice(
language_model: Model,
question: str,
choices: list[str],
):
with user():
language_model += question + "\n"
for i, choice in enumerate(choices):
language_model += f"{chr(i+ASCII_OFFSET)} : {choice}\n"
with assistant():
language_model += select(
[chr(i + ASCII_OFFSET) for i in range(len(choices))],
name="string_choice"
)
return language_model
This function is now a first-class component you can use anywhere. The immutable model pattern means these components compose cleanly—each returns a modified copy without side effects.
A brilliant developer experience feature is offline grammar debugging with the Mock model. You can validate your constraints without making expensive API calls:
from guidance import gen
from guidance.models import Mock
grammar = "expr=" + gen(regex=r"\d+([+*]\d+)*", name="expr")
# Validate strings against the grammar
assert grammar.match("expr=12+7*3") is not None
assert grammar.match("expr=12+*3") is None
# Test with a mock model
lm = Mock(b"<s>expr=12+7*3")
lm += grammar
print(lm["expr"]) # 12+7*3
This dramatically tightens the iteration loop when building complex grammars. You’re not burning tokens or waiting for API calls—you’re unit testing your constraints.
The library supports multiple backends through a unified interface: Transformers for local models, llama.cpp for optimized inference, OpenAI for API access. The same grammar code works across all of them, though the README notes that constrained generation features require full backend support, meaning not all LLM APIs support the advanced constraint mechanisms.
Gotcha
The power of Guidance’s constraint system comes with a significant caveat: it requires backend support for the constraint mechanisms. The README explicitly states that the constraint system can “ensure that the output conforms to any context free grammar (so long as the backend LLM has full support for Guidance).” If you build your application around sophisticated grammar constraints and then need to switch to a backend that doesn’t support them, you’re rewriting code. This limits cross-platform portability in practice.
The immutable model pattern, while conceptually clean, can be confusing if you’re coming from stateful LLM libraries. Every operation returns a new model instance, which means you need to reassign the result: lm = lm + something. Forget that assignment and your changes disappear.
The repository’s primary language being Jupyter Notebook suggests the library was developed heavily in notebook environments. While the Python API works fine in production, the documentation and examples lean notebook-centric. The README mentions a widget for “richer user experience” in Jupyter, but this doesn’t translate to production deployments where you’re building APIs or batch processing pipelines. You’ll need to adapt patterns from the notebook-oriented examples.
Verdict
Use Guidance if you’re building applications where LLM output needs to feed into downstream systems that expect specific formats—structured data extraction, form filling, multi-step workflows with branching logic, or anything where parsing failures create reliability issues. The constraint system can significantly reduce output validation problems. It’s also excellent for rapid prototyping when you want to iterate on prompt structures without writing boilerplate parsing code. Skip it if you’re doing free-form generation where output format doesn’t matter, if you need guaranteed portability across every LLM provider (the constraint features won’t work everywhere), or if your team strongly prefers stateful programming patterns and finds immutable objects unnatural. For simple chatbots or content generation, the learning curve may outweigh the benefits. But for structured LLM applications, Guidance’s grammar-based approach offers a more reliable alternative to traditional prompt engineering.