Back to Articles

BAML: The Schema-First Language That Makes LLM Outputs Actually Reliable

[ View on GitHub ]

BAML: The Schema-First Language That Makes LLM Outputs Actually Reliable

Hook

If testing your LLM pipeline takes 2 minutes, you’ll test 10 ideas in 20 minutes. Reduce it to 5 seconds, and you’ll test 240 ideas in the same time. BAML makes this possible by treating prompts as functions you can test directly in your editor.

Context

Most LLM applications face a cruel paradox: the models are powerful, but getting reliable structured outputs requires wrestling with JSON schemas, handling parsing failures, and rebuilding retry logic for every project. Teams either settle for unreliable string outputs or lock themselves into provider-specific tool-calling APIs that break when you switch models. Even worse, the iteration cycle kills productivity—you write a prompt, run your application, wait for the response, discover it failed parsing, adjust the prompt, and repeat. Each cycle burns minutes.

BAML (Basically a Made-up Language) treats this as a compilation problem, not a runtime problem. Built in Rust, BAML is a domain-specific language where you define typed functions with prompts in .baml files, and the compiler generates type-safe client libraries for Python, TypeScript, Ruby, and Go (with additional languages accessible via REST API). The key insight: prompts are functions that take parameters and return types. This simple mental model unlocks type safety, streaming with partial types, IDE tooling, and most importantly, reliable structured outputs even from models that don’t support native tool-calling APIs.

Technical Insight

BAML’s architecture separates prompt engineering from application logic through a compilation layer. You define functions in .baml files with explicit input parameters and return types. Here’s a complete agent example from the README:

function ChatAgent(message: Message[], tone: "happy" | "sad") -> StopTool | ReplyTool {
    client "openai/gpt-4o-mini"

    prompt #"
        Be a {{ tone }} bot.

        {{ ctx.output_format }}

        {% for m in message %}
        {{ _.role(m.role) }}
        {{ m.content }}
        {% endfor %}
    "#
}

class Message {
    role string
    content string
}

class ReplyTool {
  response string
}

class StopTool {
  action "stop" @description(#"
    when it might be a good time to end the conversation
  "#)
}

The function signature ChatAgent(message: Message[], tone: "happy" | "sad") -> StopTool | ReplyTool declares both inputs and outputs explicitly. The prompt uses Jinja-like templating with {{ ctx.output_format }}, which BAML automatically populates with schema information to guide the model. This declarative approach means you’re doing schema engineering—focusing on modeling your domain types—rather than wrestling with prompt strings.

The Rust compiler generates a baml_client for your target language. In Python, calling this function looks native:

from baml_client import b
from baml_client.types import Message, StopTool

messages = [Message(role="assistant", content="How can I help?")]

while True:
  print(messages[-1].content)
  user_reply = input()
  messages.append(Message(role="user", content=user_reply))
  tool = b.ChatAgent(messages, "happy")
  if isinstance(tool, StopTool):
    print("Goodbye!")
    break
  else:
    messages.append(Message(role="assistant", content=tool.response))

Notice the isinstance(tool, StopTool) check—this is real type discrimination based on the union type StopTool | ReplyTool. The generated client provides actual type objects, not just dictionaries. This agent is a simple while loop that maintains conversation state and terminates when the model returns a StopTool, demonstrating that BAML doesn’t force you into a framework—it’s just typed function calls.

Streaming adds only a couple lines and maintains type safety:

stream = b.stream.ChatAgent(messages, "happy")
for tool in stream:
    if isinstance(tool, StopTool):
      # handle partial StopTool
      ...
    
final = stream.get_final_response()

Each chunk in the stream is a “Partial type with all Optional fields,” meaning if your return type has five string fields, the partial type has five optional string fields that fill in progressively. This makes building real-time UIs trivial—you can show data as it arrives without manual state management.

The killer feature is Schema-Aligned Parsing (SAP), BAML’s algorithm for extracting structured outputs from any model. Traditional approaches fail when models output markdown-wrapped JSON or include chain-of-thought reasoning before the answer. SAP handles these edge cases, which is why BAML worked with DeepSeek-R1 and OpenAI O1 on day one of their releases, even though those models don’t support native tool-calling. The README explicitly states: “With BAML, your structured outputs work in Day-1 of a model release. No need to figure out whether a model supports parallel tool calls, or whether it supports recursive schemas, or anyOf or oneOf etc.”

Model switching is declarative and static. To change from GPT-4o-mini to O3-mini, you modify the client declaration:

function Extract() -> Resume {
+  client openai/o3-mini
  prompt #"
    ....
  "#
}

Beyond simple client swapping, BAML supports sophisticated strategies defined statically: retry policies, fallbacks, and model rotations using a round-robin pattern. For runtime model selection, BAML provides a Client Registry pattern (the README mentions it but doesn’t detail the API, so we won’t invent specifics).

The VSCode extension turns your editor into a prompt testing environment. You can visualize the full prompt—including multi-modal assets—and see the raw API request before executing. The playground runs tests in parallel, dramatically reducing iteration time. The README emphasizes this speed advantage: “If testing your pipeline takes 2 minutes, you can only test 10 ideas in 20 minutes. If you reduce it to 5 seconds, you can test 240 ideas in the same amount of time.” This isn’t hyperbole—testing prompts without restarting your application or navigating to external playgrounds compounds your iteration velocity.

Gotcha

BAML introduces a compilation step and a new language to your stack. You’re committing to learning BAML syntax, managing .baml files alongside your application code, and running the compiler as part of your build process. For teams that value simplicity and minimal tooling, this overhead might not justify the benefits, especially for straightforward use cases with single-model dependencies.

The IDE tooling currently focuses on VSCode. The README mentions “jetbrains + neovim coming soon,” which suggests teams on IntelliJ, PyCharm, or Vim may have a different experience with the playground’s iteration loop. This is worth considering given how central the IDE experience is to BAML’s value proposition. You can still use BAML without the VSCode extension, but the testing workflow may differ.

The Schema-Aligned Parsing algorithm is BAML’s creation for handling edge cases like markdown-wrapped JSON and chain-of-thought outputs. While BAML itself is Apache 2 licensed and fully open-source, migrating away would require reimplementing the parsing logic yourself or accepting different output reliability. The multi-language client generation partially mitigates this since your application code stays in your preferred language, but the prompt definitions and parsing logic are part of BAML’s ecosystem.

Verdict

Use BAML if you’re building LLM applications where structured outputs are critical—agents that make decisions, data extraction pipelines, or multi-step workflows where type safety prevents runtime failures. The investment in learning a DSL pays off quickly when you need to support multiple models, handle streaming, or iterate on prompts faster than your competitors. Teams working in VSCode will see immediate benefits from the integrated testing playground. BAML excels when prompt engineering is a core competency for your product, not a one-off experiment. Skip BAML if your project is a simple chatbot with unstructured outputs, you’re locked into a single LLM provider whose native tool-calling meets all your needs, or your team is unwilling to add a compilation step and learn a new language. For production applications that need reliable structured outputs across multiple models, BAML is the rare tool that actually delivers on its promise to make prompt engineering feel like software engineering.

// QUOTABLE

If testing your LLM pipeline takes 2 minutes, you'll test 10 ideas in 20 minutes. Reduce it to 5 seconds, and you'll test 240 ideas in the same time. BAML makes this possible by treating prompts as...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/boundaryml-baml.svg)](https://starlog.is/api/badge-click/developer-tools/boundaryml-baml)