ell: Treating Prompts as Versioned Functions Instead of Strings
Hook
What if every prompt iteration you’ve ever tried was automatically versioned, serialized, and queryable—without you changing a single line of your workflow?
Context
Prompt engineering has a dirty secret: most teams treat prompts like magic strings scattered across codebases, Notion docs, and Slack threads. When a prompt breaks in production, there’s no diff to review. When you want to A/B test two variations, you’re copy-pasting into spreadsheets. The ecosystem responded with heavyweight frameworks like LangChain that solve this by adding layers of abstraction—chains, agents, retrievers—but in doing so, they pull you away from writing normal Python.
The core tension is that prompts are fundamentally code—they have inputs, outputs, and versions—but we’ve been treating them as configuration. ell takes a different approach: what if prompts were just decorated Python functions? What if versioning happened automatically through code analysis, the same way Git tracks changes? What if you could visualize every prompt iteration in a local web UI without sending data to external platforms? This is prompt engineering as a software engineering discipline, not a dark art.
Technical Insight
ell introduces the concept of Language Model Programs (LMPs)—Python functions decorated with @ell.simple or @ell.complex that define how to interact with language models. The distinction appears to be semantic based on the codebase: simple LMPs are shown returning text in the examples. Here’s the canonical example from the README:
import ell
@ell.simple(model="gpt-4o")
def hello(world: str):
"""You are a helpful assistant that writes in lower case."""
return f"Say hello to {world[::-1]} with a poem."
hello("sama")
Notice what’s happening: the docstring becomes the system message, the return value becomes the user message, and the decorator handles model invocation. This isn’t just syntactic sugar—it’s a fundamental shift in how prompts compose. Because LMPs are functions, you can call them from other LMPs, pass them as arguments, or unit test them with standard pytest.
The versioning mechanism is where ell gets interesting. Every time you modify an LMP, ell performs static and dynamic analysis to serialize the function’s source, dependencies, and parameters. It generates a content hash and stores the version in a local store. The framework even uses gpt-4o-mini to auto-generate commit messages describing what changed. This happens transparently—no Git hooks, no IDE plugins, just Python decorators doing work at import time.
Multimodal support feels native because ell coerces rich types directly into the message format expected by OpenAI’s API. Want to send an image from your webcam to GPT-4o?
from PIL import Image
import ell
@ell.simple(model="gpt-4o", temperature=0.1)
def describe_activity(image: Image.Image):
return [
ell.system("You are VisionGPT. Answer <5 words all lower case."),
ell.user(["Describe what the person in the image is doing:", image])
]
describe_activity(capture_webcam_image())
The ell.user() helper accepts heterogeneous lists—strings, PIL Images, audio files—and handles the base64 encoding and MIME type detection under the hood. This is significantly cleaner than manually constructing OpenAI’s nested message dictionaries.
Ell Studio, the companion web UI, runs locally via ell-studio --storage ./logdir. It visualizes your prompt version history as a graph, lets you compare outputs across versions, and tracks performance metrics. Because it’s reading from a local store, there’s zero latency and zero data exfiltration. The design philosophy is explicit: prompt optimization should feel like using TensorBoard for model training, not like navigating a SaaS product’s billing dashboard.
The framework also makes composition explicit. You can nest LMPs naturally:
@ell.simple(model="gpt-4o-mini")
def generate_title(content: str):
return f"Generate a title for: {content}"
@ell.simple(model="gpt-4o")
def write_article(topic: str):
content = f"Write a detailed article about {topic}"
title = generate_title(content)
return f"Article with title '{title}': {content}"
Each LMP is versioned independently, so changing generate_title doesn’t invalidate write_article’s version history unless the source of write_article itself changes. This appears to be possible because ell tracks the dependency graph through code analysis.
Gotcha
The automatic versioning is both ell’s superpower and its potential Achilles’ heel. Every code change triggers a new version, which means refactoring variable names or adding comments creates new entries in your version history. For teams iterating rapidly, this could lead to hundreds of versions that differ only trivially. The README doesn’t document any way to ignore certain changes or manually control version boundaries—it’s all or nothing.
The local-first storage is philosophically aligned with developer autonomy, but it’s a non-starter for team collaboration without additional infrastructure. If three engineers are iterating on prompts, they each have separate local stores. There’s no built-in way to sync versions across machines or deploy a shared Ell Studio instance for a team. You’d need to build your own synchronization layer or commit the storage directory to version control, which defeats the purpose of automatic versioning.
Finally, ell is explicitly a library, not a framework. It doesn’t have opinions about agents, retrieval-augmented generation, or orchestration. If you’re building a complex multi-agent system, you’ll need to bring your own architecture. LangChain gives you (too many) building blocks; ell gives you a better primitive for one part of the stack. Whether that’s a feature or a limitation depends entirely on your use case.
Verdict
Use ell if you’re a solo developer or small team building LLM applications where prompt iteration velocity matters more than enterprise collaboration features. It’s particularly strong for projects where you want version control and empirical tracking without adding external dependencies or SaaS platforms. The functional design means it slots into existing Python codebases with minimal friction—if you’re already writing functions that call OpenAI’s API, adding decorators is a one-line change. Skip it if you need multi-user collaboration out of the box, already have established prompt management infrastructure, or require the batteries-included abstractions of frameworks like LangChain (agents, vector stores, etc.). Also skip it if you’re allergic to magic—the automatic versioning and code serialization, while elegant, do involve runtime introspection that some teams will find too implicit for production systems.