Marsha: The Programming Language That Compiles English Into Tested Python Using LLMs
Hook
What if your compiler had a 20% chance of failing, but when it worked, it wrote all your boilerplate Python for you? Marsha is betting that's a trade-off worth making.
Context
The dream of programming in natural language is as old as programming itself. For decades, we've been promised that someday we'd just tell computers what we want in plain English, and they'd figure out the implementation. Every attempt—from COBOL's verbose syntax to visual programming environments—fell short because natural language is fundamentally ambiguous while code demands precision.
Large language models changed the equation. Suddenly, systems could interpret vague intent and generate working code. But raw LLM output is notoriously unreliable: it hallucinates APIs, produces subtly broken logic, and can't guarantee correctness. Marsha tackles this head-on by treating LLMs not as magical code writers, but as code generators that must prove their work. It introduces a functional programming language where you write specifications in structured English, provide examples of expected behavior, and let an LLM generate Python implementations that are automatically tested against those examples. If the tests fail, the compiler asks the LLM to try again. It's programming by specification with AI-powered implementation and automated validation—a genuinely novel approach to the old problem of closing the gap between human intent and machine execution.
Technical Insight
Marsha's architecture rests on three pillars: type definitions, function specifications, and a test-driven compilation loop. Unlike traditional compilers that deterministically translate syntax trees, Marsha feeds structured specifications to OpenAI's API and validates the resulting Python against developer-provided examples.
Type definitions use a CSV-inspired format that's both human-readable and mechanically parsable. You define data structures inline:
type User {
name: string,
email: string,
age: int
}
Functions are where the magic happens. You write markdown-style declarations with natural language descriptions and concrete examples. The compiler extracts these as both specification hints for the LLM and as pytest test cases:
## filter_active_users
Given a list of users and a minimum age threshold, return only users whose age is greater than or equal to the threshold, sorted by name alphabetically.
### Examples
Input: users=[User("Alice", "a@example.com", 25), User("Bob", "b@example.com", 17)], min_age=18
Output: [User("Alice", "a@example.com", 25)]
Input: users=[User("Zoe", "z@example.com", 30), User("Adam", "adam@example.com", 30)], min_age=25
Output: [User("Adam", "adam@example.com", 30), User("Zoe", "z@example.com", 30)]
When you compile this, Marsha constructs a prompt containing the type definitions, function signature, natural language description, and examples. The LLM generates a Python implementation. Critically, Marsha doesn't just accept this output—it immediately runs the examples as tests. If fewer than 80% pass, it feeds the error messages back to the LLM with a prompt like "The previous implementation failed these tests: [errors]. Please fix the implementation." This iterates up to a configurable maximum (typically 3-5 attempts).
The generated Python emerges as a proper module with type hints derived from your Marsha types, complete with a pytest suite. For the example above, you'd get something like:
from dataclasses import dataclass
from typing import List
@dataclass
class User:
name: str
email: str
age: int
def filter_active_users(users: List[User], min_age: int) -> List[User]:
return sorted(
[u for u in users if u.age >= min_age],
key=lambda u: u.name
)
This test-driven feedback loop is Marsha's key innovation. It transforms unreliable LLM code generation into a system that produces verifiably correct implementations for the specific examples you care about. The compiler isn't just translating syntax—it's negotiating with an AI until it produces code that meets your specifications.
The markdown-inspired syntax serves a dual purpose. For humans, it reads like documentation you'd write anyway. For machines, it's structured enough to parse mechanically. Function names, type signatures, descriptions, and examples all occupy predictable positions that regex patterns can extract. This lets Marsha's relatively simple Python parser build a specification AST without needing a complex grammar. The heavy lifting happens not in parsing, but in the LLM-driven code generation and validation loop.
One architectural choice worth noting: Marsha generates persistent Python files rather than ephemeral bytecode. You can read, debug, and even manually edit the generated code. This transparency makes it feel less like magic and more like an extremely sophisticated code generator. You're not locked into Marsha's ecosystem—the output is just Python.
Gotcha
Marsha's Achilles' heel is non-determinism at every level. Run the compiler twice on identical source, and you'll likely get different Python implementations. Sometimes one version uses list comprehensions while another uses explicit loops. Sometimes one handles edge cases the other misses. This makes version control nightmares: diffs become meaningless because regenerating code produces unrelated changes. Debugging is similarly chaotic—you can't reliably reproduce a bug because recompilation might generate code that doesn't exhibit the issue.
The 80% success rate target sounds reasonable until you realize it means 20% of your functions might not compile at all. And even when compilation succeeds, the LLM only proves correctness for your explicit examples. If you provided two examples but there are a dozen edge cases you didn't think to specify, the generated code might silently fail on those. You're getting code that passes the tests you wrote, not necessarily code that solves your problem comprehensively.
Cost and vendor lock-in are practical concerns. Every compilation hits OpenAI's API, accruing charges. A medium-sized project with dozens of functions could cost dollars per build. During development, when you're iterating rapidly, this adds up. And you're completely dependent on OpenAI's infrastructure—no compilation offline, no control over model versions, no ability to use local or alternative LLMs despite the codebase theoretically supporting pluggable backends.
Verdict
Use if: You're prototyping data transformation pipelines where you can articulate clear examples but writing explicit Python feels tedious. You're building one-off analysis scripts where correctness can be manually verified and non-deterministic builds don't matter. You're exploring LLM-assisted development and want to experiment with specification-driven programming. You have OpenAI API budget to burn and value development speed over build reproducibility.
Skip if: You need production-grade reliability with deterministic builds. Your project requires debugging and maintaining code over time, where non-deterministic compilation would create chaos. You can't tolerate a 20% compilation failure rate or need guaranteed correctness beyond explicit examples. You're cost-sensitive or need to compile offline. You're working on anything mission-critical where the 'best-effort' compilation model is unacceptable. In those cases, stick with traditional Python or use LLM assistants like GitHub Copilot that augment rather than replace deterministic compilation.