grex: Generate Regular Expressions from Test Cases Without the Headache
Hook
Writing a regex to match email addresses typically takes 20 minutes and three Stack Overflow tabs. What if you could just paste five examples and get a working pattern instantly?
Context
Regular expressions are simultaneously one of the most powerful and most frustrating tools in a developer's arsenal. We've all been there: staring at a pattern like ^(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@ wondering where you went wrong. The traditional workflow involves writing a pattern, testing it against known inputs, discovering it breaks on edge cases, adjusting the pattern, and repeating until either the regex works or you give up and write parser code instead.
This trial-and-error approach is backwards. You already know what strings should match—why not start from there? Enter grex, a Rust-based tool that inverts the regex creation process. Instead of translating requirements into cryptic symbol soup, you provide concrete examples of strings you want to match, and grex generates a regex guaranteed to match them. It's not magic; it's graph theory applied to pattern recognition. Originally inspired by the JavaScript library regexgen, grex takes the concept further with Unicode compliance, multiple output modes, and bindings for both Rust and Python ecosystems.
Technical Insight
Under the hood, grex constructs a directed acyclic graph representation of your input strings, where nodes represent character positions and edges represent character transitions. The algorithm identifies structural patterns across inputs: shared prefixes and suffixes become anchors, repeated substrings transform into quantifiers, and sets of similar characters collapse into character classes. This isn't simple string concatenation with pipes—it's sophisticated pattern mining.
Let's see it in action. Suppose you're building a validator for product SKUs that follow inconsistent formats from legacy systems. You have examples like ABC-123, ABC-456, XYZ-789, and XYZ-012. Instead of manually crafting [A-Z]{3}-[0-9]{3}, you invoke grex:
$ grex ABC-123 ABC-456 XYZ-789 XYZ-012
^(?:ABC|XYZ)\-(?:012|123|456|789)$
Not quite optimal—it's overly specific, matching only those exact numbers. But grex offers flags for generalization. Add --digits to convert literal digits into \d:
$ grex --digits ABC-123 ABC-456 XYZ-789 XYZ-012
^(?:ABC|XYZ)\-\d{3}$
Perfect. The pattern now correctly captures the structure while generalizing the numeric portion. The --digits flag tells grex to treat sequences of digits as character classes rather than literals. Similarly, --words converts alphabetic sequences to \w, and --spaces handles whitespace.
The Python bindings make this even more powerful for data pipeline integration. Here's a realistic scenario: you're processing user-generated content and need to extract timestamps that appear in multiple formats:
from grex import RegExpBuilder
timestamps = [
"2024-01-15 14:30:22",
"2024-02-28 09:15:44",
"2024-03-10 18:45:01"
]
builder = RegExpBuilder.from_test_cases(timestamps)
regex_pattern = builder.with_conversion_of_digits().build()
print(regex_pattern)
# Output: ^\d{4}\-\d{2}\-\d{2}\ \d{2}:\d{2}:\d{2}$
The generated pattern isn't just correct—it's structured in a way that reveals the underlying format. You can immediately see the year-month-day structure followed by hour-minute-second. This makes the pattern auditable and maintainable, something you don't get from asking an LLM to "write a regex for timestamps."
Grex's Unicode handling deserves special attention. It properly treats grapheme clusters as atomic units, which matters when dealing with international text. A combining diacritic like é (e + combining acute accent U+0301) is treated as a single matching unit, not two separate characters. This is critical for applications processing user names, addresses, or any content in languages beyond ASCII. The tool supports Unicode 16.0, including proper handling of astral plane code points like emoji and historical scripts.
One architectural decision that sets grex apart is its focus on correctness guarantees. The codebase uses property-based testing (via the proptest crate) to verify that every generated regex actually matches the input test cases. This isn't just unit testing—it's mathematical proof that the output is valid for the inputs. When you run grex, you're not hoping the pattern works; you're guaranteed it does for your examples. The tradeoff is that patterns may be longer than hand-optimized alternatives, but that's an acceptable cost for eliminating an entire class of bugs.
Gotcha
The biggest gotcha with grex is that it optimizes for correctness, not elegance. The generated patterns can be verbose, sometimes comically so. If you feed it ten variations of phone numbers with different formatting, you might get a 300-character monstrosity with nested alternations. It's correct, but you wouldn't want to maintain it. You'll almost always need a human refinement pass to reduce complexity and improve readability. Think of grex as a sophisticated first draft, not publication-ready code.
The conversion flags (--digits, --words, etc.) are powerful but dangerous. Converting 123 to \d+ means your pattern now also matches 000, 999999, and any other digit sequence. If your original test cases were carefully chosen to represent valid inputs, using these flags throws away that specificity. You might generate a pattern that technically matches your examples but also matches invalid data you never intended to allow. This is especially problematic in security contexts—a carelessly generalized regex for input validation might allow injection attacks. Always test the generalized pattern against known invalid inputs, not just valid ones.
Another limitation: grex cannot optimize for regex engine performance. A hand-crafted pattern with strategic use of non-capturing groups, atomic groups, or possessive quantifiers will outperform grex's output on complex strings. If you're processing millions of log lines and regex performance is your bottleneck, you'll need to profile and hand-optimize after generation. Grex also can't incorporate domain knowledge—it doesn't know that a four-digit sequence at the start of a timestamp is probably a year and could be constrained to 19\d{2}|20\d{2} for realistic date ranges. It only knows what the examples show.
Verdict
Use grex if you're reverse-engineering patterns from existing data, bootstrapping validators for legacy formats, or need to quickly prototype a regex when you have representative examples but fuzzy requirements. It's invaluable when working with Unicode-heavy content or when you need mathematical certainty that your pattern handles known cases. The Python bindings make it particularly valuable in data science pipelines where you're discovering patterns in messy real-world data. Skip it if you're working with simple patterns where hand-crafting is faster, if you need highly optimized regexes for performance-critical code paths, or if you already know the abstract pattern and just need to translate it to regex syntax. Also skip it if you can't afford the time to validate and refine the output—using grex patterns blindly in production without human review is asking for trouble. This tool is a force multiplier for experienced developers who understand regex fundamentals, not a replacement for that knowledge.