Prompt Optimizer: Cutting LLM API Costs by Compressing Tokens Without Model Access
Hook
A single percentage point in token reduction can translate to $10,000 in annual savings for a production LLM application processing 10 million requests. Yet most teams are still paying full price for every redundant word in their prompts.
Context
The economics of large language models have created a peculiar optimization problem. Unlike traditional software where compute costs scale predictably with infrastructure, LLM applications pay per token—both input and output. For a startup processing 100,000 GPT-4 requests daily with 1,000-token prompts, you're looking at roughly $6,000 monthly just on input tokens. Scale that to enterprise volumes and the cost becomes a board-level concern.
The obvious solution—simply shortening prompts—runs into immediate problems. Prompts aren't arbitrary text; they're carefully engineered instructions where phrasing matters. Remove the wrong words and your accuracy plummets. Add context and performance improves but costs spike. What developers needed was a programmatic way to compress prompts while preserving their semantic intent, without requiring access to model internals or retraining. That's the gap prompt-optimizer fills: a model-agnostic library that applies various text compression strategies to reduce token counts while letting you control the accuracy-cost tradeoff.
Technical Insight
The architecture of prompt-optimizer revolves around modular optimizer classes that each implement a specific compression strategy. Rather than a one-size-fits-all approach, the library provides a plugin system where you can chain optimizers or apply them selectively based on your use case. At its core, each optimizer inherits from a base class and implements an optimize() method that takes text and returns compressed text.
Here's how you'd use the entropy-based optimizer, which removes tokens that contribute least to information content:
from prompt_optimizer import EntropyOptim
original_prompt = """Please carefully analyze the following question and provide a detailed answer.
Question: What is the capital of France?
Please think step by step and explain your reasoning thoroughly."""
optimizer = EntropyOptim(p=0.15) # Remove 15% of lowest-entropy tokens
compressed = optimizer.optimize(original_prompt)
print(compressed)
# Output: "analyze question provide answer. Question: capital France? think step explain reasoning."
print(f"Token reduction: {optimizer.get_reduction_ratio():.1%}")
print(f"Semantic similarity: {optimizer.get_similarity_score():.2f}")
The p parameter controls compression aggressiveness. Set it to 0.05 for conservative optimization (preserving 95% of tokens) or 0.30 for aggressive compression. The library calculates entropy using token frequency distributions—common words like "the" and "please" have lower entropy and get pruned first, while content-bearing words like "France" and "capital" are preserved.
What makes this architecture particularly clever is the protected tags feature. You can wrap critical sections in special delimiters to prevent optimization:
from prompt_optimizer import PunctuationOptim
prompt = """<PROTECTED>You are a JSON-generating API. Always respond with valid JSON.</PROTECTED>
Please analyze the user's query and generate a response. The response should be thoughtful and detailed.
User query: {user_input}"""
optimizer = PunctuationOptim(protected_tag="PROTECTED")
compressed = optimizer.optimize(prompt)
# The system message stays intact, only the instruction text is compressed
This is essential for production systems where certain prompt components—like few-shot examples, output format specifications, or safety instructions—must remain unchanged. Without protected tags, you'd risk breaking your application's contract with the model.
The library includes six different optimization strategies, each with distinct compression characteristics. EntropyOptim uses information theory. SynonymOptim replaces words with shorter synonyms using WordNet. LemmatizerOptim reduces words to their base forms ("running" → "run"). PunctuationOptim strips non-essential punctuation. Each targets different redundancy patterns in text.
You can also chain optimizers for cumulative effect:
from prompt_optimizer import PromptOptimizer
optimizer = PromptOptimizer([
('punctuation', {'p': 0.1}),
('lemmatizer', {}),
('entropy', {'p': 0.05})
])
compressed = optimizer.optimize_chain(original_prompt)
The sequencing matters here. Punctuation removal first prevents those characters from affecting entropy calculations. Lemmatization reduces vocabulary size before entropy-based pruning. This pipeline approach lets you compose compression strategies that match your specific prompt characteristics.
One architectural decision that deserves attention: the library operates purely at the text level, never touching model weights or requiring API access. This makes it work with any LLM—OpenAI, Anthropic, Cohere, or your self-hosted model—but it also means optimizations are blind to model-specific tokenization. GPT-4's tokenizer might split words differently than Claude's, so a "20% token reduction" measured with one tokenizer won't necessarily transfer. In practice, this means you should measure reduction using your actual target model's tokenizer rather than relying on word counts.
Gotcha
The fundamental limitation is the compression-performance tradeoff, and it's steeper than you might expect. The repository's own benchmarks on LogiQA (a logical reasoning task) show baseline accuracy at 32% dropping to 28% with gentle optimization (10% token reduction), and cratering to 8% with aggressive optimization (50% reduction). That's not a typo—cutting tokens in half can quarter your accuracy.
This happens because prompt engineering is fragile. Those "unnecessary" words often carry subtle semantic signals or structural cues that models rely on. When you strip "Please analyze carefully" down to "analyze", you've lost the politeness framing that can influence model behavior. Remove punctuation and you've eliminated sentence boundaries that help models parse complex instructions. The optimizer doesn't understand these linguistic functions; it just sees token frequency.
The second gotcha is the lack of universal guidelines. The repository provides benchmarks for LogiQA only—100 samples of a specific reasoning task. Will EntropyOptim with p=0.15 work well for your customer support chatbot? Your code generation pipeline? Your document summarization service? You won't know without running your own evaluations. This isn't a dial you can tune once; it's per-task empirical tuning that requires ground truth data and meaningful metrics. For teams without ML evaluation infrastructure, that's a non-trivial investment before seeing any cost savings.
Verdict
Use if: You're operating at scale where 5-10% cost reduction means real money (think $50K+ annual LLM spend), you have evaluation infrastructure to test optimizations against your specific tasks, your prompts contain genuinely redundant content (verbose instructions, repeated examples, excessive politeness), and you can tolerate measured performance degradation in exchange for cost savings. This tool shines for high-volume, lower-stakes applications like content moderation, simple classification, or non-customer-facing analytics. Skip if: You're optimizing prematurely with low request volumes where savings won't exceed implementation costs, your prompts are already tightly engineered, your use case demands maximum accuracy (medical, legal, financial decisions), or you lack the ability to A/B test and measure quality impacts. Don't use this as a substitute for good prompt engineering—compress verbose prompts, not well-crafted ones.