AutoTemp: Multi-Armed Bandit Optimization for LLM Temperature Selection
Hook
Every LLM engineer has cargo-culted temperature=0.7 into their prompts at some point. But what if the difference between temperature 0.3 and 0.9 is the difference between mediocre and exceptional output for your specific use case—and you're leaving performance on the table because you never systematically tested?
Context
Temperature is one of the most influential hyperparameters in large language model inference, controlling the randomness of token selection during generation. Set it too low and responses become deterministic but potentially rigid; set it too high and you get creative chaos that might derail task completion. The traditional approach involves manual experimentation—running the same prompt at a handful of temperatures, eyeballing the results, and picking one that "feels right." This works poorly for several reasons: human evaluation doesn't scale, we're terrible at assessing subtle quality differences across dozens of outputs, and our intuitions about temperature often don't transfer across different prompt types or models.
The research community has known for years that optimal temperature varies significantly by task. Creative writing might shine at 0.9, mathematical reasoning at 0.1, and nuanced explanation at 0.5. Yet most practitioners either stick with default values or use the same temperature across all their prompts. AutoTemp emerged from this gap—a recognition that temperature optimization is both important and under-explored. Rather than treating it as a one-time manual experiment, it frames temperature selection as a sequential decision problem amenable to bandit algorithms, where each temperature setting is an "arm" to pull and the goal is to efficiently identify the best performer without exhaustively testing everything.
Technical Insight
AutoTemp's architecture centers on three core components: a temperature sweep engine, a multi-judge evaluation system, and an optional UCB1 bandit optimizer. The basic workflow generates responses across a configurable temperature range (typically 0.0 to 1.0 in 0.1 increments), then uses LLM-as-judge to score each output across multiple quality dimensions.
The evaluation system is more sophisticated than a simple "rate this output" approach. It employs multiple independent judge calls per response, aggregating scores to reduce the inherent variance of single-judge evaluations. Each judge evaluates outputs on dimensions like relevance, clarity, coherence, and task completion. Here's a simplified example of how you might use AutoTemp's core evaluation logic:
from autotemp import AutoTemp
import openai
# Initialize with your prompt and evaluation criteria
optimizer = AutoTemp(
prompt="Explain quantum entanglement to a high school student",
model="gpt-4",
temperature_range=(0.0, 1.0),
num_samples=5, # responses per temperature
num_judges=3, # independent judges per response
criteria=[
"clarity",
"accuracy",
"pedagogical_effectiveness",
"engagement"
]
)
# Run optimization with UCB1 bandit
results = optimizer.optimize(
mode="bandit",
iterations=50,
ucb_c=2.0 # exploration parameter
)
# Get best temperature and response
print(f"Optimal temperature: {results.best_temperature}")
print(f"Confidence interval: {results.confidence_interval}")
print(f"Best output: {results.best_response}")
The UCB1 implementation is where AutoTemp gets interesting from an algorithms perspective. Rather than naively testing all temperatures equally, it treats the problem as a multi-armed bandit where each temperature is an arm with unknown expected reward. The UCB1 algorithm balances exploration (trying under-sampled temperatures to reduce uncertainty) with exploitation (pulling temperatures that have performed well so far). The upper confidence bound for each temperature t is calculated as:
UCB(t) = average_score(t) + c * sqrt(ln(total_pulls) / pulls(t))
The c parameter controls exploration strength—higher values encourage more exploration of uncertain temperatures. This makes AutoTemp dramatically more sample-efficient than exhaustive grid search, especially important given that each sample costs an API call. In practice, UCB1 often identifies the optimal temperature region within 20-30 iterations, compared to hundreds needed for statistical confidence with uniform sampling.
The multi-judge system addresses a critical weakness in LLM-as-judge evaluation: individual judgments are noisy and potentially biased. By running multiple independent judge evaluations and aggregating scores (typically using mean or median), AutoTemp reduces variance and makes the optimization more robust. The implementation actually calls the judge model multiple times with temperature=0 to get consistent but independent evaluations—each judge sees the same candidate response but produces its own score from scratch.
For statistical rigor, AutoTemp supports bootstrap confidence intervals around the final temperature recommendation. After identifying a winning temperature, it resamples from the observed score distributions to estimate uncertainty in the ranking. This is crucial for understanding whether temperature 0.4 actually outperforms 0.5 or if you're just seeing random noise.
The repository includes both a Python library for programmatic use and a client-side web interface deployed on GitHub Pages. The web version is architecturally fascinating—it runs entirely in the browser using JavaScript to directly call OpenAI's API. This eliminates server infrastructure but requires users to input their API keys client-side, which then get included in browser-executed requests. It's a pragmatic choice for a research tool but highlights the security tradeoffs of client-side LLM applications.
Gotcha
The most obvious limitation is cost. AutoTemp's value proposition is "spend more API calls now to find the optimal temperature," which only makes economic sense in specific scenarios. If you're running a prompt thousands of times in production, spending $10-50 upfront to optimize temperature might save money long-term through better quality or fewer retry loops. But for one-off tasks or when using expensive models like GPT-4, the optimization cost can exceed the cost of just running your original prompt a few times manually. With default settings (10 temperatures × 5 samples × 3 judges), you're making 150+ API calls before seeing results.
The reliance on LLM judges for evaluation introduces meta-model bias—you're optimizing for what GPT-4 (or another judge model) thinks is good, which may not align with human preferences or domain-specific quality criteria. This is particularly problematic for subjective tasks like creative writing or nuanced argumentation where "quality" is inherently contested. The multi-judge ensemble helps but doesn't eliminate the fundamental issue that you're letting an LLM grade an LLM's homework. For high-stakes applications, you'd want to validate AutoTemp's recommendations against human evaluation on a sample of outputs.
The client-side web application, while clever, exposes API keys in the browser environment. Anyone with browser dev tools can extract keys from network requests or JavaScript memory. For personal experimentation this might be acceptable, but it's not suitable for shared environments or production-adjacent work. The Python library doesn't have this issue, but running it requires more setup than clicking a web link.
Finally, AutoTemp assumes that temperature is the dominant variable worth optimizing. In reality, output quality depends on prompt engineering, model selection, top-p sampling, frequency penalties, and numerous other factors. Optimizing temperature in isolation might miss larger improvements available from better prompt design or different models entirely.
Verdict
Use if: You're running systematic evaluations of prompt performance and want to eliminate temperature as a confounding variable; you're building production systems where the same prompt gets executed thousands of times and marginal quality improvements justify upfront optimization costs; you're conducting research on LLM behavior across temperature ranges and need a reproducible evaluation framework; or you're prototyping agentic systems where temperature selection needs to be adaptive and you want to explore bandit-based approaches. Skip if: You're working with tight API budgets or expensive models where 100+ calls for optimization is prohibitive; your use case involves highly subjective or domain-specific quality criteria where LLM judges won't align with actual stakeholder preferences; you're building production applications with strict latency requirements (AutoTemp is a batch optimization tool, not a real-time system); you already have strong empirical evidence about optimal temperatures for your task type; or you're working on prompts where temperature variation has minimal impact on output quality (like structured data extraction or deterministic reasoning tasks that work best at temperature=0 anyway).