AutoTemp: Multi-Armed Bandit Optimization for LLM Temperature Selection
Hook
You’re probably using the wrong temperature for your LLM prompts—and the only way to know for sure costs 15× your current token budget.
Context
Temperature is the most consequential hyperparameter in language model inference, yet it’s routinely set by gut instinct. Set it too low (0.0-0.3) and your creative writing sounds robotic; too high (0.9-1.2) and your technical documentation hallucinates. The standard practice—picking 0.7 because it ‘feels balanced’ or cargo-culting OpenAI’s defaults—ignores the reality that optimal temperature varies wildly by task type, prompt complexity, and desired output characteristics.
AutoTemp treats this guessing game as a solvable optimization problem. Instead of picking one temperature and hoping for the best, it runs your prompt at multiple temperatures simultaneously, generates competing outputs, evaluates them using an ensemble of LLM judges across seven scoring dimensions (relevance, clarity, utility, creativity, coherence, safety, and overall quality), then selects the winner. For research workflows and high-stakes content generation, it adds a UCB1 multi-armed bandit algorithm that learns which temperatures perform best and dynamically allocates more attempts to promising candidates.
Technical Insight
AutoTemp’s architecture is a meta-optimization layer that wraps OpenAI’s API (or compatible endpoints) with two distinct modes. Standard Mode operates like a simple tournament: it generates one completion per temperature in your specified range, sends each output to multiple independent judge LLMs that score it using a structured rubric, aggregates the scores, and returns the highest-ranked result. Advanced Mode implements Upper Confidence Bound (UCB1) bandit optimization, treating each temperature as a slot machine arm and iteratively deciding which to pull next based on observed rewards and exploration bonuses.
The multi-judge evaluation system is the cornerstone of AutoTemp’s reliability. Instead of trusting a single LLM to evaluate outputs (which introduces the model’s own biases and variance), it spawns multiple independent judges—defaulting to three—and aggregates their scores across dimensions including relevance, clarity, utility, creativity, coherence, safety, and overall quality. This ensemble approach mirrors established practices in LLM-as-a-judge research, where multiple evaluators demonstrably reduce position bias and stylistic preferences that plague single-judge setups.
Here’s a minimal example of running AutoTemp programmatically in Standard Mode:
from autotemp import AutoTemp
import os
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'
agent = AutoTemp(
judges=3,
model_version="gpt-4"
)
# Standard mode usage via Gradio UI or programmatic calls
# README shows Gradio UI as primary interface
# Advanced Mode (UCB) is toggled via UI with rounds and exploration_c parameters
The Advanced Mode (UCB) can be enabled through the Gradio UI by toggling ‘Advanced Mode (UCB)’ and configuring the number of rounds and exploration coefficient c. The UCB1 algorithm balances exploration and exploitation: higher exploration coefficients favor trying under-sampled temperatures, lower values double down on known winners. The README indicates this mode treats each temperature as a bandit arm and pulls arms iteratively for the specified rounds using UCB1.
For research and benchmarking workflows, AutoTemp includes a dataset evaluation mode that processes multiple prompt-reference pairs:
dataset = [
{"prompt": "Summarize: The mitochondria is the powerhouse of the cell...",
"reference": "Mitochondria generate cellular energy through ATP production."},
{"prompt": "Translate to Spanish: Hello, how are you?",
"reference": "Hola, ¿cómo estás?"}
]
summary = agent.benchmark(
dataset=dataset,
temperature_string="0.4,0.7,1.0",
top_p=0.9,
models=["gpt-3.5-turbo", "gpt-4"],
advanced=True,
rounds=8,
judges=3,
csv_path="results.csv"
)
The benchmark method generates CSV exports and computes bootstrap confidence intervals for aggregate statistics. The summary includes mean_overall scores with confidence intervals, external metric means and CIs if optional dependencies (BLEU via sacrebleu, ROUGE via rouge-score, BERTScore via bert-score) are installed, and token usage with estimated USD cost per model. This level of statistical rigor positions AutoTemp as a research instrument rather than just a production utility.
The repository also includes a client-side single-page application deployable to GitHub Pages. The static web app runs entirely in the browser using the user’s API key, making it trivial to share AutoTemp’s functionality without maintaining server infrastructure. This architectural choice—eschewing a backend proxy in favor of client-side API calls—prioritizes deployment simplicity and cost (zero) over key security and rate limiting.
Gotcha
AutoTemp’s core limitation is that it solves a $10 problem with a $150 solution. Running five temperatures with three judges means 15+ API calls per prompt (five generations plus 15 judge evaluations), multiplying your token costs by an order of magnitude before you’ve generated a single user-facing output. Advanced Mode with multiple UCB rounds can further multiply API calls. This cost structure makes AutoTemp economically viable only when output quality justifies extreme token budgets—academic research, high-value content creation, or prompt engineering workflows where you’re optimizing a template that will be reused thousands of times.
The judges-judging-generators circularity introduces subtle bias. AutoTemp uses LLMs to evaluate LLM outputs, which means judges may systematically prefer outputs that resemble their own generation patterns. A GPT-4 judge evaluating GPT-4 generations might reward stylistic choices that GPT-3.5 wouldn’t make, creating a preference feedback loop. The multi-judge ensemble mitigates this somewhat, but doesn’t eliminate it. When ground truth references are available, the external NLP metrics (BLEU, ROUGE, BERTScore) provide model-agnostic signals—but these require installing additional dependencies and only work for tasks with clear correct answers.
The UCB optimization assumes temperature effects are stationary: that a temperature performing well early in the bandit process will continue performing well later for the same prompt. This holds reasonably well for single prompts but breaks down when you try to generalize learned temperature preferences across different tasks. A temperature optimized for creative storytelling won’t transfer to technical Q&A. AutoTemp doesn’t learn cross-prompt temperature policies; every new prompt restarts exploration from scratch.
Verdict
Use AutoTemp if you’re conducting research on LLM generation parameters, optimizing high-value prompt templates where 10-20× token costs are justified by output quality, or need reproducible evidence for temperature selection decisions in academic work. The benchmarking infrastructure and statistical rigor make it particularly valuable for systematic evaluation across datasets. Skip it for production applications with latency or cost constraints, simple tasks where established heuristics (0.0 for facts, 0.7 for general use, 1.0+ for creativity) already work, or real-time user-facing features where waiting for 15+ API calls is unacceptable. This is a power tool for prompt engineers and researchers, not a drop-in replacement for temperature=0.7.