Back to Articles

PromptWizard: Teaching LLMs to Write Their Own Prompts Through Self-Critique

[ View on GitHub ]

PromptWizard: Teaching LLMs to Write Their Own Prompts Through Self-Critique

Hook

What if your LLM could debug its own prompts? PromptWizard makes the model its own prompt engineer—generating critiques, synthesizing examples, and evolving instructions through iterative feedback loops in approximately 20-30 minutes on average for mathematical reasoning tasks.

Context

Prompt engineering remains one of the most labor-intensive bottlenecks in LLM deployment. Finding the right combination of instructions and examples typically requires extensive manual iteration by domain experts. Existing approaches often require separate optimizer models, while manual few-shot prompting breaks down when you lack high-quality training examples.

Microsoft Research’s PromptWizard attacks this problem with a counterintuitive approach: make the LLM optimize itself. Instead of relying on human prompt engineers or external optimization algorithms, PromptWizard implements a self-evolving mechanism where the model generates candidate prompts, critiques their weaknesses on validation data, and synthesizes improvements—all autonomously. The framework targets scenarios where you have training samples (as few as 25+ in their examples) but need both better instructions and synthetic examples to reach production-grade performance. In their demonstrations on GSM8k (grade school math), SVAMP (arithmetic word problems), AQUARAT (algebraic reasoning), and BBII (instruction induction) datasets, optimization runs complete in approximately 20-30 minutes on average.

Technical Insight

PromptWizard implements a two-stage architecture that separates instruction optimization from example generation, then unifies them through sequential co-optimization. Stage 1 focuses purely on iterative instruction refinement: the LLM generates an initial prompt instruction based on the task description, evaluates it against a validation set, writes a critique identifying failure modes, then proposes a refined version. This cycle repeats for multiple iterations, with each refinement building on validation performance feedback.

Stage 2 introduces synthetic example generation alongside instruction optimization. Here’s where the architecture gets interesting: instead of just mining existing training data for few-shot examples, PromptWizard prompts the LLM to generate synthetic training samples that specifically address weaknesses identified during critique. The framework then combines positive examples (correct predictions), negative examples (failures with corrections), and these task-aware synthetic examples into a unified prompt structure.

The framework supports three distinct scenarios based on data availability. Scenario 1 optimizes prompts without any examples—pure instruction tuning through self-critique. Scenario 2 generates synthetic examples from scratch when you have no training data, creating diverse samples that cover edge cases the base prompt missed. Scenario 3, the most powerful mode, takes existing training data and augments it with synthetic examples while simultaneously optimizing instructions. This holistic co-optimization is what differentiates PromptWizard’s approach of treating instructions and examples as co-dependent rather than independent.

Configuration happens through two files: a YAML config specifying optimization parameters and a .env file for API credentials. Here’s what a minimal setup looks like based on the GSM8k example:

# promptopt_config.yaml
task_description: "You are a mathematics expert. You will be given a mathematics problem which you need to solve"
base_instruction: "Lets think step by step."
answer_format: "At the end, wrap only your final option between <ANS_START> and <ANS_END> tags"
seen_set_size: 25
# Additional parameters configured per README examples
# Example usage based on demo notebooks
from promptwizard import PromptOptimizer

# Initialize with config
optimizer = PromptOptimizer(
    config_path="configs/promptopt_config.yaml",
    train_data_path="data/train.jsonl",
    validation_data_path="data/val.jsonl"
)

# Run optimization (Scenario 3: with training data)
optimized_prompt = optimizer.optimize()

# The result includes both instructions and curated examples
print(optimized_prompt.instruction)  # Refined instruction
print(len(optimized_prompt.examples))  # Positive + negative + synthetic

The data format is deliberately simple—.jsonl files where each line contains a JSON object with question and answer fields. This simplicity requires strict formatting for answer extraction: PromptWizard uses formatting tags (<ANS_START> and <ANS_END>) to parse LLM outputs reliably. The README indicates custom extraction functions are needed via def extract_final_answer() for different task types.

One particularly clever design decision is the Chain-of-Thought integration. Rather than requiring manual CoT examples, PromptWizard prompts the LLM to generate reasoning steps for synthetic examples automatically, creating what the README describes as “Self generated Chain of Thought (CoT) steps with combination of positive, negative and synthetic examples.”

The feedback loop mechanism runs entirely through API calls to Azure OpenAI or OpenAI endpoints, as configured in the .env file. For GSM8k, SVAMP, AQUARAT, and BBII datasets, the README notes optimization took “around 20-30 minutes on average.” The runtime is primarily network latency and API queuing rather than computational overhead.

Gotcha

The primary constraint is API dependency and associated costs. PromptWizard’s self-evolving approach requires extensive LLM calls per optimization run, which makes it impractical for resource-constrained scenarios or rapid iteration during development. That 20-30 minute optimization time assumes you have stable API access—if you hit rate limits or experience endpoint throttling, runs can extend significantly or fail entirely. There’s no mention of local inference support or caching layers to mitigate this; the framework appears designed around API-based optimization.

Answer extraction requires careful configuration. The framework’s reliance on <ANS_START> and <ANS_END> tags works for controlled benchmarks like GSM8k where you can enforce output formatting, but the README explicitly notes that “It is crucial to set the answer_format properly to ensure correct extraction by def extract_final_answer()” and that developers need to “write code to extract string between the tags.” For tasks beyond the demonstrated mathematical reasoning and instruction-following domains, you’ll need custom extraction functions, and the README provides limited guidance beyond the GSM8k/BBII examples.

The framework has only been demonstrated on mathematical reasoning (GSM8k, SVAMP, AQUARAT) and instruction-following (BBII) benchmarks according to the README. There’s no evidence provided for generalization to code synthesis, multilingual tasks, retrieval-augmented generation, or creative/open-ended tasks. The synthetic example generation appears optimized for problems with clear correctness criteria rather than subjective evaluation.

The README notes that “time taken for prompt optimization is dependent on the dataset” and provides timing only for their specific benchmarks, suggesting optimization time could vary significantly for other task types or dataset sizes.

Verdict

Use PromptWizard if you’re deploying LLMs on well-defined reasoning or instruction-following tasks where you have training samples (25+ in their examples), can work with API-based optimization, and can tolerate approximately 30-minute optimization cycles. It appears well-suited when you need both instructions and examples optimized together—particularly valuable when your training data has gaps or edge cases. The self-evolving critique mechanism offers an alternative to manual engineering for math/reasoning domains, and the synthetic example generation creates diversity beyond simple retrieval.

Skip it if you’re working with tight API budgets (optimization requires many API calls), need real-time or frequent re-optimization, require support for creative/open-ended output formats beyond strict answer extraction, or are working outside the demonstrated domains (mathematical reasoning and instruction-following tasks like GSM8k, SVAMP, AQUARAT, and BBII). The framework currently supports Python and requires API access to Azure OpenAI or OpenAI endpoints. For prototyping, simple tasks with abundant training data, or scenarios where you need more control over the optimization process, consider whether the self-evolving approach fits your workflow and budget constraints.

// QUOTABLE

What if your LLM could debug its own prompts? PromptWizard makes the model its own prompt engineer—generating critiques, synthesizing examples, and evolving instructions through iterative feedback ...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/microsoft-promptwizard.svg)](https://starlog.is/api/badge-click/developer-tools/microsoft-promptwizard)