Back to Articles

Harness-1: Training Search Agents with State Externalization

[ View on GitHub ]

Harness-1: Training Search Agents with State Externalization

Hook

What if your AI agent's entire search history—every query, every document scored, every decision point—lived in inspectable Python dictionaries instead of hidden neural activations? That's the architectural bet behind Harness-1.

Context

Large language models have gotten impressive at answering questions, but their search behavior remains opaque. When you ask GPT-4 or Claude to research a complex topic, the model either dumps everything into its context window or generates reasoning traces that mix genuine retrieval strategy with hallucinated confidence. You can't inspect the search graph, replay decisions with different documents, or understand why the model chose to explore one path over another. The reasoning happens inside a black box of matrix multiplications.

Harness-1 takes a different approach: separate the search policy from the search state. Instead of asking a model to simultaneously decide what to search for AND remember what it's already found AND track which documents are most relevant, this system uses a 20B parameter model that only makes decisions. All the actual state—the candidate documents, the evidence sets, the verification records, the budget tracking—lives in an external Python harness. The model emits actions like "search for X" or "inspect document Y", and the harness updates the world state accordingly. This isn't just cleaner architecture; it enables something frontier models can't do: perfect state recovery, mid-search debugging, and RL training specifically optimized for evidence gathering rather than hoping search emerges from next-token prediction.

Technical Insight

The core architectural insight is treating the language model as a stateless policy function. On each step, the harness serializes the current search state into a prompt, sends it to the 20B model via vLLM's raw completions endpoint, interprets the model's output as a structured action, executes that action against external tools (Chroma vector stores, rerankers, verification APIs), and updates the state accordingly. The model never "remembers" anything—it just sees state and emits actions.

Here's what the action parsing looks like in practice. The model generates tokens that the harness interprets as structured commands:

# Model generates: "SEARCH: multi-hop reasoning approaches"
# Harness interprets and executes:
if action.startswith("SEARCH:"):
    query = action.split(":", 1)[1].strip()
    # Query Chroma with OpenAI embeddings
    results = vector_store.similarity_search(
        query, 
        k=20,
        filter={"corpus": self.current_corpus}
    )
    # Update harness state
    self.candidate_docs.extend(results)
    self.search_history.append({
        "step": self.step_count,
        "action": "search",
        "query": query,
        "retrieved": len(results)
    })
    self.token_budget -= estimate_tokens(query)

elif action.startswith("INSPECT:"):
    doc_id = action.split(":", 1)[1].strip()
    # Fetch full document and apply reranker
    doc = self.doc_store.get(doc_id)
    relevance_score = self.reranker.score(doc, self.original_question)
    self.inspected_docs[doc_id] = {
        "content": doc,
        "score": relevance_score,
        "step": self.step_count
    }

This separation has enormous implications for training. Instead of fine-tuning on QA pairs and hoping the model learns to search, Harness-1 uses RL with trajectory-level rewards. The reward signal comes from evidence recall: did the model's search strategy surface the ground-truth supporting documents? The training loop generates rollouts using this harness, computes rewards based on which documents ended up in the final curated set, and updates the policy to maximize evidence gathering.

The prompting strategy is deliberately low-level. Rather than using chat templates or high-level abstractions, the system works with integer token IDs directly:

# Construct prompt as raw token IDs
state_tokens = self.tokenizer.encode(
    self._serialize_state(),
    add_special_tokens=False
)
action_prefix = self.tokenizer.encode(
    "\nNext action:",
    add_special_tokens=False  
)
prompt_ids = state_tokens + action_prefix

# Send to vLLM completions endpoint
response = self.vllm_client.completions.create(
    model="harness-1-20b",
    prompt=prompt_ids,  # Raw integers
    max_tokens=50,
    temperature=0.7,
    stop=["\n", "DONE"]
)

This gives precise control over tokenization without fighting chat templates that try to be helpful by adding conversational framing. The harness can inject structured state representations and be confident about exactly what tokens the model sees.

The budget tracking is particularly clever. Each action has a token cost (search queries cost less than full document inspection), and the harness maintains a running budget. The model learns cost-benefit tradeoffs:

class BudgetTracker:
    def __init__(self, total_budget=4096):
        self.remaining = total_budget
        self.action_costs = {
            "search": 50,
            "inspect": 200,
            "verify": 300,
            "curate": 100
        }
    
    def can_afford(self, action_type):
        return self.remaining >= self.action_costs[action_type]
    
    def charge(self, action_type, actual_tokens=None):
        cost = actual_tokens or self.action_costs[action_type]
        self.remaining -= cost
        return self.remaining

The RL training rewards policies that gather comprehensive evidence before running out of budget. This creates pressure to prioritize high-value actions: broad search early to discover candidates, targeted inspection of promising documents, verification only when needed.

The evaluation infrastructure exposes a critical insight: trajectory recall vs final-answer recall. The model might discover relevant documents during search (high trajectory recall) but fail to include them in the final curated answer set (low final recall). This gap reveals optimization targets—the policy needs to not just find evidence but explicitly mark it for inclusion.

Gotcha

The infrastructure requirements are non-trivial. You need a Chroma vector database with your entire document corpus pre-indexed using a specific chunking strategy and embedding model (OpenAI's text-embedding-3-large). The repo doesn't include corpus preparation scripts—it assumes you arrive with a "compatible retrieval backend" already built. For BrowseComp+ evaluation, document IDs must align exactly with ground-truth annotations, which means you're either using their exact preprocessing pipeline or rebuilding it from scratch.

The training loop uses private Tinker infrastructure. The README explains how to serve and evaluate the pre-trained checkpoint, but the actual RL training code that generated it is not open-sourced. You can't reproduce the full train-from-scratch workflow without access to Tinker's adapter merging and RL orchestration systems. This is evaluation-only infrastructure for external researchers.

Deployment costs are real. The 20B parameter model requires H100-class GPUs for the validated performance numbers. Running vLLM with this checkpoint, maintaining a Chroma instance with your corpus, and paying for OpenAI embedding/reranker API calls adds up quickly. The promise is that 20B is more efficient than 70B+ frontier models, but you're still well outside CPU inference or consumer GPU territory. Performance on quantized variants or smaller hardware is undocumented.

Verdict

Use if: You're building agentic search systems where interpretability matters more than plug-and-play convenience. If you need to debug why your agent chose certain documents, validate that multi-hop retrieval is actually happening, or train policies specifically for evidence gathering rather than hoping search emerges from QA fine-tuning. The state externalization gives you visibility that opaque reasoning models can't match, and the RL training directly optimizes for retrieval behavior. Best for legal research, medical literature review, compliance checking, or any domain where comprehensive evidence gathering is the primary goal.

Skip if: You want simple inference without infrastructure complexity, lack H100-class GPUs, need offline operation without API dependencies, or your documents fit comfortably in extended context windows (just use Claude with 200K context instead). The tax for state externalization is building and maintaining a Chroma index, orchestrating multiple services, and accepting that evaluation means running complex multi-component pipelines. Also skip if you need to retrain from scratch—the open-sourced components support evaluation only, not reproducing the full RL training workflow.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/pat-jj-harness-1.svg)](https://starlog.is/api/badge-click/llm-engineering/pat-jj-harness-1)