Back to Articles

MiniHF: Building Domain-Specific Language Models Through Constitutional AI and Tree Search

[ View on GitHub ]

MiniHF: Building Domain-Specific Language Models Through Constitutional AI and Tree Search

Hook

What if your language model could critique its own outputs during inference—not through expensive self-reflection prompts, but through a frozen parallel model trained from the same foundation weights?

Context

The standard path to custom language models follows a well-worn groove: collect instruction data, fine-tune with supervised learning, maybe add RLHF if you have infrastructure budget. But this pipeline assumes you know what you want upfront and can afford the compute costs of enterprise-scale reinforcement learning. For individual researchers and small teams working with local models, the gap between "clever prompt" and "production-ready fine-tune" remains frustratingly wide.

MiniHF emerged from a different philosophy: prompts aren't products, they're scaffolding. The tool treats prompt development as an intermediate step toward creating entirely new document types that your base model has never seen. Rather than wrestling with system messages and few-shot examples, you interactively generate training data through a branching interface, then distill that knowledge into model weights. The twist is its dual LoRA architecture paired with Monte Carlo Tree Search inference—letting a single foundation model hold both a generator and an evaluator perspective simultaneously, with the evaluator guiding better outputs through rejection sampling during inference.

Technical Insight

MiniHF's architecture splits traditional language model capabilities into two distinct LoRA adapters trained on the same foundation model. The generator LoRA produces text, while the evaluator LoRA scores outputs, creating a separation of concerns that prevents the value collapse common in self-improving systems. During inference, the evaluator stays frozen while guiding the generator through tree search—a technique MiniHF calls "Weave."

The Weave algorithm implements Monte Carlo Tree Search over token sequences. At each generation step, the system samples multiple candidate continuations, uses the evaluator LoRA to score them, then selects the highest-scoring branch to continue. This rejection sampling happens entirely locally without additional API calls, turning inference into a search problem rather than greedy decoding:

# Simplified Weave MCTS concept
def weave_generate(prompt, generator_lora, evaluator_lora, branches=3, depth=5):
    current_text = prompt
    
    for step in range(depth):
        candidates = []
        
        # Generate multiple candidate continuations
        for _ in range(branches):
            continuation = generator_lora.generate(
                current_text, 
                max_new_tokens=20
            )
            candidates.append(continuation)
        
        # Score each candidate with frozen evaluator
        scores = []
        for candidate in candidates:
            score = evaluator_lora.evaluate(
                current_text + candidate,
                criterion="quality"  # Constitutional principle
            )
            scores.append(score)
        
        # Select best branch and continue
        best_idx = scores.index(max(scores))
        current_text += candidates[best_idx]
    
    return current_text

The training workflow centers on constitutional AI principles adapted for local execution. Users start by generating examples through MiniHF's Flask-based branching interface, where each writing session creates a tree of alternative completions. You mark preferred branches, and this preference data feeds both LoRAs. The generator trains on human-preferred completions using standard supervised fine-tuning, while the evaluator learns to predict which branches got selected.

What makes this architecture powerful is the frozen evaluator during RLAIF (Reinforcement Learning from AI Feedback) loops. After initial training, the generator LoRA continues updating via reinforcement learning signals from the evaluator, but the evaluator itself stays frozen. This prevents the feedback spiral where a model's values drift as it updates itself—a critical problem in traditional RLHF setups. The evaluator maintains a stable "constitutional" perspective learned from your initial human preferences.

The dual LoRA approach also solves a practical deployment problem: you can load both adapters onto a single foundation model in VRAM, then swap between them or use both simultaneously during tree search. This is far more memory-efficient than running separate generator and reward models:

# Loading dual LoRAs in MiniHF
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load generator adapter
generator = PeftModel.from_pretrained(
    base_model,
    "./outputs/generator_lora",
    adapter_name="generator"
)

# Load evaluator adapter on same model
generator.load_adapter(
    "./outputs/evaluator_lora",
    adapter_name="evaluator"
)

# Switch between them during inference
generator.set_adapter("generator")  # For text generation
generator.set_adapter("evaluator")  # For scoring

The system's philosophy of "inventing document types" manifests in how it handles training data. Rather than collecting question-answer pairs or instruction-following examples, MiniHF encourages users to generate extended documents that exemplify the style, reasoning, or knowledge they want the model to internalize. A session might involve writing a detailed technical analysis, branching at decision points to explore different explanatory approaches, then marking which branches best achieve your goals. This interactive exploration becomes the training corpus.

MiniHF deliberately minimizes dependencies—it's built on Flask, PyTorch, Transformers, and PEFT, avoiding heavy frameworks. The web interface serves as a data collection tool first and inference UI second, with all the interesting work happening in the training loop configuration and Weave search implementation.

Gotcha

The project documentation is refreshingly honest about RLAIF limitations: the current implementation is "not robust" and tends toward mode collapse. In practice, this means models trained through multiple RLAIF iterations often converge to repetitive outputs or agreeable but vacuous responses ("Yes, that's correct" loops are specifically mentioned). The zero-shot evaluator setup lacks the grounding needed for stable long-term training, and there's currently no pathway to incorporate human feedback into evaluator tuning.

Hardware requirements are substantial. While MiniHF works with smaller models, practical use cases targeting Mistral-7B or larger demand A6000-class GPUs (40GB+ VRAM) and around 1TB storage for model checkpoints and LoRA experiments. The tree search inference, while clever, multiplies compute costs linearly with branch factor—generating with branches=5 means running your model five times per search step. This makes Weave impractical for real-time applications or resource-constrained environments. The tool assumes you're committed to local infrastructure and willing to tolerate longer iteration cycles compared to API-based development workflows.

Verdict

Use MiniHF if you're developing specialized language models for narrow domains where you need full control over training data and can provide high-quality human feedback through interactive sessions. It's ideal for researchers exploring constitutional AI approaches, teams with domain expertise who want to encode knowledge into weights rather than prompts, and anyone allergic to API costs who has access to serious GPU hardware. The dual LoRA architecture and Weave search are genuinely novel approaches to local LLM development. Skip it if you need production-ready RLHF infrastructure with proven stability, want polished UX for non-technical users, lack access to 40GB+ VRAM GPUs, or prefer working with commercial API-based models. The mode collapse issues and experimental nature make this strictly a power-user tool for those comfortable debugging training dynamics and accepting rough edges in exchange for architectural flexibility.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/jd-p-minihf.svg)](https://starlog.is/api/badge-click/llm-engineering/jd-p-minihf)