How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

Hook

In just 11 months, a volunteer army of 13,500 contributors created one of the most valuable open datasets in conversational AI—then deliberately shut the project down at the height of its success.

Context

When ChatGPT launched in November 2022, it ignited a race to democratize conversational AI. The problem wasn't just model architecture—OpenAI had already published the InstructGPT paper detailing their approach. The real bottleneck was Reinforcement Learning from Human Feedback (RLHF), which required massive amounts of human-labeled preference data that only well-funded labs could afford. Most open-source attempts focused on smaller datasets or synthetic data generation, but these approaches couldn't capture the nuance and diversity of real human conversations.

LAION-AI, the nonprofit behind the LAION-5B image dataset that powered Stable Diffusion, saw an opportunity. Rather than trying to replicate ChatGPT's capabilities immediately, they built Open-Assistant as a dual-purpose system: a functional chat assistant and a massively parallel data collection platform. By gamifying the annotation process with leaderboards, streaks, and varied tasks, they transformed RLHF data collection from an expensive outsourced process into a community-driven contribution model. The result was OASST2: 160,000+ messages across conversation trees in 35+ languages, with multiple human-ranked responses per prompt—all released under permissive licenses.

Technical Insight

Open-Assistant's architecture elegantly separated concerns across three primary systems. The frontend, built in Next.js, presented users with gamified tasks: creating prompts, writing assistant responses, ranking multiple replies, or labeling message quality. Each interaction contributed to conversation trees stored in PostgreSQL, where parent-child message relationships preserved context chains. The backend FastAPI service managed task distribution, ensuring users received diverse assignments weighted by their language preferences and previous contribution quality.

The conversation tree structure was the system's secret weapon. Unlike simple question-answer pairs, Open-Assistant stored messages as nodes with multiple children, enabling the same prompt to have several human-written responses that could then be ranked against each other. This data structure maps directly to RLHF's preference learning phase:

# Simplified conversation tree structure
class Message:
    id: UUID
    parent_id: Optional[UUID]  # None for root messages
    text: str
    role: Literal["prompter", "assistant"]
    lang: str
    
class MessageRanking:
    message_ids: List[UUID]  # Ordered from best to worst
    ranking_user_id: UUID
    
# Query for preference pairs
def get_preference_pairs(parent_msg_id):
    children = db.query(Message).filter(
        Message.parent_id == parent_msg_id,
        Message.role == "assistant"
    ).all()
    
    rankings = db.query(MessageRanking).filter(
        MessageRanking.message_ids.contains(children.ids)
    ).all()
    
    # Convert rankings to pairwise preferences
    pairs = []
    for ranking in rankings:
        for i, better_id in enumerate(ranking.message_ids[:-1]):
            for worse_id in ranking.message_ids[i+1:]:
                pairs.append((better_id, worse_id))
    return pairs

This tree structure enabled efficient preference dataset creation. For any prompt, the system could generate dozens of preference pairs from a single ranking task, dramatically improving data efficiency compared to traditional binary comparison approaches.

The training pipeline followed the three-stage InstructGPT methodology. Stage one used supervised fine-tuning on the highest-quality assistant responses, bootstrapping a model that could generate plausible completions. Stage two trained a reward model—essentially a classifier predicting which responses humans would prefer—using the Bradley-Terry preference model on ranked pairs. Stage three used Proximal Policy Optimization (PPO) to fine-tune the base model, treating the reward model as an objective function while maintaining KL-divergence constraints to prevent reward hacking.

What made Open-Assistant architecturally interesting wasn't novel ML techniques but rather the production engineering around community-scale data collection. The task assignment system used a sophisticated queue that balanced multiple objectives: ensuring geographic and linguistic diversity, preventing single users from dominating conversation threads, matching task difficulty to user reputation scores, and maintaining engagement through variety. The leaderboard system used a composite score weighing both quantity and quality metrics, with quality determined by downstream model performance and inter-rater agreement.

The inference system, containerized separately from the data collection platform, supported model hosting through a FastAPI endpoint with WebSocket streaming for incremental responses. The inference configuration exposed key parameters like temperature, top-p sampling, and repetition penalties:

# Inference configuration from the Open-Assistant API
class InferenceConfig:
    model_name: str
    max_new_tokens: int = 1024
    temperature: float = 1.0
    top_p: float = 0.9
    repetition_penalty: float = 1.2
    
    # Safety constraints
    stop_sequences: List[str] = ["<|endoftext|>", "Human:"]
    max_context_length: int = 2048
    
# The streaming implementation used server-sent events
@app.websocket("/inference/{conversation_id}")
async def stream_inference(websocket, conversation_id):
    config = await get_user_config(conversation_id)
    conversation = await load_conversation_tree(conversation_id)
    
    prompt = format_conversation_as_prompt(conversation)
    
    async for token in model.generate_stream(
        prompt,
        **config.dict()
    ):
        await websocket.send_json({"token": token})
        
    await update_conversation_tree(
        conversation_id,
        new_message=generated_text
    )

The Docker-compose orchestration tied everything together, with separate services for the web frontend, backend API, PostgreSQL database, Redis cache for task queues, and optional inference workers. This modularity meant contributors could run just the data collection interface without requiring GPU resources, while researchers could deploy only the inference stack with pre-trained models.

Gotcha

The project's completed status is the primary limitation—don't mistake this for an actively maintained chatbot platform. The repository clearly states the project finished its mission in October 2023, and while the code remains accessible, there's no ongoing support for issues, security patches, or integration with newer model architectures. The inference system in particular shows its age: it predates techniques like flash attention, grouped-query attention, and modern quantization methods that make large language models far more efficient today.

Local deployment is deceptively complex. While Docker Compose files exist, actually running the full stack requires navigating multiple configuration files, understanding the relationship between data collection and inference components, and managing database migrations. The documentation assumes significant familiarity with ML infrastructure. Expect to spend hours debugging environment issues if you're trying to replicate the training pipeline rather than just accessing the dataset. The reward modeling and PPO training code especially requires substantial GPU memory—training even the smallest models demands multiple A100s or equivalent hardware. This isn't a project you can experiment with on a laptop, and the infrastructure costs to reproduce the full training run would be prohibitive for individuals or small teams.

Verdict

Use if: You're researching RLHF methodologies and want to study a complete, production-grade implementation with real-world data at scale; you need high-quality multilingual conversational datasets for instruction tuning; you're building educational content about LLM fine-tuning and want concrete code examples of reward modeling or PPO; or you're analyzing the evolution of open-source AI and need a historical reference for early ChatGPT alternatives. Skip if: You want an actively maintained chatbot to deploy in production; you need modern inference optimizations and model serving infrastructure; you're looking for a user-friendly local AI chat interface (use Ollama or LM Studio instead); or you want to contribute to an active open-source AI project—consider alternatives like Hugging Face TRL or Axolotl that incorporate lessons learned from Open-Assistant's pioneering work. This repository's value lies in its dataset and historical significance, not as a foundation for new development.

How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

Stanford Alpaca: The $500 Experiment That Democratized LLM Fine-Tuning

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]