Back to Articles

WARC-GPT: Teaching Language Models to Read Web Archive History

[ View on GitHub ]

WARC-GPT: Teaching Language Models to Read Web Archive History

Hook

The Library of Congress stores over 800 billion archived web pages in WARC format, but finding specific information requires knowing exact URLs and dates. What if you could just ask questions instead?

Context

Web archives are the institutional memory of the internet. Organizations like the Internet Archive, national libraries, and universities capture billions of web pages in WARC (Web ARChive) format—a standardized container that packages HTTP responses, headers, and metadata. But these archives suffer from a fundamental usability problem: accessing them requires replay tools that show you pages as they appeared, forcing you to browse historically rather than search semantically.

Traditional full-text search over WARCs exists through tools like Solr and OpenWayback, but keyword matching falls short for research questions that span concepts rather than exact phrases. A historian researching "public sentiment about climate policy in 2015" doesn't want every page containing those words—they want relevant discussions, even if phrased differently. Harvard Library Innovation Lab built WARC-GPT to explore whether Retrieval Augmented Generation could make decades of archived web content conversationally queryable while keeping data under institutional control.

Technical Insight

WARC-GPT implements a classic RAG pipeline with domain-specific adaptations for web archive formats. The architecture splits cleanly into ingestion and inference phases, using ChromaDB as the semantic bridge between archived web content and language models.

The ingestion process walks through WARC files using the warcio library, filtering for HTML and PDF responses. For each record, it extracts text content—using BeautifulSoup for HTML and PyPDF2 for PDFs—then chunks it based on the embedding model's context window. Here's the core chunking logic:

def chunk_text(text, model_name, chunk_overlap=200):
    # Get context window for embedding model
    context_window = EMBEDDING_MODELS.get(model_name, 512)
    
    # Simple character-based chunking
    chunk_size = context_window * 4  # Rough chars-to-tokens
    chunks = []
    
    for i in range(0, len(text), chunk_size - chunk_overlap):
        chunk = text[i:i + chunk_size]
        chunks.append({
            'text': chunk,
            'metadata': {
                'start_char': i,
                'end_char': i + len(chunk)
            }
        })
    
    return chunks

This character-based approach with overlap ensures embeddings capture complete thoughts across chunk boundaries, though it's agnostic to semantic breaks like paragraphs or sections. Each chunk gets embedded using sentence-transformers (default: all-MiniLM-L6-v2) and stored in ChromaDB with metadata linking back to the original WARC record URL and timestamp.

The query-time architecture is where WARC-GPT's flexibility shines. It supports three LLM backends through a provider abstraction: OpenAI's API, Ollama for local models, and any OpenAI-compatible endpoint. The retrieval step queries ChromaDB with the user's question, fetching top-k semantically similar chunks:

# Semantic retrieval from ChromaDB
results = collection.query(
    query_texts=[user_question],
    n_results=5,
    include=['documents', 'metadatas', 'distances']
)

# Build context from retrieved chunks
context = "\n\n".join([
    f"[Source: {meta['url']} - {meta['timestamp']}]\n{doc}"
    for doc, meta in zip(results['documents'][0], results['metadatas'][0])
])

The retrieved context gets injected into a prompt template that instructs the LLM to answer based on archived content. WARC-GPT maintains conversation history in-memory, allowing multi-turn dialogues where users can ask follow-ups like "tell me more about that policy" without re-establishing context.

One thoughtful feature is the T-SNE visualization endpoint. After ingestion, you can generate 2D projections of the embedding space to see how content clusters—useful for understanding whether your WARC collection contains distinct topical areas or homogeneous content. This interpretability layer helps archivists validate that embeddings meaningfully represent their collections.

The Flask web UI provides a ChatGPT-style interface, but the real power is in the API endpoints that could integrate into archival discovery systems. A library could add a "Ask about this collection" feature to their digital collections portal, routing questions through WARC-GPT to surface relevant historical snapshots.

Gotcha

WARC-GPT's experimental status shows most clearly in its data management: every ingestion run wipes the entire ChromaDB database. There's no incremental indexing, no way to add new WARCs without re-processing everything, and no persistence guarantees beyond the local filesystem. For a 50GB WARC collection, this means hours of re-ingestion if you want to add one more file. Production use would require forking and implementing collection-level indexing with unique identifiers.

The content extraction pipeline also reveals age. It only handles text/html and application/pdf MIME types, ignoring the reality that modern web archives increasingly capture JavaScript-rendered single-page applications, multimedia content, and API responses. A WARC of a 2024 React application would yield almost nothing useful—just the sparse HTML shell before JavaScript execution. There's no headless browser rendering, no JavaScript execution, and no handling of dynamic content that defines contemporary web experiences. Additionally, the chunking strategy can split mid-sentence or separate critical context from its referent, and there's no deduplication to handle near-identical pages captured at different timestamps. For researchers working with archives of social media platforms or modern web applications, these limitations make WARC-GPT a non-starter without significant extension work.

Verdict

Use WARC-GPT if you're a digital librarian, archivist, or researcher exploring how RAG can unlock web archive collections, particularly when privacy concerns demand local model deployment. It's ideal for prototyping conversational interfaces over curated WARC collections (conferences, events, specific campaigns) where content is primarily textual HTML from the 2000s-2010s web era, and you need something working in a weekend hackathon. Skip if you need production-ready archival infrastructure, work with large-scale collections requiring incremental updates, have modern JavaScript-heavy websites in your WARCs, or lack the Python expertise to fork and extend when you inevitably hit its experimental limitations. For mission-critical discovery, invest in LangChain or LlamaIndex with custom WARC loaders instead—you'll write more code upfront but get production features like incremental indexing, proper document management, and active maintenance.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/harvard-lil-warc-gpt.svg)](https://starlog.is/api/badge-click/automation/harvard-lil-warc-gpt)