WARC-GPT: Building a RAG Pipeline for Web Archive Forensics
Hook
What if you could ask natural language questions about terabytes of archived web pages? Harvard Law’s Library Innovation Lab built a tool that treats WARC files—the standard format for web archiving—as a queryable knowledge base using retrieval augmented generation.
Context
Web archives are digital time capsules. Libraries, researchers, and legal teams use the WARC (Web ARChive) format to preserve websites for posterity—court evidence, historical records, disappeared journalism. But searching these archives traditionally means keyword matching against indexed metadata, not understanding context or answering questions like “What did this company’s privacy policy say about data retention in 2019?”
WARC-GPT emerged from Harvard Law School’s Library Innovation Lab to solve this archival access problem. Unlike general-purpose RAG frameworks that treat documents as isolated files, web archives have unique constraints: they bundle HTML, PDFs, images, and metadata into container files; they record when and where content was captured; and they’re used in contexts where provenance matters—legal discovery, academic research, journalism. The tool bridges the gap between modern LLM capabilities and the decidedly old-school world of digital preservation.
Technical Insight
WARC-GPT’s architecture follows a classic three-stage RAG pattern with archival-specific adaptations. The ingestion phase extracts text from text/html and application/pdf response records in WARC files, generates embeddings using configurable models (OpenAI’s or local embedding models), and stores them in a local ChromaDB vector database. Crucially, it preserves WARC metadata alongside embeddings—record IDs, capture dates, target URIs—so responses maintain forensic traceability.
Ingestion is a one-command operation. Drop WARC files in the ./warc directory and run:
poetry run flask ingest
Under the hood, the system chunks text based on the embedding model’s context window, meaning a single web page might generate multiple embedding vectors. Each chunk retains its parent WARC record metadata, creating a bidirectional link between semantic search results and archival provenance.
The query phase performs semantic search against ChromaDB using the user’s question, retrieves relevant text excerpts with their metadata, then injects them as context into an LLM prompt. The /api/search endpoint exposes this retrieval layer independently:
curl -X POST http://localhost:5000/api/search \
-H "Content-Type: application/json" \
-d '{"message": "privacy policy data retention"}'
This returns a JSON array where each result includes warc_record_id, warc_record_date, warc_record_target_uri, and warc_record_text—everything needed to cite sources in a legal brief or research paper.
The LLM integration layer supports three modes: OpenAI’s API, local Ollama instances, or OpenAI-compatible endpoints (HuggingFace TGI, vLLM). This flexibility matters for privacy-sensitive archives—law firms or governments might need inference to stay on-premises. The system prompt is configurable via environment variables, allowing customization for specific domains.
The web UI automatically manages chat history, enabling multi-turn conversations where the LLM can reference previous exchanges. This supports chain-of-thought reasoning: “Based on the 2019 policy you just showed me, did they change this clause in 2020?”
One underrated feature is the T-SNE visualization (flask visualize), which projects embeddings into 2D space. For archivists, this reveals semantic clusters—all pages about a particular scandal, or shifts in language over time—that metadata alone wouldn’t surface.
Gotcha
The big limitation is ingestion: running flask ingest clears the entire ./chromadb folder. You can’t incrementally add WARCs to an existing knowledge base. For large archives (terabytes), this means reprocessing everything when you add new captures. The README explicitly notes this destructive behavior without offering workarounds.
It’s also experimental software with a capital E. Harvard LIL includes a disclaimer that the project may be “sunsetted or significantly pivoted” without notice. There’s no versioning strategy, no production deployment guide, no performance benchmarks. The Flask development server isn’t suitable for multi-user load, and there’s no authentication layer—anyone with network access can query your archive. Content type support is limited to HTML and PDF; no extraction from images, video, or structured data formats like JSON-LD that modern web pages embed. If your WARC contains Twitter JSON archives or YouTube metadata, WARC-GPT ignores it.
Verdict
Use WARC-GPT if you’re working with web archive collections where provenance tracking matters—legal e-discovery, academic digital humanities, investigative journalism, or institutional memory projects. The ability to run inference locally with Ollama makes it viable for privacy-sensitive archives where cloud APIs are non-starters. The metadata preservation is genuinely differentiated from generic RAG tools that treat documents as context-free text blobs. Skip it if you need production reliability, incremental ingestion for large-scale archives, or you’re working with standard document formats where LangChain or LlamaIndex would give you better ecosystems and fewer sharp edges. This is a research prototype that prioritizes archival workflows over polish—valuable in its niche, but know you’re adopting experimental code.