Back to Articles

Building a $5/Month AI News Pipeline with Multi-Tier LLM Curation

[ View on GitHub ]

Building a $5/Month AI News Pipeline with Multi-Tier LLM Curation

Hook

Most news aggregation pipelines either cost hundreds per month in API fees or drown you in duplicates and low-quality content. OpenClaw Newsroom runs a five-source AI news operation with LLM editorial curation for less than the price of a latte.

Context

If you’ve ever tried to stay current with AI news, you know the problem: RSS feeds are noisy, Reddit discussions are scattered across subreddits, Twitter moves too fast to track manually, and GitHub trending surfaces interesting projects but lacks context. You need multiple sources, but aggregating them yourself means writing scrapers, handling rate limits, deduplicating stories that appear across platforms, and somehow filtering the signal from the noise.

The traditional solution is expensive third-party services or building a complex data pipeline with message queues, worker pools, and database clusters. OpenClaw Newsroom takes a different approach: a deliberately simple bash-orchestrated Python pipeline that treats source failures as expected, uses SQLite for persistence, and chains together increasingly capable (and expensive) LLMs only when needed. It’s designed to run as a cron job in the OpenClaw personal AI system, aggregating news every two hours with minimal infrastructure and maximum cost efficiency.

Technical Insight

Two-Phase Deduplication

Best-Effort Ingestion

batch

unique within run

hash check

fuzzy match

filtered

persist hashes

raw articles

Data Sources

RSS Feeds

blogwatcher

Reddit API

JSON

Twitter/X

bird CLI

GitHub Trending

scraper

Tavily Search

API

Article Fetcher

continues on failure

Within-Scan

80% similarity

Cross-Scan

SQLite hash store

SQLite DB

seen_articles

Deduplicated

Articles

System architecture — auto-generated

The architecture reveals several clever cost-optimization decisions. At the data ingestion layer, OpenClaw Newsroom pulls from five distinct sources using whatever tools are simplest: blogwatcher CLI for RSS feeds, direct Reddit JSON API calls, the bird CLI for Twitter/X, GitHub’s trending page, and Tavily for web search. The critical insight is the best-effort philosophy—if Twitter CLI isn’t installed or Reddit rate-limits you, the pipeline doesn’t fail. It continues with whatever data it successfully fetched.

Deduplication happens in two phases using SQLite. Within each scan run, articles are compared using an 80% similarity threshold to catch near-duplicates from different sources. But the real power is cross-scan persistence: every article’s hash is stored, so if the same story resurfaces two hours later (common with trending topics), it’s automatically filtered out. This is implemented with a simple SQLite table that acts as a bloom filter:

import sqlite3
import hashlib
from difflib import SequenceMatcher

class DedupStore:
    def __init__(self, db_path='memory/articles.db'):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('''
            CREATE TABLE IF NOT EXISTS seen_articles (
                content_hash TEXT PRIMARY KEY,
                title TEXT,
                url TEXT,
                first_seen INTEGER
            )
        ''')
    
    def is_duplicate(self, title, content, threshold=0.8):
        # Fast hash check first
        content_hash = hashlib.sha256(content.encode()).hexdigest()
        cursor = self.conn.execute(
            'SELECT title FROM seen_articles WHERE content_hash = ?',
            (content_hash,)
        )
        if cursor.fetchone():
            return True
        
        # Fuzzy match against recent articles
        cursor = self.conn.execute(
            'SELECT title, content_hash FROM seen_articles ORDER BY first_seen DESC LIMIT 100'
        )
        for existing_title, _ in cursor:
            similarity = SequenceMatcher(None, title.lower(), existing_title.lower()).ratio()
            if similarity >= threshold:
                return True
        
        return False

The three-tier LLM editorial chain is where cost optimization shines. Rather than sending every article to an expensive model, OpenClaw Newsroom implements a quality gate system. Gemini Flash Lite (the cheapest option) scores and filters initial articles. Only the top candidates proceed to Grok via OpenRouter for deeper analysis. Finally, Gemini Flash (more capable but pricier) performs final editorial selection and generates summaries. This cascading approach means you’re only paying premium rates for content that’s already proven its relevance:

def curate_articles(articles, profile_context):
    # Tier 1: Gemini Flash Lite - cheap quality scoring
    scored = []
    for article in articles:
        score = gemini_lite_score(article['title'], article['summary'])
        if score > 7.0:  # Only promote high-quality candidates
            article['quality_score'] = score
            scored.append(article)
    
    # Tier 2: Grok via OpenRouter - contextual ranking
    if len(scored) > 10:
        ranked = grok_rank_by_relevance(scored, profile_context, top_k=10)
    else:
        ranked = scored
    
    # Tier 3: Gemini Flash - final editorial and summarization
    final_selections = []
    for article in ranked[:5]:
        editorial = gemini_flash_editorial(article, profile_context)
        if editorial['approved']:
            article['editorial_note'] = editorial['summary']
            final_selections.append(article)
    
    return final_selections

The editorial profile learning system is surprisingly sophisticated for such a lean codebase. When a human approves or rejects LLM-curated stories, those decisions are logged to SQLite with reasoning. Future curation runs query this history to build a profile context that informs the LLM’s judgment. Over time, the pipeline learns which story types and sources align with the user’s interests without requiring explicit rule configuration.

Orchestration happens via bash script rather than a Python workflow engine like Airflow or Prefect. While this seems primitive, it’s actually ideal for a cron-job-based system with no concurrent execution requirements. The bash script handles API key injection, sets up the memory directory structure, and logs outputs to markdown files. Python modules remain pure functions with no global state, making them trivially testable outside the OpenClaw context. The tradeoff is you lose sophisticated error handling and retry logic, but for a best-effort pipeline that runs every two hours, immediate failure recovery isn’t critical.

Gotcha

The tight coupling to OpenClaw’s directory structure and cron system is a real adoption barrier. The pipeline expects memory/ directories, specific environment variable names, and markdown file conventions that make sense in OpenClaw but are opaque to standalone users. You can’t just git clone and run—you need to understand OpenClaw’s architecture or spend time adapting paths and orchestration logic.

API dependency fragility is another concern. The pipeline requires valid API keys for Gemini, GitHub, Tavily, and OpenRouter, plus installed CLI tools for blogwatcher and bird. If any service changes its rate limits, pricing, or authentication flow, parts of your pipeline silently degrade. The best-effort philosophy means you won’t get hard errors, but you might not notice Twitter stopped working for three days until you check logs. For production use cases requiring audit trails and SLAs, this silent degradation is unacceptable. The bash orchestration also makes debugging multi-source failures tedious—you’re grepping through markdown logs rather than using structured logging with correlation IDs.

Verdict

Use if: You’re already running OpenClaw and want automated AI news aggregation with minimal maintenance overhead, you need cost-effective LLM curation (the $5/month budget is real), you value simplicity over feature richness, or you’re building a personal/small-team news digest where ‘good enough’ every 2 hours beats perfect real-time coverage. The stdlib-only Python and best-effort source model make this surprisingly robust for hobby projects. Skip if: You’re not using OpenClaw (the adaptation tax is high), you need sub-hourly latency or enterprise reliability guarantees, you want extensive customization beyond the five built-in sources, or you require detailed observability and error tracking. For production news services or systems where missing stories has business impact, the silent failure modes and bash orchestration are dealbreakers. This is a personal productivity tool, not infrastructure.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/jacob-bd-openclaw-newsroom.svg)](https://starlog.is/api/badge-click/automation/jacob-bd-openclaw-newsroom)