Building a $5/Month AI News Pipeline with Multi-Tier LLM Curation
Hook
Most news aggregation pipelines either cost hundreds per month in API fees or drown you in duplicates and low-quality content. OpenClaw Newsroom runs a five-source AI news operation with LLM editorial curation for less than the price of a latte.
Context
If you’ve ever tried to stay current with AI news, you know the problem: RSS feeds are noisy, Reddit discussions are scattered across subreddits, Twitter moves too fast to track manually, and GitHub trending surfaces interesting projects but lacks context. You need multiple sources, but aggregating them yourself means writing scrapers, handling rate limits, deduplicating stories that appear across platforms, and somehow filtering the signal from the noise.
The traditional solution is expensive third-party services or building a complex data pipeline with message queues, worker pools, and database clusters. OpenClaw Newsroom takes a different approach: a deliberately simple bash-orchestrated Python pipeline that treats source failures as expected, uses SQLite for persistence, and chains together increasingly capable (and expensive) LLMs only when needed. It’s designed to run as a cron job in the OpenClaw personal AI system, aggregating news every two hours with minimal infrastructure and maximum cost efficiency.
Technical Insight
The architecture reveals several clever cost-optimization decisions. At the data ingestion layer, OpenClaw Newsroom pulls from five distinct sources using whatever tools are simplest: blogwatcher CLI for RSS feeds, direct Reddit JSON API calls, the bird CLI for Twitter/X, GitHub’s trending page, and Tavily for web search. The critical insight is the best-effort philosophy—if Twitter CLI isn’t installed or Reddit rate-limits you, the pipeline doesn’t fail. It continues with whatever data it successfully fetched.
Deduplication happens in two phases using SQLite. Within each scan run, articles are compared using an 80% similarity threshold to catch near-duplicates from different sources. But the real power is cross-scan persistence: every article’s hash is stored, so if the same story resurfaces two hours later (common with trending topics), it’s automatically filtered out. This is implemented with a simple SQLite table that acts as a bloom filter:
import sqlite3
import hashlib
from difflib import SequenceMatcher
class DedupStore:
def __init__(self, db_path='memory/articles.db'):
self.conn = sqlite3.connect(db_path)
self.conn.execute('''
CREATE TABLE IF NOT EXISTS seen_articles (
content_hash TEXT PRIMARY KEY,
title TEXT,
url TEXT,
first_seen INTEGER
)
''')
def is_duplicate(self, title, content, threshold=0.8):
# Fast hash check first
content_hash = hashlib.sha256(content.encode()).hexdigest()
cursor = self.conn.execute(
'SELECT title FROM seen_articles WHERE content_hash = ?',
(content_hash,)
)
if cursor.fetchone():
return True
# Fuzzy match against recent articles
cursor = self.conn.execute(
'SELECT title, content_hash FROM seen_articles ORDER BY first_seen DESC LIMIT 100'
)
for existing_title, _ in cursor:
similarity = SequenceMatcher(None, title.lower(), existing_title.lower()).ratio()
if similarity >= threshold:
return True
return False
The three-tier LLM editorial chain is where cost optimization shines. Rather than sending every article to an expensive model, OpenClaw Newsroom implements a quality gate system. Gemini Flash Lite (the cheapest option) scores and filters initial articles. Only the top candidates proceed to Grok via OpenRouter for deeper analysis. Finally, Gemini Flash (more capable but pricier) performs final editorial selection and generates summaries. This cascading approach means you’re only paying premium rates for content that’s already proven its relevance:
def curate_articles(articles, profile_context):
# Tier 1: Gemini Flash Lite - cheap quality scoring
scored = []
for article in articles:
score = gemini_lite_score(article['title'], article['summary'])
if score > 7.0: # Only promote high-quality candidates
article['quality_score'] = score
scored.append(article)
# Tier 2: Grok via OpenRouter - contextual ranking
if len(scored) > 10:
ranked = grok_rank_by_relevance(scored, profile_context, top_k=10)
else:
ranked = scored
# Tier 3: Gemini Flash - final editorial and summarization
final_selections = []
for article in ranked[:5]:
editorial = gemini_flash_editorial(article, profile_context)
if editorial['approved']:
article['editorial_note'] = editorial['summary']
final_selections.append(article)
return final_selections
The editorial profile learning system is surprisingly sophisticated for such a lean codebase. When a human approves or rejects LLM-curated stories, those decisions are logged to SQLite with reasoning. Future curation runs query this history to build a profile context that informs the LLM’s judgment. Over time, the pipeline learns which story types and sources align with the user’s interests without requiring explicit rule configuration.
Orchestration happens via bash script rather than a Python workflow engine like Airflow or Prefect. While this seems primitive, it’s actually ideal for a cron-job-based system with no concurrent execution requirements. The bash script handles API key injection, sets up the memory directory structure, and logs outputs to markdown files. Python modules remain pure functions with no global state, making them trivially testable outside the OpenClaw context. The tradeoff is you lose sophisticated error handling and retry logic, but for a best-effort pipeline that runs every two hours, immediate failure recovery isn’t critical.
Gotcha
The tight coupling to OpenClaw’s directory structure and cron system is a real adoption barrier. The pipeline expects memory/ directories, specific environment variable names, and markdown file conventions that make sense in OpenClaw but are opaque to standalone users. You can’t just git clone and run—you need to understand OpenClaw’s architecture or spend time adapting paths and orchestration logic.
API dependency fragility is another concern. The pipeline requires valid API keys for Gemini, GitHub, Tavily, and OpenRouter, plus installed CLI tools for blogwatcher and bird. If any service changes its rate limits, pricing, or authentication flow, parts of your pipeline silently degrade. The best-effort philosophy means you won’t get hard errors, but you might not notice Twitter stopped working for three days until you check logs. For production use cases requiring audit trails and SLAs, this silent degradation is unacceptable. The bash orchestration also makes debugging multi-source failures tedious—you’re grepping through markdown logs rather than using structured logging with correlation IDs.
Verdict
Use if: You’re already running OpenClaw and want automated AI news aggregation with minimal maintenance overhead, you need cost-effective LLM curation (the $5/month budget is real), you value simplicity over feature richness, or you’re building a personal/small-team news digest where ‘good enough’ every 2 hours beats perfect real-time coverage. The stdlib-only Python and best-effort source model make this surprisingly robust for hobby projects. Skip if: You’re not using OpenClaw (the adaptation tax is high), you need sub-hourly latency or enterprise reliability guarantees, you want extensive customization beyond the five built-in sources, or you require detailed observability and error tracking. For production news services or systems where missing stories has business impact, the silent failure modes and bash orchestration are dealbreakers. This is a personal productivity tool, not infrastructure.