Back to Articles

Building a GitHub Trending Bot: What 115 Stars Taught Us About Scraping, Storage, and Social Growth

[ View on GitHub ]

Building a GitHub Trending Bot: What 115 Stars Taught Us About Scraping, Storage, and Social Growth

Hook

In 2015, a single Go bot began tweeting every trending GitHub repository. Eight years later, it's still running—but the way it handles Twitter's character limits and GitHub's DOM structure reveals why web scraping bots are both powerful and precarious.

Context

Before GitHub's Explore page became algorithmically curated and before every developer tool had a newsletter, discovering trending open-source projects meant manually checking GitHub's trending page or relying on aggregators like Hacker News. The problem? Timing. A repository could trend for six hours while you were asleep, and you'd miss it entirely.

Andreas Grunwald built TrendingGithub to solve this discovery problem by creating a persistent watcher—a bot that scrapes GitHub's trending pages every 30 minutes and broadcasts finds to Twitter. But the real engineering challenge wasn't the scraping itself; it was preventing duplicate tweets, maximizing information density within character limits, and growing an audience without triggering Twitter's anti-spam mechanisms. The project became a case study in building resilient scrapers with proper observability.

Technical Insight

Trigger scrape

Repository metadata

Check 30-day TTL

Not blacklisted

Optimized tweet

Success

Add with 30d TTL

Followers list

Random follows

Scheduler

Every 30 min

GitHub Trending

Scraper

Redis/Memory

Blacklist Storage

Tweet Composer

& Optimizer

Twitter API

Growth Hack

Follow Bot

System architecture — auto-generated

TrendingGithub's architecture centers on three interconnected systems: a scraping engine, a storage-backed blacklist, and a tweet composition optimizer. Let's examine each.

The scraping layer fetches GitHub's trending HTML and parses repository metadata without an official API. The bot uses Go's net/http package to retrieve trending pages, then extracts repository names, descriptions, and URLs through DOM traversal. This approach is brittle by design—any HTML structure change breaks parsing—but it's also the only option since GitHub doesn't expose trending data through its REST or GraphQL APIs. The codebase accepts this fragility and mitigates it through error handling and monitoring rather than trying to make scraping bulletproof.

The deduplication system is where things get architecturally interesting. TrendingGithub uses a 30-day sliding window blacklist stored in Redis (or an in-memory map for development). When the bot successfully tweets a repository, it adds that repo to the blacklist with a 30-day TTL. Here's the elegant part: the bot doesn't permanently ignore repositories. If a project trends again after 30 days, it's re-eligible for tweeting. This creates a natural filter for sustained popularity versus one-hit wonders:

// Simplified blacklist check logic
func (s *Storage) IsBlacklisted(repo string) bool {
    key := fmt.Sprintf("blacklist:%s", repo)
    exists, err := s.redis.Exists(key).Result()
    if err != nil {
        return false // Fail open on errors
    }
    return exists > 0
}

func (s *Storage) AddToBlacklist(repo string) error {
    key := fmt.Sprintf("blacklist:%s", repo)
    return s.redis.Set(key, "1", 30*24*time.Hour).Err()
}

The storage interface abstraction means you can swap Redis for DynamoDB, PostgreSQL, or even a flat file without touching the core bot logic. This separation anticipates infrastructure changes—crucial for side projects that might move between hosting providers.

The tweet composition engine reveals careful attention to Twitter's constraints. In 2015, Twitter used t.co URL shortening with variable-length shortened URLs, and the character limit was 140. TrendingGithub's composer queries Twitter's help/configuration endpoint every 24 hours to get the current short_url_length value, then constructs tweets that maximize information density:

// Tweet composition with dynamic URL length calculation
func ComposeTweet(repo Repository, shortURLLength int) string {
    // Format: "user/repo: description https://github.com/user/repo"
    baseFormat := "%s: %s %s"
    repoName := fmt.Sprintf("%s/%s", repo.Owner, repo.Name)
    url := fmt.Sprintf("https://github.com/%s", repoName)
    
    // Calculate available space for description
    overhead := len(repoName) + len(": ") + shortURLLength + len(" ")
    maxDescLength := 140 - overhead
    
    description := repo.Description
    if len(description) > maxDescLength {
        description = description[:maxDescLength-3] + "..."
    }
    
    return fmt.Sprintf(baseFormat, repoName, description, url)
}

This dynamic calculation means the bot adapts to Twitter's infrastructure changes automatically. When Twitter updated t.co's URL length, the bot adjusted tweet formats without code changes—just by refreshing the configuration.

The growth strategy implements a subtle social engineering pattern: following friends-of-followers rather than aggressive follow-backs. The bot periodically selects random followers, fetches their friends (accounts they follow), and follows a subset. This creates organic discovery through social graphs instead of appearing as spam. It's the digital equivalent of joining conversations your friends are already in, rather than cold-calling strangers.

Operational visibility comes through Go's expvar package, which exposes runtime metrics on a dedicated TCP port. You can curl the metrics endpoint and see tweet counts, blacklist sizes, error rates, and memory stats—all the telemetry needed to diagnose issues in production without adding heavyweight monitoring dependencies.

Gotcha

The fundamental limitation is architectural: web scraping is a hack that breaks. GitHub can restructure their HTML tomorrow, and TrendingGithub stops working until someone updates the parsing logic. There's no SLA, no deprecation notice, and no backward compatibility guarantee. Worse, aggressive scraping could violate GitHub's Terms of Service, potentially leading to IP blocks or account restrictions. The repository README doesn't address the legal gray area of automated scraping at scale.

The Twitter integration is also frozen in time. The codebase targets Twitter's 140-character limit and likely uses API v1.1, but Twitter deprecated v1.1 in favor of v2 and expanded character limits to 280. Running this bot today requires either forking and updating the Twitter client libraries or accepting that your tweets won't use modern features like thread composition, polls, or media attachments. The @TrendingGithub account itself appears inactive or minimally maintained, suggesting the original author moved on without handing off active development.

Verdict

Use if: You want to learn how to build resilient scrapers with proper storage abstraction, you're studying bot growth strategies that work within platform policies, or you need a reference implementation for Go-based social media automation with observability built in. It's also useful if you're prototyping custom trending content aggregation and need a starting architecture. Skip if: You need a production-ready solution you can deploy without significant maintenance—the scraping brittleness and outdated Twitter API dependencies make this a research project rather than turn-key infrastructure. Also skip if you're simply trying to follow GitHub trends; just use GitHub's native Watch features, subscribe to newsletters like Console.dev or Changelog Nightly, or follow the existing bot accounts that are actively maintained.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/andygrunwald-trendinggithub.svg)](https://starlog.is/api/badge-click/cybersecurity/andygrunwald-trendinggithub)