How They SRE: Mining 100+ Company Engineering Blogs So You Don't Have To
Hook
Netflix's chaos engineering, Google's error budgets, and Airbnb's incident response frameworks are all documented publicly—but scattered across hundreds of blog posts and conference talks that most engineers will never find. One repository changed that.
Context
Site Reliability Engineering emerged from Google in the mid-2000s, but for years the practices remained opaque to outsiders. As SRE became an industry standard, companies began publishing their approaches through engineering blogs, conference talks, and post-mortems. The problem? This knowledge became impossibly fragmented. Want to understand how different companies handle on-call rotations? You'd need to know which companies even write about it, find their engineering blogs, search through years of archives, and repeat this for dozens of organizations.
Upendra Gundecha created howtheysre to solve this discovery problem. Rather than another "awesome list" of generic SRE tools, this repository focuses exclusively on how specific, named companies practice reliability engineering in production. It's the difference between reading about monitoring theory versus seeing how Stripe actually implements their observability stack. The repository doesn't teach SRE—it shows you SRE as practiced by organizations handling millions of requests per second.
Technical Insight
The architecture of howtheysre is deliberately minimal, which turns out to be its greatest strength. The repository uses a static HTML structure with JavaScript-powered collapsible sections, making it trivially hostable on GitHub Pages and searchable with browser find functions. Each company gets a section with categorized links to their public resources. But the real engineering is in the curation methodology and automation.
The content organization follows a consistent taxonomy across companies: Culture, Hiring, Onboarding, Incident Management, Post-Mortems, Monitoring/Observability, Chaos Engineering, Performance, Security, and Platform Engineering. This standardization lets you compare approaches. Want to see how five different companies handle incident command structures? Search for "incident" and jump between company sections.
The repository includes GitHub Actions workflows that validate link integrity and enforce content standards. Here's a simplified version of how the link validation might work:
const checkLinks = async (htmlFile) => {
const content = await fs.readFile(htmlFile, 'utf-8');
const linkRegex = /href="(https?:\/\/[^"]+)"/g;
const links = [...content.matchAll(linkRegex)].map(m => m[1]);
const results = await Promise.allSettled(
links.map(async (url) => {
const response = await fetch(url, {
method: 'HEAD',
timeout: 5000
});
return { url, status: response.status };
})
);
const broken = results
.filter(r => r.status === 'rejected' || r.value.status >= 400)
.map(r => r.value?.url || r.reason);
if (broken.length > 0) {
console.error('Broken links found:', broken);
process.exit(1);
}
};
This automation addresses a critical challenge with curated link collections: decay. Engineering blogs reorganize, companies rebrand, and URLs break. Continuous validation catches these issues before users encounter dead ends.
What makes this repository particularly valuable is its implicit comparative framework. By seeing how multiple companies approach the same problem, patterns emerge. You notice that most high-scale companies separate alerting into distinct tiers (pages vs tickets), that chaos engineering adoption follows a predictable maturity curve, and that incident post-mortem structures are remarkably similar across organizations despite different cultures.
The repository structure also reveals something interesting about SRE content distribution. Companies like Google, Netflix, and Uber have dozens of entries spanning years of public knowledge sharing. Smaller or newer SRE organizations have fewer resources, not because they're less sophisticated, but because they haven't invested as heavily in public content. This creates a natural weighting where the most comprehensive SRE programs are most represented.
For teams building SRE practices, the repository functions as a requirements discovery tool. Browse what mature SRE organizations consider important enough to write about publicly, and you'll find capabilities you hadn't considered. Did you know Shopify publishes detailed runbooks? That Monzo documents their approach to testing resilience? These aren't resources you'd stumble upon organically, but they're exactly what you need when facing similar challenges.
Gotcha
The repository's biggest limitation is also its defining characteristic: it's purely a link aggregator with zero synthesis. You're handed 50+ resources about incident management across different companies, but you must extract patterns yourself. There's no meta-analysis explaining that most companies converge on similar incident command structures, or highlighting where approaches diverge meaningfully. If you're new to SRE, you might not have the context to evaluate whether Google's approach is applicable to your 10-person startup or whether Netflix's chaos engineering makes sense without their organizational maturity.
The flat organizational structure becomes unwieldy at scale. With 100+ companies and hundreds of links, finding specific information requires either knowing which company documented it or using browser search. Want to compare how different companies implement error budgets? You'll manually grep through company sections. The repository lacks metadata tags, topic-based navigation, or cross-referencing. It's optimized for "I want to see everything Cloudflare has written about SRE" but not for "Show me all resources about chaos engineering maturity models." The Hacktoberfest tag suggests active community contributions, but that openness also means quality varies—some companies have rich, detailed resources while others have a single blog post or conference talk.
Verdict
Use if: You're establishing or scaling SRE practices and need real-world examples from named companies to validate your approach, benchmark against, or pitch internally ("Here's how Stripe does it"). Use it when researching specific companies before interviews or partnerships—you'll quickly understand their reliability philosophy. Use it when you're stuck on a specific problem and want to see if anyone has publicly documented their solution. Skip if: You need structured SRE education (read the Google SRE books instead), want tool recommendations without company context (check awesome-sre lists), or expect synthesized best practices rather than raw source material. Skip if you want current, real-time updates—this is a curated archive, not a news feed. The repository is a reference library for experienced practitioners who know what they're looking for and can evaluate sources critically, not a tutorial for SRE beginners.