How They SRE: The Open-Source Encyclopedia of Real-World Reliability Engineering
Hook
When your on-call system melts down at 3am, you don’t need theory—you need to know how Airbnb, Atlassian, and other leading tech companies actually solve this problem in production. That’s exactly what howtheysre delivers.
Context
Site Reliability Engineering transformed from an internal practice into an industry standard, but there’s a knowledge gap: SRE books teach principles, certifications test concepts, but neither shows you how companies with millions of users actually implement these ideas. Do they use managed services or custom tooling? How do they structure their incident response workflows? What does chaos engineering look like in production?
This is where howtheysre becomes invaluable. Created by Unmesh Gundecha, this repository functions as a living encyclopedia of SRE practices extracted directly from engineering blogs, conference talks, and public postings from technology leaders. Instead of synthesizing best practices into generic advice, it preserves the raw, contextual knowledge of how specific organizations tackle specific problems. It’s the difference between reading “implement monitoring” and reading Airbnb’s detailed post on their alerting framework.
Technical Insight
The repository’s architecture is deceptively simple but strategically organized. Rather than building a complex application, it leverages GitHub’s native features—markdown for content, GitHub Actions for CI/CD validation, and a static structure organized by company name. Each company gets a collapsible details section containing categorized links to blog posts, videos, and conference talks.
The organizational model reveals the repository’s true value. Browse to the Airbnb section and you’ll find resources spanning incident management automation through Slack, their alerting framework, and multi-part series on automating data protection at scale. The Asana section shows their security incident response process and how they achieve stable web releases. This company-centric organization lets you study an organization’s complete SRE philosophy rather than fragmentary takes on isolated topics.
The repository covers comprehensive topics including Site Reliability Engineering, Hiring and Building SRE teams, SRE Culture, DevOps, Monitoring & Observability, Alerting, Incident Response & Post-Mortem, On-Call, Testing in Production, Chaos Engineering, Automation, Performance, and Platform Engineering. This dual organization—by company and by topic area—enables different research approaches. Need to benchmark your startup against similar-scale companies? Browse by organization. Implementing chaos engineering and need case studies? Browse through the topic areas.
Here’s what the actual structure looks like:
<details>
<summary>Airbnb</summary>
### Blog Posts
* [Automated Incident Management Through Slack](https://medium.com/airbnb-engineering/incident-management-ae863dc5d47f)
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
* [Dynamic Kubernetes Cluster Scaling at Airbnb](https://medium.com/airbnb-engineering/dynamic-kubernetes-cluster-scaling-at-airbnb-d79ae3afa132)
</details>
This markdown-based approach means zero runtime dependencies, instant loading, and perfect searchability via browser’s built-in find function. The repository appears to use CI/CD workflows to maintain quality without heavyweight tooling.
The repository’s 9,710 stars indicate its role as a community-curated resource. The repository welcomes PRs, turning it into a collaborative knowledge base. When Achievers publishes new posts about their GitOps tooling or Algolia shares CI/CD platform insights, contributors can submit PRs adding these resources.
What makes this particularly powerful for practitioners is the depth of multi-part series. Achievers’ two-part series on load testing Kubernetes doesn’t just advocate for load testing—it shows their framework architecture and bottleneck resolution process. Airbnb’s three-part series on automating data protection reveals implementation details you’d never find in a conference abstract. This is institutional knowledge made public, preserved in one discoverable location.
The topics extend beyond traditional SRE boundaries into platform engineering and security—reflecting how modern reliability engineering intersects with adjacent domains. An SRE investigating production secret management will find Airbnb’s detailed post on their approach. Teams exploring service mesh architectures can read Achievers’ experience scaling production globally with observability improvements.
The JavaScript language tag might confuse initial viewers—this isn’t a JS library. It likely reflects build tooling or a future static site generator implementation, but the core content remains language-agnostic SRE knowledge applicable regardless of your stack.
Gotcha
The repository’s greatest strength—linking to original sources—creates inherent fragility. Companies reorganize their engineering blogs, Medium changes URL structures, or organizations delete posts during website migrations. While the repository includes CI workflows, link rot remains a potential issue. You’ll occasionally encounter dead links that point to valuable content now lost to the internet.
Quality and applicability vary dramatically. A post from a large-scale organization describing their SRE practices assumes infrastructure scale and tooling budgets that most organizations will never approach. Trying to implement enterprise-scale SRE at a 20-person startup will likely create more problems than it solves. The repository provides no filtering mechanism for organization size, maturity level, or implementation complexity. You’re responsible for determining whether a practice from Airbnb’s infrastructure team translates to your context.
The curation is also uneven. Some organizations have comprehensive coverage with detailed multi-part series, while others have a single blog post. This reflects the reality of public knowledge sharing—some companies invest heavily in engineering blogs while others remain relatively quiet. But it means your research depth depends on which companies have tackled your specific problem publicly.
Verdict
Use if: You’re building or scaling an SRE practice and need real-world case studies showing how established companies implement specific capabilities. It’s invaluable for benchmarking, discovering approaches you hadn’t considered, and learning from others’ production incidents. Particularly powerful when researching specific topics like chaos engineering or incident response—seeing how multiple different companies approach the same problem reveals patterns and tradeoffs better than any single authoritative guide. Skip if: You need structured learning materials or step-by-step implementation guides. This is a reference encyclopedia, not a tutorial. Also skip if you want curated, vetted best practices—you’ll need to critically evaluate whether each linked resource applies to your context, scale, and constraints. For foundational SRE learning, start with established SRE resources, then return here when you need to see how specific practices work in production.