Repokid: Netflix’s Battle-Tested Approach to Taming AWS IAM Permission Creep
Hook
Netflix operates thousands of AWS accounts with constantly evolving microservices, yet most IAM roles retain every permission they were ever granted. Repokid solves this by watching what services actually use, then deleting everything else.
Context
Anyone who’s managed AWS IAM at scale knows the painful reality: permissions only accumulate, never shrink. A developer requests S3 and DynamoDB access for a new service. Six months later, the service no longer touches DynamoDB, but the permission remains. Multiply this across hundreds of services, dozens of accounts, and constant deployments, and you’ve got a security nightmare.
The traditional approach—manual IAM audits—simply doesn’t scale in high-velocity environments. Security teams can’t keep pace with deployment velocity, and developers lack visibility into what permissions their services actually need versus what they currently have. Static analysis tools can identify overly broad policies, but they can’t tell you which permissions are actively used. Netflix built Repokid to solve this operationally: use actual AWS telemetry to identify genuinely unused permissions, then automatically remove them. It’s least privilege through observation rather than prediction.
Technical Insight
Repokid’s architecture centers on a data-driven feedback loop. It leverages AWS Access Advisor, a service that tracks the last time each AWS service was accessed using specific permissions. This data gets collected by Aardvark (another Netflix tool) and stored centrally. Repokid queries this telemetry, compares it against current IAM policies, calculates which permissions haven’t been used in your defined threshold period (typically 90 days), and surgically removes them.
The system operates in a hub-and-spoke model designed for multi-account AWS organizations. You deploy a central Repokid instance with a RepokidInstanceProfile that assumes a RepokidRole in each target account. This role needs permissions to read IAM metadata and modify role policies. Here’s what a typical role configuration looks like:
# In your Repokid configuration
config = {
'connection_iam': {
'assume_role': 'RepokidRole',
'session_name': 'repokid',
'region': 'us-east-1'
},
'aardvark_api_location': 'https://aardvark.example.com/api',
'dynamo_db': {
'account_number': '123456789012',
'region': 'us-east-1',
'table': 'repokid_roles'
},
'filter_config': {
'AgeFilter': {'minimum_age': 90},
'BlocklistFilter': {'blocklist': ['PowerUser', 'Admin']}
}
}
The DynamoDB table serves as Repokid’s memory, storing complete role state including policy versions, repoability scores (a metric indicating how much a role can be reduced), and change history. When you run repokid repo_role <role_name>, the system retrieves Access Advisor data, calculates unused permissions, generates a new minimal policy, and updates the role—all while maintaining the previous version for rollback.
What makes Repokid production-ready is its filter and hook system. Filters determine which roles Repokid can touch. The BlocklistFilter protects critical roles by name or pattern. The ExclusiveFilter only touches roles with specific tags. The AgeFilter prevents modifying newly created roles before they’ve accumulated meaningful usage data. You can chain these:
from repokid.filters import Filter
class CustomDepartmentFilter(Filter):
def apply(self, role):
# Only manage roles tagged for your department
tags = role.get('Tags', [])
dept_tag = next((t['Value'] for t in tags if t['Key'] == 'Department'), None)
return dept_tag == 'platform-engineering'
Hooks provide extension points at critical lifecycle moments: before and after repository operations, during policy generation, and on errors. Netflix uses hooks to integrate with internal ticketing systems, create change requests, and notify teams when their roles are modified. This transforms Repokid from a standalone tool into part of a broader governance workflow.
The permission removal logic itself is conservative by design. Repokid analyzes each policy statement’s actions against Access Advisor data. If an action hasn’t been used within the threshold period and isn’t in a configured whitelist, it’s marked for removal. But here’s the critical detail: it maintains statement structure. If a statement grants s3:GetObject and s3:PutObject on a bucket, and only GetObject was used, it creates a new statement with just GetObject rather than mangling the original. This preserves policy readability and auditability.
The rollback mechanism deserves attention. Every time Repokid modifies a role, it increments a version counter and stores the complete previous policy in DynamoDB. You can instantly restore any previous version:
# Rollback role to previous version
repokid rollback_role MyServiceRole
# Rollback to specific version
repokid rollback_role MyServiceRole --version 5
This safety net is essential for production deployments. When a rarely-used but critical permission gets removed (say, a quarterly report generation that uses specific DynamoDB permissions), teams can immediately restore and update their filters to prevent future removal.
Gotcha
The elephant in the room: Repokid is marked as maintenance mode in Netflix’s OSS lifecycle. Netflix still uses it internally, but active feature development has ceased. This means bug fixes and security patches may be slow, and long-term support is uncertain. You’re adopting proven technology, but not an actively evolving product.
The infrastructure requirements are non-trivial. You need DynamoDB tables, cross-account IAM roles in every account you manage, and most critically, Aardvark deployed and collecting data. Aardvark itself requires setup: Lambda functions in each account reporting to a central API. You’re looking at days or weeks of infrastructure work before Repokid becomes operational. Then you must wait for meaningful Access Advisor data to accumulate—typically 90 days—before you can confidently remove permissions. There’s no instant gratification here.
Access Advisor data has inherent limitations that Repokid inherits. It only tracks service-level access, not specific API calls. It shows that a role used S3, but not which buckets or objects. For permissions that are genuinely needed but used infrequently (disaster recovery roles, annual compliance jobs, break-glass procedures), Access Advisor will flag them as unused. You’ll need comprehensive filters and manual oversight to prevent removing critical permissions. The 90-day tracking window is AWS-imposed and unchangeable, meaning anything used less frequently than quarterly will always appear unused.
Verdict
Use if: You’re managing 20+ AWS accounts with frequent deployments, have dedicated security/platform engineering resources to handle the setup, and can tolerate a 90-day data collection period before seeing value. Repokid excels in environments where manual IAM governance is impossible due to scale and velocity, and where you can invest in building organizational knowledge around its filters and hooks. It’s particularly valuable if you’re already using infrastructure-as-code and treating IAM as cattle rather than pets. Skip if: You’re operating fewer than 10 accounts where manual quarterly IAM reviews are still feasible, lack resources for deploying and maintaining Aardvark, or need active vendor support and ongoing feature development. Also skip if you’re on AWS GovCloud or other regions where Access Advisor has limited functionality. For new projects starting from scratch, evaluate AWS IAM Access Analyzer first—it provides similar unused access detection with zero infrastructure setup, though without Repokid’s automation capabilities. Consider Repokid when you’ve outgrown native tools and need industrial-grade automated remediation.