// LATEST
Automation
Web-Shepherd: Teaching AI Agents to Navigate the Web With Interpretable Checklists
Cybersecurity
FuzzForge AI: When Your AI Assistant Becomes a Security Engineer
LLM Engineering
TOON: The Data Format That Makes LLMs 40% Cheaper to Feed
Cybersecurity
BountyBench: Testing Whether AI Can Actually Hunt Vulnerabilities
Cybersecurity
Dynamic Risk Assessment: Teaching AI Hackers When to Stop Before They Cause Real Damage
Cybersecurity
WASP: The First Security Benchmark That Proves Your AI Agent Can Be Hijacked
Data & Knowledge
ReasonRAG: Why Process Rewards Beat Outcome Rewards in Agentic Retrieval
AI Agents
GUARDIAN: Detecting When Your AI Agents Start Gaslighting Each Other
AI Agents
MCP Context Forge: Building an Enterprise Gateway for the Model Context Protocol
AI Agents
AgentAuditor: The AI Safety Tool Hiding Behind Academic Peer Review
AI Dev Tools
RAPTOR: Turning Claude Code Into a Security Research Agent With .claude.md Files
AI Agents
Happy: Turn Claude Code Into a Mobile-Controlled AI Agent With E2E Encryption
AI Agents
TheAgentCompany: The First Benchmark That Makes AI Agents Get a Real Job
AI Agents
TTI: Training Web Agents That Get Smarter By Learning From Their Own Mistakes
AI Agents
AGI SDK: Building a Benchmark Where AI Agents Actually Shop on Amazon (Sort Of)
AI Agents
SPORT-Agents: Teaching Multimodal AI to Learn from Its Own Mistakes
AI Agents
RF-Agent: Teaching Language Models to Design Reward Functions Through Tree Search
Cybersecurity
SEC-bench: Automated Benchmarking for LLM Security Agents in Real-World Vulnerability Scenarios
AI Agents
Superpowers: Teaching AI Agents to Stop Cowboy Coding
Automation
Stagehand: The Browser Automation Framework That Writes Its Own Selectors
Infrastructure
Steel Browser: The Missing Infrastructure Layer for AI Agent Automation
Cybersecurity
HackingBuddyGPT: Teaching LLMs to Escalate Privileges in 50 Lines of Code
Cybersecurity
ARTEMIS: When AI Agents Hunt for Zero-Days While You Sleep
AI Agents