All Benchmark Categories
8 categories
- Code Generation 2
- Code Understanding 1
- Knowledge & QA 4
- Math & Reasoning 3
- ML Research 1
- Security 1
- Software Engineering 7
- Tool Use & Agents 6
Benchmark Your MCP Server
Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.
Get Started Browse BenchmarksCreated by Grey Newell