Code Generation Benchmarks
2 benchmarks in this category
-
HumanEval: OpenAI Python Programming Benchmark (164 Problems)
HumanEval evaluates AI agents on 164 Python programming problems from OpenAI, testing code generation from function signatures and docstrings with unit test verification.
-
MBPP: Mostly Basic Python Programming Problems Benchmark
MBPP benchmark for mcpbr - ~1,000 crowd-sourced Python programming problems designed for entry-level programmers.
Benchmark Your MCP Server
Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.
Get Started Browse BenchmarksCreated by Grey Newell