Code Generation Benchmarks

2 benchmarks in this category

HumanEval: OpenAI Python Programming Benchmark (164 Problems)
HumanEval evaluates AI agents on 164 Python programming problems from OpenAI, testing code generation from function signatures and docstrings with unit test verification.
MBPP: Mostly Basic Python Programming Problems Benchmark
MBPP benchmark for mcpbr - ~1,000 crowd-sourced Python programming problems designed for entry-level programmers.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.