HumanEval¶
Overview¶
| Property | Value |
|---|---|
| Benchmark ID | humaneval |
| Dataset | openai_humaneval |
| Tasks | 164 Python programming problems |
| Evaluation | Unit tests combined with solution, exit code 0 = pass |
| Output Type | Function code (saved to solution.py) |
| Timeout | 60-180s recommended |
| Pre-built Images | No (lightweight Python container) |
| Difficulty Levels | None (uniform difficulty) |
Overview¶
HumanEval is a code generation benchmark created by OpenAI consisting of 164 hand-written Python programming problems. Each task provides a function signature with a descriptive docstring, and the agent must produce a correct implementation that passes all accompanying unit tests.
HumanEval is widely used as a baseline measure of code generation capability. Problems range from simple string manipulation and list operations to moderate algorithmic challenges. Because tasks are self-contained single functions with no external dependencies, HumanEval is an excellent choice for quick evaluations and smoke testing before running more expensive benchmarks like SWE-bench.
In mcpbr, HumanEval evaluates how effectively an MCP server assists the language model in generating correct, test-passing Python code within lightweight Docker containers.
What It Measures¶
HumanEval tests a focused set of code generation capabilities:
- Function-level code synthesis: The ability to translate a natural language specification (docstring) into correct Python code
- Type-aware programming: Functions include type hints that the agent must respect and leverage
- Edge case handling: Unit tests cover boundary conditions, empty inputs, and special cases that require careful implementation
- Standard library fluency: All problems use only the Python standard library, testing the agent's knowledge of built-in data structures, string operations, and mathematical functions
- Docstring comprehension: The agent must correctly interpret examples, constraints, and expected behavior described in the docstring
HumanEval does not test:
- Multi-file or multi-class code generation
- External library usage (numpy, pandas, etc.)
- Bug fixing or debugging existing code
- Long-context repository understanding
Task Structure¶
Each HumanEval task contains the following fields:
| Field | Description |
|---|---|
| task_id | Unique identifier such as HumanEval/0, HumanEval/1, etc. |
| prompt | Function signature with docstring describing the requirements and examples |
| entry_point | Name of the function to implement (e.g., has_close_elements) |
| canonical_solution | Reference implementation (not shown to the agent) |
| test | Unit tests to verify correctness |
Example task prompt:
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
The agent receives the function signature and docstring, implements the function body, and saves the result to solution.py. Task IDs are converted to Docker-safe names by replacing slashes with underscores (e.g., HumanEval/0 becomes HumanEval_0).
Configuration¶
Basic Configuration¶
# Run HumanEval with default settings
mcpbr run -c config.yaml --benchmark humaneval
# Run a small sample for quick testing
mcpbr run -c config.yaml --benchmark humaneval -n 20
# Run specific tasks by ID
mcpbr run -c config.yaml --benchmark humaneval -t HumanEval/0 -t HumanEval/1
# Run with verbose output
mcpbr run -c config.yaml --benchmark humaneval -n 10 -v
# Save results to JSON
mcpbr run -c config.yaml --benchmark humaneval -n 20 -o results.json
Advanced Options¶
HumanEval does not support difficulty or category filtering since all 164 tasks are uniform Python function completions. However, you can select specific tasks by ID:
# Run specific tasks for targeted testing
mcpbr run -c config.yaml --benchmark humaneval \
-t HumanEval/0 -t HumanEval/10 -t HumanEval/50
# Run all tasks with extended timeout and results
mcpbr run -c config.yaml --benchmark humaneval \
-o results.json -r report.md -v
Configuration optimized for throughput:
benchmark: "humaneval"
sample_size: 164 # All tasks
timeout_seconds: 120
max_iterations: 10
max_concurrent: 8 # High concurrency -- lightweight containers
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "sonnet"
Evaluation Methodology¶
HumanEval evaluation follows a straightforward test-execution pipeline:
-
Solution discovery: The evaluator looks for the agent's solution file. It checks several candidate filenames in order:
solution.py,answer.py,code.py,implementation.py,main.py. If no file is found, it attempts to extract code from the agent's raw response (markdown code blocks or inline function definitions). -
Test assembly: The solution code is combined with the task's unit test code and a
check()call that invokes the tests against the implemented entry point function. -
Execution: The combined file is executed with
python3inside the Docker container with a 30-second timeout. -
Verdict: The task is marked as resolved if the Python process exits with code 0 and no
AssertionErrorappears in stderr.
The evaluation does not require any external packages. All HumanEval tasks use only the Python standard library.
Interpreting Results¶
Key Metrics¶
| Metric | Description |
|---|---|
| pass@1 | Percentage of tasks resolved on the first attempt (primary metric) |
| Resolve rate | Equivalent to pass@1 in mcpbr's single-attempt evaluation |
| Error breakdown | Distribution of failure types: assertion errors vs. runtime errors vs. no solution |
What Good Results Look Like¶
| Score Range | Assessment |
|---|---|
| 90-100% | Excellent -- state-of-the-art code generation with effective MCP assistance |
| 80-90% | Good -- strong code generation, may struggle with complex algorithmic problems |
| 70-80% | Adequate -- baseline capability, check for prompt or configuration improvements |
| Below 70% | Needs investigation -- likely configuration issues, poor prompting, or MCP server problems |
Context: Industry Benchmarks
Leading models (Claude, GPT-4, etc.) achieve 85-95% on HumanEval without MCP assistance. An effective MCP server integration should maintain or slightly improve these scores. Significant drops below the base model's capability suggest integration issues rather than model limitations.
Common Failure Patterns¶
| Pattern | Cause | Solution |
|---|---|---|
| High assertion errors | Incorrect logic or edge case handling | Review failed tasks; consider enabling extended thinking |
| No solution file found | Agent describes code but does not save it | Strengthen prompt to emphasize saving to solution.py |
| Import errors | Agent uses third-party libraries | Add instruction to use only standard library |
| Timeout failures | Infinite loops or extremely slow algorithms | Set timeout_seconds: 180; review algorithmic complexity |
| Signature modification | Agent changes function name or parameters | Emphasize preserving the exact function signature |
Example Output¶
Successful resolution:
Failed resolution (assertion error):
{
"resolved": false,
"exit_code": 1,
"stdout": "",
"stderr": "Traceback (most recent call last):\n File \"test_solution.py\", line 12, in <module>\n check(has_close_elements)\n File \"test_solution.py\", line 8, in check\n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\nAssertionError",
"error": "Test assertions failed"
}
Failed resolution (no solution found):
Best Practices¶
Recommended Workflow¶
- Start with a small sample (10-20 tasks) to verify your setup before running all 164 tasks
- Review the first few results to confirm the agent saves solutions to
solution.py - Run the full benchmark once configuration is validated
- Compare across models by running with different model configurations
Performance Tips¶
- Use shorter timeouts (60-180s) since HumanEval tasks are quick and well-defined
- Run high concurrency (
max_concurrent: 8or higher) since each task uses a lightweight Python container with minimal resource requirements - Set
max_iterationslower (10-15) since tasks are simpler than SWE-bench and require fewer agent turns - Run HumanEval first before expensive benchmarks like SWE-bench as a smoke test for your MCP server integration
Cost Optimization¶
- 164 tasks is manageable: Running all tasks is feasible and recommended for consistent benchmarking
- Low token usage: Function completion typically requires fewer tokens than bug-fixing or multi-step reasoning tasks
- Quick turnaround: Most tasks complete in under 2 minutes, making full-suite runs practical even for iteration
- Use
sonnetfor cost efficiency: HumanEval tasks rarely benefit from the additional reasoning capability of more expensive models
Common Issues & Solutions¶
| Issue | Cause | Solution |
|---|---|---|
Agent does not create solution.py | Agent outputs code in response but does not save a file | Ensure your agent prompt explicitly instructs saving to solution.py. The evaluator falls back to response extraction, but file-based solutions are more reliable. |
| Tests fail with import errors | Agent imports third-party packages (e.g., numpy) | HumanEval tasks use only the Python standard library. Verify that the agent prompt discourages external dependencies. |
| Timeout during evaluation | Agent's implementation contains infinite loops or slow algorithms | Individual test execution has a 30-second timeout. Set timeout_seconds: 180 in configuration for the overall task timeout. |
| Function signature is modified | Agent changes function name, parameters, or type hints | The test harness calls the original entry_point function name. The agent must preserve the exact signature. |
| Low scores despite correct logic | Agent wraps code in a class or adds extra boilerplate | HumanEval expects bare function definitions. Instruct the agent to write only the function implementation. |
Comparison with Similar Benchmarks¶
| Aspect | HumanEval | MBPP | SWE-bench | APPS |
|---|---|---|---|---|
| Goal | Complete a function | Solve a coding problem | Fix a real bug | Solve coding challenges |
| Task Count | 164 | ~1,000 | 300-2,294 | 10,000 |
| Difficulty | Moderate (uniform) | Easy-moderate | High | Easy-competition |
| Languages | Python only | Python only | Python only | Python only |
| Output | Function body | Function/program | Unified diff patch | Full program |
| Dependencies | Standard library only | Standard library only | Project-specific | Varies |
| Pre-built Images | No | No | Yes (Epoch AI) | No |
| Typical Runtime | 1-2 min/task | 1-2 min/task | 5-10 min/task | 2-5 min/task |
| Best For | Quick smoke tests, baseline metrics | Larger-scale code generation evaluation | Real-world software engineering | Broad code generation across difficulty levels |
When to Use HumanEval
Use HumanEval when you need a quick, reliable baseline for code generation capability. It is the standard first benchmark to run when setting up a new MCP server or model configuration. For more comprehensive evaluation, follow up with SWE-bench or APPS.
References¶
- HumanEval Repository
- HumanEval Paper (Evaluating Large Language Models Trained on Code)
- HumanEval Dataset on HuggingFace
- MBPP Benchmark -- related code generation benchmark with more tasks
- SWE-bench -- real-world bug fixing benchmark
- Benchmarks Overview
- Configuration Reference
- CLI Reference