BigCodeBench¶
Overview¶
| Property | Value |
|---|---|
| Benchmark ID | bigcodebench |
| Dataset | bigcode/bigcodebench |
| Tasks | 1,140 coding tasks |
| Evaluation | Run provided test cases against generated code |
| Output Type | Test pass/fail |
| Timeout | 180-300s recommended |
Overview¶
BigCodeBench is a benchmark designed to evaluate code generation capabilities on practical, real-world programming tasks. Unlike algorithmic benchmarks that focus on standalone functions, BigCodeBench tasks require composing multiple function calls from 139 different libraries across 7 major domains. This makes it an excellent test of whether an AI agent can effectively leverage library APIs in realistic software development scenarios.
The benchmark includes 1,140 tasks that span a diverse range of practical programming activities:
- Data Analysis: Tasks using Pandas, NumPy, SciPy for data manipulation and statistical analysis.
- Machine Learning: Tasks involving scikit-learn, TensorFlow, PyTorch for model building and evaluation.
- Web Development: Tasks using Flask, Django, requests for building and interacting with web services.
- Data Visualization: Tasks requiring Matplotlib, Seaborn, Plotly for creating charts and plots.
- File Processing: Tasks involving CSV, JSON, XML parsing and file system operations.
- Networking: Tasks using socket, HTTP libraries, and API interactions.
- System Programming: Tasks involving OS operations, subprocess management, and system utilities.
Each task provides either an instruction prompt (describing what to implement) or a completion prompt (providing partial code to complete), along with test cases that validate the implementation.
Task Structure¶
Each BigCodeBench task includes the following components:
- Task ID: A unique identifier for the task (e.g.,
BigCodeBench/0). - Instruct Prompt: A natural language description of the function to implement, including its purpose, parameters, and expected behavior.
- Complete Prompt: An alternative prompt format providing the function signature and partial implementation for completion.
- Test Code: Python test cases that validate the generated implementation against expected behavior.
- Libraries: A list of required libraries that the solution must use (e.g.,
["pandas", "numpy", "matplotlib"]). - Domain: The broad domain category the task belongs to.
- Instance ID: An auto-generated identifier in the format
bigcodebench_{task_id}.
The agent receives the instruction or completion prompt along with the list of required libraries and must produce a complete implementation that passes all test cases.
Running the Benchmark¶
# Run BigCodeBench with default settings
mcpbr run -c config.yaml --benchmark bigcodebench
# Run a sample of 20 tasks
mcpbr run -c config.yaml --benchmark bigcodebench -n 20
# Run a specific task
mcpbr run -c config.yaml --benchmark bigcodebench -t BigCodeBench/0
# Filter by domain
mcpbr run -c config.yaml --benchmark bigcodebench --filter-category "data analysis"
# Filter by required library
mcpbr run -c config.yaml --benchmark bigcodebench --filter-tags pandas
# Filter tasks requiring both pandas and numpy
mcpbr run -c config.yaml --benchmark bigcodebench \
--filter-tags pandas --filter-tags numpy
Evaluation Methodology¶
BigCodeBench evaluation combines the generated solution with provided test code:
- Solution Generation: The agent produces a Python implementation based on the task prompt.
- Test Assembly: The generated solution is concatenated with the task's test code to create a single executable test file (
test_solution.py). - Test Execution: The combined file is executed using Python 3 inside the Docker container with a 60-second timeout.
- Pass/Fail Determination: The task is marked as resolved if the test execution completes with exit code 0 (all assertions pass). Any assertion error, import error, or runtime exception results in a failure.
- Result Reporting: Results include the resolution status, exit code, and captured stdout/stderr for debugging.
Since BigCodeBench tasks require specific libraries, the Docker environment must have the relevant Python packages installed. The evaluation captures both stdout and stderr (truncated to 1,000 characters) to help diagnose failures.
Example Output¶
Successful Resolution¶
{
"instance_id": "bigcodebench_BigCodeBench/42",
"resolved": true,
"exit_code": 0,
"stdout": "...",
"stderr": ""
}
Failed Resolution¶
{
"instance_id": "bigcodebench_BigCodeBench/99",
"resolved": false,
"exit_code": 1,
"stdout": "",
"stderr": "AssertionError: Expected DataFrame with 3 columns, got 2"
}
Missing Test Code¶
{
"instance_id": "bigcodebench_BigCodeBench/500",
"resolved": false,
"error": "No test code provided"
}
Troubleshooting¶
ImportError for required libraries BigCodeBench tasks require specific Python libraries (e.g., pandas, numpy, scikit-learn). If the Docker environment does not have these packages installed, tests will fail with ImportError. Ensure your Docker image includes the common data science Python stack, or configure a base image that pre-installs these dependencies.
Test assertions fail despite correct logic Some BigCodeBench tests check specific output formats, column names, or data types. The agent must follow the prompt specifications exactly. For example, a task may require returning a Pandas DataFrame with specific column names, and returning a dictionary instead will fail the assertion.
Timeout during test execution The test execution has a 60-second timeout. Tasks involving large dataset generation, model training, or complex plotting may approach this limit. If you see frequent timeouts, consider increasing the per-test timeout in the evaluation configuration.
Agent does not use required libraries BigCodeBench tasks explicitly list required libraries. If the agent implements the solution using different libraries or pure Python, it may technically work but fail specific test assertions that check for library-specific behavior (e.g., verifying the return type is a Pandas DataFrame rather than a list).
Best Practices¶
- Start with a small sample (
-n 10) to verify that the Docker environment has the required libraries installed. - Filter by domain using
filter_categoryto focus evaluations on specific areas like data analysis or web development. - Use filter_tags for library-specific evaluation: Test your MCP server's effectiveness with specific library ecosystems (e.g.,
--filter-tags pandasfor data manipulation tasks). - Ensure library availability: The Docker environment should include common Python packages. Consider using a data science base image or installing dependencies in advance.
- Review stderr on failures: The captured stderr output often contains the exact assertion error, making it straightforward to diagnose why a solution failed.
- Compare instruct vs. complete modes: BigCodeBench provides both instruction-style and completion-style prompts. Evaluating with both can reveal different aspects of your agent's capabilities.