SWE-bench¶
Overview¶
| Property | Value |
|---|---|
| Benchmark ID | swe-bench-verified, swe-bench-lite, swe-bench-full |
| Dataset | SWE-bench/SWE-bench_Verified, SWE-bench/SWE-bench_Lite, SWE-bench/SWE-bench |
| Tasks | 500 (Verified), 300 (Lite), 2,294 (Full) |
| Evaluation | Apply unified diff patch, run FAIL_TO_PASS and PASS_TO_PASS test suites |
| Output Type | Patch (unified diff) |
| Timeout | 300-600s recommended |
Overview¶
SWE-bench is the gold-standard benchmark for evaluating AI agents on real-world software engineering tasks. Each task is a genuine GitHub issue from a popular open-source Python repository, and the agent must produce a unified diff patch that resolves the bug. The evaluation verifies the fix by running the repository's test suite, checking that previously failing tests now pass while existing passing tests remain unbroken.
SWE-bench is widely used by the research community and industry to measure progress in automated software engineering. mcpbr supports all three official variants:
- swe-bench-verified (default): 500 tasks that have been manually validated by human annotators to confirm test correctness. This is the recommended variant for accurate benchmarking.
- swe-bench-lite: 300 curated tasks from popular repositories, suitable for quick testing and iteration.
- swe-bench-full: The complete dataset of 2,294 tasks for comprehensive evaluation and research purposes.
Pre-built Docker images from Epoch AI are available for most tasks. These images include the repository checked out at the correct commit with all dependencies pre-installed and validated, providing faster and more reproducible evaluations.
Task Structure¶
Each SWE-bench task includes the following components:
- Instance ID: A unique identifier combining the repository and issue number (e.g.,
django__django-11099). - Problem Statement: The original bug description from the GitHub issue, including reproduction steps and expected behavior.
- Repository: The GitHub repository name (e.g.,
django/django,scikit-learn/scikit-learn). - Base Commit: The specific commit hash where the bug exists, ensuring reproducible evaluation.
- Test Patch: Additional test code that verifies the fix, applied alongside the agent's patch.
- FAIL_TO_PASS: A list of test cases that should fail before the fix and pass after it is applied.
- PASS_TO_PASS: A list of test cases that must continue passing after the fix, ensuring no regressions.
The agent receives the problem statement and access to the repository at the base commit. It must analyze the codebase, identify the root cause, and generate a minimal patch that resolves the issue.
Running the Benchmark¶
# Run SWE-bench Verified (default, manually validated tests)
mcpbr run -c config.yaml --benchmark swe-bench-verified
# Run SWE-bench Lite (300 tasks, quick testing)
mcpbr run -c config.yaml --benchmark swe-bench-lite
# Run SWE-bench Full (2,294 tasks, comprehensive)
mcpbr run -c config.yaml --benchmark swe-bench-full
# Run a sample of 20 tasks
mcpbr run -c config.yaml --benchmark swe-bench-verified -n 20
# Run specific tasks by instance ID
mcpbr run -c config.yaml --benchmark swe-bench-verified -t django__django-11099
# Filter by repository
mcpbr run -c config.yaml --benchmark swe-bench-verified --filter-category django
# Filter by multiple repositories
mcpbr run -c config.yaml --benchmark swe-bench-verified \
--filter-category django --filter-category scikit-learn
Evaluation Methodology¶
SWE-bench evaluation follows a rigorous multi-step process:
- Patch Generation: The agent analyzes the repository and produces a unified diff patch targeting the bug described in the problem statement.
- Patch Application: The generated patch is applied to the repository at the base commit using standard
git applyorpatchutilities. - Test Patch Application: If the task includes a test patch (additional tests that verify the fix), it is applied on top of the agent's changes.
- FAIL_TO_PASS Verification: The tests listed in FAIL_TO_PASS are executed. All of these tests must now pass, confirming the bug has been fixed.
- PASS_TO_PASS Verification: The tests listed in PASS_TO_PASS are executed. All of these tests must continue to pass, confirming no regressions were introduced.
- Resolution: A task is marked as resolved only if the patch applies cleanly, all FAIL_TO_PASS tests pass, and all PASS_TO_PASS tests remain passing.
The evaluation uses pre-built Docker images when available (use_prebuilt_images: true), which include the repository at the correct commit with all Python dependencies installed and validated. This eliminates environment setup variability and produces more reliable results.
Example Output¶
Successful Resolution¶
{
"instance_id": "django__django-11099",
"resolved": true,
"patch_applied": true,
"fail_to_pass": {
"passed": 3,
"total": 3
},
"pass_to_pass": {
"passed": 47,
"total": 47
}
}
Failed Resolution¶
{
"instance_id": "scikit-learn__scikit-learn-13779",
"resolved": false,
"patch_applied": true,
"fail_to_pass": {
"passed": 1,
"total": 2
},
"pass_to_pass": {
"passed": 45,
"total": 47
}
}
Patch Application Failure¶
{
"instance_id": "sympy__sympy-18199",
"resolved": false,
"patch_applied": false,
"eval_error": "Patch failed to apply: hunks FAILED -- saving rejects to file"
}
Troubleshooting¶
Patch fails to apply cleanly The agent's patch may target incorrect line numbers or file paths. Ensure the agent is working with the correct version of the repository. Pre-built images guarantee the repository is at the exact base commit. If building from scratch, verify the checkout succeeded.
PASS_TO_PASS tests fail after patch The agent introduced a regression. This often happens when the fix is too broad or modifies shared utility functions. Encourage the agent to make minimal, targeted changes by using a focused prompt template.
Evaluation times out SWE-bench tasks involving large repositories or complex test suites may need longer timeouts. Increase timeout_seconds to 600 or higher for repositories like Django or Matplotlib. Tasks from smaller repositories like Flask typically complete within 300 seconds.
Docker image pull fails Pre-built images from Epoch AI may not be available for all tasks. If an image pull fails, mcpbr falls back to building the environment from scratch. Set use_prebuilt_images: false to always build from scratch, though this is slower and less reliable.
Best Practices¶
- Start with swe-bench-verified for the most reliable evaluation results, as all tasks have been manually validated.
- Use pre-built images (
use_prebuilt_images: true) for faster and more consistent evaluation environments. - Test with a small sample first (
-n 5or-n 10) before running the full benchmark to verify your configuration works. - Filter by repository using
filter_categoryto focus on specific projects relevant to your MCP server's capabilities. - Set appropriate timeouts: 300 seconds works for most tasks, but complex repositories like Django may need 600 seconds.
- Monitor token usage: Bug-fixing tasks can require extensive code exploration, so track costs carefully with large sample sizes.
- Use swe-bench-lite for iteration: When developing or tuning your MCP server, the Lite variant offers a good balance of coverage and speed.