Benchmarks¶
mcpbr supports a comprehensive suite of benchmarks for evaluating MCP servers and AI agent capabilities. Each benchmark targets different skills - from bug fixing and code generation to math reasoning, tool use, and security exploit generation.
Quick Start¶
# List all available benchmarks
mcpbr benchmarks
# Run a specific benchmark
mcpbr run -c config.yaml --benchmark humaneval -n 20
# Run default benchmark (SWE-bench Verified)
mcpbr run -c config.yaml
All Benchmarks¶
| Benchmark | ID | Tasks | Category | Evaluation | Docs |
|---|---|---|---|---|---|
| SWE-bench Verified | swe-bench-verified | 500 | Software Engineering | Test suite pass/fail | Details |
| SWE-bench Lite | swe-bench-lite | 300 | Software Engineering | Test suite pass/fail | Details |
| SWE-bench Full | swe-bench-full | 2,294 | Software Engineering | Test suite pass/fail | Details |
| APPS | apps | 10,000 | Software Engineering | stdin/stdout tests | Details |
| CodeContests | codecontests | Varies | Software Engineering | Test case comparison | Details |
| BigCodeBench | bigcodebench | 1,140 | Software Engineering | Test pass/fail | Details |
| LeetCode | leetcode | Varies | Software Engineering | Code execution | Details |
| CoderEval | codereval | Varies | Software Engineering | Language-specific tests | Details |
| Aider Polyglot | aider-polyglot | Varies | Software Engineering | Language-specific tests | Details |
| HumanEval | humaneval | 164 | Code Generation | Unit tests | Details |
| MBPP | mbpp | ~1,000 | Code Generation | Test pass/fail | Details |
| GSM8K | gsm8k | 1,319 | Math & Reasoning | Numeric answer matching | Details |
| MATH | math | 12,500 | Math & Reasoning | LaTeX answer extraction | Details |
| BigBench-Hard | bigbench-hard | 27 subtasks | Math & Reasoning | Exact match | Details |
| TruthfulQA | truthfulqa | ~800 | Knowledge & QA | Substring matching | Details |
| HellaSwag | hellaswag | Varies | Knowledge & QA | Option selection | Details |
| ARC | arc | 7,787 | Knowledge & QA | Multiple choice | Details |
| GAIA | gaia | Varies | Knowledge & QA | Exact match | Details |
| MCPToolBench++ | mcptoolbench | Varies | Tool Use & Agents | Tool accuracy metrics | Details |
| ToolBench | toolbench | Varies | Tool Use & Agents | Tool call comparison | Details |
| AgentBench | agentbench | Varies | Tool Use & Agents | String matching | Details |
| WebArena | webarena | Varies | Tool Use & Agents | Reference matching | Details |
| TerminalBench | terminalbench | Varies | Tool Use & Agents | Validation command | Details |
| InterCode | intercode | Varies | Tool Use & Agents | Output comparison | Details |
| MLAgentBench | mlagentbench | Varies | ML Research | Score comparison | Details |
| RepoQA | repoqa | Varies | Code Understanding | Function name match | Details |
| CyberGym | cybergym | Varies | Security | Crash detection | Details |
Benchmarks by Category¶
Software Engineering¶
Benchmarks that test an agent's ability to work with real codebases, fix bugs, and solve programming challenges.
| Benchmark | Focus | Difficulty | Best For |
|---|---|---|---|
| SWE-bench | Real GitHub bug fixes | High | MCP server evaluation, production benchmarking |
| APPS | Coding problems (intro → competition) | Low → High | Broad code generation assessment |
| CodeContests | Competitive programming | High | Algorithmic reasoning evaluation |
| BigCodeBench | Multi-library function composition | Medium | Real-world API usage testing |
| LeetCode | Algorithmic problems | Low → High | Data structure and algorithm evaluation |
| CoderEval | Code generation in project context | Medium | Contextual code generation |
| Aider Polyglot | Multi-language code editing | Medium | Cross-language editing capability |
Code Generation¶
Focused benchmarks for evaluating pure code generation from specifications.
| Benchmark | Focus | Best For |
|---|---|---|
| HumanEval | Python function completion | Quick smoke tests, baseline metrics |
| MBPP | Entry-level Python problems | Entry-level code generation |
Math & Reasoning¶
Benchmarks testing mathematical reasoning and multi-step problem solving.
| Benchmark | Focus | Best For |
|---|---|---|
| GSM8K | Grade-school math word problems | Chain-of-thought evaluation |
| MATH | Competition mathematics (AMC/AIME) | Advanced math reasoning |
| BigBench-Hard | 27 hard reasoning tasks | Broad reasoning assessment |
Knowledge & QA¶
Benchmarks evaluating knowledge, truthfulness, and question answering.
| Benchmark | Focus | Best For |
|---|---|---|
| TruthfulQA | Truthfulness and avoiding misconceptions | Truthfulness evaluation |
| HellaSwag | Commonsense reasoning | Commonsense evaluation |
| ARC | Grade-school science questions | Science reasoning |
| GAIA | General AI assistant tasks | Multi-modal, tool-use evaluation |
Tool Use & Agents¶
Benchmarks specifically testing tool use, API interaction, and agentic capabilities.
| Benchmark | Focus | Best For |
|---|---|---|
| MCPToolBench++ | MCP tool discovery and invocation | MCP server evaluation |
| ToolBench | Real-world API tool use | API tool selection testing |
| AgentBench | Multi-environment agent tasks | Broad agent evaluation |
| WebArena | Web browsing and interaction | Web automation testing |
| TerminalBench | Terminal/shell task completion | CLI and shell evaluation |
| InterCode | Interactive code environments | Multi-turn code interaction |
ML Research¶
| Benchmark | Focus | Best For |
|---|---|---|
| MLAgentBench | ML research tasks (Kaggle) | ML pipeline evaluation |
Code Understanding¶
| Benchmark | Focus | Best For |
|---|---|---|
| RepoQA | Long-context code understanding | Repository comprehension |
Security¶
| Benchmark | Focus | Best For |
|---|---|---|
| CyberGym | Vulnerability exploitation (PoC) | Security analysis evaluation |
Comparing Benchmarks¶
| Aspect | SWE-bench | HumanEval | GSM8K | CyberGym | MCPToolBench++ |
|---|---|---|---|---|---|
| Goal | Fix bugs | Generate code | Solve math | Exploit vulnerabilities | Use MCP tools |
| Output | Patch (diff) | Function code | Numeric answer | PoC code | Tool calls |
| Languages | Python | Python | N/A | C/C++ | N/A |
| Evaluation | Test suite | Unit tests | Answer matching | Crash detection | Tool accuracy |
| Pre-built Images | Yes | No | No | No | No |
| Typical Timeout | 300-600s | 60-180s | 60-180s | 600-900s | 180-300s |
| Task Count | 300-2,294 | 164 | 1,319 | Varies | Varies |
| Difficulty Levels | N/A | N/A | N/A | 0-3 | easy/hard |
| Best For | MCP evaluation | Quick tests | Reasoning | Security research | Tool use testing |
Benchmark Abstraction¶
mcpbr uses a Protocol-based abstraction that makes it easy to add new benchmarks:
from mcpbr.benchmarks import Benchmark
class MyBenchmark:
"""Custom benchmark implementation."""
name = "my-benchmark"
def load_tasks(self, sample_size, task_ids, level):
"""Load tasks from dataset."""
...
def normalize_task(self, task):
"""Convert to normalized BenchmarkTask format."""
...
async def create_environment(self, task, docker_manager):
"""Create isolated Docker environment."""
...
async def evaluate(self, env, task, solution):
"""Evaluate the solution."""
...
def get_prebuilt_image(self, task):
"""Return pre-built image name or None."""
...
def get_prompt_template(self):
"""Return agent prompt template."""
...
Each benchmark implements:
load_tasks(): Load tasks from HuggingFace or other sourcesnormalize_task(): Convert to common formatcreate_environment(): Set up Docker container with dependenciesevaluate(): Run benchmark-specific evaluationget_prebuilt_image(): Return pre-built image name if availableget_prompt_template(): Provide task-appropriate instructions
See src/mcpbr/benchmarks/ for reference implementations.
Listing Benchmarks¶
Use the CLI to see available benchmarks:
$ mcpbr benchmarks
Available Benchmarks
┌────────────────┬──────────────────────────────────────────────────────┬─────────────────────────┐
│ Benchmark │ Description │ Output Type │
├────────────────┼──────────────────────────────────────────────────────┼─────────────────────────┤
│ swe-bench │ Software bug fixes in GitHub repositories │ Patch (unified diff) │
│ cybergym │ Security vulnerability exploitation (PoC generation) │ Exploit code │
│ humaneval │ Python function completion (code generation) │ Function code │
│ mcptoolbench │ MCP tool use evaluation │ Tool call accuracy │
│ gsm8k │ Grade-school math reasoning │ Numeric answer │
│ ... │ ... │ ... │
└────────────────┴──────────────────────────────────────────────────────┴─────────────────────────┘
Use --benchmark flag with 'run' command to select a benchmark
Example: mcpbr run -c config.yaml --benchmark humaneval
Filtering Tasks¶
All benchmarks support task filtering to select specific subsets:
# Filter by difficulty level
filter_difficulty:
- "easy"
- "medium"
# Filter by category
filter_category:
- "django"
- "scikit-learn"
# Filter by tags
filter_tags:
- "security"
See Configuration for full filtering documentation.
Best Practices¶
Choosing a Benchmark¶
- Testing MCP servers: Start with SWE-bench or MCPToolBench++
- Quick smoke tests: Use HumanEval (164 tasks, fast)
- Math reasoning: Use GSM8K for basic or MATH for competition-level
- Security research: Use CyberGym with appropriate difficulty level
- Multi-language: Use Aider Polyglot or CoderEval
General Tips¶
- Start small: Run with
-n 5before scaling up - Use pre-built images: Enabled by default for SWE-bench, much faster
- Set appropriate timeouts: See individual benchmark pages for recommendations
- Save results: Always use
-o results.jsonto preserve data - Compare benchmarks: Run multiple benchmarks to get a comprehensive picture
Related Links¶
- Configuration Guide - Full configuration reference
- CLI Reference - All command options
- Best Practices - Tips for effective evaluation
- Architecture - How mcpbr works internally