Benchmarks¶

Q: What benchmarks does mcpbr support?

mcpbr supports 25+ benchmarks across software engineering (SWE-bench, APPS, CodeContests), code generation (HumanEval, MBPP), math reasoning (GSM8K, MATH), knowledge & QA (TruthfulQA, ARC), tool use (MCPToolBench++, ToolBench), ML research (MLAgentBench), code understanding (RepoQA), and security (CyberGym).

Q: How do I list available benchmarks in mcpbr?

Run 'mcpbr benchmarks' to see all available benchmarks with their descriptions and output types.

Q: How do I select a benchmark to run?

Use the --benchmark flag with the benchmark ID: 'mcpbr run -c config.yaml --benchmark humaneval'. Or set 'benchmark: humaneval' in your YAML config file.

mcpbr supports a comprehensive suite of benchmarks for evaluating MCP servers and AI agent capabilities. Each benchmark targets different skills - from bug fixing and code generation to math reasoning, tool use, and security exploit generation.

Quick Start¶

# List all available benchmarks
mcpbr benchmarks

# Run a specific benchmark
mcpbr run -c config.yaml --benchmark humaneval -n 20

# Run default benchmark (SWE-bench Verified)
mcpbr run -c config.yaml

All Benchmarks¶

Benchmark	ID	Tasks	Category	Evaluation	Docs
SWE-bench Verified	`swe-bench-verified`	500	Software Engineering	Test suite pass/fail	Details
SWE-bench Lite	`swe-bench-lite`	300	Software Engineering	Test suite pass/fail	Details
SWE-bench Full	`swe-bench-full`	2,294	Software Engineering	Test suite pass/fail	Details
APPS	`apps`	10,000	Software Engineering	stdin/stdout tests	Details
CodeContests	`codecontests`	Varies	Software Engineering	Test case comparison	Details
BigCodeBench	`bigcodebench`	1,140	Software Engineering	Test pass/fail	Details
LeetCode	`leetcode`	Varies	Software Engineering	Code execution	Details
CoderEval	`codereval`	Varies	Software Engineering	Language-specific tests	Details
Aider Polyglot	`aider-polyglot`	Varies	Software Engineering	Language-specific tests	Details
HumanEval	`humaneval`	164	Code Generation	Unit tests	Details
MBPP	`mbpp`	~1,000	Code Generation	Test pass/fail	Details
GSM8K	`gsm8k`	1,319	Math & Reasoning	Numeric answer matching	Details
MATH	`math`	12,500	Math & Reasoning	LaTeX answer extraction	Details
BigBench-Hard	`bigbench-hard`	27 subtasks	Math & Reasoning	Exact match	Details
TruthfulQA	`truthfulqa`	~800	Knowledge & QA	Substring matching	Details
HellaSwag	`hellaswag`	Varies	Knowledge & QA	Option selection	Details
ARC	`arc`	7,787	Knowledge & QA	Multiple choice	Details
GAIA	`gaia`	Varies	Knowledge & QA	Exact match	Details
MCPToolBench++	`mcptoolbench`	Varies	Tool Use & Agents	Tool accuracy metrics	Details
ToolBench	`toolbench`	Varies	Tool Use & Agents	Tool call comparison	Details
AgentBench	`agentbench`	Varies	Tool Use & Agents	String matching	Details
WebArena	`webarena`	Varies	Tool Use & Agents	Reference matching	Details
TerminalBench	`terminalbench`	Varies	Tool Use & Agents	Validation command	Details
InterCode	`intercode`	Varies	Tool Use & Agents	Output comparison	Details
MLAgentBench	`mlagentbench`	Varies	ML Research	Score comparison	Details
RepoQA	`repoqa`	Varies	Code Understanding	Function name match	Details
CyberGym	`cybergym`	Varies	Security	Crash detection	Details

Benchmarks by Category¶

Software Engineering¶

Benchmarks that test an agent's ability to work with real codebases, fix bugs, and solve programming challenges.

Benchmark	Focus	Difficulty	Best For
SWE-bench	Real GitHub bug fixes	High	MCP server evaluation, production benchmarking
APPS	Coding problems (intro → competition)	Low → High	Broad code generation assessment
CodeContests	Competitive programming	High	Algorithmic reasoning evaluation
BigCodeBench	Multi-library function composition	Medium	Real-world API usage testing
LeetCode	Algorithmic problems	Low → High	Data structure and algorithm evaluation
CoderEval	Code generation in project context	Medium	Contextual code generation
Aider Polyglot	Multi-language code editing	Medium	Cross-language editing capability

Code Generation¶

Focused benchmarks for evaluating pure code generation from specifications.

Benchmark	Focus	Best For
HumanEval	Python function completion	Quick smoke tests, baseline metrics
MBPP	Entry-level Python problems	Entry-level code generation

Math & Reasoning¶

Benchmarks testing mathematical reasoning and multi-step problem solving.

Benchmark	Focus	Best For
GSM8K	Grade-school math word problems	Chain-of-thought evaluation
MATH	Competition mathematics (AMC/AIME)	Advanced math reasoning
BigBench-Hard	27 hard reasoning tasks	Broad reasoning assessment

Knowledge & QA¶

Benchmarks evaluating knowledge, truthfulness, and question answering.

Benchmark	Focus	Best For
TruthfulQA	Truthfulness and avoiding misconceptions	Truthfulness evaluation
HellaSwag	Commonsense reasoning	Commonsense evaluation
ARC	Grade-school science questions	Science reasoning
GAIA	General AI assistant tasks	Multi-modal, tool-use evaluation

Tool Use & Agents¶

Benchmarks specifically testing tool use, API interaction, and agentic capabilities.

Benchmark	Focus	Best For
MCPToolBench++	MCP tool discovery and invocation	MCP server evaluation
ToolBench	Real-world API tool use	API tool selection testing
AgentBench	Multi-environment agent tasks	Broad agent evaluation
WebArena	Web browsing and interaction	Web automation testing
TerminalBench	Terminal/shell task completion	CLI and shell evaluation
InterCode	Interactive code environments	Multi-turn code interaction

ML Research¶

Benchmark	Focus	Best For
MLAgentBench	ML research tasks (Kaggle)	ML pipeline evaluation

Code Understanding¶

Benchmark	Focus	Best For
RepoQA	Long-context code understanding	Repository comprehension

Security¶

Benchmark	Focus	Best For
CyberGym	Vulnerability exploitation (PoC)	Security analysis evaluation

Comparing Benchmarks¶

Aspect	SWE-bench	HumanEval	GSM8K	CyberGym	MCPToolBench++
Goal	Fix bugs	Generate code	Solve math	Exploit vulnerabilities	Use MCP tools
Output	Patch (diff)	Function code	Numeric answer	PoC code	Tool calls
Languages	Python	Python	N/A	C/C++	N/A
Evaluation	Test suite	Unit tests	Answer matching	Crash detection	Tool accuracy
Pre-built Images	Yes	No	No	No	No
Typical Timeout	300-600s	60-180s	60-180s	600-900s	180-300s
Task Count	300-2,294	164	1,319	Varies	Varies
Difficulty Levels	N/A	N/A	N/A	0-3	easy/hard
Best For	MCP evaluation	Quick tests	Reasoning	Security research	Tool use testing

Benchmark Abstraction¶

mcpbr uses a Protocol-based abstraction that makes it easy to add new benchmarks:

from mcpbr.benchmarks import Benchmark

class MyBenchmark:
    """Custom benchmark implementation."""

    name = "my-benchmark"

    def load_tasks(self, sample_size, task_ids, level):
        """Load tasks from dataset."""
        ...

    def normalize_task(self, task):
        """Convert to normalized BenchmarkTask format."""
        ...

    async def create_environment(self, task, docker_manager):
        """Create isolated Docker environment."""
        ...

    async def evaluate(self, env, task, solution):
        """Evaluate the solution."""
        ...

    def get_prebuilt_image(self, task):
        """Return pre-built image name or None."""
        ...

    def get_prompt_template(self):
        """Return agent prompt template."""
        ...

Each benchmark implements:

load_tasks(): Load tasks from HuggingFace or other sources
normalize_task(): Convert to common format
create_environment(): Set up Docker container with dependencies
evaluate(): Run benchmark-specific evaluation
get_prebuilt_image(): Return pre-built image name if available
get_prompt_template(): Provide task-appropriate instructions

See src/mcpbr/benchmarks/ for reference implementations.

Listing Benchmarks¶

Use the CLI to see available benchmarks:

$ mcpbr benchmarks

Available Benchmarks

┌────────────────┬──────────────────────────────────────────────────────┬─────────────────────────┐
│ Benchmark      │ Description                                          │ Output Type             │
├────────────────┼──────────────────────────────────────────────────────┼─────────────────────────┤
│ swe-bench      │ Software bug fixes in GitHub repositories             │ Patch (unified diff)    │
│ cybergym       │ Security vulnerability exploitation (PoC generation)  │ Exploit code            │
│ humaneval      │ Python function completion (code generation)          │ Function code           │
│ mcptoolbench   │ MCP tool use evaluation                               │ Tool call accuracy      │
│ gsm8k          │ Grade-school math reasoning                           │ Numeric answer          │
│ ...            │ ...                                                   │ ...                     │
└────────────────┴──────────────────────────────────────────────────────┴─────────────────────────┘

Use --benchmark flag with 'run' command to select a benchmark
Example: mcpbr run -c config.yaml --benchmark humaneval

Filtering Tasks¶

All benchmarks support task filtering to select specific subsets:

# Filter by difficulty level
filter_difficulty:
  - "easy"
  - "medium"

# Filter by category
filter_category:
  - "django"
  - "scikit-learn"

# Filter by tags
filter_tags:
  - "security"

# CLI filtering
mcpbr run -c config.yaml --filter-difficulty easy --filter-category browser

See Configuration for full filtering documentation.

Best Practices¶

Choosing a Benchmark¶

Testing MCP servers: Start with SWE-bench or MCPToolBench++
Quick smoke tests: Use HumanEval (164 tasks, fast)
Math reasoning: Use GSM8K for basic or MATH for competition-level
Security research: Use CyberGym with appropriate difficulty level
Multi-language: Use Aider Polyglot or CoderEval

General Tips¶

Start small: Run with -n 5 before scaling up
Use pre-built images: Enabled by default for SWE-bench, much faster
Set appropriate timeouts: See individual benchmark pages for recommendations
Save results: Always use -o results.json to preserve data
Compare benchmarks: Run multiple benchmarks to get a comprehensive picture

Configuration Guide - Full configuration reference
CLI Reference - All command options
Best Practices - Tips for effective evaluation
Architecture - How mcpbr works internally