API Reference¶

Q: What is the mcpbr Python API?

mcpbr provides a full Python SDK for programmatic benchmarking of MCP servers, including configuration management, benchmark execution, analytics, and report generation -- all without requiring the CLI.

Q: How do I run mcpbr evaluations programmatically?

Use the MCPBenchmark class from the SDK: create an instance with a config dict or YAML path, call validate() to check configuration, dry_run() to preview the execution plan, and run() to execute the benchmark asynchronously.

Q: What modules does mcpbr expose for extensibility?

mcpbr exposes the SDK (mcpbr.sdk), configuration (mcpbr.config), benchmarks (mcpbr.benchmarks with a Benchmark protocol), analytics (mcpbr.analytics for statistical analysis and historical tracking), and reports (mcpbr.reports for HTML, Markdown, and PDF generation).

Q: Can I define custom benchmarks?

Yes. Implement the Benchmark protocol from mcpbr.benchmarks.base, which requires load_tasks(), normalize_task(), create_environment(), evaluate(), get_prebuilt_image(), and get_prompt_template() methods.

Comprehensive reference documentation for the mcpbr Python API. Use these modules to programmatically configure, execute, and analyze MCP server benchmarks.

Quick Start¶

The fastest way to use mcpbr programmatically is through the SDK module:

from mcpbr import MCPBenchmark, list_benchmarks, list_models, get_version

# Check available benchmarks and models
print(get_version())  # e.g., "0.8.0"
for b in list_benchmarks():
    print(b["name"], b["class"])

# Configure and validate a benchmark
bench = MCPBenchmark({
    "mcp_server": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
    },
    "benchmark": "humaneval",
    "model": "sonnet",
    "sample_size": 10,
})

is_valid, errors = bench.validate()
if is_valid:
    plan = bench.dry_run()
    print(plan)

SDK vs Harness

The SDK (mcpbr.sdk) provides a high-level, user-friendly interface for configuration and validation. For full evaluation execution with Docker environments, use the Harness (mcpbr.harness.run_evaluation) directly.

Core Modules¶

mcpbr is organized into several focused modules. Click through to each sub-page for detailed API documentation with examples.

Module	Description	Key Classes / Functions
SDK	High-level Python interface	`MCPBenchmark`, `BenchmarkResult`, `list_benchmarks()`, `list_models()`
Configuration	Config models and YAML loading	`HarnessConfig`, `MCPServerConfig`, `AzureConfig`, `load_config()`
Analytics	Statistical analysis and tracking	`ResultsDatabase`, `ComparisonEngine`, `RegressionDetector`, `ABTest`
Reports	Report generation in multiple formats	`HTMLReportGenerator`, `EnhancedMarkdownGenerator`, `PDFReportGenerator`
Benchmarks	Benchmark protocol and extensions	`Benchmark` protocol, `BenchmarkTask`, `create_benchmark()`

Architecture Overview¶

mcpbr
 |-- sdk.py                  # Public Python SDK (MCPBenchmark, list_*)
 |-- config.py               # Configuration models (HarnessConfig, MCPServerConfig)
 |-- harness.py              # Evaluation orchestration (run_evaluation)
 |-- models.py               # Model registry (SUPPORTED_MODELS)
 |-- benchmarks/
 |   |-- base.py             # Benchmark protocol and BenchmarkTask
 |   |-- swebench.py         # SWE-bench implementation
 |   |-- humaneval.py        # HumanEval implementation
 |   +-- ...                 # 27+ benchmark implementations
 |-- analytics/
 |   |-- database.py         # SQLite results storage
 |   |-- statistical.py      # Hypothesis testing
 |   |-- comparison.py       # Multi-model comparison
 |   |-- regression_detector.py
 |   |-- ab_testing.py       # A/B testing framework
 |   |-- leaderboard.py      # Rankings generation
 |   |-- metrics.py          # Custom metrics registry
 |   |-- trends.py           # Time-series trends
 |   |-- anomaly.py          # Outlier detection
 |   |-- correlation.py      # Metric correlations
 |   |-- error_analysis.py   # Error clustering
 |   +-- difficulty.py       # Task difficulty scoring
 +-- reports/
     |-- html_report.py      # Interactive HTML reports
     |-- enhanced_markdown.py # GitHub-flavored markdown
     +-- pdf_report.py       # Print-friendly PDF reports

Harness API¶

The harness module orchestrates the full evaluation pipeline, including task loading, Docker environment management, agent execution, and result aggregation.

run_evaluation¶

`run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None, state_tracker=None, from_task=None, incremental_save_path=None, mcp_logs_dir=None)` `async` ¶

Run the full evaluation.

Parameters:

Name	Type	Description	Default
`config`	`HarnessConfig`	Harness configuration.	required
`run_mcp`	`bool`	Whether to run MCP evaluation.	`True`
`run_baseline`	`bool`	Whether to run baseline evaluation.	`True`
`verbose`	`bool`	Enable verbose output.	`False`
`verbosity`	`int`	Verbosity level (0=silent, 1=summary, 2=detailed).	`1`
`log_file`	`TextIO \| None`	Optional file handle for writing raw JSON logs.	`None`
`log_dir`	`Path \| None`	Optional directory for per-instance JSON log files.	`None`
`task_ids`	`list[str] \| None`	Specific task IDs to run (None for all).	`None`
`state_tracker`	`Any \| None`	Optional state tracker for incremental evaluation.	`None`
`from_task`	`str \| None`	Optional task ID to resume from.	`None`
`incremental_save_path`	`Path \| None`	Optional path to save results incrementally for crash recovery.	`None`
`mcp_logs_dir`	`Path \| None`	Directory for MCP server logs.	`None`

Returns:

Type	Description
`EvaluationResults`	EvaluationResults with all results.

EvaluationResults¶

`EvaluationResults` `dataclass` ¶

Complete evaluation results.

TaskResult¶

`TaskResult` `dataclass` ¶

Result for a single task.

`comparison_mode` `property` ¶

Check if this result is from comparison mode.

Models¶

ModelInfo¶

`ModelInfo` `dataclass` ¶

Information about a supported model.

Model Functions¶

`list_supported_models()` ¶

Get a list of all supported models.

Returns:

Type	Description
`list[ModelInfo]`	List of ModelInfo objects.

`get_model_info(model_id)` ¶

Get information about a model.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Anthropic model ID.	required

Returns:

Type	Description
`ModelInfo \| None`	ModelInfo if found, None otherwise.

`is_model_supported(model_id)` ¶

Check if a model is in the supported list.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Anthropic model ID.	required

Returns:

Type	Description
`bool`	True if the model is supported.

`validate_model(model_id)` ¶

Validate a model ID and return a helpful error message if invalid.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Anthropic model ID to validate.	required

Returns:

Type	Description
`tuple[bool, str]`	Tuple of (is_valid, error_message).

Constants¶

Default Values¶

from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES, VALID_BENCHMARKS

print(DEFAULT_MODEL)       # "sonnet"
print(VALID_PROVIDERS)     # ("anthropic", "openai", "gemini", "qwen")
print(VALID_HARNESSES)     # ("claude-code",)
print(VALID_BENCHMARKS)    # 29 benchmark identifiers

Supported Models¶

Model ID	Provider	Display Name	Context Window
`claude-opus-4-5-20251101`	Anthropic	Claude Opus 4.5	200,000
`claude-sonnet-4-5-20250929`	Anthropic	Claude Sonnet 4.5	200,000
`claude-haiku-4-5-20251001`	Anthropic	Claude Haiku 4.5	200,000
`sonnet`	Anthropic	Claude Sonnet (alias)	200,000
`opus`	Anthropic	Claude Opus (alias)	200,000
`haiku`	Anthropic	Claude Haiku (alias)	200,000
`gpt-4o`	OpenAI	GPT-4o	128,000
`gpt-4-turbo`	OpenAI	GPT-4 Turbo	128,000
`gpt-4o-mini`	OpenAI	GPT-4o Mini	128,000
`gemini-2.0-flash`	Google	Gemini 2.0 Flash	1,048,576
`gemini-1.5-pro`	Google	Gemini 1.5 Pro	2,097,152
`gemini-1.5-flash`	Google	Gemini 1.5 Flash	1,048,576
`qwen-plus`	Alibaba	Qwen Plus	131,072
`qwen-turbo`	Alibaba	Qwen Turbo	131,072
`qwen-max`	Alibaba	Qwen Max	131,072

API Reference¶

Quick Start¶

Core Modules¶

Architecture Overview¶

Harness API¶

run_evaluation¶

run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None, state_tracker=None, from_task=None, incremental_save_path=None, mcp_logs_dir=None) async ¶

EvaluationResults¶

EvaluationResults dataclass ¶

TaskResult¶

TaskResult dataclass ¶

comparison_mode property ¶

Models¶

ModelInfo¶

ModelInfo dataclass ¶

Model Functions¶

list_supported_models() ¶

get_model_info(model_id) ¶

is_model_supported(model_id) ¶

validate_model(model_id) ¶

Constants¶

Default Values¶

Supported Models¶

`run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None, state_tracker=None, from_task=None, incremental_save_path=None, mcp_logs_dir=None)` `async` ¶

`EvaluationResults` `dataclass` ¶

`TaskResult` `dataclass` ¶

`comparison_mode` `property` ¶

`ModelInfo` `dataclass` ¶

`list_supported_models()` ¶

`get_model_info(model_id)` ¶

`is_model_supported(model_id)` ¶

`validate_model(model_id)` ¶