Skip to content

API Reference

Comprehensive reference documentation for the mcpbr Python API. Use these modules to programmatically configure, execute, and analyze MCP server benchmarks.

Quick Start

The fastest way to use mcpbr programmatically is through the SDK module:

from mcpbr import MCPBenchmark, list_benchmarks, list_models, get_version

# Check available benchmarks and models
print(get_version())  # e.g., "0.8.0"
for b in list_benchmarks():
    print(b["name"], b["class"])

# Configure and validate a benchmark
bench = MCPBenchmark({
    "mcp_server": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
    },
    "benchmark": "humaneval",
    "model": "sonnet",
    "sample_size": 10,
})

is_valid, errors = bench.validate()
if is_valid:
    plan = bench.dry_run()
    print(plan)

SDK vs Harness

The SDK (mcpbr.sdk) provides a high-level, user-friendly interface for configuration and validation. For full evaluation execution with Docker environments, use the Harness (mcpbr.harness.run_evaluation) directly.


Core Modules

mcpbr is organized into several focused modules. Click through to each sub-page for detailed API documentation with examples.

Module Description Key Classes / Functions
SDK High-level Python interface MCPBenchmark, BenchmarkResult, list_benchmarks(), list_models()
Configuration Config models and YAML loading HarnessConfig, MCPServerConfig, AzureConfig, load_config()
Analytics Statistical analysis and tracking ResultsDatabase, ComparisonEngine, RegressionDetector, ABTest
Reports Report generation in multiple formats HTMLReportGenerator, EnhancedMarkdownGenerator, PDFReportGenerator
Benchmarks Benchmark protocol and extensions Benchmark protocol, BenchmarkTask, create_benchmark()

Architecture Overview

mcpbr
 |-- sdk.py                  # Public Python SDK (MCPBenchmark, list_*)
 |-- config.py               # Configuration models (HarnessConfig, MCPServerConfig)
 |-- harness.py              # Evaluation orchestration (run_evaluation)
 |-- models.py               # Model registry (SUPPORTED_MODELS)
 |-- benchmarks/
 |   |-- base.py             # Benchmark protocol and BenchmarkTask
 |   |-- swebench.py         # SWE-bench implementation
 |   |-- humaneval.py        # HumanEval implementation
 |   +-- ...                 # 27+ benchmark implementations
 |-- analytics/
 |   |-- database.py         # SQLite results storage
 |   |-- statistical.py      # Hypothesis testing
 |   |-- comparison.py       # Multi-model comparison
 |   |-- regression_detector.py
 |   |-- ab_testing.py       # A/B testing framework
 |   |-- leaderboard.py      # Rankings generation
 |   |-- metrics.py          # Custom metrics registry
 |   |-- trends.py           # Time-series trends
 |   |-- anomaly.py          # Outlier detection
 |   |-- correlation.py      # Metric correlations
 |   |-- error_analysis.py   # Error clustering
 |   +-- difficulty.py       # Task difficulty scoring
 +-- reports/
     |-- html_report.py      # Interactive HTML reports
     |-- enhanced_markdown.py # GitHub-flavored markdown
     +-- pdf_report.py       # Print-friendly PDF reports

Harness API

The harness module orchestrates the full evaluation pipeline, including task loading, Docker environment management, agent execution, and result aggregation.

run_evaluation

run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None, state_tracker=None, from_task=None, incremental_save_path=None, mcp_logs_dir=None) async

Run the full evaluation.

Parameters:

Name Type Description Default
config HarnessConfig

Harness configuration.

required
run_mcp bool

Whether to run MCP evaluation.

True
run_baseline bool

Whether to run baseline evaluation.

True
verbose bool

Enable verbose output.

False
verbosity int

Verbosity level (0=silent, 1=summary, 2=detailed).

1
log_file TextIO | None

Optional file handle for writing raw JSON logs.

None
log_dir Path | None

Optional directory for per-instance JSON log files.

None
task_ids list[str] | None

Specific task IDs to run (None for all).

None
state_tracker Any | None

Optional state tracker for incremental evaluation.

None
from_task str | None

Optional task ID to resume from.

None
incremental_save_path Path | None

Optional path to save results incrementally for crash recovery.

None
mcp_logs_dir Path | None

Directory for MCP server logs.

None

Returns:

Type Description
EvaluationResults

EvaluationResults with all results.

EvaluationResults

EvaluationResults dataclass

Complete evaluation results.

TaskResult

TaskResult dataclass

Result for a single task.

comparison_mode property

Check if this result is from comparison mode.


Models

ModelInfo

ModelInfo dataclass

Information about a supported model.

Model Functions

list_supported_models()

Get a list of all supported models.

Returns:

Type Description
list[ModelInfo]

List of ModelInfo objects.

get_model_info(model_id)

Get information about a model.

Parameters:

Name Type Description Default
model_id str

Anthropic model ID.

required

Returns:

Type Description
ModelInfo | None

ModelInfo if found, None otherwise.

is_model_supported(model_id)

Check if a model is in the supported list.

Parameters:

Name Type Description Default
model_id str

Anthropic model ID.

required

Returns:

Type Description
bool

True if the model is supported.

validate_model(model_id)

Validate a model ID and return a helpful error message if invalid.

Parameters:

Name Type Description Default
model_id str

Anthropic model ID to validate.

required

Returns:

Type Description
tuple[bool, str]

Tuple of (is_valid, error_message).


Constants

Default Values

from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES, VALID_BENCHMARKS

print(DEFAULT_MODEL)       # "sonnet"
print(VALID_PROVIDERS)     # ("anthropic", "openai", "gemini", "qwen")
print(VALID_HARNESSES)     # ("claude-code",)
print(VALID_BENCHMARKS)    # 29 benchmark identifiers

Supported Models

Model ID Provider Display Name Context Window
claude-opus-4-5-20251101 Anthropic Claude Opus 4.5 200,000
claude-sonnet-4-5-20250929 Anthropic Claude Sonnet 4.5 200,000
claude-haiku-4-5-20251001 Anthropic Claude Haiku 4.5 200,000
sonnet Anthropic Claude Sonnet (alias) 200,000
opus Anthropic Claude Opus (alias) 200,000
haiku Anthropic Claude Haiku (alias) 200,000
gpt-4o OpenAI GPT-4o 128,000
gpt-4-turbo OpenAI GPT-4 Turbo 128,000
gpt-4o-mini OpenAI GPT-4o Mini 128,000
gemini-2.0-flash Google Gemini 2.0 Flash 1,048,576
gemini-1.5-pro Google Gemini 1.5 Pro 2,097,152
gemini-1.5-flash Google Gemini 1.5 Flash 1,048,576
qwen-plus Alibaba Qwen Plus 131,072
qwen-turbo Alibaba Qwen Turbo 131,072
qwen-max Alibaba Qwen Max 131,072