API Reference¶
Comprehensive reference documentation for the mcpbr Python API. Use these modules to programmatically configure, execute, and analyze MCP server benchmarks.
Quick Start¶
The fastest way to use mcpbr programmatically is through the SDK module:
from mcpbr import MCPBenchmark, list_benchmarks, list_models, get_version
# Check available benchmarks and models
print(get_version()) # e.g., "0.8.0"
for b in list_benchmarks():
print(b["name"], b["class"])
# Configure and validate a benchmark
bench = MCPBenchmark({
"mcp_server": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
},
"benchmark": "humaneval",
"model": "sonnet",
"sample_size": 10,
})
is_valid, errors = bench.validate()
if is_valid:
plan = bench.dry_run()
print(plan)
SDK vs Harness
The SDK (mcpbr.sdk) provides a high-level, user-friendly interface for configuration and validation. For full evaluation execution with Docker environments, use the Harness (mcpbr.harness.run_evaluation) directly.
Core Modules¶
mcpbr is organized into several focused modules. Click through to each sub-page for detailed API documentation with examples.
| Module | Description | Key Classes / Functions |
|---|---|---|
| SDK | High-level Python interface | MCPBenchmark, BenchmarkResult, list_benchmarks(), list_models() |
| Configuration | Config models and YAML loading | HarnessConfig, MCPServerConfig, AzureConfig, load_config() |
| Analytics | Statistical analysis and tracking | ResultsDatabase, ComparisonEngine, RegressionDetector, ABTest |
| Reports | Report generation in multiple formats | HTMLReportGenerator, EnhancedMarkdownGenerator, PDFReportGenerator |
| Benchmarks | Benchmark protocol and extensions | Benchmark protocol, BenchmarkTask, create_benchmark() |
Architecture Overview¶
mcpbr
|-- sdk.py # Public Python SDK (MCPBenchmark, list_*)
|-- config.py # Configuration models (HarnessConfig, MCPServerConfig)
|-- harness.py # Evaluation orchestration (run_evaluation)
|-- models.py # Model registry (SUPPORTED_MODELS)
|-- benchmarks/
| |-- base.py # Benchmark protocol and BenchmarkTask
| |-- swebench.py # SWE-bench implementation
| |-- humaneval.py # HumanEval implementation
| +-- ... # 27+ benchmark implementations
|-- analytics/
| |-- database.py # SQLite results storage
| |-- statistical.py # Hypothesis testing
| |-- comparison.py # Multi-model comparison
| |-- regression_detector.py
| |-- ab_testing.py # A/B testing framework
| |-- leaderboard.py # Rankings generation
| |-- metrics.py # Custom metrics registry
| |-- trends.py # Time-series trends
| |-- anomaly.py # Outlier detection
| |-- correlation.py # Metric correlations
| |-- error_analysis.py # Error clustering
| +-- difficulty.py # Task difficulty scoring
+-- reports/
|-- html_report.py # Interactive HTML reports
|-- enhanced_markdown.py # GitHub-flavored markdown
+-- pdf_report.py # Print-friendly PDF reports
Harness API¶
The harness module orchestrates the full evaluation pipeline, including task loading, Docker environment management, agent execution, and result aggregation.
run_evaluation¶
run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None, state_tracker=None, from_task=None, incremental_save_path=None, mcp_logs_dir=None) async ¶
Run the full evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | HarnessConfig | Harness configuration. | required |
run_mcp | bool | Whether to run MCP evaluation. | True |
run_baseline | bool | Whether to run baseline evaluation. | True |
verbose | bool | Enable verbose output. | False |
verbosity | int | Verbosity level (0=silent, 1=summary, 2=detailed). | 1 |
log_file | TextIO | None | Optional file handle for writing raw JSON logs. | None |
log_dir | Path | None | Optional directory for per-instance JSON log files. | None |
task_ids | list[str] | None | Specific task IDs to run (None for all). | None |
state_tracker | Any | None | Optional state tracker for incremental evaluation. | None |
from_task | str | None | Optional task ID to resume from. | None |
incremental_save_path | Path | None | Optional path to save results incrementally for crash recovery. | None |
mcp_logs_dir | Path | None | Directory for MCP server logs. | None |
Returns:
| Type | Description |
|---|---|
EvaluationResults | EvaluationResults with all results. |
EvaluationResults¶
EvaluationResults dataclass ¶
Complete evaluation results.
TaskResult¶
TaskResult dataclass ¶
Result for a single task.
comparison_mode property ¶
Check if this result is from comparison mode.
Models¶
ModelInfo¶
ModelInfo dataclass ¶
Information about a supported model.
Model Functions¶
list_supported_models() ¶
Get a list of all supported models.
Returns:
| Type | Description |
|---|---|
list[ModelInfo] | List of ModelInfo objects. |
get_model_info(model_id) ¶
Get information about a model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id | str | Anthropic model ID. | required |
Returns:
| Type | Description |
|---|---|
ModelInfo | None | ModelInfo if found, None otherwise. |
is_model_supported(model_id) ¶
Check if a model is in the supported list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id | str | Anthropic model ID. | required |
Returns:
| Type | Description |
|---|---|
bool | True if the model is supported. |
validate_model(model_id) ¶
Validate a model ID and return a helpful error message if invalid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id | str | Anthropic model ID to validate. | required |
Returns:
| Type | Description |
|---|---|
tuple[bool, str] | Tuple of (is_valid, error_message). |
Constants¶
Default Values¶
from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES, VALID_BENCHMARKS
print(DEFAULT_MODEL) # "sonnet"
print(VALID_PROVIDERS) # ("anthropic", "openai", "gemini", "qwen")
print(VALID_HARNESSES) # ("claude-code",)
print(VALID_BENCHMARKS) # 29 benchmark identifiers
Supported Models¶
| Model ID | Provider | Display Name | Context Window |
|---|---|---|---|
claude-opus-4-5-20251101 | Anthropic | Claude Opus 4.5 | 200,000 |
claude-sonnet-4-5-20250929 | Anthropic | Claude Sonnet 4.5 | 200,000 |
claude-haiku-4-5-20251001 | Anthropic | Claude Haiku 4.5 | 200,000 |
sonnet | Anthropic | Claude Sonnet (alias) | 200,000 |
opus | Anthropic | Claude Opus (alias) | 200,000 |
haiku | Anthropic | Claude Haiku (alias) | 200,000 |
gpt-4o | OpenAI | GPT-4o | 128,000 |
gpt-4-turbo | OpenAI | GPT-4 Turbo | 128,000 |
gpt-4o-mini | OpenAI | GPT-4o Mini | 128,000 |
gemini-2.0-flash | Gemini 2.0 Flash | 1,048,576 | |
gemini-1.5-pro | Gemini 1.5 Pro | 2,097,152 | |
gemini-1.5-flash | Gemini 1.5 Flash | 1,048,576 | |
qwen-plus | Alibaba | Qwen Plus | 131,072 |
qwen-turbo | Alibaba | Qwen Turbo | 131,072 |
qwen-max | Alibaba | Qwen Max | 131,072 |