SDK Reference¶
The mcpbr.sdk module provides the public Python SDK for programmatic access to MCP server benchmarking. It is the primary entry point for Python users who want to configure, validate, and execute benchmarks without the CLI.
All public symbols are re-exported from the top-level mcpbr package.
from mcpbr import MCPBenchmark, BenchmarkResult
from mcpbr import list_benchmarks, list_models, list_providers, get_version
MCPBenchmark¶
The main class for configuring and running MCP benchmarks.
MCPBenchmark ¶
High-level interface for configuring and running MCP benchmarks.
Can be initialized from a config dict, a YAML file path (str or Path), or an existing HarnessConfig instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | dict[str, Any] | str | Path | HarnessConfig | A dict of config values, a path to a YAML config file (str or Path), or a HarnessConfig instance. | required |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If a file path is given and the file does not exist. |
ValueError | If the config dict is invalid. |
__init__(config) ¶
validate() ¶
Validate the current configuration.
Checks that the configuration is internally consistent, the model is recognized, and required fields are present.
Returns:
| Type | Description |
|---|---|
tuple[bool, list[str]] | A tuple of (is_valid, list_of_warnings_or_errors). |
dry_run() ¶
Generate an execution plan without running anything.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | A dict describing what would be executed, including benchmark, |
dict[str, Any] | model, provider, MCP server config, and runtime settings. |
run(**kwargs) async ¶
Execute the benchmark.
This is the main entry point for running a benchmark programmatically. It delegates to the internal _execute method, which can be overridden or mocked for testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs | Any | Additional keyword arguments passed to the executor. | {} |
Returns:
| Type | Description |
|---|---|
BenchmarkResult | BenchmarkResult with the evaluation results. |
Initialization¶
MCPBenchmark accepts three types of configuration input:
from mcpbr import MCPBenchmark
from mcpbr.config import HarnessConfig, MCPServerConfig
config = HarnessConfig(
mcp_server=MCPServerConfig(
command="uvx",
args=["my-mcp-server", "--workdir", "{workdir}"],
),
benchmark="swe-bench-verified",
model="sonnet",
sample_size=5,
)
bench = MCPBenchmark(config)
File Not Found
When passing a file path, MCPBenchmark raises FileNotFoundError if the file does not exist. When passing a dict, it raises ValueError if the configuration is invalid.
validate()¶
Check that the configuration is internally consistent before running.
bench = MCPBenchmark({
"mcp_server": {"command": "npx", "args": ["my-server"]},
"benchmark": "humaneval",
"model": "sonnet",
})
is_valid, errors = bench.validate()
if not is_valid:
for error in errors:
print(f"Validation error: {error}")
else:
print("Configuration is valid")
Returns: tuple[bool, list[str]] -- A tuple of (is_valid, list_of_warnings_or_errors).
Validation checks include:
| Check | Description |
|---|---|
| Model registry | Warns if the model ID is not in SUPPORTED_MODELS |
| Benchmark registry | Errors if the benchmark name is not in BENCHMARK_REGISTRY |
| Provider | Errors if the provider is not in VALID_PROVIDERS |
dry_run()¶
Generate an execution plan without running anything. Useful for previewing what would happen.
plan = bench.dry_run()
print(plan)
# {
# "benchmark": "humaneval",
# "model": "sonnet",
# "provider": "anthropic",
# "agent_harness": "claude-code",
# "timeout_seconds": 300,
# "max_concurrent": 4,
# "max_iterations": 10,
# "sample_size": 10,
# "mcp_server": {
# "command": "npx",
# "args": ["my-server"],
# "name": "mcpbr",
# },
# }
Returns: dict[str, Any] -- A dictionary describing the execution plan, including benchmark, model, provider, MCP server config, and runtime settings.
The plan includes comparison mode information when comparison_mode is enabled:
bench = MCPBenchmark({
"comparison_mode": True,
"mcp_server_a": {"command": "server-a", "name": "Server A"},
"mcp_server_b": {"command": "server-b", "name": "Server B"},
"benchmark": "humaneval",
"model": "sonnet",
})
plan = bench.dry_run()
# plan["comparison_mode"] == True
# plan["mcp_server_a"] and plan["mcp_server_b"] are present
run()¶
Execute the benchmark asynchronously.
import asyncio
from mcpbr import MCPBenchmark
bench = MCPBenchmark({
"mcp_server": {"command": "npx", "args": ["my-server", "{workdir}"]},
"benchmark": "humaneval",
"model": "sonnet",
})
# Run asynchronously
result = asyncio.run(bench.run())
print(result.success, result.summary)
Execution Status
Full benchmark execution via the SDK run() method delegates to an internal _execute() method. Currently, _execute() raises NotImplementedError -- use the mcpbr CLI for actual benchmark runs, or mock MCPBenchmark._execute for testing.
Returns: BenchmarkResult -- A dataclass with the evaluation results.
BenchmarkResult¶
Dataclass representing the result of a benchmark run.
BenchmarkResult dataclass ¶
Result of a benchmark run.
Attributes:
| Name | Type | Description |
|---|---|---|
success | bool | Whether the benchmark completed successfully. |
summary | dict[str, Any] | Aggregated results (e.g., pass rate, resolved count). |
tasks | list[dict[str, Any]] | Per-task results as a list of dicts. |
metadata | dict[str, Any] | Run metadata (benchmark name, model, timestamps, etc.). |
total_cost | float | Total API cost in USD. |
total_tokens | int | Total tokens consumed. |
duration_seconds | float | Wall-clock duration of the run. |
Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
success | bool | (required) | Whether the benchmark completed successfully |
summary | dict[str, Any] | (required) | Aggregated results (e.g., pass rate, resolved count) |
tasks | list[dict[str, Any]] | (required) | Per-task results as a list of dicts |
metadata | dict[str, Any] | (required) | Run metadata (benchmark name, model, timestamps, etc.) |
total_cost | float | 0.0 | Total API cost in USD |
total_tokens | int | 0 | Total tokens consumed |
duration_seconds | float | 0.0 | Wall-clock duration of the run |
Working with BenchmarkResult
result = BenchmarkResult(
success=True,
summary={"pass_rate": 0.85, "resolved": 17, "total": 20},
tasks=[{"task_id": "task_1", "resolved": True}, ...],
metadata={"benchmark": "humaneval", "model": "sonnet"},
total_cost=1.23,
total_tokens=150000,
duration_seconds=245.7,
)
if result.success:
print(f"Pass rate: {result.summary['pass_rate']:.0%}")
print(f"Cost: ${result.total_cost:.2f}")
print(f"Duration: {result.duration_seconds:.0f}s")
Discovery Functions¶
list_benchmarks()¶
List all available benchmarks registered in the system.
from mcpbr import list_benchmarks
benchmarks = list_benchmarks()
for b in benchmarks:
print(f"{b['name']:25s} {b['class']}")
Returns: list[dict[str, str]] -- Each dict contains name (the benchmark identifier) and class (the benchmark class name).
Sample Output
list_providers()¶
List all supported model providers.
from mcpbr import list_providers
providers = list_providers()
print(providers)
# ['anthropic', 'openai', 'gemini', 'qwen']
Returns: list[str] -- A list of provider name strings.
list_models()¶
List all supported models with their metadata.
from mcpbr import list_models
models = list_models()
for m in models:
print(f"{m['id']:35s} {m['provider']:10s} {m['context_window']:>10,}")
Returns: list[dict[str, str]] -- Each dict contains:
| Key | Type | Description |
|---|---|---|
id | str | Model identifier (e.g., "sonnet", "gpt-4o") |
provider | str | Provider name (e.g., "Anthropic", "OpenAI") |
display_name | str | Human-readable model name |
context_window | int | Maximum context window in tokens |
supports_tools | bool | Whether the model supports tool calling |
notes | str | Additional notes (e.g., alias information) |
get_version()¶
Get the current mcpbr version string.
Returns: str -- The version string.
Error Handling¶
The SDK raises standard Python exceptions:
| Exception | When |
|---|---|
FileNotFoundError | Config file path does not exist |
ValueError | Invalid config dict (Pydantic validation failure) |
TypeError | Config argument is not a dict, str, Path, or HarnessConfig |
NotImplementedError | MCPBenchmark.run() called (full execution not yet wired into SDK) |
Error Handling Pattern
Testing with the SDK¶
The SDK is designed to be easily mockable for testing:
import asyncio
from unittest.mock import AsyncMock
from mcpbr import MCPBenchmark, BenchmarkResult
# Create a benchmark instance
bench = MCPBenchmark({
"mcp_server": {"command": "test-server", "args": []},
"benchmark": "humaneval",
"model": "sonnet",
})
# Mock the internal _execute method
bench._execute = AsyncMock(return_value=BenchmarkResult(
success=True,
summary={"pass_rate": 0.90},
tasks=[],
metadata={},
total_cost=0.50,
total_tokens=10000,
duration_seconds=60.0,
))
# Run the benchmark (uses the mock)
result = asyncio.run(bench.run())
assert result.success
assert result.total_cost == 0.50