Skip to content

SDK Reference

The mcpbr.sdk module provides the public Python SDK for programmatic access to MCP server benchmarking. It is the primary entry point for Python users who want to configure, validate, and execute benchmarks without the CLI.

All public symbols are re-exported from the top-level mcpbr package.

from mcpbr import MCPBenchmark, BenchmarkResult
from mcpbr import list_benchmarks, list_models, list_providers, get_version

MCPBenchmark

The main class for configuring and running MCP benchmarks.

MCPBenchmark

High-level interface for configuring and running MCP benchmarks.

Can be initialized from a config dict, a YAML file path (str or Path), or an existing HarnessConfig instance.

Parameters:

Name Type Description Default
config dict[str, Any] | str | Path | HarnessConfig

A dict of config values, a path to a YAML config file (str or Path), or a HarnessConfig instance.

required

Raises:

Type Description
FileNotFoundError

If a file path is given and the file does not exist.

ValueError

If the config dict is invalid.

__init__(config)

validate()

Validate the current configuration.

Checks that the configuration is internally consistent, the model is recognized, and required fields are present.

Returns:

Type Description
tuple[bool, list[str]]

A tuple of (is_valid, list_of_warnings_or_errors).

dry_run()

Generate an execution plan without running anything.

Returns:

Type Description
dict[str, Any]

A dict describing what would be executed, including benchmark,

dict[str, Any]

model, provider, MCP server config, and runtime settings.

run(**kwargs) async

Execute the benchmark.

This is the main entry point for running a benchmark programmatically. It delegates to the internal _execute method, which can be overridden or mocked for testing.

Parameters:

Name Type Description Default
**kwargs Any

Additional keyword arguments passed to the executor.

{}

Returns:

Type Description
BenchmarkResult

BenchmarkResult with the evaluation results.

Initialization

MCPBenchmark accepts three types of configuration input:

from mcpbr import MCPBenchmark

bench = MCPBenchmark({
    "mcp_server": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
    },
    "benchmark": "humaneval",
    "model": "sonnet",
    "sample_size": 10,
    "timeout_seconds": 300,
})
from mcpbr import MCPBenchmark

bench = MCPBenchmark("mcpbr.yaml")
# or with a Path object
from pathlib import Path
bench = MCPBenchmark(Path("configs/production.yaml"))
from mcpbr import MCPBenchmark
from mcpbr.config import HarnessConfig, MCPServerConfig

config = HarnessConfig(
    mcp_server=MCPServerConfig(
        command="uvx",
        args=["my-mcp-server", "--workdir", "{workdir}"],
    ),
    benchmark="swe-bench-verified",
    model="sonnet",
    sample_size=5,
)
bench = MCPBenchmark(config)

File Not Found

When passing a file path, MCPBenchmark raises FileNotFoundError if the file does not exist. When passing a dict, it raises ValueError if the configuration is invalid.

validate()

Check that the configuration is internally consistent before running.

bench = MCPBenchmark({
    "mcp_server": {"command": "npx", "args": ["my-server"]},
    "benchmark": "humaneval",
    "model": "sonnet",
})

is_valid, errors = bench.validate()
if not is_valid:
    for error in errors:
        print(f"Validation error: {error}")
else:
    print("Configuration is valid")

Returns: tuple[bool, list[str]] -- A tuple of (is_valid, list_of_warnings_or_errors).

Validation checks include:

Check Description
Model registry Warns if the model ID is not in SUPPORTED_MODELS
Benchmark registry Errors if the benchmark name is not in BENCHMARK_REGISTRY
Provider Errors if the provider is not in VALID_PROVIDERS

dry_run()

Generate an execution plan without running anything. Useful for previewing what would happen.

plan = bench.dry_run()
print(plan)
# {
#     "benchmark": "humaneval",
#     "model": "sonnet",
#     "provider": "anthropic",
#     "agent_harness": "claude-code",
#     "timeout_seconds": 300,
#     "max_concurrent": 4,
#     "max_iterations": 10,
#     "sample_size": 10,
#     "mcp_server": {
#         "command": "npx",
#         "args": ["my-server"],
#         "name": "mcpbr",
#     },
# }

Returns: dict[str, Any] -- A dictionary describing the execution plan, including benchmark, model, provider, MCP server config, and runtime settings.

The plan includes comparison mode information when comparison_mode is enabled:

bench = MCPBenchmark({
    "comparison_mode": True,
    "mcp_server_a": {"command": "server-a", "name": "Server A"},
    "mcp_server_b": {"command": "server-b", "name": "Server B"},
    "benchmark": "humaneval",
    "model": "sonnet",
})
plan = bench.dry_run()
# plan["comparison_mode"] == True
# plan["mcp_server_a"] and plan["mcp_server_b"] are present

run()

Execute the benchmark asynchronously.

import asyncio
from mcpbr import MCPBenchmark

bench = MCPBenchmark({
    "mcp_server": {"command": "npx", "args": ["my-server", "{workdir}"]},
    "benchmark": "humaneval",
    "model": "sonnet",
})

# Run asynchronously
result = asyncio.run(bench.run())
print(result.success, result.summary)

Execution Status

Full benchmark execution via the SDK run() method delegates to an internal _execute() method. Currently, _execute() raises NotImplementedError -- use the mcpbr CLI for actual benchmark runs, or mock MCPBenchmark._execute for testing.

Returns: BenchmarkResult -- A dataclass with the evaluation results.


BenchmarkResult

Dataclass representing the result of a benchmark run.

BenchmarkResult dataclass

Result of a benchmark run.

Attributes:

Name Type Description
success bool

Whether the benchmark completed successfully.

summary dict[str, Any]

Aggregated results (e.g., pass rate, resolved count).

tasks list[dict[str, Any]]

Per-task results as a list of dicts.

metadata dict[str, Any]

Run metadata (benchmark name, model, timestamps, etc.).

total_cost float

Total API cost in USD.

total_tokens int

Total tokens consumed.

duration_seconds float

Wall-clock duration of the run.

Fields

Field Type Default Description
success bool (required) Whether the benchmark completed successfully
summary dict[str, Any] (required) Aggregated results (e.g., pass rate, resolved count)
tasks list[dict[str, Any]] (required) Per-task results as a list of dicts
metadata dict[str, Any] (required) Run metadata (benchmark name, model, timestamps, etc.)
total_cost float 0.0 Total API cost in USD
total_tokens int 0 Total tokens consumed
duration_seconds float 0.0 Wall-clock duration of the run

Working with BenchmarkResult

result = BenchmarkResult(
    success=True,
    summary={"pass_rate": 0.85, "resolved": 17, "total": 20},
    tasks=[{"task_id": "task_1", "resolved": True}, ...],
    metadata={"benchmark": "humaneval", "model": "sonnet"},
    total_cost=1.23,
    total_tokens=150000,
    duration_seconds=245.7,
)

if result.success:
    print(f"Pass rate: {result.summary['pass_rate']:.0%}")
    print(f"Cost: ${result.total_cost:.2f}")
    print(f"Duration: {result.duration_seconds:.0f}s")

Discovery Functions

list_benchmarks()

List all available benchmarks registered in the system.

from mcpbr import list_benchmarks

benchmarks = list_benchmarks()
for b in benchmarks:
    print(f"{b['name']:25s} {b['class']}")

Returns: list[dict[str, str]] -- Each dict contains name (the benchmark identifier) and class (the benchmark class name).

Sample Output

swe-bench-lite            SWEBenchmark
swe-bench-verified        SWEBenchmark
swe-bench-full            SWEBenchmark
cybergym                  CyberGymBenchmark
humaneval                 HumanEvalBenchmark
mcptoolbench              MCPToolBenchmark
gsm8k                     GSM8KBenchmark
...

list_providers()

List all supported model providers.

from mcpbr import list_providers

providers = list_providers()
print(providers)
# ['anthropic', 'openai', 'gemini', 'qwen']

Returns: list[str] -- A list of provider name strings.

list_models()

List all supported models with their metadata.

from mcpbr import list_models

models = list_models()
for m in models:
    print(f"{m['id']:35s} {m['provider']:10s} {m['context_window']:>10,}")

Returns: list[dict[str, str]] -- Each dict contains:

Key Type Description
id str Model identifier (e.g., "sonnet", "gpt-4o")
provider str Provider name (e.g., "Anthropic", "OpenAI")
display_name str Human-readable model name
context_window int Maximum context window in tokens
supports_tools bool Whether the model supports tool calling
notes str Additional notes (e.g., alias information)

get_version()

Get the current mcpbr version string.

from mcpbr import get_version

version = get_version()
print(version)  # e.g., "0.8.0"

Returns: str -- The version string.


Error Handling

The SDK raises standard Python exceptions:

Exception When
FileNotFoundError Config file path does not exist
ValueError Invalid config dict (Pydantic validation failure)
TypeError Config argument is not a dict, str, Path, or HarnessConfig
NotImplementedError MCPBenchmark.run() called (full execution not yet wired into SDK)

Error Handling Pattern

from mcpbr import MCPBenchmark

try:
    bench = MCPBenchmark("nonexistent.yaml")
except FileNotFoundError as e:
    print(f"Config file not found: {e}")

try:
    bench = MCPBenchmark({"benchmark": "invalid-benchmark"})
except ValueError as e:
    print(f"Invalid configuration: {e}")

Testing with the SDK

The SDK is designed to be easily mockable for testing:

import asyncio
from unittest.mock import AsyncMock
from mcpbr import MCPBenchmark, BenchmarkResult

# Create a benchmark instance
bench = MCPBenchmark({
    "mcp_server": {"command": "test-server", "args": []},
    "benchmark": "humaneval",
    "model": "sonnet",
})

# Mock the internal _execute method
bench._execute = AsyncMock(return_value=BenchmarkResult(
    success=True,
    summary={"pass_rate": 0.90},
    tasks=[],
    metadata={},
    total_cost=0.50,
    total_tokens=10000,
    duration_seconds=60.0,
))

# Run the benchmark (uses the mock)
result = asyncio.run(bench.run())
assert result.success
assert result.total_cost == 0.50