Skip to content

Benchmarks API Reference

The mcpbr.benchmarks package defines the benchmark abstraction layer and provides implementations for 29 benchmarks. Use the Benchmark protocol to add custom benchmarks or interact with existing ones programmatically.

from mcpbr.benchmarks import (
    Benchmark,
    BenchmarkTask,
    BENCHMARK_REGISTRY,
    create_benchmark,
    list_benchmarks,
)

Benchmark Protocol

The Benchmark protocol defines the interface that all benchmark implementations must satisfy. It is decorated with @runtime_checkable, so you can use isinstance() checks.

Benchmark

Bases: Protocol

Protocol for benchmark implementations.

Each benchmark (SWE-bench, CyberGym, etc.) implements this protocol to provide task loading, environment setup, and evaluation.

load_tasks(sample_size=None, task_ids=None, level=None, filter_difficulty=None, filter_category=None, filter_tags=None)

Load tasks from the benchmark dataset.

Parameters:

Name Type Description Default
sample_size int | None

Maximum number of tasks to load (None for all).

None
task_ids list[str] | None

Specific task IDs to load (None for all).

None
level int | None

Difficulty/context level (benchmark-specific, e.g., CyberGym 0-3).

None
filter_difficulty list[str] | None

Filter by difficulty levels (benchmark-specific).

None
filter_category list[str] | None

Filter by categories (benchmark-specific).

None
filter_tags list[str] | None

Filter by tags (requires all tags to match).

None

Returns:

Type Description
list[dict[str, Any]]

List of task dictionaries in benchmark-specific format.

normalize_task(task)

Convert benchmark-specific task format to normalized BenchmarkTask.

Parameters:

Name Type Description Default
task dict[str, Any]

Task in benchmark-specific format.

required

Returns:

Type Description
BenchmarkTask

Normalized BenchmarkTask.

create_environment(task, docker_manager) async

Create an isolated environment for the task.

Parameters:

Name Type Description Default
task dict[str, Any]

Task dictionary.

required
docker_manager DockerEnvironmentManager

Docker environment manager.

required

Returns:

Type Description
TaskEnvironment

TaskEnvironment for the task.

evaluate(env, task, solution) async

Evaluate a solution for the task.

Parameters:

Name Type Description Default
env TaskEnvironment

Task environment.

required
task dict[str, Any]

Task dictionary.

required
solution str

Solution to evaluate (e.g., patch, PoC code).

required

Returns:

Type Description
dict[str, Any]

Dictionary with evaluation results including 'resolved' boolean.

get_prebuilt_image(task)

Get pre-built Docker image name for the task, if available.

Parameters:

Name Type Description Default
task dict[str, Any]

Task dictionary.

required

Returns:

Type Description
str | None

Docker image name or None if no pre-built image exists.

get_prompt_template()

Get the benchmark-specific prompt template for agents.

Returns:

Type Description
str

Prompt template string with {problem_statement} placeholder.

get_default_sandbox_level()

Get the default sandbox security level for this benchmark.

Benchmarks that run untrusted or adversarial code should return a stricter level. The user's explicit sandbox config always takes precedence over this default.

Returns:

Type Description
str | None

Security level string ("permissive", "standard", "strict")

str | None

or None to use the global default.

Required Attributes

Attribute Type Description
name str Human-readable benchmark name

Required Methods

load_tasks()

Load tasks from the benchmark dataset with optional filtering.

def load_tasks(
    self,
    sample_size: int | None = None,
    task_ids: list[str] | None = None,
    level: int | None = None,
    filter_difficulty: list[str] | None = None,
    filter_category: list[str] | None = None,
    filter_tags: list[str] | None = None,
) -> list[dict[str, Any]]
Parameter Type Description
sample_size int \| None Maximum number of tasks to load (None for all)
task_ids list[str] \| None Specific task IDs to load (None for all)
level int \| None Difficulty/context level (benchmark-specific, e.g., CyberGym 0-3)
filter_difficulty list[str] \| None Filter by difficulty levels
filter_category list[str] \| None Filter by categories
filter_tags list[str] \| None Filter by tags (all must match)

Returns: list[dict[str, Any]] -- Task dictionaries in benchmark-specific format.

normalize_task()

Convert a benchmark-specific task dictionary to the normalized BenchmarkTask format.

def normalize_task(self, task: dict[str, Any]) -> BenchmarkTask

Returns: BenchmarkTask with standardized fields.

create_environment()

Create an isolated Docker environment for the task.

async def create_environment(
    self,
    task: dict[str, Any],
    docker_manager: DockerEnvironmentManager,
) -> TaskEnvironment

Returns: TaskEnvironment with the Docker container and working directory.

evaluate()

Evaluate a solution (e.g., a patch or generated code) against the task.

async def evaluate(
    self,
    env: TaskEnvironment,
    task: dict[str, Any],
    solution: str,
) -> dict[str, Any]

Returns: Dictionary with evaluation results including a resolved boolean.

get_prebuilt_image()

Get the pre-built Docker image name for a task, if available.

def get_prebuilt_image(self, task: dict[str, Any]) -> str | None

Returns: Docker image name or None.

get_prompt_template()

Get the benchmark-specific prompt template for agents.

def get_prompt_template(self) -> str

Returns: Prompt template string with {problem_statement} placeholder.


BenchmarkTask

Normalized task representation across different benchmarks.

BenchmarkTask dataclass

Normalized task representation across different benchmarks.

Fields

Field Type Default Description
task_id str (required) Unique task identifier
problem_statement str (required) The problem description given to the agent
repo str (required) Repository name or path
commit str (required) Git commit hash for the task environment
metadata dict[str, Any] {} Additional benchmark-specific metadata

BenchmarkTask Example

from mcpbr.benchmarks import BenchmarkTask

task = BenchmarkTask(
    task_id="django__django-11099",
    problem_statement="Fix the bug in QuerySet.union() that drops ORDER BY...",
    repo="django/django",
    commit="abc123def456",
    metadata={
        "difficulty": "medium",
        "fail_to_pass": ["tests.queries.test_qs_combinators.QuerySetSetOperationTests"],
    },
)

TaskEnvironment

The Docker-based execution environment for benchmark tasks. This is returned by create_environment() and provides methods for interacting with the container.

Key Properties

Property Type Description
container Container Docker container object
workdir str Working directory inside the container
host_workdir str Working directory on the host
instance_id str Task instance identifier
uses_prebuilt bool Whether a pre-built image was used

Key Methods

Method Description
exec_command(cmd) Execute a command in the container
exec_command_streaming(cmd) Execute with streaming output
write_file(path, content) Write a file inside the container
read_file(path) Read a file from the container
cleanup() Remove the container and clean up resources

create_benchmark()

Factory function to create benchmark instances from the registry.

from mcpbr.benchmarks import create_benchmark

# Create a benchmark by name
benchmark = create_benchmark("humaneval")

# SWE-bench variants auto-set the dataset
benchmark = create_benchmark("swe-bench-verified")
# Internally sets dataset="SWE-bench/SWE-bench_Verified"

# Pass additional kwargs to the constructor
benchmark = create_benchmark("cybergym", level=2)
Parameter Type Description
name str Benchmark name from the registry
**kwargs Any Arguments passed to the benchmark constructor

Raises: ValueError if the benchmark name is not recognized.


BENCHMARK_REGISTRY

Dictionary mapping benchmark IDs to their implementation classes.

from mcpbr.benchmarks import BENCHMARK_REGISTRY

# List all registered benchmarks
for name, cls in BENCHMARK_REGISTRY.items():
    print(f"{name:25s} -> {cls.__name__}")

Available Benchmarks

mcpbr ships with 29 benchmark implementations:

Software Engineering

Benchmark ID Class Description
swe-bench-lite SWEBenchmark SWE-bench Lite -- 300 curated GitHub bug fixes
swe-bench-verified SWEBenchmark SWE-bench Verified -- 500 manually validated bug fixes
swe-bench-full SWEBenchmark SWE-bench Full -- 2,294 complete dataset
aider-polyglot AiderPolyglotBenchmark Aider polyglot coding benchmark

Code Generation

Benchmark ID Class Description
humaneval HumanEvalBenchmark OpenAI HumanEval -- function-level code generation
mbpp MBPPBenchmark Mostly Basic Python Problems
apps APPSBenchmark APPS competitive programming
codecontests CodeContestsBenchmark Google CodeContests
bigcodebench BigCodeBenchBenchmark BigCodeBench
leetcode LeetCodeBenchmark LeetCode-style problems
codereval CoderEvalBenchmark CoderEval repository-level code generation
repoqa RepoQABenchmark Repository-level code QA

Reasoning and Knowledge

Benchmark ID Class Description
gsm8k GSM8KBenchmark GSM8K grade-school math word problems
math MATHBenchmark MATH competition-level mathematics
truthfulqa TruthfulQABenchmark TruthfulQA truthfulness evaluation
bigbench-hard BigBenchHardBenchmark BIG-bench Hard challenging tasks
hellaswag HellaSwagBenchmark HellaSwag commonsense reasoning
arc ARCBenchmark AI2 Reasoning Challenge
mmmu MMMUBenchmark Massive Multi-discipline Multimodal Understanding
longbench LongBenchBenchmark Long-context understanding

Agent and Tool Use

Benchmark ID Class Description
mcptoolbench MCPToolBenchmark MCP Tool Bench -- MCP-specific tool usage
toolbench ToolBenchBenchmark ToolBench general tool usage
gaia GAIABenchmark GAIA general AI assistant
agentbench AgentBenchBenchmark AgentBench multi-domain agent
webarena WebArenaBenchmark WebArena web browsing tasks
mlagentbench MLAgentBenchBenchmark ML Agent Bench
intercode InterCodeBenchmark InterCode interactive coding
terminalbench TerminalBenchBenchmark TerminalBench terminal operations

Security

Benchmark ID Class Description
cybergym CyberGymBenchmark CyberGym security challenges (levels 0-3)
adversarial AdversarialBenchmark Adversarial robustness testing

Custom

Benchmark ID Class Description
custom CustomBenchmark User-defined custom benchmark

Implementing a Custom Benchmark

To add a new benchmark, implement the Benchmark protocol:

from typing import Any
from mcpbr.benchmarks.base import Benchmark, BenchmarkTask
from mcpbr.docker_env import DockerEnvironmentManager, TaskEnvironment


class MyBenchmark:
    """Custom benchmark implementation."""

    name: str = "my-benchmark"

    def load_tasks(
        self,
        sample_size: int | None = None,
        task_ids: list[str] | None = None,
        level: int | None = None,
        filter_difficulty: list[str] | None = None,
        filter_category: list[str] | None = None,
        filter_tags: list[str] | None = None,
    ) -> list[dict[str, Any]]:
        """Load tasks from your dataset."""
        tasks = [
            {
                "instance_id": "task-001",
                "problem_statement": "Implement a function that...",
                "repo": "my-org/my-repo",
                "base_commit": "abc123",
                "difficulty": "easy",
            },
            # ... more tasks
        ]

        # Apply filters
        if task_ids:
            tasks = [t for t in tasks if t["instance_id"] in task_ids]
        if filter_difficulty:
            tasks = [t for t in tasks if t.get("difficulty") in filter_difficulty]
        if sample_size:
            tasks = tasks[:sample_size]

        return tasks

    def normalize_task(self, task: dict[str, Any]) -> BenchmarkTask:
        """Convert to normalized format."""
        return BenchmarkTask(
            task_id=task["instance_id"],
            problem_statement=task["problem_statement"],
            repo=task["repo"],
            commit=task["base_commit"],
            metadata={"difficulty": task.get("difficulty")},
        )

    async def create_environment(
        self,
        task: dict[str, Any],
        docker_manager: DockerEnvironmentManager,
    ) -> TaskEnvironment:
        """Create a Docker environment for the task."""
        image = self.get_prebuilt_image(task)
        return await docker_manager.create_environment(
            image=image or "python:3.11-slim",
            instance_id=task["instance_id"],
            repo=task["repo"],
            commit=task["base_commit"],
        )

    async def evaluate(
        self,
        env: TaskEnvironment,
        task: dict[str, Any],
        solution: str,
    ) -> dict[str, Any]:
        """Evaluate the agent's solution."""
        # Apply the solution and run tests
        await env.write_file("/tmp/solution.py", solution)
        exit_code, output = await env.exec_command(
            "cd /workspace && python -m pytest tests/ -v"
        )
        return {
            "resolved": exit_code == 0,
            "test_output": output,
        }

    def get_prebuilt_image(self, task: dict[str, Any]) -> str | None:
        """Return pre-built Docker image if available."""
        return None  # No pre-built images

    def get_prompt_template(self) -> str:
        """Return the prompt template for agents."""
        return (
            "You are a software engineer. Solve the following problem:\n\n"
            "{problem_statement}\n\n"
            "Provide your solution as a Python implementation."
        )

Registering Your Benchmark

To make your benchmark available via create_benchmark() and the CLI, register it in BENCHMARK_REGISTRY:

# In mcpbr/benchmarks/__init__.py
from .my_benchmark import MyBenchmark

BENCHMARK_REGISTRY["my-benchmark"] = MyBenchmark

And add the benchmark ID to VALID_BENCHMARKS in mcpbr/config.py:

VALID_BENCHMARKS = (
    # ... existing benchmarks ...
    "my-benchmark",
)

Protocol Compliance

You can verify your implementation satisfies the protocol at runtime:

from mcpbr.benchmarks import Benchmark

my_bench = MyBenchmark()
assert isinstance(my_bench, Benchmark)  # Runtime check


list_benchmarks()

List all available benchmark names from the registry.

from mcpbr.benchmarks import list_benchmarks

names = list_benchmarks()
print(names)
# ['swe-bench-lite', 'swe-bench-verified', 'swe-bench-full', 'cybergym',
#  'humaneval', 'mcptoolbench', 'gsm8k', 'mbpp', 'math', ...]

Returns: list[str] -- Sorted list of benchmark identifier strings.