Configuration Reference¶
The mcpbr.config module provides Pydantic-based configuration models for the evaluation harness. All configuration can be specified via YAML files or constructed programmatically.
HarnessConfig¶
The main configuration class for the evaluation harness. Every field has a sensible default except mcp_server (required in single-server mode).
HarnessConfig ¶
Bases: BaseModel
Main configuration for the test harness.
Supports multiple model providers and agent harnesses.
validate_thinking_budget(v) classmethod ¶
Validate thinking_budget is within acceptable bounds.
Claude API requires budget_tokens >= 1024 and < max_tokens. Claude Code caps thinking at 31999 tokens by default.
validate_stratified_sampling() ¶
Ensure stratify_field is set when using stratified sampling.
validate_checkpoint_interval(v) classmethod ¶
Validate checkpoint_interval is at least 1.
validate_max_failures(v) classmethod ¶
Validate max_failures is positive if set.
validate_rate_limit_rpm(v) classmethod ¶
Validate rate_limit_rpm is positive if set.
validate_rate_limit_strategy(v) classmethod ¶
Validate rate limit strategy.
validate_redaction_level(v) classmethod ¶
Validate redaction level.
validate_data_retention_days(v) classmethod ¶
Validate data_retention_days is positive if set.
validate_model_for_provider() ¶
Validate model ID based on the provider.
Anthropic provider accepts any model ID (direct API).
validate_server_config() ¶
Validate MCP server configuration consistency.
Fields¶
MCP Server Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
mcp_server | MCPServerConfig \| None | None | MCP server configuration (required when comparison_mode is false) |
mcp_server_a | MCPServerConfig \| None | None | First MCP server for comparison mode |
mcp_server_b | MCPServerConfig \| None | None | Second MCP server for comparison mode |
comparison_mode | bool | false | Enable side-by-side comparison mode |
Model and Provider¶
| Field | Type | Default | Description |
|---|---|---|---|
provider | str | "anthropic" | Model provider. Valid: anthropic, openai, gemini, qwen |
model | str | "sonnet" | Model ID for the selected provider |
agent_harness | str | "claude-code" | Agent harness to use. Valid: claude-code |
agent_prompt | str \| None | None | Custom prompt template. Use {problem_statement} as placeholder |
Benchmark Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
benchmark | str | "swe-bench-verified" | Benchmark to run. Use mcpbr benchmarks for the full list |
sample_size | int \| None | None | Number of tasks to evaluate (None for full dataset) |
cybergym_level | int | 1 | CyberGym difficulty level (0-3) |
filter_difficulty | list[str] \| None | None | Filter by difficulty (e.g., ["easy", "medium"]) |
filter_category | list[str] \| None | None | Filter by category (e.g., ["django", "flask"]) |
filter_tags | list[str] \| None | None | Filter by tags (all tags must match) |
Runtime Settings¶
| Field | Type | Default | Description |
|---|---|---|---|
timeout_seconds | int | 300 | Timeout for each task in seconds (minimum: 30) |
max_concurrent | int | 4 | Maximum concurrent task evaluations (minimum: 1) |
max_iterations | int | 10 | Maximum agent iterations per task |
thinking_budget | int \| None | None | Extended thinking token budget (1024-31999 if set) |
budget | float \| None | None | Maximum budget in USD (halts when reached) |
continue_on_error | bool | true | Continue evaluation when individual tasks fail |
max_failures | int \| None | None | Maximum task failures before halting (None for unlimited) |
Caching and Checkpointing¶
| Field | Type | Default | Description |
|---|---|---|---|
cache_enabled | bool | false | Enable result caching to avoid re-running identical evaluations |
cache_dir | Path \| None | None | Cache directory (default: ~/.cache/mcpbr) |
checkpoint_interval | int | 1 | Save checkpoint every N completed tasks (minimum: 1) |
resume_from_checkpoint | Path \| None | None | Path to checkpoint file to resume from |
Docker and Infrastructure¶
| Field | Type | Default | Description |
|---|---|---|---|
use_prebuilt_images | bool | true | Use pre-built Docker images when available |
volumes | dict[str, str] | {} | Additional volume mounts (host_path: container_path) |
infrastructure | InfrastructureConfig | InfrastructureConfig() | Infrastructure configuration (local or azure) |
Output¶
| Field | Type | Default | Description |
|---|---|---|---|
output_dir | str \| None | None | Directory for outputs (default: .mcpbr_run_TIMESTAMP) |
disable_logs | bool | false | Disable detailed execution logs |
enable_profiling | bool | false | Enable performance profiling (tool latency, memory) |
Validators¶
HarnessConfig includes automatic validation:
- provider: Must be one of
VALID_PROVIDERS - agent_harness: Must be one of
VALID_HARNESSES - benchmark: Must be one of
VALID_BENCHMARKS - timeout_seconds: Must be at least 30
- max_concurrent: Must be at least 1
- budget: Must be positive if set
- thinking_budget: Must be between 1024 and 31999 if set
- Server config consistency:
comparison_moderequires bothmcp_server_aandmcp_server_b; single mode requiresmcp_server
MCPServerConfig¶
Configuration for an MCP server process.
MCPServerConfig ¶
Bases: BaseModel
Configuration for an MCP server.
name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)') class-attribute instance-attribute ¶
command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')") class-attribute instance-attribute ¶
args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.') class-attribute instance-attribute ¶
env = Field(default_factory=dict, description='Environment variables for the MCP server') class-attribute instance-attribute ¶
startup_timeout_ms = Field(default=60000, description='Timeout in milliseconds for MCP server startup (default: 60s)') class-attribute instance-attribute ¶
tool_timeout_ms = Field(default=900000, description='Timeout in milliseconds for MCP tool execution (default: 15 min for long-running tools)') class-attribute instance-attribute ¶
setup_command = Field(default=None, description='Shell command to run inside the container BEFORE the agent starts. Runs outside the task timer (does not count against timeout_seconds). Use {workdir} as placeholder. Useful for pre-computing caches.') class-attribute instance-attribute ¶
setup_timeout_ms = Field(default=900000, description='Timeout in milliseconds for the setup_command (default: 15 min)') class-attribute instance-attribute ¶
get_args_for_workdir(workdir) ¶
Replace {workdir} placeholder in args with actual path.
get_setup_command_for_workdir(workdir) ¶
Replace {workdir} placeholder in setup_command with actual path.
get_expanded_env() ¶
Expand ${VAR} references in env values using os.environ.
Returns:
| Type | Description |
|---|---|
dict[str, str] | Dictionary with environment variables expanded. |
Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
name | str | "mcpbr" | Name to register the MCP server as (appears in tool names) |
command | str | (required) | Command to start the MCP server (e.g., npx, uvx, python) |
args | list[str] | [] | Arguments to pass to the command. Use {workdir} as placeholder |
env | dict[str, str] | {} | Environment variables for the MCP server |
startup_timeout_ms | int | 60000 | Timeout for MCP server startup (default: 60s) |
tool_timeout_ms | int | 900000 | Timeout for MCP tool execution (default: 15 min) |
setup_command | str \| None | None | Shell command to run inside the container before the agent starts |
setup_timeout_ms | int | 900000 | Timeout for the setup_command (default: 15 min) |
Methods¶
get_args_for_workdir(workdir: str) -> list[str]¶
Replace {workdir} placeholder in args with the actual working directory path.
server = MCPServerConfig(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
)
resolved_args = server.get_args_for_workdir("/tmp/task-repo")
# ["-y", "@modelcontextprotocol/server-filesystem", "/tmp/task-repo"]
get_setup_command_for_workdir(workdir: str) -> str | None¶
Replace {workdir} placeholder in setup_command.
server = MCPServerConfig(
command="my-server",
setup_command="cd {workdir} && pip install -e .",
)
cmd = server.get_setup_command_for_workdir("/tmp/task-repo")
# "cd /tmp/task-repo && pip install -e ."
get_expanded_env() -> dict[str, str]¶
Expand ${VAR} references in env values using os.environ.
import os
os.environ["MY_API_KEY"] = "sk-123"
server = MCPServerConfig(
command="my-server",
env={"API_KEY": "${MY_API_KEY}"},
)
expanded = server.get_expanded_env()
# {"API_KEY": "sk-123"}
AzureConfig¶
Configuration for Azure cloud infrastructure.
AzureConfig ¶
Bases: BaseModel
Configuration for Azure infrastructure.
validate_resource_group(v) classmethod ¶
Validate resource group name format.
Azure resource groups must be 1-90 characters and contain only alphanumeric characters, dashes, and underscores.
validate_location(v) classmethod ¶
Validate Azure region.
validate_cpu_cores(v) classmethod ¶
Validate CPU cores is at least 1.
validate_memory_gb(v) classmethod ¶
Validate memory is at least 1 GB.
validate_disk_gb(v) classmethod ¶
Validate disk size is at least 30 GB.
validate_env_keys(v) classmethod ¶
Validate env_keys_to_export is a list of strings.
Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
resource_group | str | (required) | Azure resource group name (alphanumeric, dash, underscore, 1-90 chars) |
location | str | "eastus" | Azure region (e.g., eastus, westus2, northeurope) |
vm_size | str \| None | None | Azure VM size (e.g., Standard_D4s_v3). Alternative to cpu_cores/memory_gb |
cpu_cores | int | 8 | Number of CPU cores (used if vm_size not specified) |
memory_gb | int | 32 | Memory in GB (used if vm_size not specified) |
disk_gb | int | 250 | Disk size in GB (minimum: 30) |
auto_shutdown | bool | true | Automatically shutdown VM after evaluation completes |
preserve_on_error | bool | true | Keep VM running if evaluation fails for debugging |
env_keys_to_export | list[str] | ["ANTHROPIC_API_KEY"] | Environment variables to export to Azure VM |
ssh_key_path | Path \| None | None | Path to SSH key (auto-generated if not provided) |
zone | str \| None | None | Azure availability zone ("1", "2", or "3") |
python_version | str | "3.11" | Python version to install on VM |
InfrastructureConfig¶
Configuration for the infrastructure mode selector.
InfrastructureConfig ¶
Bases: BaseModel
Configuration for infrastructure mode.
validate_provider_config() ¶
Ensure the correct provider config is present for the selected mode.
Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
mode | Literal["local", "azure"] | "local" | Infrastructure mode |
azure | AzureConfig \| None | None | Azure configuration (required when mode="azure") |
Azure Validation
When mode is set to "azure", the azure field is required. Omitting it raises a ValueError.
load_config()¶
Load configuration from a YAML file with environment variable expansion and inheritance support.
load_config(config_path, warn_security=True) ¶
Load configuration from a YAML file with environment variable expansion.
Automatically loads .env file from current directory if it exists. Supports ${VAR} and ${VAR:-default} syntax for environment variables. Supports config inheritance via the 'extends' field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path | str | Path | Path to the YAML configuration file. | required |
warn_security | bool | Whether to print security warnings for hardcoded secrets. | True |
Returns:
| Type | Description |
|---|---|
HarnessConfig | Validated HarnessConfig instance. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If config file doesn't exist. |
ValueError | If config is invalid or required environment variables are missing. |
CircularInheritanceError | If circular inheritance is detected. |
ConfigInheritanceError | If there's an error loading or merging inherited configs. |
Signature¶
Features¶
- Automatically loads
.envfile from the current directory - Supports
${VAR}and${VAR:-default}syntax for environment variables - Supports config inheritance via the
extendsfield - Validates all fields with Pydantic
- Warns about hardcoded secrets when
warn_security=True
Example¶
from mcpbr.config import load_config
# Load from YAML file
config = load_config("mcpbr.yaml")
# Suppress security warnings (e.g., in CI)
config = load_config("mcpbr.yaml", warn_security=False)
Exceptions¶
| Exception | When |
|---|---|
FileNotFoundError | Config file does not exist |
ValueError | Invalid config or missing required environment variables |
CircularInheritanceError | Circular inheritance chain detected |
ConfigInheritanceError | Error loading or merging inherited configs |
YAML Configuration Format¶
Minimal Configuration¶
mcp_server:
command: npx
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
benchmark: humaneval
model: sonnet
Full Configuration¶
# MCP Server
mcp_server:
name: my-mcp-server
command: npx
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
env:
API_KEY: "${MY_API_KEY}"
startup_timeout_ms: 60000
tool_timeout_ms: 900000
setup_command: "cd {workdir} && npm install"
setup_timeout_ms: 300000
# Model and Provider
provider: anthropic
model: sonnet
agent_harness: claude-code
agent_prompt: |
You are a software engineer. Fix the following bug:
{problem_statement}
# Benchmark
benchmark: swe-bench-verified
sample_size: 20
filter_category:
- django
- flask
# Runtime
timeout_seconds: 600
max_concurrent: 4
max_iterations: 15
thinking_budget: 10000
budget: 50.0
# Error Handling
continue_on_error: true
max_failures: 10
# Caching
cache_enabled: true
cache_dir: ~/.cache/mcpbr
# Checkpointing
checkpoint_interval: 5
# Docker
use_prebuilt_images: true
volumes:
/host/cache: /container/cache
# Output
output_dir: ./results
disable_logs: false
enable_profiling: true
Comparison Mode Configuration¶
comparison_mode: true
mcp_server_a:
name: server-v1
command: npx
args: ["-y", "mcp-server-v1", "{workdir}"]
mcp_server_b:
name: server-v2
command: npx
args: ["-y", "mcp-server-v2", "{workdir}"]
benchmark: swe-bench-verified
model: sonnet
sample_size: 50
Azure Infrastructure Configuration¶
mcp_server:
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
benchmark: swe-bench-verified
model: sonnet
sample_size: 100
infrastructure:
mode: azure
azure:
resource_group: mcpbr-eval-rg
location: eastus2
vm_size: Standard_D8s_v3
disk_gb: 500
auto_shutdown: true
preserve_on_error: true
env_keys_to_export:
- ANTHROPIC_API_KEY
- GITHUB_TOKEN
python_version: "3.11"
Config Inheritance¶
Configs can extend other configs using the extends field:
# base.yaml
mcp_server:
command: npx
args: ["-y", "my-server", "{workdir}"]
model: sonnet
timeout_seconds: 300
Environment Variable Support¶
Syntax¶
| Syntax | Description |
|---|---|
${VAR} | Required variable (error if not set) |
${VAR:-default} | Optional variable with default value |
.env File¶
mcpbr automatically loads a .env file from the current directory:
Usage in YAML¶
mcp_server:
command: my-server
env:
API_KEY: "${ANTHROPIC_API_KEY}"
DEBUG: "${DEBUG:-false}"
REGION: "${AWS_REGION:-us-east-1}"
Security
Do not hardcode API keys or secrets directly in YAML config files. Use environment variables or a .env file instead. mcpbr will warn about potential hardcoded secrets when warn_security=True (the default).
Programmatic Configuration¶
Creating a Default Config¶
from mcpbr.config import create_default_config
config = create_default_config()
# Uses default MCP server (filesystem), anthropic provider, sonnet model
Building Config Programmatically¶
from mcpbr.config import HarnessConfig, MCPServerConfig, InfrastructureConfig, AzureConfig
config = HarnessConfig(
mcp_server=MCPServerConfig(
name="my-server",
command="uvx",
args=["my-mcp-server", "--workdir", "{workdir}"],
env={"API_KEY": "from-env"},
),
provider="anthropic",
model="sonnet",
benchmark="swe-bench-verified",
sample_size=10,
timeout_seconds=600,
max_concurrent=2,
budget=25.0,
infrastructure=InfrastructureConfig(
mode="azure",
azure=AzureConfig(
resource_group="my-eval-rg",
location="eastus2",
cpu_cores=8,
memory_gb=32,
),
),
)