Best Practices Guide¶
This guide helps you get the most value from mcpbr while avoiding common pitfalls. Whether you're testing a new MCP server, optimizing costs, or setting up CI/CD pipelines, these practices will help you work effectively.
Why these practices matter
mcpbr was built on the principle that MCP servers should be tested like APIs, not like plugins. These best practices put that philosophy into action. See the Testing Philosophy page for the underlying principles.
Quick Reference¶
| Scenario | Recommended Approach |
|---|---|
| First-time setup | Use quick-test template, verify with n=1 |
| MCP server testing | Standalone test → quick-test → scale gradually |
| Cost optimization | Use Haiku model, small samples, --mcp-only flag |
| Production evaluation | Use production template, save all outputs |
| CI/CD integration | Use regression detection, JUnit XML, notifications |
| Security testing | Start with cybergym-basic, progress to advanced |
| Debugging failures | Enable -vv, use --log-dir, analyze tool usage |
Benchmark Selection Guidelines¶
Choosing the Right Benchmark¶
Use SWE-bench when: - Testing code exploration and bug-fixing capabilities - Evaluating Python-focused MCP servers - Need proven, standardized benchmarks - Want fast evaluation with pre-built images
Use CyberGym when: - Testing security analysis capabilities - Evaluating C/C++ code understanding - Need vulnerability detection benchmarks - Want to test different difficulty levels
SWE-bench Best Practices¶
Start Small, Scale Gradually
# Step 1: Single task smoke test
mcpbr run -c config.yaml -n 1 -v
# Step 2: Small sample
mcpbr run -c config.yaml -n 5 -o results-5.json
# Step 3: Medium sample
mcpbr run -c config.yaml -n 25 -o results-25.json
# Step 4: Full evaluation (if needed)
mcpbr run -c config.yaml -o results-full.json
Use Pre-built Images
Pre-built images provide: - Validated dependency installation - Consistent evaluation environment - Faster startup (no package installation) - Working Python imports inside containers
Anti-pattern: Disabling pre-built images without good reason
CyberGym Best Practices¶
Choose Appropriate Difficulty Level
| Level | Context | Use Case | Typical Success Rate |
|---|---|---|---|
| 0 | Minimal | Test discovery abilities | Low (10-20%) |
| 1 | Type only | Balanced challenge | Medium (20-40%) |
| 2 | Description | Practical testing | Higher (40-60%) |
| 3 | Full context | Maximum guidance | Highest (60-80%) |
Start with Level 2 for Most Use Cases
# Level 2 provides good balance of challenge and success
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5
Increase Timeouts for Compilation
Anti-pattern: Using level 3 for all testing
# Level 3 is too easy for meaningful evaluation
mcpbr run -c config.yaml --benchmark cybergym --level 3 # Not recommended
MCP Server Configuration Best Practices¶
Selecting MCP Servers¶
Match Server Capabilities to Benchmark Needs
For SWE-bench (bug fixing): - Filesystem access (read/write) - Code search capabilities - Test execution tools - Git operations
For CyberGym (security): - Code analysis tools - AST parsing - Vulnerability pattern detection - Build system integration
Configuration Patterns¶
Good: Clear, Minimal Configuration
mcp_server:
name: "mcpbr"
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
env: {}
Better: With Environment Variables
mcp_server:
name: "codebase"
command: "npx"
args: ["-y", "@supermodeltools/mcp-server"]
env:
SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
LOG_LEVEL: "info"
Anti-pattern: Hardcoded Secrets
Testing Your MCP Server¶
Step 1: Standalone Verification
# Test server starts correctly
npx -y @modelcontextprotocol/server-filesystem /tmp/test
# For custom servers
python -m my_mcp_server --workspace /tmp/test
Step 2: Quick Smoke Test
Step 3: Analyze Tool Usage
# Check if MCP tools are being used
mcpbr run -c config.yaml -n 5 -o results.json
cat results.json | jq '.tasks[0].mcp.tool_usage'
Step 4: Compare Against Baseline
Red Flags: - MCP tools never appear in tool_usage - Tool usage is always 0 or very low - Similar results between MCP and baseline - Server startup warnings in logs
Performance Optimization Tips¶
Docker Resource Management¶
Set Appropriate Concurrency
# For most systems
max_concurrent: 4
# For powerful machines (16+ GB RAM, 8+ cores)
max_concurrent: 8
# For limited resources or API rate limits
max_concurrent: 2
# For debugging
max_concurrent: 1
Monitor Resource Usage
Clean Up Orphaned Containers
# Preview what will be removed
mcpbr cleanup --dry-run
# Remove orphaned containers
mcpbr cleanup -f
Apple Silicon Optimization¶
Expected Performance: - Tasks take 2-3x longer than native x86_64 - This is normal due to emulation - Pre-built images help reduce overhead
Recommended Settings:
max_concurrent: 2 # Reduce concurrency
timeout_seconds: 600 # Increase timeouts
use_prebuilt_images: true # Essential for performance
Install Rosetta 2 (if not already installed)
Timeout Tuning¶
Default Timeouts by Benchmark
| Benchmark | Recommended Timeout | Max Iterations |
|---|---|---|
| SWE-bench | 300-600s | 10-30 |
| CyberGym | 600-900s | 15-30 |
Adjust Based on Task Complexity
# Simple tasks
timeout_seconds: 300
max_iterations: 10
# Complex tasks or slow hardware
timeout_seconds: 600
max_iterations: 30
Anti-pattern: Extremely Long Timeouts
Model Selection¶
Development/Testing:
Production/Benchmarking:
Maximum Performance:
Cost Management Strategies¶
Understanding Costs¶
Token Usage Factors: - Model choice (Haiku < Sonnet < Opus) - Number of iterations (more turns = more tokens) - Task complexity (complex bugs require more exploration) - Sample size (most obvious cost driver)
Typical Costs (per task, Sonnet model): - Simple task: $0.10-0.30 (5-10K output tokens) - Medium task: $0.30-0.80 (10-20K output tokens) - Complex task: $0.80-2.00 (20-50K output tokens)
Cost Optimization Strategies¶
1. Start Small
# Test with 1 task first
mcpbr run -c config.yaml -n 1
# Scale to 5 tasks to validate
mcpbr run -c config.yaml -n 5
# Only run full evaluation when confident
mcpbr run -c config.yaml -n 50
2. Use Faster Models for Development
3. Skip Baseline During Iteration
4. Reduce Iterations
5. Monitor Token Usage
# Save results and check token consumption
mcpbr run -c config.yaml -n 5 -o results.json
# Analyze token usage
cat results.json | jq '.tasks[] | {id: .instance_id, tokens: .mcp.tokens}'
Anti-pattern: Running Full Evaluations Repeatedly
# Bad: Running 300 tasks multiple times during development
mcpbr run -c config.yaml # Default: full dataset
mcpbr run -c config.yaml # Oops, again...
# Result: Hundreds of dollars in API costs
Good Pattern: Incremental Testing
# Development cycle
mcpbr run -c config.yaml -n 1 -M # $0.20
mcpbr run -c config.yaml -n 5 -M # $1-2
mcpbr run -c config.yaml -n 10 # $5-10
# Only when ready:
mcpbr run -c config.yaml -n 50 -o final.json # $50-100
Cost Tracking¶
Track Costs Per Run
import json
with open("results.json") as f:
results = json.load(f)
# Calculate approximate costs (Sonnet pricing as of 2026)
INPUT_COST = 3.00 / 1_000_000 # $3 per 1M tokens
OUTPUT_COST = 15.00 / 1_000_000 # $15 per 1M tokens
total_cost = 0
for task in results["tasks"]:
mcp = task.get("mcp", {})
tokens = mcp.get("tokens", {})
input_tokens = tokens.get("input", 0)
output_tokens = tokens.get("output", 0)
task_cost = (input_tokens * INPUT_COST) + (output_tokens * OUTPUT_COST)
total_cost += task_cost
print(f"Total cost: ${total_cost:.2f}")
print(f"Average per task: ${total_cost / len(results['tasks']):.2f}")
Result Interpretation Guidelines¶
Understanding Resolution Rates¶
What "Resolved" Means: 1. Patch was generated 2. Patch applied cleanly 3. All FAIL_TO_PASS tests now pass 4. All PASS_TO_PASS tests still pass
Interpreting Improvement:
This means: - MCP agent is 60% better than baseline - Your MCP server helped on 3 additional tasks - Both agents struggled (absolute rates are low)
Success Rate Benchmarks¶
Typical Resolution Rates (SWE-bench Lite):
| Configuration | Expected Range | Interpretation |
|---|---|---|
| Baseline (Sonnet) | 15-25% | Normal for single-shot |
| Basic filesystem MCP | 20-30% | Modest improvement |
| Advanced MCP server | 30-45% | Significant value |
| State-of-the-art | 45-60% | Excellent performance |
Low Rates (Both < 15%): - Tasks may be inherently difficult - Sample may include hard tasks - Timeouts may be too short - Model may need more iterations
High Baseline (> 25%): - Sample may include easier tasks - Good task selection - Model is performing well
Low Improvement (< 10%): - MCP tools not providing value - Tools not being used effectively - Baseline already sufficient
Analyzing Tool Usage¶
Extract Tool Statistics
Healthy Tool Distribution:
{
"Grep": 15, // Searching code
"Read": 20, // Reading files
"Bash": 25, // Running tests
"Edit": 5, // Making changes
"mcp__read": 10 // MCP tools being used
}
Red Flags:
{
"TodoWrite": 50, // Too much planning, not enough action
"mcp__search": 0 // MCP tools not being used at all
}
Comparing Configurations¶
Save Results with Descriptive Names
mcpbr run -c filesystem.yaml -o results-filesystem.json
mcpbr run -c supermodel.yaml -o results-supermodel.json
Compare Resolution Rates
import json
def compare_results(file1, file2):
with open(file1) as f1, open(file2) as f2:
r1 = json.load(f1)
r2 = json.load(f2)
rate1 = r1["summary"]["mcp"]["rate"]
rate2 = r2["summary"]["mcp"]["rate"]
print(f"{file1}: {rate1:.1%}")
print(f"{file2}: {rate2:.1%}")
print(f"Difference: {(rate2 - rate1):.1%}")
compare_results("results-filesystem.json", "results-supermodel.json")
Security Considerations¶
API Key Management¶
Good: Environment Variables
Better: Shell Profile
Best: Secret Management
# Using 1Password CLI
export ANTHROPIC_API_KEY=$(op read "op://vault/mcpbr/api_key")
# Using AWS Secrets Manager
export ANTHROPIC_API_KEY=$(aws secretsmanager get-secret-value \
--secret-id mcpbr/anthropic-key --query SecretString --output text)
Anti-pattern: Hardcoded in Config
Docker Security¶
Network Isolation (when external access not needed)
# For most use cases, network access is required for API calls
# But if testing without MCP:
docker_network_mode: "none"
Container Cleanup
# Regularly clean up orphaned containers
mcpbr cleanup -f
# Check for running containers
docker ps | grep mcpbr
Data Security¶
Sensitive Repositories: - mcpbr runs on public datasets (SWE-bench, CyberGym) - Do NOT use with proprietary code - Task data is sent to Anthropic API - Logs may contain code snippets
Log Management:
# Logs contain code and conversations
mcpbr run -c config.yaml --log-dir logs/
# Secure log files
chmod 700 logs/
CI/CD Integration Patterns¶
GitHub Actions¶
Basic Workflow
name: MCP Benchmark
on:
pull_request:
paths:
- 'mcp-server/**'
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install mcpbr
run: pip install mcpbr
- name: Run benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mcpbr run -c config.yaml -n 10 -o results.json
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: results.json
With Regression Detection
- name: Download baseline
run: |
gh run download --name baseline-results --dir .
- name: Run with regression detection
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mcpbr run -c config.yaml -n 25 \
--baseline-results baseline.json \
--regression-threshold 0.1 \
--slack-webhook ${{ secrets.SLACK_WEBHOOK }} \
-o current.json
With JUnit XML
- name: Run benchmark
run: mcpbr run -c config.yaml --output-junit junit.xml
- name: Publish test results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: junit.xml
GitLab CI¶
mcpbr-benchmark:
image: python:3.11
services:
- docker:dind
variables:
DOCKER_HOST: tcp://docker:2375
script:
- pip install mcpbr
- mcpbr run -c config.yaml -n 10 --output-junit junit.xml
artifacts:
reports:
junit: junit.xml
paths:
- results.json
only:
- merge_requests
Cost Control in CI/CD¶
Sample Size Limits
# Don't run full benchmarks on every PR
mcpbr run -c config.yaml -n 10 # Small sample for PRs
# Full benchmarks on main branch only
mcpbr run -c config.yaml -n 50 # Larger sample for releases
Conditional Execution
Debugging and Troubleshooting¶
Diagnostic Workflow¶
Step 1: Verify Prerequisites
# Check Docker
docker info
# Check API key
echo $ANTHROPIC_API_KEY | head -c 10
# Check Claude CLI
which claude
Step 2: Test MCP Server Standalone
# For filesystem server
npx -y @modelcontextprotocol/server-filesystem /tmp/test
# For custom servers
python -m my_mcp_server --workspace /tmp/test
Step 3: Run Single Task with Verbose Logging
Step 4: Analyze Logs
# Check system events
cat debug/*.json | jq '.events[] | select(.type == "system")'
# Check tool usage
cat debug/*.json | jq '.events[] | select(.type == "assistant") |
.message.content[] | select(.type == "tool_use") | .name'
Common Issues and Solutions¶
MCP Server Not Starting
Solutions: 1. Test server command directly 2. Check environment variables are set 3. Verify command is in PATH 4. Check server logs for errors
No Patch Generated
Causes: - Task too complex for iterations limit - Agent couldn't find solution - Agent made changes then reverted
Solutions:
Timeouts
Solutions:
Tests Failing
This means: - Patch applied successfully - But didn't fix the bug - Agent made incorrect changes - Not an mcpbr issue - agent behavior
Debug Flags¶
Verbose Output Levels
# Standard output
mcpbr run -c config.yaml
# Verbose: summary + task progress
mcpbr run -c config.yaml -v
# Very verbose: detailed tool calls
mcpbr run -c config.yaml -vv
Per-Instance Logs
# Create detailed JSON logs for each task
mcpbr run -c config.yaml --log-dir logs/
# Logs are timestamped: instance_id_runtype_timestamp.json
ls logs/
# django__django-11099_mcp_20260120_143052.json
# django__django-11099_baseline_20260120_143156.json
Single Log File
Iterative Development Workflow¶
Phase 1: Quick Validation¶
Goal: Verify basic functionality
# 1. Test MCP server starts
npx -y @modelcontextprotocol/server-filesystem /tmp/test
# 2. Run single task
mcpbr init -t quick-test
mcpbr run -c mcpbr.yaml -v
# 3. Check if it worked
# - Did the server start? (check for warnings)
# - Were tools registered? (check verbose output)
# - Was a patch generated?
Success Criteria: - No server startup errors - Task completes without timeout - Patch generated (even if incorrect)
Phase 2: Small-Scale Testing¶
Goal: Validate at small scale
# 1. Run 5 tasks with MCP only
mcpbr run -c config.yaml -n 5 -M -o dev-mcp.json
# 2. Analyze tool usage
cat dev-mcp.json | jq '.tasks[].mcp.tool_usage'
# 3. Check if MCP tools are used
cat dev-mcp.json | jq '.tasks[].mcp.tool_usage |
to_entries | map(select(.key | startswith("mcp")))'
Success Criteria: - MCP tools appear in tool_usage - At least 1-2 tasks resolved - No consistent errors
Phase 3: Baseline Comparison¶
Goal: Measure improvement
# 1. Run 10 tasks with MCP + baseline
mcpbr run -c config.yaml -n 10 -o comparison.json
# 2. Check improvement
cat comparison.json | jq '.summary'
# 3. Find MCP-only wins
cat comparison.json | jq '.tasks[] |
select(.mcp.resolved == true and .baseline.resolved == false) |
.instance_id'
Success Criteria: - MCP rate > baseline rate - At least 1-2 MCP-only wins - Improvement > 10%
Phase 4: Optimization¶
Goal: Improve performance based on findings
Analyze Failures:
# Find tasks where MCP failed
cat comparison.json | jq '.tasks[] |
select(.mcp.resolved == false) |
{id: .instance_id, error: .mcp.error, iterations: .mcp.iterations}'
Common Optimizations: - Increase iterations if hitting limits - Adjust timeout if tasks timeout - Modify MCP server configuration - Update agent prompt
Phase 5: Production Evaluation¶
Goal: Final comprehensive benchmark
# 1. Run larger sample
mcpbr run -c config.yaml -n 50 -o production.json -r report.md
# 2. Save for regression detection
cp production.json baseline.json
# 3. Generate all outputs
mcpbr run -c config.yaml -n 50 \
-o results.json \
-y results.yaml \
-r report.md \
--output-junit junit.xml \
--log-dir logs/
Success Criteria: - Statistically significant sample (n >= 25) - Results saved for future comparison - Improvement is consistent - Documentation completed
Templates and Configuration¶
Using Templates Effectively¶
Start with Templates
# List all templates
mcpbr templates
# Use appropriate template
mcpbr init -t quick-test # For testing
mcpbr init -t filesystem # For development
mcpbr init -t production # For final evaluation
Customize After Generation
# Generate from template
mcpbr init -t filesystem
# Edit to customize
vim mcpbr.yaml
# Test your changes
mcpbr run -c mcpbr.yaml -n 1 -v
Configuration Patterns¶
Development Configuration
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "haiku" # Fast and cheap
sample_size: 5 # Small sample
max_concurrent: 1 # Serial execution for debugging
timeout_seconds: 180 # Shorter timeout
max_iterations: 5 # Fewer iterations
use_prebuilt_images: true
Production Configuration
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "sonnet" # Best balance
sample_size: 50 # Meaningful sample
max_concurrent: 4 # Parallel execution
timeout_seconds: 600 # Generous timeout
max_iterations: 30 # More iterations
use_prebuilt_images: true
CI/CD Configuration
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "sonnet"
sample_size: 10 # Quick feedback
max_concurrent: 2 # Don't overload
timeout_seconds: 300
max_iterations: 15 # Balanced
use_prebuilt_images: true
Examples and Use Cases¶
Use Case 1: Testing a New MCP Server¶
Scenario: You've built a custom MCP server with advanced code search
Workflow:
# 1. Create config from template
mcpbr init -t custom-python -o my-server.yaml
# 2. Edit to point to your server
# Update args: ["-m", "my_mcp_server", "--workspace", "{workdir}"]
# 3. Test standalone
python -m my_mcp_server --workspace /tmp/test
# 4. Quick test
mcpbr run -c my-server.yaml -n 1 -v -M
# 5. Small comparison
mcpbr run -c my-server.yaml -n 10 -o results.json
# 6. Analyze tool usage
cat results.json | jq '.tasks[0].mcp.tool_usage'
# 7. Full evaluation if promising
mcpbr run -c my-server.yaml -n 50 -o final.json -r report.md
Use Case 2: Comparing Two MCP Servers¶
Scenario: Evaluating filesystem vs. Supermodel
Workflow:
# 1. Create configurations
mcpbr init -t filesystem -o filesystem.yaml
mcpbr init -t supermodel -o supermodel.yaml
# 2. Set API key for Supermodel
export SUPERMODEL_API_KEY="your-key"
# 3. Run identical samples
mcpbr run -c filesystem.yaml -n 25 -o fs-results.json
mcpbr run -c supermodel.yaml -n 25 -o sm-results.json
# 4. Compare results
python compare.py fs-results.json sm-results.json
# 5. Analyze differences
# Check which tasks each solved
# Compare tool usage patterns
# Analyze token consumption
Use Case 3: Cost-Optimized Development¶
Scenario: Limited budget, need to test iteratively
Workflow:
# Phase 1: Ultra-cheap validation (< $1)
mcpbr init -t quick-test
# Edit: model: "haiku", sample_size: 1, max_iterations: 3
mcpbr run -c mcpbr.yaml -M # MCP only, ~$0.10
# Phase 2: Small test (< $5)
# Edit: sample_size: 5, max_iterations: 5
mcpbr run -c mcpbr.yaml -M # ~$2-3
# Phase 3: Baseline comparison (< $20)
# Edit: sample_size: 10, max_iterations: 10
mcpbr run -c mcpbr.yaml -o results.json # ~$10-15
# Phase 4: Production (budgeted)
# Switch to: model: "sonnet", sample_size: 50
mcpbr run -c mcpbr.yaml -o final.json # ~$100-150
Use Case 4: CI/CD Integration¶
Scenario: Automated regression testing on PR
Workflow:
# .github/workflows/mcp-test.yml
name: MCP Regression Test
on:
pull_request:
paths: ['mcp-server/**']
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Download baseline
run: gh run download --name baseline --dir .
- name: Install mcpbr
run: pip install mcpbr
- name: Run regression test
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mcpbr run -c config.yaml -n 10 \
--baseline-results baseline.json \
--regression-threshold 0.1 \
--slack-webhook ${{ secrets.SLACK_WEBHOOK }} \
-o current.json \
--output-junit junit.xml
- name: Publish results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: junit.xml
Use Case 5: Security Research¶
Scenario: Evaluating vulnerability detection capabilities
Workflow:
# 1. Start with basic level
mcpbr init -t cybergym-basic
# 2. Test single vulnerability
mcpbr run -c mcpbr.yaml -n 1 -v --log-dir logs/
# 3. Check PoC generation
# Look for poc.c, poc.py in logs
cat logs/*.json | jq '.events[] | select(.type == "assistant") |
.message.content[] | select(.type == "text") | .text' | grep -i poc
# 4. Scale up
mcpbr run -c mcpbr.yaml -n 5 -o level1.json
# 5. Try higher level
mcpbr init -t cybergym-advanced -o level3.yaml
mcpbr run -c level3.yaml -n 5 -o level3.json
# 6. Compare difficulty levels
python compare.py level1.json level3.json
Anti-Patterns to Avoid¶
Configuration Anti-Patterns¶
Bad: Hardcoded secrets
Bad: Unrealistic timeouts
Bad: Excessive concurrency
Workflow Anti-Patterns¶
Bad: Running full benchmark during development
Bad: Not saving results
Bad: Skipping standalone testing
Analysis Anti-Patterns¶
Bad: Focusing only on resolution rate
# Missing important insights
rate = results["summary"]["mcp"]["rate"]
print(f"Rate: {rate}") # That's it?
Bad: Not checking tool usage
Bad: Comparing different samples
# Results not comparable
mcpbr run -c config-a.yaml -n 10 # Random 10
mcpbr run -c config-b.yaml -n 10 # Different random 10
Quick Start Checklist¶
Before First Run: - [ ] Docker installed and running - [ ] ANTHROPIC_API_KEY set - [ ] Claude CLI installed (which claude) - [ ] mcpbr installed (mcpbr --version)
For New MCP Server: - [ ] Test server standalone - [ ] Create config from template - [ ] Run single task test (n=1) - [ ] Check tool registration - [ ] Verify MCP tools used - [ ] Scale to 5-10 tasks - [ ] Save results - [ ] Compare vs baseline
For Production Run: - [ ] Config validated - [ ] Sample size determined - [ ] Timeout appropriate - [ ] Output paths specified - [ ] Baseline results saved - [ ] Budget confirmed - [ ] Results will be saved
Security Best Practices¶
Securing your mcpbr deployment is critical, especially when running evaluations in CI/CD pipelines, shared environments, or with third-party MCP servers.
API Key Management¶
Never commit API keys to version control
API keys in configuration files or source code are one of the most common security incidents. Use environment variables or secret management tools exclusively.
Hierarchy of Key Management (from basic to advanced):
| Method | Security Level | Best For |
|---|---|---|
| Environment variable | Basic | Local development |
.env file (gitignored) | Better | Team development |
Shell profile (~/.zshrc) | Better | Personal machines |
| Secret manager (1Password, AWS SM) | Best | Production / CI/CD |
| Hardware security module (HSM) | Maximum | Enterprise deployments |
Using .env files with mcpbr:
mcpbr automatically loads .env files from the current directory. Create one but ensure it is never committed:
# Create .env file
echo 'ANTHROPIC_API_KEY=sk-ant-...' > .env
echo 'SUPERMODEL_API_KEY=your-key-here' >> .env
# Ensure .env is gitignored
echo '.env' >> .gitignore
Then reference variables in your config:
Using secret managers in CI/CD:
# AWS Secrets Manager
export ANTHROPIC_API_KEY=$(aws secretsmanager get-secret-value \
--secret-id mcpbr/anthropic-key --query SecretString --output text)
# 1Password CLI
export ANTHROPIC_API_KEY=$(op read "op://vault/mcpbr/api_key")
# HashiCorp Vault
export ANTHROPIC_API_KEY=$(vault kv get -field=api_key secret/mcpbr)
Rotate keys regularly
Set a reminder to rotate your API keys on a regular schedule (e.g., every 90 days). If you suspect a key has been exposed, rotate it immediately in the Anthropic Console.
Docker Security¶
mcpbr runs evaluation tasks inside Docker containers, which provides a baseline of isolation. You can harden this further.
Resource limits to prevent runaway containers:
# In your config, control concurrency to bound resource usage
max_concurrent: 4 # Limit parallel containers
timeout_seconds: 600 # Hard timeout per task
max_iterations: 30 # Cap agent turns
Monitor container activity:
# Watch container resource usage in real time
docker stats --filter "name=mcpbr"
# List all mcpbr containers (including stopped)
docker ps -a --filter "name=mcpbr"
Network isolation for non-API workloads:
Containers run as root by default
Docker containers in mcpbr run as root within the container to ensure dependency installation and test execution work correctly. This is isolated from the host via Docker's namespacing, but avoid mounting sensitive host directories as volumes.
Volume mount security:
# Only mount what is necessary
volumes:
"/path/to/cache": "/cache" # Read-write mount for caching
# NEVER mount these:
# volumes:
# "/etc": "/host-etc" # Host system configuration
# "/var/run/docker.sock": "/var/run/docker.sock" # Docker socket
# "~/.ssh": "/root/.ssh" # SSH keys
MCP Server Sandboxing¶
Third-party MCP servers execute arbitrary code
MCP servers run commands and access files within the Docker container. Only use MCP servers you trust, and review their source code before deploying in production.
Sandboxing recommendations:
- Review server source code before first use
- Pin server versions to prevent supply-chain attacks:
- Limit environment variables exposed to the server -- only pass what is required
- Use
setup_commandcautiously -- it runs with full container privileges
Output Sanitization¶
Logs and results may contain sensitive data:
- Task logs can include code snippets from repositories
- API conversations are recorded in per-instance logs
- Results JSON contains tool call traces
Secure your outputs:
# Restrict log directory permissions
mcpbr run -c config.yaml --log-dir logs/
chmod 700 logs/
# Redact sensitive fields before sharing results
cat results.json | jq 'del(.tasks[].mcp.conversation)' > results-safe.json
Audit before sharing
Before sharing results files, reports, or logs externally, review them for any API keys, internal paths, or proprietary code that may have been captured during evaluation.
Secure CI/CD Pipeline Configuration¶
# .github/workflows/benchmark.yml
name: MCP Benchmark (Secure)
on:
pull_request:
paths: ['mcp-server/**']
permissions:
contents: read # Minimal permissions
jobs:
benchmark:
runs-on: ubuntu-latest
environment: benchmarks # Use a protected environment
steps:
- uses: actions/checkout@v4
- name: Install mcpbr
run: pip install mcpbr
- name: Run benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mcpbr run -c config.yaml -n 10 -o results.json
- name: Upload results (restricted retention)
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: results.json
retention-days: 30 # Don't keep results forever
CI/CD security checklist:
- [ ] API keys stored as repository or environment secrets (never in workflow files)
- [ ] Workflow permissions set to minimum required (
contents: read) - [ ] Protected environments for production benchmarks
- [ ] Artifact retention policies configured
- [ ] Branch protection rules on main branch
- [ ] Audit logs enabled for secret access
Performance Optimization¶
Beyond the basic performance tips covered earlier, this section provides advanced optimization strategies for large-scale evaluations.
Concurrent Task Execution¶
The max_concurrent setting controls how many Docker containers run simultaneously. Tuning this requires balancing CPU, memory, network bandwidth, and API rate limits.
Recommended settings by machine profile:
| Machine Profile | RAM | CPU Cores | max_concurrent | Notes |
|---|---|---|---|---|
| Laptop (8 GB) | 8 GB | 4 | 1-2 | Memory constrained |
| Workstation (16 GB) | 16 GB | 8 | 3-4 | Good balance |
| Power workstation (32 GB) | 32 GB | 12+ | 4-8 | Check API rate limits |
| Cloud VM (64+ GB) | 64+ GB | 16+ | 8-12 | Monitor network I/O |
| Apple Silicon (16 GB) | 16 GB | 8-10 | 2-3 | Emulation overhead |
# Conservative (safe default)
max_concurrent: 4
# Aggressive (high-end hardware with sufficient API quota)
max_concurrent: 8
Monitor and adjust
Start with max_concurrent: 4 and monitor with docker stats. If you see containers being OOM-killed or Docker becoming unresponsive, reduce concurrency. If CPU and memory utilization are low, increase it.
Docker Image Caching and Pre-built Images¶
Docker image pulls and builds are often the largest time cost for first-time runs. Optimize this with caching strategies.
Always use pre-built images:
Pre-pull images before evaluation:
# Pre-pull common base images to avoid per-task download delays
docker pull ghcr.io/epoch-research/swe-bench.eval.x86_64.django__django-11099
docker pull ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907
Docker build cache optimization:
# Ensure Docker build cache is not pruned aggressively
docker system df # Check disk usage
# Prune only dangling images, keep build cache
docker image prune -f # Remove dangling only
Disk space management
SWE-bench pre-built images are large (1-3 GB each). A full evaluation may require 50+ GB of Docker image storage. Monitor disk usage with docker system df and clean up between runs with mcpbr cleanup.
Dataset Caching¶
mcpbr supports result caching to avoid re-running identical evaluations. This is especially valuable during iterative development.
When caching helps most:
- Re-running after a configuration change that only affects one side (MCP or baseline)
- Iterating on MCP server changes with the same task set
- Resuming after a partial failure
Persistent volume caching for MCP servers:
If your MCP server performs expensive pre-computation (like codebase indexing), use the setup_command and volumes features:
mcp_server:
command: "npx"
args: ["-y", "@supermodeltools/mcp-server", "{workdir}"]
setup_command: "npx -y @supermodeltools/mcp-server --index {workdir}"
setup_timeout_ms: 900000 # 15 minutes for indexing
# Mount a persistent volume for caching across tasks
volumes:
"/tmp/mcpbr-cache": "/cache"
Memory Management for Large Benchmarks¶
Full dataset evaluations (300+ tasks) can strain system memory. Use these strategies to stay within bounds.
# Reduce concurrency to lower peak memory
max_concurrent: 2
# Use the graceful degradation system to handle failures
continue_on_error: true
max_failures: 10 # Stop if too many tasks fail (likely a systemic issue)
# Enable checkpointing for crash recovery
checkpoint_interval: 1 # Save state after every task
System-level memory monitoring:
# Monitor Docker memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Set Docker Desktop memory limit (macOS/Windows)
# Docker Desktop > Settings > Resources > Memory: 8-12 GB
Memory budget rule of thumb
Each concurrent container uses approximately 1-3 GB of RAM. For a machine with 16 GB total, allocate 8-12 GB to Docker and set max_concurrent to 3-4.
Network Optimization for Remote Evaluations¶
Reduce network latency:
- Use a cloud VM in the same region as the Anthropic API endpoint
- Pre-pull Docker images before starting evaluations
- Use
setup_commandto front-load network-intensive operations
Azure infrastructure mode for large runs:
infrastructure:
mode: azure
azure:
resource_group: "mcpbr-eval"
location: "eastus" # Close to Anthropic API
cpu_cores: 8
memory_gb: 32
disk_gb: 250
auto_shutdown: true
CI/CD Integration¶
This section provides production-ready CI/CD configurations for automated benchmarking.
GitHub Actions Example Workflow¶
Complete workflow with caching, regression detection, and notifications:
name: MCP Server Benchmark
on:
pull_request:
paths: ['mcp-server/**', 'config.yaml']
schedule:
- cron: '0 6 * * 1' # Weekly on Monday at 6 AM UTC
concurrency:
group: benchmark-${{ github.ref }}
cancel-in-progress: true
jobs:
benchmark:
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install mcpbr
run: pip install mcpbr
- name: Cache Docker images
uses: actions/cache@v4
with:
path: /tmp/docker-cache
key: docker-${{ runner.os }}-${{ hashFiles('config.yaml') }}
restore-keys: docker-${{ runner.os }}-
- name: Load cached Docker images
run: |
if [ -d /tmp/docker-cache ]; then
for img in /tmp/docker-cache/*.tar; do
docker load -i "$img" 2>/dev/null || true
done
fi
- name: Download baseline results
uses: actions/download-artifact@v4
with:
name: baseline-results
path: .
continue-on-error: true # OK if no baseline exists yet
- name: Run benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
ARGS="-c config.yaml -n 10 -o results.json --output-junit junit.xml"
if [ -f baseline.json ]; then
ARGS="$ARGS --baseline-results baseline.json --regression-threshold 0.1"
fi
mcpbr run $ARGS
- name: Publish test results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: junit.xml
- name: Upload results
uses: actions/upload-artifact@v4
if: always()
with:
name: benchmark-results
path: |
results.json
junit.xml
retention-days: 90
- name: Update baseline (main branch only)
if: github.ref == 'refs/heads/main' && success()
run: cp results.json baseline.json
- name: Upload baseline
if: github.ref == 'refs/heads/main' && success()
uses: actions/upload-artifact@v4
with:
name: baseline-results
path: baseline.json
retention-days: 365
GitLab CI Example¶
stages:
- benchmark
- report
mcpbr-benchmark:
stage: benchmark
image: python:3.11
services:
- docker:24.0-dind
variables:
DOCKER_HOST: tcp://docker:2375
DOCKER_TLS_CERTDIR: ""
before_script:
- pip install mcpbr
script:
- mcpbr run -c config.yaml -n 10 -o results.json --output-junit junit.xml
artifacts:
reports:
junit: junit.xml
paths:
- results.json
expire_in: 90 days
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
changes:
- mcp-server/**
- config.yaml
benchmark-report:
stage: report
image: python:3.11
needs: ["mcpbr-benchmark"]
script:
- pip install mcpbr
- |
python3 -c "
import json
with open('results.json') as f:
r = json.load(f)
summary = r.get('summary', {})
mcp = summary.get('mcp', {})
print(f'MCP Resolution Rate: {mcp.get(\"rate\", 0):.1%}')
print(f'Tasks Resolved: {mcp.get(\"resolved\", 0)}/{mcp.get(\"total\", 0)}')
"
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
Running in CI with Cost Budgets¶
CI benchmarks incur API costs on every run
Without cost controls, a misconfigured CI pipeline can quickly accumulate significant API charges. Always set explicit limits.
Cost control configuration for CI:
# ci-config.yaml -- optimized for CI cost control
model: "sonnet"
sample_size: 10 # Small sample for PRs
max_concurrent: 2 # Don't overload CI runners
timeout_seconds: 300 # 5-minute hard limit per task
max_iterations: 15 # Cap agent turns
budget: 25.00 # Hard budget cap in USD
use_prebuilt_images: true
continue_on_error: true
max_failures: 3 # Stop early on systemic issues
Conditional execution to avoid unnecessary runs:
# Only run benchmarks when MCP server code changes
on:
pull_request:
paths:
- 'mcp-server/**'
- 'config.yaml'
- 'benchmarks/**'
Environment-based sample sizing:
# In your workflow
- name: Set sample size
run: |
if [ "${{ github.event_name }}" = "schedule" ]; then
echo "SAMPLE_SIZE=50" >> $GITHUB_ENV # Weekly: larger sample
elif [ "${{ github.ref }}" = "refs/heads/main" ]; then
echo "SAMPLE_SIZE=25" >> $GITHUB_ENV # Main: medium sample
else
echo "SAMPLE_SIZE=10" >> $GITHUB_ENV # PRs: small sample
fi
- name: Run benchmark
run: mcpbr run -c config.yaml -n $SAMPLE_SIZE -o results.json
Regression Detection in CI Pipelines¶
Use regression detection to automatically fail builds when MCP server performance degrades.
# Run with regression detection
mcpbr run -c config.yaml -n 25 \
--baseline-results baseline.json \
--regression-threshold 0.1 \
-o current.json
The --regression-threshold 0.1 flag means the pipeline will fail (exit code 1) if the MCP resolution rate drops by more than 10 percentage points compared to the baseline.
Recommended thresholds:
| Context | Threshold | Rationale |
|---|---|---|
| PR checks | 0.15 | Tolerant -- small samples have high variance |
| Main branch | 0.10 | Moderate -- catch meaningful regressions |
| Release gate | 0.05 | Strict -- protect production quality |
Caching Strategies in CI¶
Cache Docker images between runs:
- name: Cache Docker images
uses: actions/cache@v4
with:
path: /tmp/docker-cache
key: docker-${{ hashFiles('config.yaml') }}-${{ github.sha }}
restore-keys: |
docker-${{ hashFiles('config.yaml') }}-
docker-
Cache mcpbr results:
- name: Cache evaluation results
uses: actions/cache@v4
with:
path: ~/.cache/mcpbr
key: mcpbr-cache-${{ hashFiles('config.yaml') }}
Cache pip dependencies:
- name: Cache pip
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('**/requirements*.txt') }}
Troubleshooting Guide¶
This section provides a quick-reference table for common errors and detailed debugging techniques.
Common Errors and Solutions¶
| Error | Likely Cause | Solution |
|---|---|---|
Cannot connect to Docker daemon | Docker not running | Start Docker Desktop or run sudo systemctl start docker |
ANTHROPIC_API_KEY not set | Missing environment variable | export ANTHROPIC_API_KEY="sk-ant-..." |
Timeout after 300 seconds | Task too complex or slow hardware | Increase timeout_seconds: 600 and reduce max_concurrent |
OOM killed / container exits 137 | Insufficient memory | Reduce max_concurrent, increase Docker memory allocation |
Connection refused / ECONNREFUSED | Network issue or MCP server crash | Check server logs, verify docker network ls, restart Docker |
MCP server add failed (exit 1) | Server command not found or misconfigured | Test server standalone: npx -y @your/server /tmp/test |
Patch does not apply | Agent changes conflict with test patches | Agent behavior issue -- increase max_iterations or adjust prompt |
Rate limit exceeded | Too many concurrent API calls | Reduce max_concurrent: 2 and check Anthropic Console quota |
No space left on device | Docker images filling disk | Run mcpbr cleanup -f and docker system prune |
Pre-built image not found | Image not available for this task | Normal -- mcpbr falls back to building from scratch |
Config file not found | Wrong path to YAML | Verify path: ls -la config.yaml |
Invalid model | Unsupported model name | Run mcpbr models for valid options |
MCP server registration timed out | Server startup taking too long | Increase startup_timeout_ms in MCP server config |
Tool execution timed out | MCP tool call exceeding limit | Increase tool_timeout_ms in MCP server config |
Debugging Techniques¶
Verbose mode levels:
# Standard output (progress bars, summary)
mcpbr run -c config.yaml
# Verbose: task progress and summary details
mcpbr run -c config.yaml -v
# Very verbose: detailed tool calls and agent interactions
mcpbr run -c config.yaml -vv
Structured JSON logging:
# Enable structured logs for machine parsing
MCPBR_LOG_LEVEL=DEBUG mcpbr run -c config.yaml -n 1 --log-dir debug/
Log analysis workflow
# 1. Run a single task with maximum debugging
mcpbr run -c config.yaml -n 1 -vv --log-dir debug/
# 2. List generated log files
ls debug/
# 3. Extract system events (errors, warnings)
cat debug/*.json | jq '.events[] | select(.type == "system")'
# 4. Extract tool usage sequence
cat debug/*.json | jq '.events[] | select(.type == "assistant") |
.message.content[] | select(.type == "tool_use") | .name'
# 5. Check for MCP-specific tool calls
cat debug/*.json | jq '.events[] | select(.type == "assistant") |
.message.content[] | select(.type == "tool_use") |
select(.name | startswith("mcp__"))'
Profiling evaluations:
This records tool latency, memory usage, and overhead metrics for each task, helping identify bottlenecks.
Checking MCP server health:
# View MCP server logs for a specific instance
cat ~/.mcpbr_state/logs/*_mcp.log
# Follow logs in real time during evaluation
tail -f ~/.mcpbr_state/logs/*.log
# Test server independently
npx -y @modelcontextprotocol/server-filesystem /tmp/test
Checkpoint recovery after crashes:
If mcpbr crashes mid-evaluation, use checkpoint files to understand what completed:
# Check for checkpoint files
ls .mcpbr_run_*/checkpoint.json
# View checkpoint state
cat .mcpbr_run_*/checkpoint.json | jq '{
completed: (.completed | length),
failed: (.failed | length),
skipped: (.skipped | length)
}'
# Resume from checkpoint
mcpbr run -c config.yaml --resume-from-checkpoint .mcpbr_run_20260201/checkpoint.json
Getting Help (Troubleshooting)¶
Before opening an issue
Run through this checklist to gather the information needed for a quick resolution:
- Verify prerequisites:
docker info,which claude,echo $ANTHROPIC_API_KEY | head -c 10 - Reproduce with minimal config: Single task, verbose output
- Collect version info:
mcpbr --version,python --version,docker --version - Gather logs: Run with
--log-dir debug/and include relevant excerpts - Redact secrets: Remove API keys, internal paths, proprietary code from logs before sharing
Where to go:
- GitHub Issues -- Bug reports and feature requests
- GitHub Discussions -- Questions and community help
- Documentation -- Full reference guides
Cost Management¶
This section provides detailed strategies for understanding, estimating, and controlling evaluation costs.
Budget Configuration and Monitoring¶
mcpbr supports a hard budget cap that halts evaluation when the estimated spend reaches the limit:
When the budget is reached, mcpbr will:
- Complete the currently running tasks
- Skip remaining tasks
- Save partial results to the output file
- Report the budget limit in the summary
Combine budget with sample size for double protection
Model Cost Comparison Table¶
Use this table to estimate costs and select the right model for your evaluation stage. Prices are per million tokens (MTok) as of January 2026.
| Model | Provider | Input $/MTok | Output $/MTok | Best For |
|---|---|---|---|---|
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | Development, iteration, smoke tests |
| Claude Sonnet 4.5 | Anthropic | $3.00 | $15.00 | Production evaluation (recommended) |
| Claude Opus 4.5 | Anthropic | $5.00 | $25.00 | Maximum performance benchmarks |
| GPT-4o | OpenAI | $2.50 | $10.00 | Cross-provider comparison |
| GPT-4o Mini | OpenAI | $0.15 | $0.60 | Ultra-low-cost exploration |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest option for prototyping | |
| Gemini 1.5 Pro | $1.25 | $5.00 | Long-context evaluations | |
| Qwen Plus | Alibaba | $0.40 | $1.20 | Budget evaluations |
| Qwen Max | Alibaba | $1.20 | $6.00 | Best Qwen performance |
Cost estimation formula
For Sonnet with typical SWE-bench tasks (~15K input + ~10K output tokens per task): Actual costs vary based on task complexity and iteration count.Cost-Effective Evaluation Strategies¶
The incremental evaluation ladder:
| Phase | Model | Sample | MCP Only? | Est. Cost | Purpose |
|---|---|---|---|---|---|
| 1. Smoke test | haiku | 1 | Yes (-M) | < $0.10 | Verify setup |
| 2. Quick validation | haiku | 5 | Yes (-M) | < $1.00 | Check tool usage |
| 3. Small comparison | sonnet | 10 | No | $5-10 | Compare MCP vs baseline |
| 4. Medium evaluation | sonnet | 25 | No | $10-30 | Statistical significance |
| 5. Full benchmark | sonnet | 50+ | No | $50-150 | Production results |
Skip baseline during development:
# Only run MCP agent while iterating on server config
mcpbr run -c config.yaml -n 5 -M # Saves ~50% cost
Use Haiku for iteration, Sonnet for final results:
Estimating Costs Before Running¶
Quick cost estimate script:
from mcpbr.pricing import calculate_cost, format_cost
# Estimate for a typical SWE-bench task with Sonnet
# Average: ~15K input tokens, ~10K output tokens per task
tasks = 25
avg_input = 15_000
avg_output = 10_000
# Both MCP and baseline run, so double the task count
total_runs = tasks * 2
per_task_cost = calculate_cost("sonnet", avg_input, avg_output)
total_est = per_task_cost * total_runs if per_task_cost else 0
print(f"Estimated cost per task: {format_cost(per_task_cost)}")
print(f"Estimated total ({tasks} tasks, MCP + baseline): {format_cost(total_est)}")
After a run, calculate actual costs:
import json
from mcpbr.pricing import calculate_cost, format_cost
with open("results.json") as f:
results = json.load(f)
total_cost = 0
model = results.get("config", {}).get("model", "sonnet")
for task in results.get("tasks", []):
for run_type in ["mcp", "baseline"]:
run = task.get(run_type, {})
tokens = run.get("tokens", {})
cost = calculate_cost(
model,
tokens.get("input", 0),
tokens.get("output", 0),
)
if cost:
total_cost += cost
print(f"Total actual cost: {format_cost(total_cost)}")
print(f"Cost per task: {format_cost(total_cost / len(results.get('tasks', [1])))}")
Analytics Best Practices¶
Getting meaningful insights from mcpbr evaluations requires careful experimental design and statistical awareness.
When to Use the Analytics Database¶
mcpbr saves results in structured JSON and YAML formats. For teams running frequent evaluations, consider importing results into a database for trend analysis:
- Single evaluation: JSON output is sufficient (
-o results.json) - Comparing 2-3 configs: Side-by-side JSON comparison works well
- Ongoing regression tracking: Import results into a database (SQLite, PostgreSQL) for trend queries
- Team dashboards: Use YAML/JSON outputs with visualization tools (Grafana, Jupyter)
# Export multiple formats for different consumers
mcpbr run -c config.yaml -n 25 \
-o results.json \
-y results.yaml \
-r report.md \
--output-junit junit.xml
Meaningful Comparisons¶
Comparing results from different task samples is invalid
mcpbr randomly samples tasks unless you specify them explicitly. Two runs with -n 25 may evaluate completely different tasks, making comparison meaningless.
How to ensure valid comparisons:
-
Same tasks: Use
--taskflags to specify identical task sets: -
Same benchmark: Never compare SWE-bench results with CyberGym results
-
Same model: Model capability differences will dominate MCP server differences
-
Same parameters: Use identical
timeout_seconds,max_iterations, and other settings
Comparison script:
import json
def compare(file_a: str, file_b: str) -> None:
with open(file_a) as f:
a = json.load(f)
with open(file_b) as f:
b = json.load(f)
a_rate = a["summary"]["mcp"]["rate"]
b_rate = b["summary"]["mcp"]["rate"]
a_resolved = set(
t["instance_id"] for t in a["tasks"]
if t.get("mcp", {}).get("resolved", False)
)
b_resolved = set(
t["instance_id"] for t in b["tasks"]
if t.get("mcp", {}).get("resolved", False)
)
print(f"Config A: {a_rate:.1%} ({len(a_resolved)} resolved)")
print(f"Config B: {b_rate:.1%} ({len(b_resolved)} resolved)")
print(f"Only A solved: {a_resolved - b_resolved}")
print(f"Only B solved: {b_resolved - a_resolved}")
print(f"Both solved: {a_resolved & b_resolved}")
compare("results-a.json", "results-b.json")
Interpreting Statistical Significance¶
Small sample sizes produce noisy results. Before drawing conclusions, consider the variance in your measurements.
Sample size guidelines:
| Sample Size | Confidence | Use Case |
|---|---|---|
| 1-5 | Very low | Smoke testing only |
| 10-25 | Low-moderate | Directional signal |
| 25-50 | Moderate | Reasonable confidence for large effects |
| 50-100 | Good | Detect moderate improvements (10%+) |
| 100+ | High | Detect small improvements (5%+) |
A single run with n=10 is not statistically meaningful
If Config A resolves 3/10 (30%) and Config B resolves 4/10 (40%), the difference could easily be due to chance. You need larger samples or repeated runs to draw reliable conclusions.
Practical significance check:
# Simple binomial confidence interval
import math
def confidence_interval(resolved: int, total: int, z: float = 1.96) -> tuple[float, float]:
"""95% confidence interval for resolution rate."""
if total == 0:
return (0.0, 0.0)
p = resolved / total
margin = z * math.sqrt(p * (1 - p) / total)
return (max(0, p - margin), min(1, p + margin))
# Example: 8 out of 25 resolved
low, high = confidence_interval(8, 25)
print(f"Rate: {8/25:.1%}, 95% CI: [{low:.1%}, {high:.1%}]")
# Rate: 32.0%, 95% CI: [14.7%, 49.3%]
If the confidence intervals of two configurations overlap substantially, the difference is likely not statistically significant.
Regression Detection Thresholds¶
Choose regression thresholds based on your sample size and tolerance for false alarms:
| Scenario | Threshold | False Alarm Rate | Miss Rate |
|---|---|---|---|
| PR checks (n=10) | 0.20 | Low | High (misses small regressions) |
| Main branch (n=25) | 0.10 | Moderate | Moderate |
| Release gate (n=50) | 0.05 | Higher | Low (catches most regressions) |
# Conservative: only alert on large regressions
--regression-threshold 0.15
# Strict: alert on any meaningful drop
--regression-threshold 0.05
Use rolling baselines
Update your baseline results periodically (e.g., weekly on the main branch) so that regression detection compares against recent performance rather than a stale snapshot.
Building Leaderboards¶
For teams evaluating multiple MCP servers, build a leaderboard from saved results:
import json
import glob
results = []
for path in glob.glob("results-*.json"):
with open(path) as f:
data = json.load(f)
config_name = path.replace("results-", "").replace(".json", "")
mcp_summary = data.get("summary", {}).get("mcp", {})
results.append({
"config": config_name,
"rate": mcp_summary.get("rate", 0),
"resolved": mcp_summary.get("resolved", 0),
"total": mcp_summary.get("total", 0),
})
# Sort by resolution rate
results.sort(key=lambda x: x["rate"], reverse=True)
print(f"{'Rank':<6}{'Config':<25}{'Rate':<10}{'Resolved':<10}")
print("-" * 51)
for i, r in enumerate(results, 1):
print(f"{i:<6}{r['config']:<25}{r['rate']:.1%}{'':<4}{r['resolved']}/{r['total']}")
Leaderboard best practices:
- Always report sample size alongside resolution rate
- Use the same task set across all configurations
- Record the model, parameters, and date for reproducibility
- Track cost-per-resolved-task alongside raw performance
- Re-run periodically as MCP servers and models are updated
Additional Resources¶
- About mcpbr - The project story and vision
- Testing Philosophy - Principles behind meaningful evaluation
- Configuration Guide - Detailed configuration reference
- Troubleshooting - Common issues and solutions
- CLI Reference - All command options
- Benchmarks Guide - Benchmark details
- Evaluation Results - Understanding output
- Templates - Configuration templates
- MCP Integration - MCP server testing
Getting Help¶
Before Asking for Help: 1. Check troubleshooting guide 2. Run with -vv --log-dir debug/ 3. Test MCP server standalone 4. Verify prerequisites
When Reporting Issues: - Include mcpbr version (mcpbr --version) - Include Python version - Include Docker version - Include config file (redact secrets!) - Include relevant logs - Describe expected vs actual behavior
Community: - GitHub Issues - Bug reports - GitHub Discussions - Questions - Documentation - Comprehensive guides