Understanding Evaluation Results¶

Q: How do I interpret mcpbr evaluation results?

mcpbr reports resolution rates for MCP and baseline agents, where 'resolved' means the generated patch passes all required tests. The improvement percentage shows how much better the MCP agent performed relative to the baseline.

Q: What does 'resolved' mean in mcpbr results?

A task is 'resolved' when the agent's patch passes all FAIL_TO_PASS tests (the tests that should pass after the fix) and all PASS_TO_PASS tests (regression tests that should continue passing).

Q: What output formats does mcpbr support?

mcpbr supports console output with summary tables, JSON output (--output flag), YAML output (--output-yaml flag) with full structured data, Markdown reports (--report flag), and per-instance JSON logs (--log-dir flag).

Q: How do I analyze mcpbr results?

Use the JSON output for programmatic analysis. Key fields include summary.mcp.rate, summary.baseline.rate, and per-task results with patch_generated, resolved, tokens, iterations, and tool_usage data.

Q: What is tool coverage and why is it useful?

Tool coverage shows which MCP tools are actually being used vs. available. It helps identify underutilized tools, understand tool discoverability, and optimize your MCP server by focusing on the most valuable tools for solving tasks.

This guide explains how to interpret and analyze mcpbr evaluation results.

Console Output¶

When running an evaluation, mcpbr displays real-time progress and a final summary.

Verbose Mode (`-v`)¶

mcpbr Evaluation
  Config: config.yaml
  Provider: anthropic
  Model: sonnet
  Agent Harness: claude-code
  Dataset: SWE-bench/SWE-bench_Lite
  Sample size: 10
  Run MCP: True, Run Baseline: True
  Pre-built images: True
  Log dir: my-logs

Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 10 tasks
Provider: anthropic, Harness: claude-code
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp    > TodoWrite
14:23:22 astropy-12907:mcp    < Todos have been modified successfully...
14:23:26 astropy-12907:mcp    > Glob
14:23:26 astropy-12907:mcp    > Grep
14:23:27 astropy-12907:mcp    < $WORKDIR/astropy/modeling/separable.py
14:27:43 astropy-12907:mcp    * done turns=31 tokens=115/6,542

Legend:

> Tool call started
< Tool result received
* Run completed

Summary Table¶

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 8/25      | 5/25     |
| Resolution Rate | 32.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +60.0%

Tool Coverage Analysis

            MCP Tool Usage
+------------------+---------+
| Metric           | Value   |
+------------------+---------+
| Available Tools  | 15      |
| Used Tools       | 8       |
| Coverage Rate    | 53.3%   |
+------------------+---------+

Most Used Tools:
  Bash: 127
  Read: 98
  Grep: 45
  Write: 23
  Edit: 12

Unused Tools (7):
  WebFetch
  NotebookEdit
  Skill
  ... and 4 more

Per-Task Results
+------------------------+------+----------+-------+
| Instance ID            | MCP  | Baseline | Error |
+------------------------+------+----------+-------+
| astropy__astropy-12907 | PASS |   PASS   |       |
| django__django-11099   | PASS |   FAIL   |       |
| sympy__sympy-18087     | FAIL |   FAIL   |       |
+------------------------+------+----------+-------+

What "Resolved" Means¶

A task is considered resolved when:

Patch Generated: The agent produced a non-empty diff
Patch Applied: The diff applies cleanly to the repository
FAIL_TO_PASS Tests Pass: Tests that were failing now pass
PASS_TO_PASS Tests Pass: Existing tests still pass (no regressions)

JSON Output¶

Save structured results with --output:

mcpbr run -c config.yaml -o results.json

Schema¶

{
  "metadata": {
    "timestamp": "2026-01-17T07:23:39.871437+00:00",
    "config": {
      "model": "sonnet",
      "provider": "anthropic",
      "agent_harness": "claude-code",
      "dataset": "SWE-bench/SWE-bench_Lite",
      "sample_size": 25,
      "timeout_seconds": 600,
      "max_iterations": 30
    },
    "mcp_server": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
    }
  },
  "summary": {
    "mcp": {
      "resolved": 8,
      "total": 25,
      "rate": 0.32
    },
    "baseline": {
      "resolved": 5,
      "total": 25,
      "rate": 0.20
    },
    "improvement": "+60.0%",
    "tool_coverage": {
      "total_available": 15,
      "total_used": 8,
      "coverage_rate": 0.533,
      "unused_tools": ["WebFetch", "NotebookEdit", "Skill", "WebSearch", "Task", "TodoWrite", "Agent"],
      "most_used": {
        "Bash": 127,
        "Read": 98,
        "Grep": 45,
        "Write": 23,
        "Edit": 12,
        "Glob": 8,
        "mcp__filesystem__read_file": 5,
        "mcp__filesystem__write_file": 2
      },
      "least_used": {
        "mcp__filesystem__write_file": 2,
        "mcp__filesystem__read_file": 5
      },
      "all_tool_usage": {
        "Bash": 127,
        "Read": 98,
        "Grep": 45,
        "Write": 23,
        "Edit": 12,
        "Glob": 8,
        "mcp__filesystem__read_file": 5,
        "mcp__filesystem__write_file": 2
      }
    }
  },
  "tasks": [...]
}

Per-Task Results¶

Each task includes detailed metrics:

{
  "instance_id": "astropy__astropy-12907",
  "mcp": {
    "patch_generated": true,
    "tokens": {
      "input": 115,
      "output": 6542
    },
    "iterations": 30,
    "tool_calls": 72,
    "tool_usage": {
      "TodoWrite": 4,
      "Task": 1,
      "Glob": 4,
      "Grep": 11,
      "Bash": 27,
      "Read": 22,
      "Write": 2,
      "Edit": 1
    },
    "resolved": true,
    "patch_applied": true,
    "fail_to_pass": {
      "passed": 2,
      "total": 2
    },
    "pass_to_pass": {
      "passed": 10,
      "total": 10
    }
  },
  "baseline": {
    "patch_generated": true,
    "tokens": {
      "input": 63,
      "output": 7615
    },
    "iterations": 30,
    "tool_calls": 57,
    "resolved": true,
    "patch_applied": true
  }
}

Key Metrics¶

Field	Description
`patch_generated`	Whether the agent produced a diff
`patch_applied`	Whether the diff applied cleanly
`resolved`	Whether all tests pass
`tokens.input`	Input tokens consumed
`tokens.output`	Output tokens generated
`iterations`	Number of agent turns
`tool_calls`	Total tool invocations
`tool_usage`	Breakdown by tool name
`fail_to_pass`	Tests that should now pass
`pass_to_pass`	Regression tests
`error`	Error message if failed

Tool Coverage Metrics¶

The tool_coverage section in the summary provides insights into which tools are being used:

Field	Description
`total_available`	Total number of tools available to the agent (built-in + MCP)
`total_used`	Number of distinct tools actually called during evaluation
`coverage_rate`	Percentage of available tools used (0.0 to 1.0)
`unused_tools`	List of tools that were never called
`most_used`	Dictionary of most frequently used tools with call counts
`least_used`	Dictionary of least frequently used tools with call counts
`all_tool_usage`	Complete breakdown of all tool usage across all tasks

This helps identify:

Which MCP tools are actually being utilized
Whether certain tools are being ignored (low discoverability)
The most valuable tools for solving tasks
Opportunities to simplify tool offerings

YAML Output¶

Save results in YAML format with --output-yaml:

mcpbr run -c config.yaml -y results.yaml

YAML output provides the same structured data as JSON but in a human-readable, hierarchical format that's ideal for:

DevOps Integration: Easy to parse in CI/CD pipelines
Configuration Management: Natural fit for tools like Ansible, Kubernetes
Version Control: More readable diffs when committed to Git
Manual Review: Easier to read and edit than JSON

Example YAML Output¶

metadata:
  timestamp: '2026-01-17T07:23:39.871437+00:00'
  config:
    model: sonnet
    provider: anthropic
    agent_harness: claude-code
    dataset: SWE-bench/SWE-bench_Lite
    sample_size: 25
    timeout_seconds: 600
    max_iterations: 30
  mcp_server:
    command: npx
    args:
    - -y
    - '@modelcontextprotocol/server-filesystem'
    - '{workdir}'
summary:
  mcp:
    resolved: 8
    total: 25
    rate: 0.32
  baseline:
    resolved: 5
    total: 25
    rate: 0.2
  improvement: +60.0%
tasks:
- instance_id: astropy__astropy-12907
  mcp:
    patch_generated: true
    tokens:
      input: 115
      output: 6542
    iterations: 30
    tool_calls: 72
    tool_usage:
      TodoWrite: 4
      Task: 1
      Glob: 4
      Grep: 11
      Bash: 27
      Read: 22
      Write: 2
      Edit: 1
    resolved: true
    patch_applied: true
    fail_to_pass:
      passed: 2
      total: 2
    pass_to_pass:
      passed: 10
      total: 10
  baseline:
    patch_generated: true
    tokens:
      input: 63
      output: 7615
    iterations: 30
    tool_calls: 57
    resolved: true
    patch_applied: true

Using YAML in CI/CD Pipelines¶

YAML output is particularly useful in automated workflows:

# GitHub Actions example
- name: Run mcpbr evaluation
  run: mcpbr run -c config.yaml -y results.yaml

- name: Parse results
  run: |
    resolution_rate=$(yq '.summary.mcp.rate' results.yaml)
    echo "MCP Resolution Rate: $resolution_rate"

    if (( $(echo "$resolution_rate > 0.5" | bc -l) )); then
      echo "✓ Performance threshold met"
    else
      echo "✗ Performance below threshold"
      exit 1
    fi

Combining Output Formats¶

You can save results in multiple formats simultaneously:

# Save all three formats
mcpbr run -c config.yaml -o results.json -y results.yaml -r report.md

This allows you to: - Use JSON for programmatic analysis - Use YAML for DevOps integration - Use Markdown for team reviews

Markdown Report¶

Generate a human-readable report with --report:

mcpbr run -c config.yaml -r report.md

The report includes:

Summary statistics
Per-task results table
Analysis of which tasks each agent solved

Per-Instance Logs¶

For detailed debugging, use --log-dir:

mcpbr run -c config.yaml -v --log-dir logs/

This creates timestamped JSON files:

logs/
  astropy__astropy-12907_mcp_20260117_143052.json
  astropy__astropy-12907_baseline_20260117_143156.json
  django__django-11099_mcp_20260117_144023.json
  ...

Log File Contents¶

{
  "instance_id": "astropy__astropy-12907",
  "run_type": "mcp",
  "events": [
    {
      "type": "system",
      "subtype": "init",
      "cwd": "/workspace",
      "tools": ["Task", "Bash", "Glob", "Grep", "Read", "Edit", "Write"],
      "model": "claude-sonnet-4-5-20250929"
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "text", "text": "I'll help you fix this bug..."}
        ]
      }
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "tool_use", "name": "Grep", "input": {"pattern": "separability"}}
        ]
      }
    },
    {
      "type": "result",
      "num_turns": 31,
      "usage": {"input_tokens": 115, "output_tokens": 6542}
    }
  ]
}

Analyzing Results¶

Improvement Calculation¶

improvement = ((mcp_rate - baseline_rate) / baseline_rate) * 100

Example: If MCP resolves 32% and baseline resolves 20%:

improvement = ((0.32 - 0.20) / 0.20) * 100 = +60%

Comparing Configurations¶

To compare different MCP servers or settings:

import json

with open("results-server-a.json") as f:
    a = json.load(f)

with open("results-server-b.json") as f:
    b = json.load(f)

print(f"Server A: {a['summary']['mcp']['rate']:.1%}")
print(f"Server B: {b['summary']['mcp']['rate']:.1%}")

Finding Interesting Tasks¶

Identify tasks where MCP helped but baseline failed:

mcp_only_wins = []
for task in results["tasks"]:
    mcp_resolved = task.get("mcp", {}).get("resolved", False)
    baseline_resolved = task.get("baseline", {}).get("resolved", False)
    if mcp_resolved and not baseline_resolved:
        mcp_only_wins.append(task["instance_id"])

print("MCP solved, baseline failed:", mcp_only_wins)

Tool Usage Analysis¶

Understand which tools are most used:

from collections import Counter

tool_counts = Counter()
for task in results["tasks"]:
    usage = task.get("mcp", {}).get("tool_usage", {})
    tool_counts.update(usage)

print("Most used tools:", tool_counts.most_common(10))

Common Patterns¶

High Resolution Rate¶

If MCP significantly outperforms baseline:

Your MCP tools provide valuable functionality
Consider which specific tools drove the improvement

Low Resolution Rate (Both)¶

If neither agent performs well:

Tasks may be inherently difficult
Consider increasing timeout_seconds and max_iterations
Review per-instance logs for common failure modes

Similar Rates¶

If MCP and baseline have similar rates:

MCP tools may not provide additional value for these tasks
Built-in tools may be sufficient
Review tool usage to see if MCP tools are being used

Next Steps¶

Troubleshooting - Common issues and solutions
Architecture - How evaluation works internally
MCP Integration - Optimize your MCP server