Skip to content

Understanding Evaluation Results

This guide explains how to interpret and analyze mcpbr evaluation results.

Console Output

When running an evaluation, mcpbr displays real-time progress and a final summary.

Verbose Mode (-v)

mcpbr Evaluation
  Config: config.yaml
  Provider: anthropic
  Model: sonnet
  Agent Harness: claude-code
  Dataset: SWE-bench/SWE-bench_Lite
  Sample size: 10
  Run MCP: True, Run Baseline: True
  Pre-built images: True
  Log dir: my-logs

Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 10 tasks
Provider: anthropic, Harness: claude-code
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp    > TodoWrite
14:23:22 astropy-12907:mcp    < Todos have been modified successfully...
14:23:26 astropy-12907:mcp    > Glob
14:23:26 astropy-12907:mcp    > Grep
14:23:27 astropy-12907:mcp    < $WORKDIR/astropy/modeling/separable.py
14:27:43 astropy-12907:mcp    * done turns=31 tokens=115/6,542

Legend:

  • > Tool call started
  • < Tool result received
  • * Run completed

Summary Table

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 8/25      | 5/25     |
| Resolution Rate | 32.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +60.0%

Tool Coverage Analysis

            MCP Tool Usage
+------------------+---------+
| Metric           | Value   |
+------------------+---------+
| Available Tools  | 15      |
| Used Tools       | 8       |
| Coverage Rate    | 53.3%   |
+------------------+---------+

Most Used Tools:
  Bash: 127
  Read: 98
  Grep: 45
  Write: 23
  Edit: 12

Unused Tools (7):
  WebFetch
  NotebookEdit
  Skill
  ... and 4 more

Per-Task Results
+------------------------+------+----------+-------+
| Instance ID            | MCP  | Baseline | Error |
+------------------------+------+----------+-------+
| astropy__astropy-12907 | PASS |   PASS   |       |
| django__django-11099   | PASS |   FAIL   |       |
| sympy__sympy-18087     | FAIL |   FAIL   |       |
+------------------------+------+----------+-------+

What "Resolved" Means

A task is considered resolved when:

  1. Patch Generated: The agent produced a non-empty diff
  2. Patch Applied: The diff applies cleanly to the repository
  3. FAIL_TO_PASS Tests Pass: Tests that were failing now pass
  4. PASS_TO_PASS Tests Pass: Existing tests still pass (no regressions)

JSON Output

Save structured results with --output:

mcpbr run -c config.yaml -o results.json

Schema

{
  "metadata": {
    "timestamp": "2026-01-17T07:23:39.871437+00:00",
    "config": {
      "model": "sonnet",
      "provider": "anthropic",
      "agent_harness": "claude-code",
      "dataset": "SWE-bench/SWE-bench_Lite",
      "sample_size": 25,
      "timeout_seconds": 600,
      "max_iterations": 30
    },
    "mcp_server": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
    }
  },
  "summary": {
    "mcp": {
      "resolved": 8,
      "total": 25,
      "rate": 0.32
    },
    "baseline": {
      "resolved": 5,
      "total": 25,
      "rate": 0.20
    },
    "improvement": "+60.0%",
    "tool_coverage": {
      "total_available": 15,
      "total_used": 8,
      "coverage_rate": 0.533,
      "unused_tools": ["WebFetch", "NotebookEdit", "Skill", "WebSearch", "Task", "TodoWrite", "Agent"],
      "most_used": {
        "Bash": 127,
        "Read": 98,
        "Grep": 45,
        "Write": 23,
        "Edit": 12,
        "Glob": 8,
        "mcp__filesystem__read_file": 5,
        "mcp__filesystem__write_file": 2
      },
      "least_used": {
        "mcp__filesystem__write_file": 2,
        "mcp__filesystem__read_file": 5
      },
      "all_tool_usage": {
        "Bash": 127,
        "Read": 98,
        "Grep": 45,
        "Write": 23,
        "Edit": 12,
        "Glob": 8,
        "mcp__filesystem__read_file": 5,
        "mcp__filesystem__write_file": 2
      }
    }
  },
  "tasks": [...]
}

Per-Task Results

Each task includes detailed metrics:

{
  "instance_id": "astropy__astropy-12907",
  "mcp": {
    "patch_generated": true,
    "tokens": {
      "input": 115,
      "output": 6542
    },
    "iterations": 30,
    "tool_calls": 72,
    "tool_usage": {
      "TodoWrite": 4,
      "Task": 1,
      "Glob": 4,
      "Grep": 11,
      "Bash": 27,
      "Read": 22,
      "Write": 2,
      "Edit": 1
    },
    "resolved": true,
    "patch_applied": true,
    "fail_to_pass": {
      "passed": 2,
      "total": 2
    },
    "pass_to_pass": {
      "passed": 10,
      "total": 10
    }
  },
  "baseline": {
    "patch_generated": true,
    "tokens": {
      "input": 63,
      "output": 7615
    },
    "iterations": 30,
    "tool_calls": 57,
    "resolved": true,
    "patch_applied": true
  }
}

Key Metrics

Field Description
patch_generated Whether the agent produced a diff
patch_applied Whether the diff applied cleanly
resolved Whether all tests pass
tokens.input Input tokens consumed
tokens.output Output tokens generated
iterations Number of agent turns
tool_calls Total tool invocations
tool_usage Breakdown by tool name
fail_to_pass Tests that should now pass
pass_to_pass Regression tests
error Error message if failed

Tool Coverage Metrics

The tool_coverage section in the summary provides insights into which tools are being used:

Field Description
total_available Total number of tools available to the agent (built-in + MCP)
total_used Number of distinct tools actually called during evaluation
coverage_rate Percentage of available tools used (0.0 to 1.0)
unused_tools List of tools that were never called
most_used Dictionary of most frequently used tools with call counts
least_used Dictionary of least frequently used tools with call counts
all_tool_usage Complete breakdown of all tool usage across all tasks

This helps identify:

  • Which MCP tools are actually being utilized
  • Whether certain tools are being ignored (low discoverability)
  • The most valuable tools for solving tasks
  • Opportunities to simplify tool offerings

YAML Output

Save results in YAML format with --output-yaml:

mcpbr run -c config.yaml -y results.yaml

YAML output provides the same structured data as JSON but in a human-readable, hierarchical format that's ideal for:

  • DevOps Integration: Easy to parse in CI/CD pipelines
  • Configuration Management: Natural fit for tools like Ansible, Kubernetes
  • Version Control: More readable diffs when committed to Git
  • Manual Review: Easier to read and edit than JSON

Example YAML Output

metadata:
  timestamp: '2026-01-17T07:23:39.871437+00:00'
  config:
    model: sonnet
    provider: anthropic
    agent_harness: claude-code
    dataset: SWE-bench/SWE-bench_Lite
    sample_size: 25
    timeout_seconds: 600
    max_iterations: 30
  mcp_server:
    command: npx
    args:
    - -y
    - '@modelcontextprotocol/server-filesystem'
    - '{workdir}'
summary:
  mcp:
    resolved: 8
    total: 25
    rate: 0.32
  baseline:
    resolved: 5
    total: 25
    rate: 0.2
  improvement: +60.0%
tasks:
- instance_id: astropy__astropy-12907
  mcp:
    patch_generated: true
    tokens:
      input: 115
      output: 6542
    iterations: 30
    tool_calls: 72
    tool_usage:
      TodoWrite: 4
      Task: 1
      Glob: 4
      Grep: 11
      Bash: 27
      Read: 22
      Write: 2
      Edit: 1
    resolved: true
    patch_applied: true
    fail_to_pass:
      passed: 2
      total: 2
    pass_to_pass:
      passed: 10
      total: 10
  baseline:
    patch_generated: true
    tokens:
      input: 63
      output: 7615
    iterations: 30
    tool_calls: 57
    resolved: true
    patch_applied: true

Using YAML in CI/CD Pipelines

YAML output is particularly useful in automated workflows:

# GitHub Actions example
- name: Run mcpbr evaluation
  run: mcpbr run -c config.yaml -y results.yaml

- name: Parse results
  run: |
    resolution_rate=$(yq '.summary.mcp.rate' results.yaml)
    echo "MCP Resolution Rate: $resolution_rate"

    if (( $(echo "$resolution_rate > 0.5" | bc -l) )); then
      echo "✓ Performance threshold met"
    else
      echo "✗ Performance below threshold"
      exit 1
    fi

Combining Output Formats

You can save results in multiple formats simultaneously:

# Save all three formats
mcpbr run -c config.yaml -o results.json -y results.yaml -r report.md

This allows you to: - Use JSON for programmatic analysis - Use YAML for DevOps integration - Use Markdown for team reviews

Markdown Report

Generate a human-readable report with --report:

mcpbr run -c config.yaml -r report.md

The report includes:

  • Summary statistics
  • Per-task results table
  • Analysis of which tasks each agent solved

Per-Instance Logs

For detailed debugging, use --log-dir:

mcpbr run -c config.yaml -v --log-dir logs/

This creates timestamped JSON files:

logs/
  astropy__astropy-12907_mcp_20260117_143052.json
  astropy__astropy-12907_baseline_20260117_143156.json
  django__django-11099_mcp_20260117_144023.json
  ...

Log File Contents

{
  "instance_id": "astropy__astropy-12907",
  "run_type": "mcp",
  "events": [
    {
      "type": "system",
      "subtype": "init",
      "cwd": "/workspace",
      "tools": ["Task", "Bash", "Glob", "Grep", "Read", "Edit", "Write"],
      "model": "claude-sonnet-4-5-20250929"
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "text", "text": "I'll help you fix this bug..."}
        ]
      }
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "tool_use", "name": "Grep", "input": {"pattern": "separability"}}
        ]
      }
    },
    {
      "type": "result",
      "num_turns": 31,
      "usage": {"input_tokens": 115, "output_tokens": 6542}
    }
  ]
}

Analyzing Results

Improvement Calculation

improvement = ((mcp_rate - baseline_rate) / baseline_rate) * 100

Example: If MCP resolves 32% and baseline resolves 20%:

improvement = ((0.32 - 0.20) / 0.20) * 100 = +60%

Comparing Configurations

To compare different MCP servers or settings:

import json

with open("results-server-a.json") as f:
    a = json.load(f)

with open("results-server-b.json") as f:
    b = json.load(f)

print(f"Server A: {a['summary']['mcp']['rate']:.1%}")
print(f"Server B: {b['summary']['mcp']['rate']:.1%}")

Finding Interesting Tasks

Identify tasks where MCP helped but baseline failed:

mcp_only_wins = []
for task in results["tasks"]:
    mcp_resolved = task.get("mcp", {}).get("resolved", False)
    baseline_resolved = task.get("baseline", {}).get("resolved", False)
    if mcp_resolved and not baseline_resolved:
        mcp_only_wins.append(task["instance_id"])

print("MCP solved, baseline failed:", mcp_only_wins)

Tool Usage Analysis

Understand which tools are most used:

from collections import Counter

tool_counts = Counter()
for task in results["tasks"]:
    usage = task.get("mcp", {}).get("tool_usage", {})
    tool_counts.update(usage)

print("Most used tools:", tool_counts.most_common(10))

Common Patterns

High Resolution Rate

If MCP significantly outperforms baseline:

  • Your MCP tools provide valuable functionality
  • Consider which specific tools drove the improvement

Low Resolution Rate (Both)

If neither agent performs well:

  • Tasks may be inherently difficult
  • Consider increasing timeout_seconds and max_iterations
  • Review per-instance logs for common failure modes

Similar Rates

If MCP and baseline have similar rates:

  • MCP tools may not provide additional value for these tasks
  • Built-in tools may be sufficient
  • Review tool usage to see if MCP tools are being used

Next Steps