MCPToolBench++¶
Overview¶
| Property | Value |
|---|---|
| Benchmark ID | mcptoolbench |
| Dataset | MCPToolBench/MCPToolBenchPP |
| Tasks | Varies (45+ categories) |
| Evaluation | Tool selection accuracy (>=0.8), parameter accuracy (>=0.7), call count limit (<=1.5x) |
| Output Type | Tool call accuracy metrics |
| Timeout | 180-300s recommended |
| Pre-built Images | No |
| Difficulty Levels | easy (single-step), hard/medium (multi-step) |
Overview¶
MCPToolBench++ is a benchmark specifically designed to evaluate how well AI agents use MCP (Model Context Protocol) tools. It tests the complete tool-use lifecycle across four key dimensions:
- Tool Discovery: Understanding what MCP tools are available and their capabilities
- Tool Selection: Choosing the appropriate tool(s) for a given task
- Tool Invocation: Calling tools with correct parameters matching their schemas
- Result Interpretation: Understanding and correctly using tool outputs
The benchmark covers 45+ categories spanning diverse tool-use scenarios:
- Browser: Web browsing and page interaction
- Finance: Financial data retrieval and calculations
- Code Analysis: Source code inspection and manipulation
- Database: Query and data management operations
- File Management: File system operations
- Communication: Email, messaging, and notification tools
- Weather: Weather data retrieval
- Search: Information retrieval and search operations
- And many more
Each task provides a query, a set of available MCP tools with their schemas, and a ground truth sequence of tool calls that correctly completes the task.
What It Measures¶
MCPToolBench++ evaluates MCP-specific tool use capabilities:
- Tool schema comprehension: Understanding tool definitions, parameter types, required vs. optional fields, and return value formats
- Tool selection accuracy: Identifying the correct tool(s) from a set of available options based on the task description
- Parameter precision: Providing the exact parameter names and values expected by the tool schema
- Sequential reasoning: For multi-step tasks, determining the correct order of tool calls and passing results between them
- Efficiency: Completing tasks without excessive exploratory or redundant tool calls
MCPToolBench++ does not test:
- The actual execution of tool calls (evaluation is based on the call structure, not results)
- Code generation or debugging
- Free-form reasoning or knowledge retrieval
- Tasks that require tools not listed in the task's available tool set
Task Structure¶
Each MCPToolBench++ task contains the following fields:
| Field | Description |
|---|---|
| uuid | Unique task identifier |
| query | The natural language task description |
| category | Task category (e.g., "browser", "finance", "code_analysis") |
| call_type | Task complexity -- "single" (one tool call) or "multi" (multiple sequential calls) |
| tools | List of available tool names |
| mcp_tools_dict | Full MCP tool definitions including schemas and descriptions |
| function_call_label | Ground truth sequence of tool calls with parameters |
The agent receives the query, task metadata, and available tools, then must select and invoke the correct tools with proper parameters.
Example Task (Single-Step)¶
Category: finance
Task Type: single-step tool call
Available Tools: get_stock_price, get_exchange_rate, calculate_tax
Task:
What is the current stock price of Apple Inc. (AAPL)?
Expected Tool Call:
- name: get_stock_price
parameters:
symbol: "AAPL"
Example Task (Multi-Step)¶
Category: finance
Task Type: multi-step tool call
Available Tools: get_stock_price, get_exchange_rate, calculate_tax
Task:
Get the current stock price of AAPL in USD, then convert it to EUR.
Expected Tool Calls:
1. name: get_stock_price
parameters: { symbol: "AAPL" }
2. name: get_exchange_rate
parameters: { from: "USD", to: "EUR" }
Configuration¶
Basic Configuration¶
# Run MCPToolBench++ with default settings
mcpbr run -c config.yaml --benchmark mcptoolbench
# Run a small sample
mcpbr run -c config.yaml --benchmark mcptoolbench -n 20
# Filter by difficulty (single vs multi-step)
mcpbr run -c config.yaml --benchmark mcptoolbench --filter-difficulty easy
mcpbr run -c config.yaml --benchmark mcptoolbench --filter-difficulty hard
# Filter by category
mcpbr run -c config.yaml --benchmark mcptoolbench --filter-category browser
mcpbr run -c config.yaml --benchmark mcptoolbench --filter-category finance
# Combine filters
mcpbr run -c config.yaml --benchmark mcptoolbench \
--filter-difficulty easy --filter-category browser
# Run with verbose output and save results
mcpbr run -c config.yaml --benchmark mcptoolbench -n 50 -v -o results.json
benchmark: "mcptoolbench"
sample_size: 10
timeout_seconds: 300
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "sonnet"
# Optional: filter by difficulty and category
filter_difficulty:
- "easy" # single-step tasks
filter_category:
- "browser"
- "finance"
Advanced Options¶
Difficulty Filtering¶
MCPToolBench++ maps difficulty labels to the call_type field:
| Filter Value | Maps To | Description |
|---|---|---|
easy or single | single | Single tool call tasks |
hard, multi, or medium | multi | Multi-step tool call sequences |
Category Filtering¶
Filter by any of the 45+ task categories. Category matching is case-insensitive. Common categories include:
| Category | Description |
|---|---|
browser | Web browsing tasks |
finance | Financial operations |
code_analysis | Code inspection and analysis |
database | Database operations |
file | File management |
weather | Weather data |
search | Information retrieval |
communication | Messaging and notification |
Configuration for Multi-Step Evaluation¶
benchmark: "mcptoolbench"
sample_size: 20
timeout_seconds: 300
max_iterations: 20
filter_difficulty:
- "hard" # Multi-step tasks only
model: "sonnet"
agent_prompt: |
{problem_statement}
Output your tool calls as a JSON array. Each element should have:
- "name": the tool name
- "parameters": an object with parameter key-value pairs
Example:
[{"name": "tool_name", "parameters": {"param1": "value1"}}]
Configuration for Category-Specific Testing¶
benchmark: "mcptoolbench"
sample_size: 30
timeout_seconds: 180
filter_category:
- "finance"
- "search"
- "weather"
model: "sonnet"
Evaluation Methodology¶
MCPToolBench++ evaluation compares the agent's tool calls against the ground truth using three metrics:
1. Tool Selection Accuracy¶
Measures how many of the expected tools the agent correctly selected:
A tool is considered correctly selected if the agent made at least one call to a tool with the same name as an expected tool. Each expected tool is counted at most once.
2. Parameter Accuracy¶
Measures how correctly the agent filled in tool parameters:
For each correctly selected tool, the agent's parameters are compared against the ground truth. A parameter is correct if both the name and value match exactly.
3. Sequence Match¶
A boolean indicating whether the agent called the exact same tools in the exact same order as the ground truth.
Overall Resolution¶
A task is resolved when all three conditions are met:
resolved = (tool_selection_accuracy >= 0.8)
AND (parameter_accuracy >= 0.7)
AND (agent_call_count <= expected_call_count * 1.5)
The thresholds allow for minor variations:
- 0.8 tool selection: Allows missing up to 20% of expected tools (e.g., 4 out of 5 correct)
- 0.7 parameter accuracy: Allows up to 30% parameter errors (e.g., minor formatting differences)
- 1.5x call limit: Allows some extra exploratory tool calls but prevents excessive flailing
Tool Call Extraction¶
The evaluation attempts to extract tool calls from the agent's response in two ways:
- JSON parsing: If the response is valid JSON (a list of tool calls or an object with a
tool_callskey), the calls are extracted directly. - Text parsing: If JSON parsing fails, the evaluation falls back to pattern matching in the response text.
Interpreting Results¶
Key Metrics¶
| Metric | Description |
|---|---|
| Resolve rate | Percentage of tasks meeting all three thresholds |
| Tool selection accuracy (avg) | Average percentage of correct tools selected across tasks |
| Parameter accuracy (avg) | Average percentage of correct parameters across tasks |
| Sequence match rate | Percentage of tasks with exact tool call order match |
| Per-category accuracy | Resolve rate broken down by task category |
| Per-difficulty accuracy | Resolve rate for single-step vs. multi-step tasks |
What Good Results Look Like¶
| Task Type | Score Range | Assessment |
|---|---|---|
| Single-step (easy) | 70-90%+ | Good -- agent reliably selects and invokes individual tools |
| Single-step (easy) | 50-70% | Adequate -- basic tool use works but parameter accuracy needs improvement |
| Multi-step (hard) | 50-70%+ | Good -- agent handles sequential tool orchestration |
| Multi-step (hard) | 30-50% | Adequate -- struggles with tool sequencing or result passing |
| Multi-step (hard) | Below 30% | Needs investigation -- check structured output format and tool schema access |
Metric Independence
A high tool selection accuracy with low parameter accuracy indicates the agent understands which tools to use but struggles with exact parameter formatting. Conversely, low tool selection with high parameter accuracy (for selected tools) suggests the agent is good at invocation but poor at choosing the right tool. Track these metrics independently for targeted improvement.
Common Failure Patterns¶
| Pattern | Cause | Solution |
|---|---|---|
| No tool calls extracted | Agent describes actions in natural language instead of structured output | Configure prompt to request JSON-formatted tool calls |
| High tool selection, low parameter accuracy | Parameter name mismatches (e.g., "stock_symbol" vs "symbol") | Review MCP tool schemas; ensure agent has access to full tool definitions |
| Excessive tool calls | Agent retries with different parameters or explores alternatives | Instruct agent to be deliberate; the 1.5x limit prevents excessive flailing |
| Wrong tool order in multi-step | Agent calls tools in incorrect sequence | Provide clear instructions about sequential dependencies |
| Category-specific failures | Agent lacks domain knowledge for certain categories | Filter to categories relevant to your MCP server; investigate per-category metrics |
Example Output¶
Successful Evaluation¶
{
"resolved": true,
"tool_selection_accuracy": 1.0,
"parameter_accuracy": 0.85,
"sequence_match": true,
"details": "Tool selection: 100.0%, Parameter accuracy: 85.0%, Sequence match: True"
}
Failed Evaluation (Low Tool Selection)¶
{
"resolved": false,
"tool_selection_accuracy": 0.5,
"parameter_accuracy": 0.9,
"sequence_match": false,
"details": "Tool selection: 50.0%, Parameter accuracy: 90.0%, Sequence match: False"
}
Failed Evaluation (No Tool Calls Extracted)¶
{
"resolved": false,
"tool_selection_accuracy": 0.0,
"parameter_accuracy": 0.0,
"sequence_match": false,
"details": "Agent made no tool calls"
}
Failed Evaluation (No Ground Truth)¶
Best Practices¶
Recommended Workflow¶
- Start with single-step tasks (
--filter-difficulty easy) to establish baseline tool selection and invocation capability - Test category-by-category to identify which tool types your MCP server handles well
- Progress to multi-step tasks once single-step accuracy exceeds 70%
- Track all three metrics (tool selection, parameter accuracy, sequence match) separately for targeted optimization
- Test with your actual MCP server -- MCPToolBench++ is most valuable when evaluating your real tool configuration
Performance Tips¶
- Use structured output prompts: MCPToolBench++ evaluation depends on extracting tool calls from the agent's response. JSON-formatted output is most reliable.
- Provide tool schemas: Ensure your agent has access to the full MCP tool definitions. The
mcp_tools_dictfield in each task contains the complete schemas. - Increase timeout for multi-step tasks: Multi-step tasks require sequential tool calls and result interpretation. Use at least 300 seconds.
- Monitor all three metrics: Track tool selection accuracy, parameter accuracy, and sequence match separately. Each reveals different aspects of tool-use capability.
Cost Optimization¶
- MCPToolBench++ is moderately priced: Tasks are shorter than code generation benchmarks but involve tool schema processing
- Single-step tasks are cheapest: One tool call per task means fewer tokens and faster completion
- Use
sonnetfor all difficulty levels: Tool use tasks depend more on structured output capability than deep reasoning - Filter by category for focused evaluation rather than running all 45+ categories
- Start with 20 tasks per category for statistically meaningful results without excessive cost
- JSON output prompts reduce cost: Structured prompts lead to more concise, parseable responses
Common Issues & Solutions¶
| Issue | Cause | Solution |
|---|---|---|
| Agent does not produce structured tool calls | Agent describes tool calls in natural language | Configure prompt to request JSON output with name and parameters fields |
| Tool selection is high but parameter accuracy is low | Parameter name mismatches or incorrect value types | Review MCP tool schemas to ensure parameter names and types match exactly |
| Extra tool calls cause resolution failure | Agent exceeds 1.5x expected call count | Instruct agent to be deliberate; avoid exploratory calls |
| Category filter returns no results | Category name does not match dataset | Inspect available categories (see category listing below) |
| Low sequence match despite high individual metrics | Correct tools called in wrong order | For multi-step tasks, instruct agent to consider dependencies between calls |
To inspect available categories:
uv run python -c "
from datasets import load_dataset
ds = load_dataset('MCPToolBench/MCPToolBenchPP', split='train')
cats = sorted(set(item['category'] for item in ds))
for cat in cats:
print(cat)
"
Comparison with Similar Benchmarks¶
| Aspect | MCPToolBench++ | ToolBench | GAIA | AgentBench | WebArena |
|---|---|---|---|---|---|
| Goal | MCP tool use | API tool use | General assistant | Multi-environment agent | Web browsing tasks |
| Tool Type | MCP tool schemas | REST API endpoints | Any available tools | Environment-specific | Browser actions |
| Evaluation | Accuracy thresholds | Tool call comparison | Exact match (answer) | String matching | Reference matching |
| Metrics | Selection + params + sequence | Tool call match | Answer correctness | Task completion | Action accuracy |
| Task Types | Single + multi-step | Single + multi-step | Varied (QA) | Varied (multi-env) | Web interaction |
| Categories | 45+ | Varies | 3 levels | Multiple environments | Web domains |
| MCP-Specific | Yes | No | No | No | No |
| Typical Timeout | 180-300s | 120-300s | 180-600s | 120-300s | 120-300s |
| Best For | MCP server evaluation | General API tool testing | Overall assistant quality | Broad agent capability | Web automation |
When to Use MCPToolBench++
Use MCPToolBench++ when you need to evaluate an MCP server's tool use pipeline specifically. It is the only benchmark designed around the MCP tool lifecycle (discovery, selection, invocation, interpretation). For general assistant capability, use GAIA. For code-focused evaluation, use SWE-bench or HumanEval. For real-world API testing, use ToolBench.
References¶
- MCPToolBench++ Dataset on HuggingFace
- Model Context Protocol Specification
- ToolBench -- general API tool use benchmark
- GAIA -- general AI assistant benchmark with tool use
- AgentBench -- multi-environment agent benchmark
- Benchmarks Overview
- MCP Integration Guide
- Configuration Reference
- CLI Reference