AgentBench¶
Overview¶
| Property | Value |
|---|---|
| Benchmark ID | agentbench |
| Dataset | THUDM/AgentBench |
| Tasks | Varies by environment |
| Evaluation | String matching on expected output (case-insensitive substring) |
| Output Type | Completion verification |
| Timeout | 180-600 seconds |
Overview¶
AgentBench is a multi-dimensional benchmark designed to evaluate LLMs as autonomous agents across diverse, interactive environments. Unlike benchmarks that test a single capability, AgentBench provides a comprehensive evaluation of an LLM's ability to understand instructions, interact with environments, and produce correct outcomes across fundamentally different domains.
The benchmark spans the following environment types:
| Environment | Code | Description |
|---|---|---|
| Operating System | os | Execute shell commands, navigate file systems, manage processes |
| Database | db | Write and execute SQL queries, manage data |
| Knowledge Graph | kg | Navigate and query structured knowledge bases |
| Web Shopping | web | Browse web stores, find products, complete purchases |
| Digital Card Game | card | Play strategy card games with defined rules |
| Lateral Thinking | lateral | Solve puzzles requiring creative, non-obvious reasoning |
| House-holding | house | Complete household tasks in simulated environments |
AgentBench is particularly valuable for evaluating:
- Environment adaptation: Can the agent adjust its approach across different domains?
- Instruction following: Does the agent correctly interpret and execute complex instructions?
- Multi-step planning: Can the agent break down complex tasks into executable steps?
- Tool interaction: How effectively does the agent use environment-specific tools and commands?
- Error recovery: Can the agent handle unexpected outcomes and adjust its strategy?
Task Structure¶
Each AgentBench task contains the following fields:
- task_id: Unique identifier for the task
- environment: The environment type (e.g., "os", "db", "kg", "web")
- instruction (or description): The task description telling the agent what to accomplish
- expected_output: The expected result for evaluation
The agent receives the instruction along with the environment context and must interact with the environment to produce the expected output.
Example Task (OS Environment)¶
Environment: os
Instruction: Find all Python files in the /home/user/projects directory that
contain the string "import pandas" and count how many there are.
Expected Output: 7
Example Task (Database Environment)¶
Environment: db
Instruction: Write a SQL query to find the top 5 customers by total order
amount from the orders table. Return their names and total amounts.
Expected Output: SELECT c.name, SUM(o.amount) as total FROM customers c
JOIN orders o ON c.id = o.customer_id GROUP BY c.name ORDER BY total DESC LIMIT 5
Example Task (Web Shopping)¶
Environment: web
Instruction: Find a pair of wireless noise-cancelling headphones under $100
on the shopping website. Add the cheapest option to the cart.
Expected Output: Added to cart
Running the Benchmark¶
# Run AgentBench with default settings
mcpbr run -c config.yaml --benchmark agentbench
# Run a small sample
mcpbr run -c config.yaml --benchmark agentbench -n 10
# Filter by environment type
mcpbr run -c config.yaml --benchmark agentbench --filter-category os
mcpbr run -c config.yaml --benchmark agentbench --filter-category db
# Run multiple environment types
mcpbr run -c config.yaml --benchmark agentbench \
--filter-category os --filter-category db
# Run with extended timeout and verbose output
mcpbr run -c config.yaml --benchmark agentbench -n 20 -v -o results.json
Environment Filtering¶
AgentBench supports filtering by environment type using filter_category. The environment names are matched case-insensitively against the environment field in each task.
Common environment codes:
| Code | Environment |
|---|---|
os | Operating system interaction |
db | Database queries |
kg | Knowledge graph navigation |
web | Web shopping and browsing |
card | Digital card games |
lateral | Lateral thinking puzzles |
house | House-holding tasks |
You can combine multiple environment types in a single evaluation by specifying multiple filter_category values.
Evaluation Methodology¶
AgentBench uses a straightforward string matching evaluation:
-
Expected output retrieval: The
expected_outputfield from the task is used as the ground truth. -
Case-insensitive substring matching: Both the expected output and the agent's solution are stripped of whitespace and lowercased. The task is resolved if the normalized expected output appears as a substring within the normalized solution.
-
Result truncation: The agent's output is truncated to 500 characters when stored in results for display purposes, but the full output is used for evaluation.
Scoring¶
Where: - expected_output: The ground truth answer from the task - solution: The agent's full response text
Evaluation Characteristics¶
- Lenient matching: Since substring matching is used, the agent can include the expected output within a longer explanation and still pass.
- Case insensitive: "SELECT" and "select" are treated as equivalent.
- Whitespace normalized: Leading and trailing whitespace is stripped before comparison.
- No partial credit: The task is either resolved or not -- there is no scoring gradient.
Example Output¶
Successful Evaluation¶
{
"resolved": true,
"agent_output": "After searching the directory, I found 7 Python files containing 'import pandas'.",
"expected_output": "7"
}
Failed Evaluation (Wrong Answer)¶
{
"resolved": false,
"agent_output": "I found 5 Python files with the import statement.",
"expected_output": "7"
}
Failed Evaluation (No Expected Output)¶
Successful Evaluation (Substring Match)¶
{
"resolved": true,
"agent_output": "The item has been added to cart successfully. Your cart now contains 1 item.",
"expected_output": "Added to cart"
}
The substring "added to cart" (lowercased) is found within the agent's response.
Troubleshooting¶
Agent provides correct reasoning but wrong output format¶
Since evaluation uses substring matching, the agent's response must contain the expected output string. If the expected output is "7" and the agent responds "seven", the evaluation will fail. Instruct the agent to include explicit, direct answers:
agent_prompt: |
{problem_statement}
After completing the task, state your final answer clearly.
If the answer is a number, include the numeric value.
If the answer is a command result, include the exact output.
Timeout on complex environment tasks¶
Some AgentBench tasks, especially those in the OS and web environments, require multiple interaction steps. Increase the timeout accordingly:
Environment-specific capabilities are missing¶
Different AgentBench environments require different agent capabilities:
- OS tasks: Need shell access and file system navigation
- DB tasks: Need SQL query execution
- Web tasks: Need web browsing capabilities
- KG tasks: Need knowledge graph query tools
Ensure your MCP server provides the necessary tools for the environment types you are evaluating.
Low scores across all environments¶
If the agent performs poorly across all environments, check:
- The agent has access to the necessary interaction tools
- The timeout is long enough for multi-step tasks
- The agent prompt encourages environment interaction rather than just reasoning
- The model is capable of following complex, multi-step instructions
Best Practices¶
- Evaluate per-environment: Track performance separately for each environment type. Aggregating across environments can mask environment-specific weaknesses.
- Start with OS and DB environments: These are typically the most straightforward and help verify basic agent interaction capabilities before testing more complex environments.
- Use generous timeouts: Agent tasks often require multiple rounds of interaction. Set
timeout_secondsto at least 300, and 600 for web and lateral thinking tasks. - Provide appropriate tools: Match your MCP server capabilities to the environments you are evaluating. An agent without web browsing tools will fail all web shopping tasks regardless of model capability.
- Monitor agent_output: The stored output is truncated to 500 characters. Use verbose logging (
-vor-vv) to capture the full agent interaction for debugging. - Use for comprehensive evaluation: AgentBench's multi-environment design makes it ideal for understanding an agent's overall capability profile. Run it alongside domain-specific benchmarks for a complete picture.
- Compare across models: AgentBench provides a standardized multi-dimensional score that is useful for comparing different models' agent capabilities.
Related Links¶
- Benchmarks Overview
- GAIA - General AI assistant benchmark
- MCPToolBench++ - MCP-specific tool use benchmark
- ToolBench - Real-world API tool use benchmark
- AgentBench Dataset
- AgentBench Paper
- AgentBench Project
- Configuration Reference
- CLI Reference