Tool Use & Agents Benchmarks

6 benchmarks in this category

AgentBench: Autonomous Agent Evaluation Across OS, Database & Web
AgentBench benchmark for evaluating LLMs as autonomous agents across diverse environments including OS, databases, and web.
InterCode: Interactive Coding with Bash, SQL & Python Interpreters
InterCode evaluates agents on interactive coding tasks requiring multi-turn interaction with Bash, SQL, and Python interpreters through observation-action loops.
MCPToolBench++: MCP Tool Discovery, Selection & Invocation Benchmark
MCPToolBench++ evaluates AI agents on MCP tool discovery, selection, invocation, and result interpretation across 45+ categories with accuracy-threshold-based evaluation.
TerminalBench: Shell & Terminal Task Evaluation for AI Agents
TerminalBench evaluates AI agents on practical terminal and shell tasks including file manipulation, system administration, scripting, and command-line tool usage with validation-command-based evaluation.
ToolBench: Real-World API Tool Selection & Invocation Benchmark
ToolBench benchmark for evaluating real-world API tool selection and invocation with proper parameters.
WebArena: Autonomous Web Agent Evaluation in Realistic Environments
WebArena evaluates autonomous web agents in realistic web environments featuring functional e-commerce sites, forums, content management systems, and maps with multi-step interaction tasks.

Benchmark Your MCP Server

Get hard numbers comparing tool-assisted vs. baseline agent performance on real tasks.