MLAgentBench¶
| Property | Value |
|---|---|
| Benchmark ID | mlagentbench |
| Dataset | MLAgentBench/MLAgentBench |
| Tasks | ML research tasks based on real Kaggle competitions and research challenges |
| Evaluation | Runs eval script, extracts numeric score, compares against baseline with automatic metric direction detection |
| Output Type | Numeric metric (accuracy, loss, F1, etc.) |
| Timeout | 300-900s recommended |
Overview¶
MLAgentBench evaluates AI agents on their ability to perform real-world machine learning research tasks. Each task is based on an actual Kaggle competition or ML research challenge, requiring agents to analyze datasets, design and implement model architectures, train models, tune hyperparameters, debug ML pipelines, and ultimately improve performance metrics beyond a given baseline.
Unlike code generation benchmarks that test isolated function implementations, MLAgentBench tests end-to-end ML engineering competency. Agents must:
- Understand the research problem and target metric
- Explore and analyze provided datasets
- Implement or modify ML training pipelines
- Train models and evaluate results
- Iterate on their approach to improve performance
The evaluation is automated: after the agent completes its work, an evaluation script runs in the environment and produces a numeric score. This score is compared against a known baseline to determine whether the agent improved performance. The system automatically detects whether the metric is "higher is better" (e.g., accuracy, F1 score) or "lower is better" (e.g., loss, RMSE) based on the metric name.
MLAgentBench is particularly useful for evaluating MCP servers that provide data analysis, ML framework integration, or computational notebook capabilities.
Task Structure¶
Each MLAgentBench task contains the following fields:
| Field | Description |
|---|---|
| task_id | Unique identifier for the task |
| research_problem | Detailed description of the ML research challenge |
| domain | ML domain: nlp, cv, tabular, or other specializations |
| metric | Target metric name (e.g., accuracy, loss, rmse, f1) |
| baseline_score | Known baseline performance to improve upon |
| eval_command | Command to run the evaluation script (default: python3 evaluate.py) |
| repo | Repository with starter code, data, and evaluation scripts |
Example task:
Complete the following ML research task:
Improve the text classification model on the IMDB sentiment analysis dataset.
The current model achieves 85% accuracy using a basic logistic regression approach.
Implement a more effective model architecture and training procedure.
Target metric: accuracy
Baseline score: 0.85
Improve upon the baseline and save your results.
The agent must analyze the provided code, implement improvements, train a model, and ensure the evaluation script reports a score above 0.85.
Running the Benchmark¶
# Run MLAgentBench with default settings
mcpbr run -c config.yaml --benchmark mlagentbench
# Run a sample of 5 tasks (ML tasks are resource-intensive)
mcpbr run -c config.yaml --benchmark mlagentbench -n 5
# Filter by ML domain
mcpbr run -c config.yaml --benchmark mlagentbench --filter-category nlp
# Filter by multiple domains
mcpbr run -c config.yaml --benchmark mlagentbench \
--filter-category cv --filter-category tabular
# Run with extended timeout for training
mcpbr run -c config.yaml --benchmark mlagentbench -n 3 --timeout 900
# Run with verbose output
mcpbr run -c config.yaml --benchmark mlagentbench -n 5 -v
# Save results to JSON
mcpbr run -c config.yaml --benchmark mlagentbench -n 10 -o results.json
benchmark: "mlagentbench"
sample_size: 5
timeout_seconds: 600
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
model: "sonnet"
# Optional: Filter by domain
filter_category:
- "nlp"
Configuration for compute-intensive CV tasks:
benchmark: "mlagentbench"
sample_size: 3
timeout_seconds: 900
max_iterations: 40
max_concurrent: 2
filter_category:
- "cv"
model: "opus"
Configuration for quick tabular tasks:
Evaluation Methodology¶
MLAgentBench evaluation measures performance improvement over a baseline through the following process:
-
Agent Execution: The agent receives the research problem, target metric, and baseline score. It works within the provided repository to analyze data, modify code, train models, and save results.
-
Evaluation Script Execution: After the agent completes its work, the evaluation command (default:
python3 evaluate.py) is executed in the environment with a 300-second timeout. This script loads the agent's trained model or predictions and computes the target metric. -
Score Extraction: The evaluation parses stdout for a line matching the pattern
score|accuracy|loss|metric = <number>(case-insensitive). The extracted numeric value is the agent's achieved score. -
Metric Direction Detection: The system automatically determines whether higher or lower values indicate improvement:
- Higher is better: Metrics not containing loss-related keywords (accuracy, score, f1, precision, recall, etc.)
-
Lower is better: Metrics containing
loss,rmse,mae,mse,error, orperplexity -
Baseline Comparison: The agent's score is compared against the task's baseline:
- For higher-is-better metrics: resolved if
score > baseline -
For lower-is-better metrics: resolved if
score < baseline(and baseline > 0) -
Resolution: The task is marked as resolved if the agent's score improves upon the baseline in the correct direction. If the evaluation script fails (non-zero exit code) or no score can be extracted, the task is marked as unresolved.
Example Output¶
Successful resolution (higher is better):
Successful resolution (lower is better):
Failed resolution (did not beat baseline):
Failed resolution (evaluation script error):
{
"resolved": false,
"error": "Evaluation script failed: ModuleNotFoundError: No module named 'sklearn'"
}
Troubleshooting¶
Evaluation script fails with import errors
ML tasks often require specific Python packages (scikit-learn, torch, tensorflow, pandas, etc.) that may not be in the base Docker image. The agent should install required packages as part of its workflow, or the task environment may need custom setup. Check stderr output for specific missing modules.
Score not extracted from evaluation output
The evaluation looks for patterns like accuracy = 0.92 or loss: 0.234 in stdout. If the evaluation script uses a different format, the score extraction will fail. Ensure the evaluation script outputs scores in a recognized format: metric_name = numeric_value or metric_name: numeric_value.
Training times out
ML training can be computationally expensive, especially for CV tasks with large datasets or complex models. Increase timeout_seconds to 900 for GPU-intensive tasks. Consider reducing max_concurrent to 1-2 for resource-constrained environments, and use --filter-category tabular for faster tasks during initial testing.
Agent does not improve baseline
The agent may struggle with complex ML tasks. Ensure the agent prompt encourages iterative experimentation and mentions the baseline score as a target to beat. Providing more context through max_iterations: 30-40 gives the agent more attempts to refine its approach.
Best Practices¶
- Start with a very small sample (
-n 2or-n 3) since ML tasks are computationally expensive and time-consuming. - Use extended timeouts (600-900s) to account for model training time, especially for CV and NLP tasks.
- Reduce concurrency (
max_concurrent: 1-2) for ML tasks that are memory and CPU intensive. - Filter by domain to focus on task types relevant to your evaluation goals. Tabular tasks tend to be fastest; CV tasks tend to be slowest.
- Increase
max_iterationsto 30-40 to give the agent sufficient turns to explore the data, implement solutions, train models, and iterate on improvements. - Monitor metric direction in results to verify the system correctly identified whether higher or lower values are better for each task's metric.
- Start with tabular tasks (
--filter-category tabular) for initial testing, as they typically have shorter training times and lower resource requirements. - Track costs carefully since ML tasks require many agent turns and long execution times, which increases API usage.