Analytics API Reference¶
The mcpbr.analytics package provides comprehensive statistical analysis, historical tracking, and comparison tools for benchmark results. All calculations use only the Python standard library -- no NumPy or SciPy required.
from mcpbr.analytics import (
ResultsDatabase,
ComparisonEngine,
RegressionDetector,
ABTest,
Leaderboard,
MetricsRegistry,
)
ResultsDatabase¶
SQLite-backed persistent storage for evaluation runs and per-task results.
ResultsDatabase ¶
SQLite-backed storage for mcpbr evaluation results.
Stores evaluation runs and per-task results, supporting queries for trend analysis, filtering, and cleanup of old data.
Example::
with ResultsDatabase("my_results.db") as db:
run_id = db.store_run(results_data)
run = db.get_run(run_id)
trends = db.get_trends(benchmark="swe-bench-verified")
__init__(db_path='mcpbr_results.db') ¶
Open or create the SQLite results database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db_path | str | Path | Path to the SQLite database file. The file and any parent directories are created if they do not exist. | 'mcpbr_results.db' |
store_run(results_data) ¶
Store a complete evaluation run with its task results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_data | dict[str, Any] | Evaluation results dictionary. Expected keys are | required |
Returns:
| Type | Description |
|---|---|
int | The auto-generated |
Raises:
| Type | Description |
|---|---|
Error | On database write failures. |
get_run(run_id) ¶
Retrieve a specific evaluation run by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id | int | The run identifier returned by :meth: | required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None | A dictionary with the run's columns, or |
list_runs(limit=50, benchmark=None, model=None, provider=None) ¶
List evaluation runs with optional filtering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit | int | Maximum number of runs to return. Runs are ordered by timestamp descending (most recent first). | 50 |
benchmark | str | None | Filter by benchmark name (exact match). | None |
model | str | None | Filter by model identifier (exact match). | None |
provider | str | None | Filter by provider name (exact match). | None |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of run dictionaries, most recent first. |
get_task_results(run_id) ¶
Get all task-level results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id | int | The run identifier. | required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of task result dictionaries for the run. |
delete_run(run_id) ¶
Delete an evaluation run and all its associated task results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id | int | The run identifier to delete. | required |
Returns:
| Type | Description |
|---|---|
bool |
|
bool | with the given ID. |
get_trends(benchmark=None, model=None, limit=20) ¶
Get resolution rate, cost, and token trends over time.
Returns a time-ordered list of aggregate metrics for each run matching the optional filters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark | str | None | Filter by benchmark name. | None |
model | str | None | Filter by model identifier. | None |
limit | int | Maximum number of data points to return. | 20 |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of dicts with keys |
list[dict[str, Any]] |
|
list[dict[str, Any]] |
|
cleanup(max_age_days=90) ¶
Delete runs older than the specified age.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_age_days | int | Maximum age in days. Runs with a timestamp older than this many days from now will be deleted along with their task results (via | 90 |
Returns:
| Type | Description |
|---|---|
int | Number of runs deleted. |
close() ¶
Close the database connection.
After calling this method the database instance should not be used.
Usage¶
from mcpbr.analytics import ResultsDatabase
# Open or create database (context manager supported)
with ResultsDatabase("my_results.db") as db:
# Store a run
run_id = db.store_run(results_data)
# Query runs
runs = db.list_runs(limit=10, benchmark="swe-bench-verified")
run = db.get_run(run_id)
# Get per-task results
task_results = db.get_task_results(run_id)
# Get trend data for charting
trends = db.get_trends(benchmark="swe-bench-verified", model="sonnet")
# Clean up old data
deleted = db.cleanup(max_age_days=90)
Methods¶
| Method | Returns | Description |
|---|---|---|
store_run(results_data) | int | Store evaluation results, returns run ID |
get_run(run_id) | dict \| None | Retrieve a specific run by ID |
list_runs(limit, benchmark, model, provider) | list[dict] | List runs with optional filtering |
get_task_results(run_id) | list[dict] | Get per-task results for a run |
delete_run(run_id) | bool | Delete a run and its task results |
get_trends(benchmark, model, limit) | list[dict] | Get time-series trend data |
cleanup(max_age_days) | int | Delete runs older than max_age_days |
close() | None | Close the database connection |
Database Schema¶
The database has two tables:
runs -- One row per evaluation run:
| Column | Type | Description |
|---|---|---|
id | INTEGER | Auto-incremented primary key |
timestamp | TEXT | ISO 8601 timestamp |
benchmark | TEXT | Benchmark name |
model | TEXT | Model identifier |
provider | TEXT | Provider name |
resolution_rate | REAL | Overall resolution rate |
total_cost | REAL | Total cost in USD |
total_tasks | INTEGER | Number of tasks evaluated |
resolved_tasks | INTEGER | Number of tasks resolved |
metadata_json | TEXT | Full metadata as JSON |
task_results -- One row per task per run:
| Column | Type | Description |
|---|---|---|
run_id | INTEGER | Foreign key to runs |
instance_id | TEXT | Task identifier |
resolved | INTEGER | 1 if resolved, 0 otherwise |
cost | REAL | Task cost in USD |
tokens_input | INTEGER | Input tokens used |
tokens_output | INTEGER | Output tokens used |
runtime_seconds | REAL | Task runtime |
error | TEXT | Error message if failed |
Statistical Tests¶
Pure Python implementations of common statistical tests for comparing benchmark results.
chi_squared_test()¶
Compare two proportions (resolution rates) using a 2x2 chi-squared test.
from mcpbr.analytics import chi_squared_test
result = chi_squared_test(
success_a=45, total_a=100,
success_b=60, total_b=100,
)
print(f"Chi2: {result['chi2']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
print(f"Effect size (phi): {result['effect_size']:.3f}")
| Parameter | Type | Description |
|---|---|---|
success_a | int | Successes in group A |
total_a | int | Total observations in group A |
success_b | int | Successes in group B |
total_b | int | Total observations in group B |
significance_level | float | Alpha threshold (default: 0.05) |
Returns: dict with chi2, p_value, significant, effect_size (phi coefficient).
bootstrap_confidence_interval()¶
Bootstrap confidence interval for a metric.
from mcpbr.analytics import bootstrap_confidence_interval
ci = bootstrap_confidence_interval(
values=[0.85, 0.90, 0.78, 0.92, 0.88, 0.82],
confidence=0.95,
n_bootstrap=1000,
)
print(f"Mean: {ci['mean']:.3f}")
print(f"95% CI: [{ci['ci_lower']:.3f}, {ci['ci_upper']:.3f}]")
print(f"Std Error: {ci['std_error']:.4f}")
| Parameter | Type | Default | Description |
|---|---|---|---|
values | list[float] | (required) | Observed metric values |
confidence | float | 0.95 | Confidence level |
n_bootstrap | int | 1000 | Number of resamples |
Returns: dict with mean, ci_lower, ci_upper, std_error.
effect_size_cohens_d()¶
Cohen's d effect size between two groups.
from mcpbr.analytics import effect_size_cohens_d
d = effect_size_cohens_d(
group_a=[0.85, 0.90, 0.88, 0.92],
group_b=[0.70, 0.75, 0.72, 0.68],
)
print(f"Cohen's d: {d:.3f}")
# Interpretation: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, >= 0.8 large
mann_whitney_u()¶
Non-parametric Mann-Whitney U test for comparing two independent samples.
from mcpbr.analytics import mann_whitney_u
result = mann_whitney_u(
group_a=[0.85, 0.90, 0.88, 0.92, 0.87],
group_b=[0.70, 0.75, 0.72, 0.68, 0.74],
)
print(f"U: {result['u_statistic']:.1f}, p={result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
permutation_test()¶
Permutation test for difference in means between two groups.
from mcpbr.analytics import permutation_test
result = permutation_test(
group_a=[0.85, 0.90, 0.88],
group_b=[0.70, 0.75, 0.72],
n_permutations=5000,
)
print(f"Observed diff: {result['observed_diff']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
compare_resolution_rates()¶
Comprehensive comparison of two result sets with chi-squared testing, effect sizes, and a human-readable summary.
from mcpbr.analytics import compare_resolution_rates
comparison = compare_resolution_rates(
results_a={"resolved": 45, "total": 100, "name": "Server A"},
results_b={"resolved": 38, "total": 100, "name": "Server B"},
)
print(comparison["summary"])
# "Server A (45.0%) vs Server B (38.0%): Server A is 7.0pp higher.
# Difference is not significant (p=0.3123, phi=0.072)."
ComparisonEngine¶
Compare evaluation results across multiple models with summary tables, rankings, Pareto frontiers, and pairwise analysis.
ComparisonEngine ¶
Engine for comparing evaluation results across multiple models.
Supports adding multiple labeled result sets and generating comprehensive comparisons including summary tables, task matrices, unique wins, rankings, pairwise comparisons, cost-performance frontiers, and winner analysis.
Example::
engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_data)
engine.add_results("gpt-4o", gpt4o_data)
comparison = engine.compare()
add_results(label, results_data) ¶
Add a labeled result set for comparison.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label | str | Human-readable label identifying the model or run (e.g., "claude-sonnet-run-1"). | required |
results_data | dict[str, Any] | Results dictionary with the standard mcpbr output structure containing | required |
compare() ¶
Generate a comprehensive comparison across all added result sets.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | Dictionary containing: - |
Raises:
| Type | Description |
|---|---|
ValueError | If fewer than two result sets have been added. |
get_cost_performance_frontier() ¶
Compute the Pareto frontier of cost vs resolution rate.
Points on the frontier represent models where no other model is both cheaper and has a higher resolution rate. The frontier is sorted by ascending cost.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of dicts, each with |
list[dict[str, Any]] | representing models on the Pareto-optimal frontier. |
Raises:
| Type | Description |
|---|---|
ValueError | If fewer than two result sets have been added. |
get_winner_analysis() ¶
Determine which model wins on each metric.
Evaluates models across resolution rate, total cost, cost efficiency (cost per resolved task), and average speed (runtime per task).
Returns:
| Type | Description |
|---|---|
dict[str, Any] | Dictionary with metric names as keys and dicts containing |
dict[str, Any] |
|
Raises:
| Type | Description |
|---|---|
ValueError | If fewer than two result sets have been added. |
Usage¶
from mcpbr.analytics import ComparisonEngine
engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_results)
engine.add_results("gpt-4o", gpt4o_results)
engine.add_results("gemini-2.0-flash", gemini_results)
# Full comparison
comparison = engine.compare()
print(comparison["models"]) # ["claude-sonnet", "gpt-4o", "gemini-2.0-flash"]
print(comparison["summary_table"]) # Per-model summary metrics
print(comparison["rankings"]) # by_rate, by_cost_efficiency, by_speed
print(comparison["unique_wins"]) # Tasks only one model resolved
print(comparison["pairwise"]) # All pairwise comparisons
# Pareto-optimal models (cost vs resolution rate)
frontier = engine.get_cost_performance_frontier()
for point in frontier:
print(f"{point['label']}: rate={point['rate']:.1%}, cost=${point['cost']:.2f}")
# Winner on each metric
winners = engine.get_winner_analysis()
for metric, info in winners.items():
print(f"{metric}: {info['winner']} ({info['value']})")
Convenience Functions¶
from mcpbr.analytics import compare_results_files, format_comparison_table
# Compare JSON result files directly
comparison = compare_results_files(
["results_sonnet.json", "results_gpt4o.json"],
labels=["Claude Sonnet", "GPT-4o"],
)
# Format as ASCII table
print(format_comparison_table(comparison))
RegressionDetector¶
Detect performance regressions between evaluation runs across multiple dimensions.
RegressionDetector ¶
Detect performance regressions between evaluation runs.
Compares a current run against a baseline across multiple dimensions: resolution rate (with statistical significance testing), cost, latency, and token usage. Also reports per-task regressions and improvements.
Example::
detector = RegressionDetector(threshold=0.05)
result = detector.detect(current_results, baseline_results)
if result["overall_status"] == "fail":
print("Regression detected!")
print(detector.format_report())
__init__(threshold=0.05, significance_level=0.05) ¶
Configure the regression detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold | float | Minimum absolute change in resolution rate to consider as a potential regression. Defaults to 0.05 (5 percentage points). | 0.05 |
significance_level | float | Alpha level for statistical significance testing. Defaults to 0.05. | 0.05 |
detect(current, baseline) ¶
Detect regressions between current and baseline results.
Analyzes resolution rate, cost, latency, and token usage, plus per-task changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current | dict[str, Any] | Current evaluation results dictionary. | required |
baseline | dict[str, Any] | Baseline evaluation results dictionary to compare against. | required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | Dictionary containing: - |
format_report() ¶
Format the last detection result as a human-readable report.
Returns:
| Type | Description |
|---|---|
str | Multi-line string containing the formatted regression report. |
Raises:
| Type | Description |
|---|---|
ValueError | If :meth: |
Usage¶
from mcpbr.analytics import RegressionDetector
detector = RegressionDetector(threshold=0.05, significance_level=0.05)
result = detector.detect(current_results, baseline_results)
# Check overall status
if result["overall_status"] == "fail":
print("REGRESSION DETECTED!")
elif result["overall_status"] == "warning":
print("Warning: potential issues")
else:
print("All clear")
# Inspect specific regressions
print(result["score_regression"]) # Resolution rate analysis
print(result["cost_regression"]) # Cost change analysis
print(result["latency_regression"]) # Latency change analysis
print(result["token_regression"]) # Token usage change analysis
print(result["task_regressions"]) # Per-task regressions
print(result["task_improvements"]) # Per-task improvements
# Human-readable report
print(detector.format_report())
Detection Thresholds¶
| Dimension | Regression Threshold | Description |
|---|---|---|
| Resolution rate | > 5pp decrease + statistically significant | Chi-squared test at alpha=0.05 |
| Cost | > 20% increase | Percentage increase in total cost |
| Latency | > 25% increase | Percentage increase in average runtime |
| Token usage | > 25% increase | Percentage increase in average tokens |
Overall Status¶
| Status | Meaning |
|---|---|
"pass" | No regressions detected |
"warning" | Cost, latency, or token regression; or per-task regressions |
"fail" | Statistically significant resolution rate regression |
ABTest¶
A/B testing framework for comparing two MCP server configurations.
ABTest ¶
A/B testing framework for comparing two MCP server configurations.
Creates a structured comparison between a control group (A) and treatment group (B), running chi-squared significance testing on resolution rates and comparing cost metrics.
Example::
test = ABTest("Model Comparison")
test.add_control(results_baseline)
test.add_treatment(results_candidate)
analysis = test.analyze()
print(test.format_report())
__init__(name, control_label='A', treatment_label='B') ¶
Initialize the A/B test.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Human-readable name for this test. | required |
control_label | str | Label for the control group (default | 'A' |
treatment_label | str | Label for the treatment group (default | 'B' |
add_control(results_data) ¶
Add the control group results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_data | dict[str, Any] | Evaluation results dictionary for the control configuration. | required |
add_treatment(results_data) ¶
Add the treatment group results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_data | dict[str, Any] | Evaluation results dictionary for the treatment configuration. | required |
analyze() ¶
Run the A/B test analysis.
Compares resolution rates using a chi-squared test, and reports differences in cost and other metrics.
Returns:
| Type | Description |
|---|---|
dict[str, Any] | Dictionary containing: - |
Raises:
| Type | Description |
|---|---|
ValueError | If control or treatment data has not been added. |
format_report() ¶
Format the analysis results as a human-readable report.
Calls :meth:analyze automatically if it has not been called yet.
Returns:
| Type | Description |
|---|---|
str | Multi-line string containing the formatted A/B test report. |
Raises:
| Type | Description |
|---|---|
ValueError | If control or treatment data has not been added. |
Usage¶
from mcpbr.analytics import ABTest
test = ABTest(
name="Filesystem v2 vs v1",
control_label="v1 (current)",
treatment_label="v2 (candidate)",
)
test.add_control(results_v1)
test.add_treatment(results_v2)
analysis = test.analyze()
print(f"Winner: {analysis['winner']}")
print(f"Rate difference: {analysis['rate_difference']:+.4f}")
print(f"Significant: {analysis['statistical_significance']['significant']}")
print(f"Recommendation: {analysis['recommendation']}")
# Formatted report
print(test.format_report())
Quick A/B Test¶
from mcpbr.analytics import run_ab_test
result = run_ab_test(results_a, results_b, test_name="Quick Comparison")
print(result["winner"])
print(result["recommendation"])
Leaderboard¶
Generate ranked leaderboards from multiple evaluation results.
Leaderboard ¶
Generate ranked leaderboards from multiple evaluation results.
Collects results from multiple configurations or models and produces a ranked comparison sorted by any supported metric.
Example::
lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
print(lb.format_table())
add_entry(label, results_data) ¶
Add a result set to the leaderboard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
label | str | Human-readable label for this entry (e.g., model name or configuration description). | required |
results_data | dict[str, Any] | Evaluation results dictionary with | required |
generate(sort_by='resolution_rate') ¶
Generate the sorted leaderboard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sort_by | str | Metric to sort by. Supported values: | 'resolution_rate' |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of ranked entry dictionaries, each containing |
list[dict[str, Any]] |
|
list[dict[str, Any]] |
|
list[dict[str, Any]] |
|
Raises:
| Type | Description |
|---|---|
ValueError | If sort_by is not a supported sort key. |
format_table(sort_by='resolution_rate') ¶
Format the leaderboard as an ASCII table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sort_by | str | Metric to sort by (see :meth: | 'resolution_rate' |
Returns:
| Type | Description |
|---|---|
str | Multi-line ASCII table string. |
format_markdown(sort_by='resolution_rate') ¶
Format the leaderboard as a Markdown table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sort_by | str | Metric to sort by (see :meth: | 'resolution_rate' |
Returns:
| Type | Description |
|---|---|
str | Markdown-formatted table string. |
Usage¶
from mcpbr.analytics import Leaderboard
lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
lb.add_entry("Gemini Flash", results_gemini)
# Generate sorted leaderboard
entries = lb.generate(sort_by="resolution_rate")
for entry in entries:
print(f"#{entry['rank']} {entry['label']}: {entry['resolution_rate']:.1%}")
# ASCII table output
print(lb.format_table(sort_by="resolution_rate"))
# Markdown table (for GitHub/docs)
print(lb.format_markdown(sort_by="resolution_rate"))
Sort Keys¶
| Key | Direction | Description |
|---|---|---|
resolution_rate | Higher is better | Fraction of tasks resolved |
resolved | Higher is better | Absolute number of resolved tasks |
total_cost | Lower is better | Total cost in USD |
cost_per_resolved | Lower is better | Cost per resolved task |
avg_tokens | Lower is better | Average tokens per task |
avg_runtime | Lower is better | Average runtime per task |
Quick Leaderboard¶
from mcpbr.analytics import generate_leaderboard
entries = generate_leaderboard([
("Claude Sonnet", results_sonnet),
("GPT-4o", results_gpt4o),
], sort_by="resolution_rate")
MetricsRegistry¶
Registry of metric definitions with built-in defaults and support for custom metrics.
MetricsRegistry ¶
Registry of metric definitions with built-in defaults.
Built-in metrics registered on initialisation: - resolution_rate: Fraction of tasks resolved. - cost_per_resolution: Total cost divided by resolved count (inf if none resolved). - avg_tokens_per_task: Mean total token count per task. - tool_failure_rate: Ratio of tool failures to total tool calls. - efficiency_score: Composite score: rate / (cost + 0.01).
register(metric) ¶
Register a custom metric definition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric | MetricDefinition | The metric to register. | required |
Raises:
| Type | Description |
|---|---|
ValueError | If a metric with the same name is already registered. |
calculate_all(results_data) ¶
Calculate all registered metrics against the given results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_data | dict[str, Any] | Evaluation results dictionary with | required |
Returns:
| Type | Description |
|---|---|
dict[str, float] | Dictionary mapping metric name to its computed float value. |
dict[str, float] | If a metric calculation raises an exception the value is |
dict[str, float] |
|
get_metric(name) ¶
Look up a metric by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Metric identifier. | required |
Returns:
| Type | Description |
|---|---|
MetricDefinition | None | The |
list_metrics() ¶
Return a sorted list of all registered metric names.
Built-in Metrics¶
| Metric | Unit | Higher is Better | Description |
|---|---|---|---|
resolution_rate | ratio | Yes | Fraction of tasks resolved |
cost_per_resolution | USD | No | Total cost / resolved count |
avg_tokens_per_task | tokens | No | Average total tokens per task |
tool_failure_rate | ratio | No | Tool failures / total tool calls |
efficiency_score | score | No | resolution_rate / (total_cost + 0.01) |
Usage¶
from mcpbr.analytics import MetricsRegistry, MetricDefinition
registry = MetricsRegistry()
# Calculate all built-in metrics
metrics = registry.calculate_all(results_data)
print(f"Resolution rate: {metrics['resolution_rate']:.1%}")
print(f"Efficiency: {metrics['efficiency_score']:.2f}")
# Register a custom metric
registry.register(MetricDefinition(
name="cost_per_token",
description="Average cost per 1000 tokens",
unit="USD/1k tokens",
calculate=lambda data: (
sum(t.get("mcp", {}).get("cost", 0) for t in data.get("tasks", [])) /
max(sum(
t.get("mcp", {}).get("tokens", {}).get("input", 0) +
t.get("mcp", {}).get("tokens", {}).get("output", 0)
for t in data.get("tasks", [])
), 1) * 1000
),
higher_is_better=False,
))
# List all registered metrics
print(registry.list_metrics())
TrendAnalysis¶
Time-series trend analysis for evaluation results.
calculate_trends()¶
Calculate trend information from a list of run summaries.
from mcpbr.analytics import calculate_trends
# runs from ResultsDatabase.get_trends()
trends = calculate_trends(runs)
print(f"Direction: {trends['direction']}") # "improving", "declining", "stable"
print(trends["resolution_rate_trend"]) # [{timestamp, rate}, ...]
print(trends["cost_trend"]) # [{timestamp, cost}, ...]
print(trends["moving_averages"]) # 3-point moving averages
detect_trend_direction()¶
Determine whether a series of values is improving, declining, or stable using linear regression.
from mcpbr.analytics import detect_trend_direction
direction = detect_trend_direction([0.40, 0.42, 0.45, 0.48, 0.50])
print(direction) # "improving"
calculate_moving_average()¶
Compute a simple moving average over a list of values.
from mcpbr.analytics import calculate_moving_average
ma = calculate_moving_average([0.40, 0.42, 0.45, 0.48, 0.50], window=3)
# [None, None, 0.4233..., 0.45, 0.4766...]
AnomalyDetection¶
Statistical methods to identify outlier values in benchmark metrics.
detect_anomalies()¶
Detect anomalous values using z-score, IQR, or MAD methods.
from mcpbr.analytics import detect_anomalies
anomalies = detect_anomalies(
values=[0.5, 0.6, 0.55, 0.58, 5.0, 0.52],
method="zscore", # "zscore", "iqr", or "mad"
threshold=2.0,
)
for a in anomalies:
print(f"Index {a['index']}: value={a['value']}, score={a['score']:.2f}")
| Method | Description | Threshold Meaning |
|---|---|---|
zscore | Z-score exceeds threshold | Number of standard deviations |
iqr | IQR fence method | Fence multiplier (commonly 1.5) |
mad | Median absolute deviation | Number of MADs |
detect_metric_anomalies()¶
Run anomaly detection across standard benchmark metrics (cost, tokens, runtime, iterations).
from mcpbr.analytics import detect_metric_anomalies
anomalies = detect_metric_anomalies(results_data)
print(f"Cost anomalies: {len(anomalies['cost'])}")
print(f"Token anomalies: {len(anomalies['tokens'])}")
print(f"Runtime anomalies: {len(anomalies['runtime'])}")
print(f"Iteration anomalies: {len(anomalies['iterations'])}")
CorrelationAnalysis¶
Compute correlations between evaluation metrics.
pearson_correlation()¶
Compute the Pearson correlation coefficient between two sequences.
from mcpbr.analytics import pearson_correlation
result = pearson_correlation(
x=[100, 200, 300, 400, 500],
y=[0.5, 1.1, 1.4, 2.0, 2.5],
)
print(f"r = {result['r']:.3f}, R^2 = {result['r_squared']:.3f}, p = {result['p_value']:.4f}")
spearman_correlation()¶
Compute the Spearman rank correlation (non-parametric, handles non-linear relationships).
from mcpbr.analytics import spearman_correlation
result = spearman_correlation(x=[1, 2, 3, 4, 5], y=[5, 6, 7, 8, 7])
analyze_metric_correlations()¶
Compute all pairwise Pearson correlations between standard metrics extracted from results.
from mcpbr.analytics import analyze_metric_correlations, find_strong_correlations
correlations = analyze_metric_correlations(results_data)
# Correlations between: cost, tokens_input, tokens_output, iterations,
# runtime_seconds, tool_calls
# Filter for strong correlations
strong = find_strong_correlations(correlations, threshold=0.7)
for c in strong:
print(f"{c['pair']}: r={c['r']:.3f} ({c['direction']})")
ErrorPatternAnalyzer¶
Analyze error patterns across benchmark results with clustering, temporal analysis, and recommendations.
ErrorPatternAnalyzer ¶
Analyzes error patterns across benchmark results.
Clusters similar errors, detects temporal patterns, correlates errors with specific tools, and produces actionable recommendations.
analyze(results) ¶
Analyze error patterns across benchmark results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results | list[dict[str, Any]] | List of task result dicts. Each may contain keys like | required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | Dictionary with keys: - total_errors: Total number of errors found. - error_clusters: List of cluster dicts with pattern, count, examples, and category. - temporal_patterns: Dict describing whether errors increase over iterations. - tool_error_correlation: Dict mapping tool names to error rates. - recommendations: List of actionable recommendation strings. |
cluster_errors(errors, similarity_threshold=0.6) ¶
Cluster similar error messages using token-overlap similarity.
Groups errors whose Jaccard similarity exceeds the given threshold, then categorises each cluster.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors | list[str] | List of raw error message strings. | required |
similarity_threshold | float | Minimum Jaccard similarity to merge two errors into the same cluster. Defaults to 0.6. | 0.6 |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | List of cluster dicts, each containing: - pattern: Representative error string (most common). - count: Number of errors in the cluster. - examples: Up to 3 distinct example messages. - category: High-level category string. |
Usage¶
from mcpbr.analytics import ErrorPatternAnalyzer
analyzer = ErrorPatternAnalyzer()
analysis = analyzer.analyze(task_results)
print(f"Total errors: {analysis['total_errors']}")
# Error clusters (grouped by similarity)
for cluster in analysis["error_clusters"]:
print(f" {cluster['category']}: {cluster['count']}x - {cluster['pattern'][:80]}")
# Temporal patterns
if analysis["temporal_patterns"]["increasing"]:
print("Warning: errors increasing over iterations")
# Tool-error correlation
for tool, rate in analysis["tool_error_correlation"].items():
if rate > 0.3:
print(f" High error rate tool: {tool} ({rate:.0%})")
# Actionable recommendations
for rec in analysis["recommendations"]:
print(f" - {rec}")
Error Categories¶
The analyzer automatically categorizes errors into:
| Category | Pattern Keywords |
|---|---|
timeout | timeout, timed out, deadline |
authentication | auth, unauthorized, 401, 403 |
rate_limit | rate limit, 429, throttle |
connection | connection, refused, DNS, network |
validation | invalid, validation, schema, parse |
permission | permission, denied, access |
unknown | Everything else |
identify_flaky_tasks()¶
Identify tasks with inconsistent outcomes across multiple runs.
from mcpbr.analytics import identify_flaky_tasks
flaky = identify_flaky_tasks([results_run1, results_run2, results_run3])
for task in flaky:
if task["flaky"]:
print(f"{task['instance_id']}: pass_rate={task['pass_rate']:.0%} over {task['run_count']} runs")
DifficultyEstimation¶
Estimate per-task difficulty based on resolution rates, resource usage, and runtime.
estimate_difficulty()¶
Score each task's difficulty on a 0-1 scale.
from mcpbr.analytics import estimate_difficulty, aggregate_difficulty_stats
difficulties = estimate_difficulty(results_data)
for d in difficulties[:5]:
print(f"{d['instance_id']}: {d['difficulty_level']} ({d['difficulty_score']:.2f})")
# Aggregate statistics
stats = aggregate_difficulty_stats(difficulties)
print(f"Distribution: {stats['distribution']}")
print(f"Avg difficulty: {stats['avg_difficulty']:.2f}")
print(f"Hardest tasks: {[t['instance_id'] for t in stats['hardest_tasks']]}")
Difficulty Levels¶
| Score Range | Level |
|---|---|
| 0.00 - 0.25 | easy |
| 0.25 - 0.50 | medium |
| 0.50 - 0.75 | hard |
| 0.75 - 1.00 | very_hard |
estimate_task_difficulty_score()¶
Score a single task's difficulty given its metrics and run averages.