Skip to content

Analytics API Reference

The mcpbr.analytics package provides comprehensive statistical analysis, historical tracking, and comparison tools for benchmark results. All calculations use only the Python standard library -- no NumPy or SciPy required.

from mcpbr.analytics import (
    ResultsDatabase,
    ComparisonEngine,
    RegressionDetector,
    ABTest,
    Leaderboard,
    MetricsRegistry,
)

ResultsDatabase

SQLite-backed persistent storage for evaluation runs and per-task results.

ResultsDatabase

SQLite-backed storage for mcpbr evaluation results.

Stores evaluation runs and per-task results, supporting queries for trend analysis, filtering, and cleanup of old data.

Example::

with ResultsDatabase("my_results.db") as db:
    run_id = db.store_run(results_data)
    run = db.get_run(run_id)
    trends = db.get_trends(benchmark="swe-bench-verified")

__init__(db_path='mcpbr_results.db')

Open or create the SQLite results database.

Parameters:

Name Type Description Default
db_path str | Path

Path to the SQLite database file. The file and any parent directories are created if they do not exist.

'mcpbr_results.db'

store_run(results_data)

Store a complete evaluation run with its task results.

Parameters:

Name Type Description Default
results_data dict[str, Any]

Evaluation results dictionary. Expected keys are metadata (with timestamp, config, and optionally mcp_server), summary (with mcp sub-dict), and tasks (list of per-task result dicts).

required

Returns:

Type Description
int

The auto-generated run_id for the stored run.

Raises:

Type Description
Error

On database write failures.

get_run(run_id)

Retrieve a specific evaluation run by ID.

Parameters:

Name Type Description Default
run_id int

The run identifier returned by :meth:store_run.

required

Returns:

Type Description
dict[str, Any] | None

A dictionary with the run's columns, or None if not found.

list_runs(limit=50, benchmark=None, model=None, provider=None)

List evaluation runs with optional filtering.

Parameters:

Name Type Description Default
limit int

Maximum number of runs to return. Runs are ordered by timestamp descending (most recent first).

50
benchmark str | None

Filter by benchmark name (exact match).

None
model str | None

Filter by model identifier (exact match).

None
provider str | None

Filter by provider name (exact match).

None

Returns:

Type Description
list[dict[str, Any]]

List of run dictionaries, most recent first.

get_task_results(run_id)

Get all task-level results for a specific run.

Parameters:

Name Type Description Default
run_id int

The run identifier.

required

Returns:

Type Description
list[dict[str, Any]]

List of task result dictionaries for the run.

delete_run(run_id)

Delete an evaluation run and all its associated task results.

Parameters:

Name Type Description Default
run_id int

The run identifier to delete.

required

Returns:

Type Description
bool

True if a run was deleted, False if no run existed

bool

with the given ID.

Get resolution rate, cost, and token trends over time.

Returns a time-ordered list of aggregate metrics for each run matching the optional filters.

Parameters:

Name Type Description Default
benchmark str | None

Filter by benchmark name.

None
model str | None

Filter by model identifier.

None
limit int

Maximum number of data points to return.

20

Returns:

Type Description
list[dict[str, Any]]

List of dicts with keys timestamp, resolution_rate,

list[dict[str, Any]]

total_cost, total_tokens, resolved_tasks, and

list[dict[str, Any]]

total_tasks, ordered by timestamp ascending.

cleanup(max_age_days=90)

Delete runs older than the specified age.

Parameters:

Name Type Description Default
max_age_days int

Maximum age in days. Runs with a timestamp older than this many days from now will be deleted along with their task results (via ON DELETE CASCADE).

90

Returns:

Type Description
int

Number of runs deleted.

close()

Close the database connection.

After calling this method the database instance should not be used.

Usage

from mcpbr.analytics import ResultsDatabase

# Open or create database (context manager supported)
with ResultsDatabase("my_results.db") as db:
    # Store a run
    run_id = db.store_run(results_data)

    # Query runs
    runs = db.list_runs(limit=10, benchmark="swe-bench-verified")
    run = db.get_run(run_id)

    # Get per-task results
    task_results = db.get_task_results(run_id)

    # Get trend data for charting
    trends = db.get_trends(benchmark="swe-bench-verified", model="sonnet")

    # Clean up old data
    deleted = db.cleanup(max_age_days=90)

Methods

Method Returns Description
store_run(results_data) int Store evaluation results, returns run ID
get_run(run_id) dict \| None Retrieve a specific run by ID
list_runs(limit, benchmark, model, provider) list[dict] List runs with optional filtering
get_task_results(run_id) list[dict] Get per-task results for a run
delete_run(run_id) bool Delete a run and its task results
get_trends(benchmark, model, limit) list[dict] Get time-series trend data
cleanup(max_age_days) int Delete runs older than max_age_days
close() None Close the database connection

Database Schema

The database has two tables:

runs -- One row per evaluation run:

Column Type Description
id INTEGER Auto-incremented primary key
timestamp TEXT ISO 8601 timestamp
benchmark TEXT Benchmark name
model TEXT Model identifier
provider TEXT Provider name
resolution_rate REAL Overall resolution rate
total_cost REAL Total cost in USD
total_tasks INTEGER Number of tasks evaluated
resolved_tasks INTEGER Number of tasks resolved
metadata_json TEXT Full metadata as JSON

task_results -- One row per task per run:

Column Type Description
run_id INTEGER Foreign key to runs
instance_id TEXT Task identifier
resolved INTEGER 1 if resolved, 0 otherwise
cost REAL Task cost in USD
tokens_input INTEGER Input tokens used
tokens_output INTEGER Output tokens used
runtime_seconds REAL Task runtime
error TEXT Error message if failed

Statistical Tests

Pure Python implementations of common statistical tests for comparing benchmark results.

chi_squared_test()

Compare two proportions (resolution rates) using a 2x2 chi-squared test.

from mcpbr.analytics import chi_squared_test

result = chi_squared_test(
    success_a=45, total_a=100,
    success_b=60, total_b=100,
)
print(f"Chi2: {result['chi2']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
print(f"Effect size (phi): {result['effect_size']:.3f}")
Parameter Type Description
success_a int Successes in group A
total_a int Total observations in group A
success_b int Successes in group B
total_b int Total observations in group B
significance_level float Alpha threshold (default: 0.05)

Returns: dict with chi2, p_value, significant, effect_size (phi coefficient).

bootstrap_confidence_interval()

Bootstrap confidence interval for a metric.

from mcpbr.analytics import bootstrap_confidence_interval

ci = bootstrap_confidence_interval(
    values=[0.85, 0.90, 0.78, 0.92, 0.88, 0.82],
    confidence=0.95,
    n_bootstrap=1000,
)
print(f"Mean: {ci['mean']:.3f}")
print(f"95% CI: [{ci['ci_lower']:.3f}, {ci['ci_upper']:.3f}]")
print(f"Std Error: {ci['std_error']:.4f}")
Parameter Type Default Description
values list[float] (required) Observed metric values
confidence float 0.95 Confidence level
n_bootstrap int 1000 Number of resamples

Returns: dict with mean, ci_lower, ci_upper, std_error.

effect_size_cohens_d()

Cohen's d effect size between two groups.

from mcpbr.analytics import effect_size_cohens_d

d = effect_size_cohens_d(
    group_a=[0.85, 0.90, 0.88, 0.92],
    group_b=[0.70, 0.75, 0.72, 0.68],
)
print(f"Cohen's d: {d:.3f}")
# Interpretation: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, >= 0.8 large

mann_whitney_u()

Non-parametric Mann-Whitney U test for comparing two independent samples.

from mcpbr.analytics import mann_whitney_u

result = mann_whitney_u(
    group_a=[0.85, 0.90, 0.88, 0.92, 0.87],
    group_b=[0.70, 0.75, 0.72, 0.68, 0.74],
)
print(f"U: {result['u_statistic']:.1f}, p={result['p_value']:.4f}")
print(f"Significant: {result['significant']}")

permutation_test()

Permutation test for difference in means between two groups.

from mcpbr.analytics import permutation_test

result = permutation_test(
    group_a=[0.85, 0.90, 0.88],
    group_b=[0.70, 0.75, 0.72],
    n_permutations=5000,
)
print(f"Observed diff: {result['observed_diff']:.4f}")
print(f"p-value: {result['p_value']:.4f}")

compare_resolution_rates()

Comprehensive comparison of two result sets with chi-squared testing, effect sizes, and a human-readable summary.

from mcpbr.analytics import compare_resolution_rates

comparison = compare_resolution_rates(
    results_a={"resolved": 45, "total": 100, "name": "Server A"},
    results_b={"resolved": 38, "total": 100, "name": "Server B"},
)
print(comparison["summary"])
# "Server A (45.0%) vs Server B (38.0%): Server A is 7.0pp higher.
#  Difference is not significant (p=0.3123, phi=0.072)."

ComparisonEngine

Compare evaluation results across multiple models with summary tables, rankings, Pareto frontiers, and pairwise analysis.

ComparisonEngine

Engine for comparing evaluation results across multiple models.

Supports adding multiple labeled result sets and generating comprehensive comparisons including summary tables, task matrices, unique wins, rankings, pairwise comparisons, cost-performance frontiers, and winner analysis.

Example::

engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_data)
engine.add_results("gpt-4o", gpt4o_data)
comparison = engine.compare()

add_results(label, results_data)

Add a labeled result set for comparison.

Parameters:

Name Type Description Default
label str

Human-readable label identifying the model or run (e.g., "claude-sonnet-run-1").

required
results_data dict[str, Any]

Results dictionary with the standard mcpbr output structure containing metadata, summary, and tasks keys.

required

compare()

Generate a comprehensive comparison across all added result sets.

Returns:

Type Description
dict[str, Any]

Dictionary containing: - models: List of model labels. - summary_table: List of dicts with per-model summary metrics including label, model, provider, benchmark, resolved, total, rate, cost, avg_cost_per_task, and avg_tokens. - task_matrix: Dict mapping instance_id to a dict of {label: resolved_bool} for each model. - unique_wins: Dict mapping label to list of instance_ids that only that model resolved. - rankings: Dict with by_rate, by_cost_efficiency, and by_speed lists, each sorted best-first. - pairwise: List of pairwise comparison dicts between all model pairs, including rate difference.

Raises:

Type Description
ValueError

If fewer than two result sets have been added.

get_cost_performance_frontier()

Compute the Pareto frontier of cost vs resolution rate.

Points on the frontier represent models where no other model is both cheaper and has a higher resolution rate. The frontier is sorted by ascending cost.

Returns:

Type Description
list[dict[str, Any]]

List of dicts, each with label, cost, and rate keys,

list[dict[str, Any]]

representing models on the Pareto-optimal frontier.

Raises:

Type Description
ValueError

If fewer than two result sets have been added.

get_winner_analysis()

Determine which model wins on each metric.

Evaluates models across resolution rate, total cost, cost efficiency (cost per resolved task), and average speed (runtime per task).

Returns:

Type Description
dict[str, Any]

Dictionary with metric names as keys and dicts containing

dict[str, Any]

winner (label) and value (the winning metric value).

Raises:

Type Description
ValueError

If fewer than two result sets have been added.

Usage

from mcpbr.analytics import ComparisonEngine

engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_results)
engine.add_results("gpt-4o", gpt4o_results)
engine.add_results("gemini-2.0-flash", gemini_results)

# Full comparison
comparison = engine.compare()
print(comparison["models"])          # ["claude-sonnet", "gpt-4o", "gemini-2.0-flash"]
print(comparison["summary_table"])   # Per-model summary metrics
print(comparison["rankings"])        # by_rate, by_cost_efficiency, by_speed
print(comparison["unique_wins"])     # Tasks only one model resolved
print(comparison["pairwise"])        # All pairwise comparisons

# Pareto-optimal models (cost vs resolution rate)
frontier = engine.get_cost_performance_frontier()
for point in frontier:
    print(f"{point['label']}: rate={point['rate']:.1%}, cost=${point['cost']:.2f}")

# Winner on each metric
winners = engine.get_winner_analysis()
for metric, info in winners.items():
    print(f"{metric}: {info['winner']} ({info['value']})")

Convenience Functions

from mcpbr.analytics import compare_results_files, format_comparison_table

# Compare JSON result files directly
comparison = compare_results_files(
    ["results_sonnet.json", "results_gpt4o.json"],
    labels=["Claude Sonnet", "GPT-4o"],
)

# Format as ASCII table
print(format_comparison_table(comparison))

RegressionDetector

Detect performance regressions between evaluation runs across multiple dimensions.

RegressionDetector

Detect performance regressions between evaluation runs.

Compares a current run against a baseline across multiple dimensions: resolution rate (with statistical significance testing), cost, latency, and token usage. Also reports per-task regressions and improvements.

Example::

detector = RegressionDetector(threshold=0.05)
result = detector.detect(current_results, baseline_results)
if result["overall_status"] == "fail":
    print("Regression detected!")
print(detector.format_report())

__init__(threshold=0.05, significance_level=0.05)

Configure the regression detector.

Parameters:

Name Type Description Default
threshold float

Minimum absolute change in resolution rate to consider as a potential regression. Defaults to 0.05 (5 percentage points).

0.05
significance_level float

Alpha level for statistical significance testing. Defaults to 0.05.

0.05

detect(current, baseline)

Detect regressions between current and baseline results.

Analyzes resolution rate, cost, latency, and token usage, plus per-task changes.

Parameters:

Name Type Description Default
current dict[str, Any]

Current evaluation results dictionary.

required
baseline dict[str, Any]

Baseline evaluation results dictionary to compare against.

required

Returns:

Type Description
dict[str, Any]

Dictionary containing: - score_regression: Resolution rate regression analysis with detected, current_rate, baseline_rate, delta, and significant. - cost_regression: Cost change analysis with detected, current_cost, baseline_cost, and delta_pct. - latency_regression: Latency change analysis. - token_regression: Token usage change analysis. - task_regressions: List of tasks that regressed. - task_improvements: List of tasks that improved. - overall_status: "pass", "warning", or "fail". - summary: Human-readable summary string.

format_report()

Format the last detection result as a human-readable report.

Returns:

Type Description
str

Multi-line string containing the formatted regression report.

Raises:

Type Description
ValueError

If :meth:detect has not been called yet.

Usage

from mcpbr.analytics import RegressionDetector

detector = RegressionDetector(threshold=0.05, significance_level=0.05)
result = detector.detect(current_results, baseline_results)

# Check overall status
if result["overall_status"] == "fail":
    print("REGRESSION DETECTED!")
elif result["overall_status"] == "warning":
    print("Warning: potential issues")
else:
    print("All clear")

# Inspect specific regressions
print(result["score_regression"])     # Resolution rate analysis
print(result["cost_regression"])      # Cost change analysis
print(result["latency_regression"])   # Latency change analysis
print(result["token_regression"])     # Token usage change analysis
print(result["task_regressions"])     # Per-task regressions
print(result["task_improvements"])    # Per-task improvements

# Human-readable report
print(detector.format_report())

Detection Thresholds

Dimension Regression Threshold Description
Resolution rate > 5pp decrease + statistically significant Chi-squared test at alpha=0.05
Cost > 20% increase Percentage increase in total cost
Latency > 25% increase Percentage increase in average runtime
Token usage > 25% increase Percentage increase in average tokens

Overall Status

Status Meaning
"pass" No regressions detected
"warning" Cost, latency, or token regression; or per-task regressions
"fail" Statistically significant resolution rate regression

ABTest

A/B testing framework for comparing two MCP server configurations.

ABTest

A/B testing framework for comparing two MCP server configurations.

Creates a structured comparison between a control group (A) and treatment group (B), running chi-squared significance testing on resolution rates and comparing cost metrics.

Example::

test = ABTest("Model Comparison")
test.add_control(results_baseline)
test.add_treatment(results_candidate)
analysis = test.analyze()
print(test.format_report())

__init__(name, control_label='A', treatment_label='B')

Initialize the A/B test.

Parameters:

Name Type Description Default
name str

Human-readable name for this test.

required
control_label str

Label for the control group (default "A").

'A'
treatment_label str

Label for the treatment group (default "B").

'B'

add_control(results_data)

Add the control group results.

Parameters:

Name Type Description Default
results_data dict[str, Any]

Evaluation results dictionary for the control configuration.

required

add_treatment(results_data)

Add the treatment group results.

Parameters:

Name Type Description Default
results_data dict[str, Any]

Evaluation results dictionary for the treatment configuration.

required

analyze()

Run the A/B test analysis.

Compares resolution rates using a chi-squared test, and reports differences in cost and other metrics.

Returns:

Type Description
dict[str, Any]

Dictionary containing: - test_name: The test name. - control: Metrics for the control group. - treatment: Metrics for the treatment group. - rate_difference: Absolute difference in resolution rates. - rate_relative_change: Percentage change in resolution rate. - cost_difference: Difference in total cost. - statistical_significance: Chi-squared test results. - winner: "control", "treatment", or "no_significant_difference". - recommendation: Human-readable recommendation.

Raises:

Type Description
ValueError

If control or treatment data has not been added.

format_report()

Format the analysis results as a human-readable report.

Calls :meth:analyze automatically if it has not been called yet.

Returns:

Type Description
str

Multi-line string containing the formatted A/B test report.

Raises:

Type Description
ValueError

If control or treatment data has not been added.

Usage

from mcpbr.analytics import ABTest

test = ABTest(
    name="Filesystem v2 vs v1",
    control_label="v1 (current)",
    treatment_label="v2 (candidate)",
)
test.add_control(results_v1)
test.add_treatment(results_v2)

analysis = test.analyze()
print(f"Winner: {analysis['winner']}")
print(f"Rate difference: {analysis['rate_difference']:+.4f}")
print(f"Significant: {analysis['statistical_significance']['significant']}")
print(f"Recommendation: {analysis['recommendation']}")

# Formatted report
print(test.format_report())

Quick A/B Test

from mcpbr.analytics import run_ab_test

result = run_ab_test(results_a, results_b, test_name="Quick Comparison")
print(result["winner"])
print(result["recommendation"])

Leaderboard

Generate ranked leaderboards from multiple evaluation results.

Leaderboard

Generate ranked leaderboards from multiple evaluation results.

Collects results from multiple configurations or models and produces a ranked comparison sorted by any supported metric.

Example::

lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
print(lb.format_table())

add_entry(label, results_data)

Add a result set to the leaderboard.

Parameters:

Name Type Description Default
label str

Human-readable label for this entry (e.g., model name or configuration description).

required
results_data dict[str, Any]

Evaluation results dictionary with summary.mcp and tasks keys.

required

generate(sort_by='resolution_rate')

Generate the sorted leaderboard.

Parameters:

Name Type Description Default
sort_by str

Metric to sort by. Supported values: "resolution_rate", "total_cost", "cost_per_resolved", "avg_tokens", "avg_runtime", "resolved". Default is "resolution_rate" (higher is better).

'resolution_rate'

Returns:

Type Description
list[dict[str, Any]]

List of ranked entry dictionaries, each containing rank,

list[dict[str, Any]]

label, model, provider, resolution_rate,

list[dict[str, Any]]

resolved, total, total_cost, cost_per_resolved,

list[dict[str, Any]]

avg_tokens, and avg_runtime.

Raises:

Type Description
ValueError

If sort_by is not a supported sort key.

format_table(sort_by='resolution_rate')

Format the leaderboard as an ASCII table.

Parameters:

Name Type Description Default
sort_by str

Metric to sort by (see :meth:generate for options).

'resolution_rate'

Returns:

Type Description
str

Multi-line ASCII table string.

format_markdown(sort_by='resolution_rate')

Format the leaderboard as a Markdown table.

Parameters:

Name Type Description Default
sort_by str

Metric to sort by (see :meth:generate for options).

'resolution_rate'

Returns:

Type Description
str

Markdown-formatted table string.

Usage

from mcpbr.analytics import Leaderboard

lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
lb.add_entry("Gemini Flash", results_gemini)

# Generate sorted leaderboard
entries = lb.generate(sort_by="resolution_rate")
for entry in entries:
    print(f"#{entry['rank']} {entry['label']}: {entry['resolution_rate']:.1%}")

# ASCII table output
print(lb.format_table(sort_by="resolution_rate"))

# Markdown table (for GitHub/docs)
print(lb.format_markdown(sort_by="resolution_rate"))

Sort Keys

Key Direction Description
resolution_rate Higher is better Fraction of tasks resolved
resolved Higher is better Absolute number of resolved tasks
total_cost Lower is better Total cost in USD
cost_per_resolved Lower is better Cost per resolved task
avg_tokens Lower is better Average tokens per task
avg_runtime Lower is better Average runtime per task

Quick Leaderboard

from mcpbr.analytics import generate_leaderboard

entries = generate_leaderboard([
    ("Claude Sonnet", results_sonnet),
    ("GPT-4o", results_gpt4o),
], sort_by="resolution_rate")

MetricsRegistry

Registry of metric definitions with built-in defaults and support for custom metrics.

MetricsRegistry

Registry of metric definitions with built-in defaults.

Built-in metrics registered on initialisation: - resolution_rate: Fraction of tasks resolved. - cost_per_resolution: Total cost divided by resolved count (inf if none resolved). - avg_tokens_per_task: Mean total token count per task. - tool_failure_rate: Ratio of tool failures to total tool calls. - efficiency_score: Composite score: rate / (cost + 0.01).

register(metric)

Register a custom metric definition.

Parameters:

Name Type Description Default
metric MetricDefinition

The metric to register.

required

Raises:

Type Description
ValueError

If a metric with the same name is already registered.

calculate_all(results_data)

Calculate all registered metrics against the given results.

Parameters:

Name Type Description Default
results_data dict[str, Any]

Evaluation results dictionary with metadata, summary, and tasks keys.

required

Returns:

Type Description
dict[str, float]

Dictionary mapping metric name to its computed float value.

dict[str, float]

If a metric calculation raises an exception the value is

dict[str, float]

float('nan').

get_metric(name)

Look up a metric by name.

Parameters:

Name Type Description Default
name str

Metric identifier.

required

Returns:

Type Description
MetricDefinition | None

The MetricDefinition if found, otherwise None.

list_metrics()

Return a sorted list of all registered metric names.

Built-in Metrics

Metric Unit Higher is Better Description
resolution_rate ratio Yes Fraction of tasks resolved
cost_per_resolution USD No Total cost / resolved count
avg_tokens_per_task tokens No Average total tokens per task
tool_failure_rate ratio No Tool failures / total tool calls
efficiency_score score No resolution_rate / (total_cost + 0.01)

Usage

from mcpbr.analytics import MetricsRegistry, MetricDefinition

registry = MetricsRegistry()

# Calculate all built-in metrics
metrics = registry.calculate_all(results_data)
print(f"Resolution rate: {metrics['resolution_rate']:.1%}")
print(f"Efficiency: {metrics['efficiency_score']:.2f}")

# Register a custom metric
registry.register(MetricDefinition(
    name="cost_per_token",
    description="Average cost per 1000 tokens",
    unit="USD/1k tokens",
    calculate=lambda data: (
        sum(t.get("mcp", {}).get("cost", 0) for t in data.get("tasks", [])) /
        max(sum(
            t.get("mcp", {}).get("tokens", {}).get("input", 0) +
            t.get("mcp", {}).get("tokens", {}).get("output", 0)
            for t in data.get("tasks", [])
        ), 1) * 1000
    ),
    higher_is_better=False,
))

# List all registered metrics
print(registry.list_metrics())

TrendAnalysis

Time-series trend analysis for evaluation results.

Calculate trend information from a list of run summaries.

from mcpbr.analytics import calculate_trends

# runs from ResultsDatabase.get_trends()
trends = calculate_trends(runs)
print(f"Direction: {trends['direction']}")  # "improving", "declining", "stable"
print(trends["resolution_rate_trend"])      # [{timestamp, rate}, ...]
print(trends["cost_trend"])                 # [{timestamp, cost}, ...]
print(trends["moving_averages"])            # 3-point moving averages

detect_trend_direction()

Determine whether a series of values is improving, declining, or stable using linear regression.

from mcpbr.analytics import detect_trend_direction

direction = detect_trend_direction([0.40, 0.42, 0.45, 0.48, 0.50])
print(direction)  # "improving"

calculate_moving_average()

Compute a simple moving average over a list of values.

from mcpbr.analytics import calculate_moving_average

ma = calculate_moving_average([0.40, 0.42, 0.45, 0.48, 0.50], window=3)
# [None, None, 0.4233..., 0.45, 0.4766...]

AnomalyDetection

Statistical methods to identify outlier values in benchmark metrics.

detect_anomalies()

Detect anomalous values using z-score, IQR, or MAD methods.

from mcpbr.analytics import detect_anomalies

anomalies = detect_anomalies(
    values=[0.5, 0.6, 0.55, 0.58, 5.0, 0.52],
    method="zscore",    # "zscore", "iqr", or "mad"
    threshold=2.0,
)
for a in anomalies:
    print(f"Index {a['index']}: value={a['value']}, score={a['score']:.2f}")
Method Description Threshold Meaning
zscore Z-score exceeds threshold Number of standard deviations
iqr IQR fence method Fence multiplier (commonly 1.5)
mad Median absolute deviation Number of MADs

detect_metric_anomalies()

Run anomaly detection across standard benchmark metrics (cost, tokens, runtime, iterations).

from mcpbr.analytics import detect_metric_anomalies

anomalies = detect_metric_anomalies(results_data)
print(f"Cost anomalies: {len(anomalies['cost'])}")
print(f"Token anomalies: {len(anomalies['tokens'])}")
print(f"Runtime anomalies: {len(anomalies['runtime'])}")
print(f"Iteration anomalies: {len(anomalies['iterations'])}")

CorrelationAnalysis

Compute correlations between evaluation metrics.

pearson_correlation()

Compute the Pearson correlation coefficient between two sequences.

from mcpbr.analytics import pearson_correlation

result = pearson_correlation(
    x=[100, 200, 300, 400, 500],
    y=[0.5, 1.1, 1.4, 2.0, 2.5],
)
print(f"r = {result['r']:.3f}, R^2 = {result['r_squared']:.3f}, p = {result['p_value']:.4f}")

spearman_correlation()

Compute the Spearman rank correlation (non-parametric, handles non-linear relationships).

from mcpbr.analytics import spearman_correlation

result = spearman_correlation(x=[1, 2, 3, 4, 5], y=[5, 6, 7, 8, 7])

analyze_metric_correlations()

Compute all pairwise Pearson correlations between standard metrics extracted from results.

from mcpbr.analytics import analyze_metric_correlations, find_strong_correlations

correlations = analyze_metric_correlations(results_data)
# Correlations between: cost, tokens_input, tokens_output, iterations,
#                        runtime_seconds, tool_calls

# Filter for strong correlations
strong = find_strong_correlations(correlations, threshold=0.7)
for c in strong:
    print(f"{c['pair']}: r={c['r']:.3f} ({c['direction']})")

ErrorPatternAnalyzer

Analyze error patterns across benchmark results with clustering, temporal analysis, and recommendations.

ErrorPatternAnalyzer

Analyzes error patterns across benchmark results.

Clusters similar errors, detects temporal patterns, correlates errors with specific tools, and produces actionable recommendations.

analyze(results)

Analyze error patterns across benchmark results.

Parameters:

Name Type Description Default
results list[dict[str, Any]]

List of task result dicts. Each may contain keys like error, errors (list), tool, iteration, and instance_id.

required

Returns:

Type Description
dict[str, Any]

Dictionary with keys: - total_errors: Total number of errors found. - error_clusters: List of cluster dicts with pattern, count, examples, and category. - temporal_patterns: Dict describing whether errors increase over iterations. - tool_error_correlation: Dict mapping tool names to error rates. - recommendations: List of actionable recommendation strings.

cluster_errors(errors, similarity_threshold=0.6)

Cluster similar error messages using token-overlap similarity.

Groups errors whose Jaccard similarity exceeds the given threshold, then categorises each cluster.

Parameters:

Name Type Description Default
errors list[str]

List of raw error message strings.

required
similarity_threshold float

Minimum Jaccard similarity to merge two errors into the same cluster. Defaults to 0.6.

0.6

Returns:

Type Description
list[dict[str, Any]]

List of cluster dicts, each containing: - pattern: Representative error string (most common). - count: Number of errors in the cluster. - examples: Up to 3 distinct example messages. - category: High-level category string.

Usage

from mcpbr.analytics import ErrorPatternAnalyzer

analyzer = ErrorPatternAnalyzer()
analysis = analyzer.analyze(task_results)

print(f"Total errors: {analysis['total_errors']}")

# Error clusters (grouped by similarity)
for cluster in analysis["error_clusters"]:
    print(f"  {cluster['category']}: {cluster['count']}x - {cluster['pattern'][:80]}")

# Temporal patterns
if analysis["temporal_patterns"]["increasing"]:
    print("Warning: errors increasing over iterations")

# Tool-error correlation
for tool, rate in analysis["tool_error_correlation"].items():
    if rate > 0.3:
        print(f"  High error rate tool: {tool} ({rate:.0%})")

# Actionable recommendations
for rec in analysis["recommendations"]:
    print(f"  - {rec}")

Error Categories

The analyzer automatically categorizes errors into:

Category Pattern Keywords
timeout timeout, timed out, deadline
authentication auth, unauthorized, 401, 403
rate_limit rate limit, 429, throttle
connection connection, refused, DNS, network
validation invalid, validation, schema, parse
permission permission, denied, access
unknown Everything else

identify_flaky_tasks()

Identify tasks with inconsistent outcomes across multiple runs.

from mcpbr.analytics import identify_flaky_tasks

flaky = identify_flaky_tasks([results_run1, results_run2, results_run3])
for task in flaky:
    if task["flaky"]:
        print(f"{task['instance_id']}: pass_rate={task['pass_rate']:.0%} over {task['run_count']} runs")

DifficultyEstimation

Estimate per-task difficulty based on resolution rates, resource usage, and runtime.

estimate_difficulty()

Score each task's difficulty on a 0-1 scale.

from mcpbr.analytics import estimate_difficulty, aggregate_difficulty_stats

difficulties = estimate_difficulty(results_data)
for d in difficulties[:5]:
    print(f"{d['instance_id']}: {d['difficulty_level']} ({d['difficulty_score']:.2f})")

# Aggregate statistics
stats = aggregate_difficulty_stats(difficulties)
print(f"Distribution: {stats['distribution']}")
print(f"Avg difficulty: {stats['avg_difficulty']:.2f}")
print(f"Hardest tasks: {[t['instance_id'] for t in stats['hardest_tasks']]}")

Difficulty Levels

Score Range Level
0.00 - 0.25 easy
0.25 - 0.50 medium
0.50 - 0.75 hard
0.75 - 1.00 very_hard

estimate_task_difficulty_score()

Score a single task's difficulty given its metrics and run averages.

from mcpbr.analytics import estimate_task_difficulty_score

score = estimate_task_difficulty_score(
    resolved=False,
    cost=0.15,
    tokens=50000,
    iterations=8,
    runtime=250.0,
    avg_cost=0.10,
    avg_tokens=30000,
    avg_iterations=5,
    avg_runtime=180.0,
)
print(f"Difficulty: {score:.2f}")  # Higher = harder