Analytics API Reference¶

Q: How do I store and query historical benchmark results?

Use ResultsDatabase, a SQLite-backed storage class. Call store_run() to save evaluation results, list_runs() to query them with filters, and get_trends() to retrieve time-series data for trend analysis.

Q: How do I compare results across multiple models?

Use ComparisonEngine. Add labeled result sets with add_results(), then call compare() to get summary tables, pairwise comparisons, rankings, and unique win analysis. Use get_cost_performance_frontier() for Pareto-optimal models.

Q: How do I detect performance regressions?

Use RegressionDetector. Call detect(current, baseline) to compare resolution rate, cost, latency, and token usage. It performs chi-squared significance testing and reports per-task regressions.

Q: What statistical tests are available?

The analytics module provides chi_squared_test, bootstrap_confidence_interval, effect_size_cohens_d, mann_whitney_u, and permutation_test -- all implemented in pure Python with no external dependencies.

The mcpbr.analytics package provides comprehensive statistical analysis, historical tracking, and comparison tools for benchmark results. All calculations use only the Python standard library -- no NumPy or SciPy required.

from mcpbr.analytics import (
    ResultsDatabase,
    ComparisonEngine,
    RegressionDetector,
    ABTest,
    Leaderboard,
    MetricsRegistry,
)

ResultsDatabase¶

SQLite-backed persistent storage for evaluation runs and per-task results.

`ResultsDatabase` ¶

SQLite-backed storage for mcpbr evaluation results.

Stores evaluation runs and per-task results, supporting queries for trend analysis, filtering, and cleanup of old data.

Example::

with ResultsDatabase("my_results.db") as db:
    run_id = db.store_run(results_data)
    run = db.get_run(run_id)
    trends = db.get_trends(benchmark="swe-bench-verified")

`init(db_path='mcpbr_results.db')` ¶

Open or create the SQLite results database.

Parameters:

Name	Type	Description	Default
`db_path`	`str \| Path`	Path to the SQLite database file. The file and any parent directories are created if they do not exist.	`'mcpbr_results.db'`

`store_run(results_data)` ¶

Store a complete evaluation run with its task results.

Parameters:

Name	Type	Description	Default
`results_data`	`dict[str, Any]`	Evaluation results dictionary. Expected keys are `metadata` (with `timestamp`, `config`, and optionally `mcp_server`), `summary` (with `mcp` sub-dict), and `tasks` (list of per-task result dicts).	required

Returns:

Type	Description
`int`	The auto-generated `run_id` for the stored run.

Raises:

Type	Description
`Error`	On database write failures.

`get_run(run_id)` ¶

Retrieve a specific evaluation run by ID.

Parameters:

Name	Type	Description	Default
`run_id`	`int`	The run identifier returned by :meth:`store_run`.	required

Returns:

Type	Description
`dict[str, Any] \| None`	A dictionary with the run's columns, or `None` if not found.

`list_runs(limit=50, benchmark=None, model=None, provider=None)` ¶

List evaluation runs with optional filtering.

Parameters:

Name	Type	Description	Default
`limit`	`int`	Maximum number of runs to return. Runs are ordered by timestamp descending (most recent first).	`50`
`benchmark`	`str \| None`	Filter by benchmark name (exact match).	`None`
`model`	`str \| None`	Filter by model identifier (exact match).	`None`
`provider`	`str \| None`	Filter by provider name (exact match).	`None`

Returns:

Type	Description
`list[dict[str, Any]]`	List of run dictionaries, most recent first.

`get_task_results(run_id)` ¶

Get all task-level results for a specific run.

Parameters:

Name	Type	Description	Default
`run_id`	`int`	The run identifier.	required

Returns:

Type	Description
`list[dict[str, Any]]`	List of task result dictionaries for the run.

`delete_run(run_id)` ¶

Delete an evaluation run and all its associated task results.

Parameters:

Name	Type	Description	Default
`run_id`	`int`	The run identifier to delete.	required

Returns:

Type	Description
`bool`	`True` if a run was deleted, `False` if no run existed
`bool`	with the given ID.

`get_trends(benchmark=None, model=None, limit=20)` ¶

Get resolution rate, cost, and token trends over time.

Returns a time-ordered list of aggregate metrics for each run matching the optional filters.

Parameters:

Name	Type	Description	Default
`benchmark`	`str \| None`	Filter by benchmark name.	`None`
`model`	`str \| None`	Filter by model identifier.	`None`
`limit`	`int`	Maximum number of data points to return.	`20`

Returns:

Type	Description
`list[dict[str, Any]]`	List of dicts with keys `timestamp`, `resolution_rate`,
`list[dict[str, Any]]`	`total_cost`, `total_tokens`, `resolved_tasks`, and
`list[dict[str, Any]]`	`total_tasks`, ordered by timestamp ascending.

`cleanup(max_age_days=90)` ¶

Delete runs older than the specified age.

Parameters:

Name	Type	Description	Default
`max_age_days`	`int`	Maximum age in days. Runs with a timestamp older than this many days from now will be deleted along with their task results (via `ON DELETE CASCADE`).	`90`

Returns:

Type	Description
`int`	Number of runs deleted.

`close()` ¶

Close the database connection.

After calling this method the database instance should not be used.

Usage¶

from mcpbr.analytics import ResultsDatabase

# Open or create database (context manager supported)
with ResultsDatabase("my_results.db") as db:
    # Store a run
    run_id = db.store_run(results_data)

    # Query runs
    runs = db.list_runs(limit=10, benchmark="swe-bench-verified")
    run = db.get_run(run_id)

    # Get per-task results
    task_results = db.get_task_results(run_id)

    # Get trend data for charting
    trends = db.get_trends(benchmark="swe-bench-verified", model="sonnet")

    # Clean up old data
    deleted = db.cleanup(max_age_days=90)

Methods¶

Method	Returns	Description
`store_run(results_data)`	`int`	Store evaluation results, returns run ID
`get_run(run_id)`	`dict \\| None`	Retrieve a specific run by ID
`list_runs(limit, benchmark, model, provider)`	`list[dict]`	List runs with optional filtering
`get_task_results(run_id)`	`list[dict]`	Get per-task results for a run
`delete_run(run_id)`	`bool`	Delete a run and its task results
`get_trends(benchmark, model, limit)`	`list[dict]`	Get time-series trend data
`cleanup(max_age_days)`	`int`	Delete runs older than max_age_days
`close()`	`None`	Close the database connection

Database Schema¶

The database has two tables:

runs -- One row per evaluation run:

Column	Type	Description
`id`	INTEGER	Auto-incremented primary key
`timestamp`	TEXT	ISO 8601 timestamp
`benchmark`	TEXT	Benchmark name
`model`	TEXT	Model identifier
`provider`	TEXT	Provider name
`resolution_rate`	REAL	Overall resolution rate
`total_cost`	REAL	Total cost in USD
`total_tasks`	INTEGER	Number of tasks evaluated
`resolved_tasks`	INTEGER	Number of tasks resolved
`metadata_json`	TEXT	Full metadata as JSON

task_results -- One row per task per run:

Column	Type	Description
`run_id`	INTEGER	Foreign key to runs
`instance_id`	TEXT	Task identifier
`resolved`	INTEGER	1 if resolved, 0 otherwise
`cost`	REAL	Task cost in USD
`tokens_input`	INTEGER	Input tokens used
`tokens_output`	INTEGER	Output tokens used
`runtime_seconds`	REAL	Task runtime
`error`	TEXT	Error message if failed

Statistical Tests¶

Pure Python implementations of common statistical tests for comparing benchmark results.

chi_squared_test()¶

Compare two proportions (resolution rates) using a 2x2 chi-squared test.

from mcpbr.analytics import chi_squared_test

result = chi_squared_test(
    success_a=45, total_a=100,
    success_b=60, total_b=100,
)
print(f"Chi2: {result['chi2']:.4f}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Significant: {result['significant']}")
print(f"Effect size (phi): {result['effect_size']:.3f}")

Parameter	Type	Description
`success_a`	`int`	Successes in group A
`total_a`	`int`	Total observations in group A
`success_b`	`int`	Successes in group B
`total_b`	`int`	Total observations in group B
`significance_level`	`float`	Alpha threshold (default: 0.05)

Returns: dict with chi2, p_value, significant, effect_size (phi coefficient).

bootstrap_confidence_interval()¶

Bootstrap confidence interval for a metric.

from mcpbr.analytics import bootstrap_confidence_interval

ci = bootstrap_confidence_interval(
    values=[0.85, 0.90, 0.78, 0.92, 0.88, 0.82],
    confidence=0.95,
    n_bootstrap=1000,
)
print(f"Mean: {ci['mean']:.3f}")
print(f"95% CI: [{ci['ci_lower']:.3f}, {ci['ci_upper']:.3f}]")
print(f"Std Error: {ci['std_error']:.4f}")

Parameter	Type	Default	Description
`values`	`list[float]`	(required)	Observed metric values
`confidence`	`float`	`0.95`	Confidence level
`n_bootstrap`	`int`	`1000`	Number of resamples

Returns: dict with mean, ci_lower, ci_upper, std_error.

effect_size_cohens_d()¶

Cohen's d effect size between two groups.

from mcpbr.analytics import effect_size_cohens_d

d = effect_size_cohens_d(
    group_a=[0.85, 0.90, 0.88, 0.92],
    group_b=[0.70, 0.75, 0.72, 0.68],
)
print(f"Cohen's d: {d:.3f}")
# Interpretation: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, >= 0.8 large

mann_whitney_u()¶

Non-parametric Mann-Whitney U test for comparing two independent samples.

from mcpbr.analytics import mann_whitney_u

result = mann_whitney_u(
    group_a=[0.85, 0.90, 0.88, 0.92, 0.87],
    group_b=[0.70, 0.75, 0.72, 0.68, 0.74],
)
print(f"U: {result['u_statistic']:.1f}, p={result['p_value']:.4f}")
print(f"Significant: {result['significant']}")

permutation_test()¶

Permutation test for difference in means between two groups.

from mcpbr.analytics import permutation_test

result = permutation_test(
    group_a=[0.85, 0.90, 0.88],
    group_b=[0.70, 0.75, 0.72],
    n_permutations=5000,
)
print(f"Observed diff: {result['observed_diff']:.4f}")
print(f"p-value: {result['p_value']:.4f}")

compare_resolution_rates()¶

Comprehensive comparison of two result sets with chi-squared testing, effect sizes, and a human-readable summary.

from mcpbr.analytics import compare_resolution_rates

comparison = compare_resolution_rates(
    results_a={"resolved": 45, "total": 100, "name": "Server A"},
    results_b={"resolved": 38, "total": 100, "name": "Server B"},
)
print(comparison["summary"])
# "Server A (45.0%) vs Server B (38.0%): Server A is 7.0pp higher.
#  Difference is not significant (p=0.3123, phi=0.072)."

ComparisonEngine¶

Compare evaluation results across multiple models with summary tables, rankings, Pareto frontiers, and pairwise analysis.

`ComparisonEngine` ¶

Engine for comparing evaluation results across multiple models.

Supports adding multiple labeled result sets and generating comprehensive comparisons including summary tables, task matrices, unique wins, rankings, pairwise comparisons, cost-performance frontiers, and winner analysis.

Example::

engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_data)
engine.add_results("gpt-4o", gpt4o_data)
comparison = engine.compare()

`add_results(label, results_data)` ¶

Add a labeled result set for comparison.

Parameters:

Name	Type	Description	Default
`label`	`str`	Human-readable label identifying the model or run (e.g., "claude-sonnet-run-1").	required
`results_data`	`dict[str, Any]`	Results dictionary with the standard mcpbr output structure containing `metadata`, `summary`, and `tasks` keys.	required

`compare()` ¶

Generate a comprehensive comparison across all added result sets.

Returns:

Type Description

dict[str, Any]

Dictionary containing: - models: List of model labels. - summary_table: List of dicts with per-model summary metrics including label, model, provider, benchmark, resolved, total, rate, cost, avg_cost_per_task, and avg_tokens. - task_matrix: Dict mapping instance_id to a dict of {label: resolved_bool} for each model. - unique_wins: Dict mapping label to list of instance_ids that only that model resolved. - rankings: Dict with by_rate, by_cost_efficiency, and by_speed lists, each sorted best-first. - pairwise: List of pairwise comparison dicts between all model pairs, including rate difference.

Raises:

Type	Description
`ValueError`	If fewer than two result sets have been added.

`get_cost_performance_frontier()` ¶

Compute the Pareto frontier of cost vs resolution rate.

Points on the frontier represent models where no other model is both cheaper and has a higher resolution rate. The frontier is sorted by ascending cost.

Returns:

Type	Description
`list[dict[str, Any]]`	List of dicts, each with `label`, `cost`, and `rate` keys,
`list[dict[str, Any]]`	representing models on the Pareto-optimal frontier.

Raises:

Type	Description
`ValueError`	If fewer than two result sets have been added.

`get_winner_analysis()` ¶

Determine which model wins on each metric.

Evaluates models across resolution rate, total cost, cost efficiency (cost per resolved task), and average speed (runtime per task).

Returns:

Type	Description
`dict[str, Any]`	Dictionary with metric names as keys and dicts containing
`dict[str, Any]`	`winner` (label) and `value` (the winning metric value).

Raises:

Type	Description
`ValueError`	If fewer than two result sets have been added.

Usage¶

from mcpbr.analytics import ComparisonEngine

engine = ComparisonEngine()
engine.add_results("claude-sonnet", sonnet_results)
engine.add_results("gpt-4o", gpt4o_results)
engine.add_results("gemini-2.0-flash", gemini_results)

# Full comparison
comparison = engine.compare()
print(comparison["models"])          # ["claude-sonnet", "gpt-4o", "gemini-2.0-flash"]
print(comparison["summary_table"])   # Per-model summary metrics
print(comparison["rankings"])        # by_rate, by_cost_efficiency, by_speed
print(comparison["unique_wins"])     # Tasks only one model resolved
print(comparison["pairwise"])        # All pairwise comparisons

# Pareto-optimal models (cost vs resolution rate)
frontier = engine.get_cost_performance_frontier()
for point in frontier:
    print(f"{point['label']}: rate={point['rate']:.1%}, cost=${point['cost']:.2f}")

# Winner on each metric
winners = engine.get_winner_analysis()
for metric, info in winners.items():
    print(f"{metric}: {info['winner']} ({info['value']})")

Convenience Functions¶

from mcpbr.analytics import compare_results_files, format_comparison_table

# Compare JSON result files directly
comparison = compare_results_files(
    ["results_sonnet.json", "results_gpt4o.json"],
    labels=["Claude Sonnet", "GPT-4o"],
)

# Format as ASCII table
print(format_comparison_table(comparison))

RegressionDetector¶

Detect performance regressions between evaluation runs across multiple dimensions.

`RegressionDetector` ¶

Detect performance regressions between evaluation runs.

Compares a current run against a baseline across multiple dimensions: resolution rate (with statistical significance testing), cost, latency, and token usage. Also reports per-task regressions and improvements.

Example::

detector = RegressionDetector(threshold=0.05)
result = detector.detect(current_results, baseline_results)
if result["overall_status"] == "fail":
    print("Regression detected!")
print(detector.format_report())

`init(threshold=0.05, significance_level=0.05)` ¶

Configure the regression detector.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Minimum absolute change in resolution rate to consider as a potential regression. Defaults to 0.05 (5 percentage points).	`0.05`
`significance_level`	`float`	Alpha level for statistical significance testing. Defaults to 0.05.	`0.05`

`detect(current, baseline)` ¶

Detect regressions between current and baseline results.

Analyzes resolution rate, cost, latency, and token usage, plus per-task changes.

Parameters:

Name	Type	Description	Default
`current`	`dict[str, Any]`	Current evaluation results dictionary.	required
`baseline`	`dict[str, Any]`	Baseline evaluation results dictionary to compare against.	required

Returns:

Type Description

dict[str, Any]

Dictionary containing: - score_regression: Resolution rate regression analysis with detected, current_rate, baseline_rate, delta, and significant. - cost_regression: Cost change analysis with detected, current_cost, baseline_cost, and delta_pct. - latency_regression: Latency change analysis. - token_regression: Token usage change analysis. - task_regressions: List of tasks that regressed. - task_improvements: List of tasks that improved. - overall_status: "pass", "warning", or "fail". - summary: Human-readable summary string.

`format_report()` ¶

Format the last detection result as a human-readable report.

Returns:

Type	Description
`str`	Multi-line string containing the formatted regression report.

Raises:

Type	Description
`ValueError`	If :meth:`detect` has not been called yet.

Usage¶

from mcpbr.analytics import RegressionDetector

detector = RegressionDetector(threshold=0.05, significance_level=0.05)
result = detector.detect(current_results, baseline_results)

# Check overall status
if result["overall_status"] == "fail":
    print("REGRESSION DETECTED!")
elif result["overall_status"] == "warning":
    print("Warning: potential issues")
else:
    print("All clear")

# Inspect specific regressions
print(result["score_regression"])     # Resolution rate analysis
print(result["cost_regression"])      # Cost change analysis
print(result["latency_regression"])   # Latency change analysis
print(result["token_regression"])     # Token usage change analysis
print(result["task_regressions"])     # Per-task regressions
print(result["task_improvements"])    # Per-task improvements

# Human-readable report
print(detector.format_report())

Detection Thresholds¶

Dimension	Regression Threshold	Description
Resolution rate	> 5pp decrease + statistically significant	Chi-squared test at alpha=0.05
Cost	> 20% increase	Percentage increase in total cost
Latency	> 25% increase	Percentage increase in average runtime
Token usage	> 25% increase	Percentage increase in average tokens

Overall Status¶

Status	Meaning
`"pass"`	No regressions detected
`"warning"`	Cost, latency, or token regression; or per-task regressions
`"fail"`	Statistically significant resolution rate regression

ABTest¶

A/B testing framework for comparing two MCP server configurations.

`ABTest` ¶

A/B testing framework for comparing two MCP server configurations.

Creates a structured comparison between a control group (A) and treatment group (B), running chi-squared significance testing on resolution rates and comparing cost metrics.

Example::

test = ABTest("Model Comparison")
test.add_control(results_baseline)
test.add_treatment(results_candidate)
analysis = test.analyze()
print(test.format_report())

`init(name, control_label='A', treatment_label='B')` ¶

Initialize the A/B test.

Parameters:

Name	Type	Description	Default
`name`	`str`	Human-readable name for this test.	required
`control_label`	`str`	Label for the control group (default `"A"`).	`'A'`
`treatment_label`	`str`	Label for the treatment group (default `"B"`).	`'B'`

`add_control(results_data)` ¶

Add the control group results.

Parameters:

Name	Type	Description	Default
`results_data`	`dict[str, Any]`	Evaluation results dictionary for the control configuration.	required

`add_treatment(results_data)` ¶

Add the treatment group results.

Parameters:

Name	Type	Description	Default
`results_data`	`dict[str, Any]`	Evaluation results dictionary for the treatment configuration.	required

`analyze()` ¶

Run the A/B test analysis.

Compares resolution rates using a chi-squared test, and reports differences in cost and other metrics.

Returns:

Type Description

dict[str, Any]

Dictionary containing: - test_name: The test name. - control: Metrics for the control group. - treatment: Metrics for the treatment group. - rate_difference: Absolute difference in resolution rates. - rate_relative_change: Percentage change in resolution rate. - cost_difference: Difference in total cost. - statistical_significance: Chi-squared test results. - winner: "control", "treatment", or "no_significant_difference". - recommendation: Human-readable recommendation.

Raises:

Type	Description
`ValueError`	If control or treatment data has not been added.

`format_report()` ¶

Format the analysis results as a human-readable report.

Calls :meth:analyze automatically if it has not been called yet.

Returns:

Type	Description
`str`	Multi-line string containing the formatted A/B test report.

Raises:

Type	Description
`ValueError`	If control or treatment data has not been added.

Usage¶

from mcpbr.analytics import ABTest

test = ABTest(
    name="Filesystem v2 vs v1",
    control_label="v1 (current)",
    treatment_label="v2 (candidate)",
)
test.add_control(results_v1)
test.add_treatment(results_v2)

analysis = test.analyze()
print(f"Winner: {analysis['winner']}")
print(f"Rate difference: {analysis['rate_difference']:+.4f}")
print(f"Significant: {analysis['statistical_significance']['significant']}")
print(f"Recommendation: {analysis['recommendation']}")

# Formatted report
print(test.format_report())

Quick A/B Test¶

from mcpbr.analytics import run_ab_test

result = run_ab_test(results_a, results_b, test_name="Quick Comparison")
print(result["winner"])
print(result["recommendation"])

Leaderboard¶

Generate ranked leaderboards from multiple evaluation results.

`Leaderboard` ¶

Generate ranked leaderboards from multiple evaluation results.

Collects results from multiple configurations or models and produces a ranked comparison sorted by any supported metric.

Example::

lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
print(lb.format_table())

`add_entry(label, results_data)` ¶

Add a result set to the leaderboard.

Parameters:

Name	Type	Description	Default
`label`	`str`	Human-readable label for this entry (e.g., model name or configuration description).	required
`results_data`	`dict[str, Any]`	Evaluation results dictionary with `summary.mcp` and `tasks` keys.	required

`generate(sort_by='resolution_rate')` ¶

Generate the sorted leaderboard.

Parameters:

Name	Type	Description	Default
`sort_by`	`str`	Metric to sort by. Supported values: `"resolution_rate"`, `"total_cost"`, `"cost_per_resolved"`, `"avg_tokens"`, `"avg_runtime"`, `"resolved"`. Default is `"resolution_rate"` (higher is better).	`'resolution_rate'`

Returns:

Type	Description
`list[dict[str, Any]]`	List of ranked entry dictionaries, each containing `rank`,
`list[dict[str, Any]]`	`label`, `model`, `provider`, `resolution_rate`,
`list[dict[str, Any]]`	`resolved`, `total`, `total_cost`, `cost_per_resolved`,
`list[dict[str, Any]]`	`avg_tokens`, and `avg_runtime`.

Raises:

Type	Description
`ValueError`	If sort_by is not a supported sort key.

`format_table(sort_by='resolution_rate')` ¶

Format the leaderboard as an ASCII table.

Parameters:

Name	Type	Description	Default
`sort_by`	`str`	Metric to sort by (see :meth:`generate` for options).	`'resolution_rate'`

Returns:

Type	Description
`str`	Multi-line ASCII table string.

`format_markdown(sort_by='resolution_rate')` ¶

Format the leaderboard as a Markdown table.

Parameters:

Name	Type	Description	Default
`sort_by`	`str`	Metric to sort by (see :meth:`generate` for options).	`'resolution_rate'`

Returns:

Type	Description
`str`	Markdown-formatted table string.

Usage¶

from mcpbr.analytics import Leaderboard

lb = Leaderboard()
lb.add_entry("Claude Sonnet", results_sonnet)
lb.add_entry("GPT-4o", results_gpt4o)
lb.add_entry("Gemini Flash", results_gemini)

# Generate sorted leaderboard
entries = lb.generate(sort_by="resolution_rate")
for entry in entries:
    print(f"#{entry['rank']} {entry['label']}: {entry['resolution_rate']:.1%}")

# ASCII table output
print(lb.format_table(sort_by="resolution_rate"))

# Markdown table (for GitHub/docs)
print(lb.format_markdown(sort_by="resolution_rate"))

Sort Keys¶

Key	Direction	Description
`resolution_rate`	Higher is better	Fraction of tasks resolved
`resolved`	Higher is better	Absolute number of resolved tasks
`total_cost`	Lower is better	Total cost in USD
`cost_per_resolved`	Lower is better	Cost per resolved task
`avg_tokens`	Lower is better	Average tokens per task
`avg_runtime`	Lower is better	Average runtime per task

Quick Leaderboard¶

from mcpbr.analytics import generate_leaderboard

entries = generate_leaderboard([
    ("Claude Sonnet", results_sonnet),
    ("GPT-4o", results_gpt4o),
], sort_by="resolution_rate")

MetricsRegistry¶

Registry of metric definitions with built-in defaults and support for custom metrics.

`MetricsRegistry` ¶

Registry of metric definitions with built-in defaults.

Built-in metrics registered on initialisation: - resolution_rate: Fraction of tasks resolved. - cost_per_resolution: Total cost divided by resolved count (inf if none resolved). - avg_tokens_per_task: Mean total token count per task. - tool_failure_rate: Ratio of tool failures to total tool calls. - efficiency_score: Composite score: rate / (cost + 0.01).

`register(metric)` ¶

Register a custom metric definition.

Parameters:

Name	Type	Description	Default
`metric`	`MetricDefinition`	The metric to register.	required

Raises:

Type	Description
`ValueError`	If a metric with the same name is already registered.

`calculate_all(results_data)` ¶

Calculate all registered metrics against the given results.

Parameters:

Name	Type	Description	Default
`results_data`	`dict[str, Any]`	Evaluation results dictionary with `metadata`, `summary`, and `tasks` keys.	required

Returns:

Type	Description
`dict[str, float]`	Dictionary mapping metric name to its computed float value.
`dict[str, float]`	If a metric calculation raises an exception the value is
`dict[str, float]`	`float('nan')`.

`get_metric(name)` ¶

Look up a metric by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Metric identifier.	required

Returns:

Type	Description
`MetricDefinition \| None`	The `MetricDefinition` if found, otherwise `None`.

`list_metrics()` ¶

Return a sorted list of all registered metric names.

Built-in Metrics¶

Metric	Unit	Higher is Better	Description
`resolution_rate`	ratio	Yes	Fraction of tasks resolved
`cost_per_resolution`	USD	No	Total cost / resolved count
`avg_tokens_per_task`	tokens	No	Average total tokens per task
`tool_failure_rate`	ratio	No	Tool failures / total tool calls
`efficiency_score`	score	No	resolution_rate / (total_cost + 0.01)

Usage¶

from mcpbr.analytics import MetricsRegistry, MetricDefinition

registry = MetricsRegistry()

# Calculate all built-in metrics
metrics = registry.calculate_all(results_data)
print(f"Resolution rate: {metrics['resolution_rate']:.1%}")
print(f"Efficiency: {metrics['efficiency_score']:.2f}")

# Register a custom metric
registry.register(MetricDefinition(
    name="cost_per_token",
    description="Average cost per 1000 tokens",
    unit="USD/1k tokens",
    calculate=lambda data: (
        sum(t.get("mcp", {}).get("cost", 0) for t in data.get("tasks", [])) /
        max(sum(
            t.get("mcp", {}).get("tokens", {}).get("input", 0) +
            t.get("mcp", {}).get("tokens", {}).get("output", 0)
            for t in data.get("tasks", [])
        ), 1) * 1000
    ),
    higher_is_better=False,
))

# List all registered metrics
print(registry.list_metrics())

TrendAnalysis¶

Time-series trend analysis for evaluation results.

calculate_trends()¶

Calculate trend information from a list of run summaries.

from mcpbr.analytics import calculate_trends

# runs from ResultsDatabase.get_trends()
trends = calculate_trends(runs)
print(f"Direction: {trends['direction']}")  # "improving", "declining", "stable"
print(trends["resolution_rate_trend"])      # [{timestamp, rate}, ...]
print(trends["cost_trend"])                 # [{timestamp, cost}, ...]
print(trends["moving_averages"])            # 3-point moving averages

detect_trend_direction()¶

Determine whether a series of values is improving, declining, or stable using linear regression.

from mcpbr.analytics import detect_trend_direction

direction = detect_trend_direction([0.40, 0.42, 0.45, 0.48, 0.50])
print(direction)  # "improving"

calculate_moving_average()¶

Compute a simple moving average over a list of values.

from mcpbr.analytics import calculate_moving_average

ma = calculate_moving_average([0.40, 0.42, 0.45, 0.48, 0.50], window=3)
# [None, None, 0.4233..., 0.45, 0.4766...]

AnomalyDetection¶

Statistical methods to identify outlier values in benchmark metrics.

detect_anomalies()¶

Detect anomalous values using z-score, IQR, or MAD methods.

from mcpbr.analytics import detect_anomalies

anomalies = detect_anomalies(
    values=[0.5, 0.6, 0.55, 0.58, 5.0, 0.52],
    method="zscore",    # "zscore", "iqr", or "mad"
    threshold=2.0,
)
for a in anomalies:
    print(f"Index {a['index']}: value={a['value']}, score={a['score']:.2f}")

Method	Description	Threshold Meaning
`zscore`	Z-score exceeds threshold	Number of standard deviations
`iqr`	IQR fence method	Fence multiplier (commonly 1.5)
`mad`	Median absolute deviation	Number of MADs

detect_metric_anomalies()¶

Run anomaly detection across standard benchmark metrics (cost, tokens, runtime, iterations).

from mcpbr.analytics import detect_metric_anomalies

anomalies = detect_metric_anomalies(results_data)
print(f"Cost anomalies: {len(anomalies['cost'])}")
print(f"Token anomalies: {len(anomalies['tokens'])}")
print(f"Runtime anomalies: {len(anomalies['runtime'])}")
print(f"Iteration anomalies: {len(anomalies['iterations'])}")

CorrelationAnalysis¶

Compute correlations between evaluation metrics.

pearson_correlation()¶

Compute the Pearson correlation coefficient between two sequences.

from mcpbr.analytics import pearson_correlation

result = pearson_correlation(
    x=[100, 200, 300, 400, 500],
    y=[0.5, 1.1, 1.4, 2.0, 2.5],
)
print(f"r = {result['r']:.3f}, R^2 = {result['r_squared']:.3f}, p = {result['p_value']:.4f}")

spearman_correlation()¶

Compute the Spearman rank correlation (non-parametric, handles non-linear relationships).

from mcpbr.analytics import spearman_correlation

result = spearman_correlation(x=[1, 2, 3, 4, 5], y=[5, 6, 7, 8, 7])

analyze_metric_correlations()¶

Compute all pairwise Pearson correlations between standard metrics extracted from results.

from mcpbr.analytics import analyze_metric_correlations, find_strong_correlations

correlations = analyze_metric_correlations(results_data)
# Correlations between: cost, tokens_input, tokens_output, iterations,
#                        runtime_seconds, tool_calls

# Filter for strong correlations
strong = find_strong_correlations(correlations, threshold=0.7)
for c in strong:
    print(f"{c['pair']}: r={c['r']:.3f} ({c['direction']})")

ErrorPatternAnalyzer¶

Analyze error patterns across benchmark results with clustering, temporal analysis, and recommendations.

`ErrorPatternAnalyzer` ¶

Analyzes error patterns across benchmark results.

Clusters similar errors, detects temporal patterns, correlates errors with specific tools, and produces actionable recommendations.

`analyze(results)` ¶

Analyze error patterns across benchmark results.

Parameters:

Name	Type	Description	Default
`results`	`list[dict[str, Any]]`	List of task result dicts. Each may contain keys like `error`, `errors` (list), `tool`, `iteration`, and `instance_id`.	required

Returns:

Type	Description
`dict[str, Any]`	Dictionary with keys: - total_errors: Total number of errors found. - error_clusters: List of cluster dicts with pattern, count, examples, and category. - temporal_patterns: Dict describing whether errors increase over iterations. - tool_error_correlation: Dict mapping tool names to error rates. - recommendations: List of actionable recommendation strings.

`cluster_errors(errors, similarity_threshold=0.6)` ¶

Cluster similar error messages using token-overlap similarity.

Groups errors whose Jaccard similarity exceeds the given threshold, then categorises each cluster.

Parameters:

Name	Type	Description	Default
`errors`	`list[str]`	List of raw error message strings.	required
`similarity_threshold`	`float`	Minimum Jaccard similarity to merge two errors into the same cluster. Defaults to 0.6.	`0.6`

Returns:

Type	Description
`list[dict[str, Any]]`	List of cluster dicts, each containing: - pattern: Representative error string (most common). - count: Number of errors in the cluster. - examples: Up to 3 distinct example messages. - category: High-level category string.

Usage¶

from mcpbr.analytics import ErrorPatternAnalyzer

analyzer = ErrorPatternAnalyzer()
analysis = analyzer.analyze(task_results)

print(f"Total errors: {analysis['total_errors']}")

# Error clusters (grouped by similarity)
for cluster in analysis["error_clusters"]:
    print(f"  {cluster['category']}: {cluster['count']}x - {cluster['pattern'][:80]}")

# Temporal patterns
if analysis["temporal_patterns"]["increasing"]:
    print("Warning: errors increasing over iterations")

# Tool-error correlation
for tool, rate in analysis["tool_error_correlation"].items():
    if rate > 0.3:
        print(f"  High error rate tool: {tool} ({rate:.0%})")

# Actionable recommendations
for rec in analysis["recommendations"]:
    print(f"  - {rec}")

Error Categories¶

The analyzer automatically categorizes errors into:

Category	Pattern Keywords
`timeout`	timeout, timed out, deadline
`authentication`	auth, unauthorized, 401, 403
`rate_limit`	rate limit, 429, throttle
`connection`	connection, refused, DNS, network
`validation`	invalid, validation, schema, parse
`permission`	permission, denied, access
`unknown`	Everything else

identify_flaky_tasks()¶

Identify tasks with inconsistent outcomes across multiple runs.

from mcpbr.analytics import identify_flaky_tasks

flaky = identify_flaky_tasks([results_run1, results_run2, results_run3])
for task in flaky:
    if task["flaky"]:
        print(f"{task['instance_id']}: pass_rate={task['pass_rate']:.0%} over {task['run_count']} runs")

DifficultyEstimation¶

Estimate per-task difficulty based on resolution rates, resource usage, and runtime.

estimate_difficulty()¶

Score each task's difficulty on a 0-1 scale.

from mcpbr.analytics import estimate_difficulty, aggregate_difficulty_stats

difficulties = estimate_difficulty(results_data)
for d in difficulties[:5]:
    print(f"{d['instance_id']}: {d['difficulty_level']} ({d['difficulty_score']:.2f})")

# Aggregate statistics
stats = aggregate_difficulty_stats(difficulties)
print(f"Distribution: {stats['distribution']}")
print(f"Avg difficulty: {stats['avg_difficulty']:.2f}")
print(f"Hardest tasks: {[t['instance_id'] for t in stats['hardest_tasks']]}")

Difficulty Levels¶

Score Range	Level
0.00 - 0.25	easy
0.25 - 0.50	medium
0.50 - 0.75	hard
0.75 - 1.00	very_hard

estimate_task_difficulty_score()¶

Score a single task's difficulty given its metrics and run averages.

from mcpbr.analytics import estimate_task_difficulty_score

score = estimate_task_difficulty_score(
    resolved=False,
    cost=0.15,
    tokens=50000,
    iterations=8,
    runtime=250.0,
    avg_cost=0.10,
    avg_tokens=30000,
    avg_iterations=5,
    avg_runtime=180.0,
)
print(f"Difficulty: {score:.2f}")  # Higher = harder

Analytics API Reference¶

ResultsDatabase¶

ResultsDatabase ¶

__init__(db_path='mcpbr_results.db') ¶

store_run(results_data) ¶

get_run(run_id) ¶

list_runs(limit=50, benchmark=None, model=None, provider=None) ¶

get_task_results(run_id) ¶

delete_run(run_id) ¶

get_trends(benchmark=None, model=None, limit=20) ¶

cleanup(max_age_days=90) ¶

close() ¶

Usage¶

Methods¶

Database Schema¶

Statistical Tests¶

chi_squared_test()¶

bootstrap_confidence_interval()¶

effect_size_cohens_d()¶

mann_whitney_u()¶

permutation_test()¶

compare_resolution_rates()¶

ComparisonEngine¶

ComparisonEngine ¶

add_results(label, results_data) ¶

compare() ¶

get_cost_performance_frontier() ¶

get_winner_analysis() ¶

Usage¶

Convenience Functions¶

RegressionDetector¶

RegressionDetector ¶

__init__(threshold=0.05, significance_level=0.05) ¶

detect(current, baseline) ¶

format_report() ¶

Usage¶

Detection Thresholds¶

Overall Status¶

ABTest¶

ABTest ¶

__init__(name, control_label='A', treatment_label='B') ¶

add_control(results_data) ¶

add_treatment(results_data) ¶

analyze() ¶

format_report() ¶

Usage¶

Quick A/B Test¶

Leaderboard¶

Leaderboard ¶

add_entry(label, results_data) ¶

generate(sort_by='resolution_rate') ¶

format_table(sort_by='resolution_rate') ¶

format_markdown(sort_by='resolution_rate') ¶

Usage¶

Sort Keys¶

Quick Leaderboard¶

MetricsRegistry¶

MetricsRegistry ¶

register(metric) ¶

calculate_all(results_data) ¶

get_metric(name) ¶

list_metrics() ¶

Built-in Metrics¶

Usage¶

TrendAnalysis¶

calculate_trends()¶

detect_trend_direction()¶

calculate_moving_average()¶

AnomalyDetection¶

detect_anomalies()¶

detect_metric_anomalies()¶

CorrelationAnalysis¶

pearson_correlation()¶

spearman_correlation()¶

analyze_metric_correlations()¶

ErrorPatternAnalyzer¶

ErrorPatternAnalyzer ¶

analyze(results) ¶

cluster_errors(errors, similarity_threshold=0.6) ¶

Usage¶

`ResultsDatabase` ¶

`init(db_path='mcpbr_results.db')` ¶

`store_run(results_data)` ¶

`get_run(run_id)` ¶

`list_runs(limit=50, benchmark=None, model=None, provider=None)` ¶

`get_task_results(run_id)` ¶

`delete_run(run_id)` ¶

`get_trends(benchmark=None, model=None, limit=20)` ¶

`cleanup(max_age_days=90)` ¶

`close()` ¶

`ComparisonEngine` ¶

`add_results(label, results_data)` ¶

`compare()` ¶

`get_cost_performance_frontier()` ¶

`get_winner_analysis()` ¶

`RegressionDetector` ¶

`init(threshold=0.05, significance_level=0.05)` ¶

`detect(current, baseline)` ¶

`format_report()` ¶

`ABTest` ¶

`init(name, control_label='A', treatment_label='B')` ¶

`add_control(results_data)` ¶

`add_treatment(results_data)` ¶

`analyze()` ¶

`format_report()` ¶

`Leaderboard` ¶

`add_entry(label, results_data)` ¶

`generate(sort_by='resolution_rate')` ¶

`format_table(sort_by='resolution_rate')` ¶

`format_markdown(sort_by='resolution_rate')` ¶

`MetricsRegistry` ¶

`register(metric)` ¶

`calculate_all(results_data)` ¶

`get_metric(name)` ¶

`list_metrics()` ¶

`ErrorPatternAnalyzer` ¶

`analyze(results)` ¶

`cluster_errors(errors, similarity_threshold=0.6)` ¶