Skip to main content

EvalResults

Returned by EvalRunner.run(). Contains per-model results and aggregate comparisons.
results = runner.run(dataset)

.compare(metric_name=None)

Return a formatted comparison table as a string. Uses the first metric if none specified.
print(results.compare())
print(results.compare("rouge_rougeL"))
print(results.compare("llm_judge"))

.summary()

Return a dict of per-model aggregate stats (mean score, stdev, latency, token usage).
summary = results.summary()
# {
#   "openai/gpt-5.4-nano": {
#     "provider": "openai",
#     "model": "gpt-5.4-nano",
#     "success_rate": 1.0,
#     "mean_latency_ms": 312.4,
#     "total_tokens": 4821,
#     "rouge_rougeL_mean": 0.782,
#     "rouge_rougeL_stdev": 0.134,
#   },
#   ...
# }

.to_dataframe()

Convert the summary to a pandas DataFrame for further analysis.
df = results.to_dataframe()
print(df.sort_values("rouge_rougeL_mean", ascending=False))

.to_json(path=None)

Export full results (summary + per-sample data) as JSON. Returns the JSON string.
results.to_json("results.json")
json_str = results.to_json()

Properties

PropertyDescription
results.modelsList of model labels.
results.metric_namesList of metric names used in this run.
results.model_resultsDict mapping label → ModelResult.

ModelResult

Per-model results accessible via results.model_results["label"].
MethodDescription
.mean_score(metric_name)Mean score for a metric across all samples.
.median_score(metric_name)Median score.
.stdev_score(metric_name)Standard deviation.
.mean_latency_ms()Mean API latency in milliseconds.
.total_tokens()Total tokens used across all completions.
PropertyDescription
.num_samplesTotal samples evaluated.
.num_errorsSamples that failed after all retries.
.success_rateFraction of samples that completed without error.
.completionsList of CompletionResult objects (one per sample).
.scoresList of {metric_name: ScoreResult} dicts (one per sample).
.errorsList of error strings or None (one per sample).