EvalResults
Returned byEvalRunner.run(). Contains per-model results and aggregate comparisons.
.compare(metric_name=None)
Return a formatted comparison table as a string. Uses the first metric if none specified.
.summary()
Return a dict of per-model aggregate stats (mean score, stdev, latency, token usage).
.to_dataframe()
Convert the summary to a pandas DataFrame for further analysis.
.to_json(path=None)
Export full results (summary + per-sample data) as JSON. Returns the JSON string.
Properties
| Property | Description |
|---|---|
results.models | List of model labels. |
results.metric_names | List of metric names used in this run. |
results.model_results | Dict mapping label → ModelResult. |
ModelResult
Per-model results accessible viaresults.model_results["label"].
| Method | Description |
|---|---|
.mean_score(metric_name) | Mean score for a metric across all samples. |
.median_score(metric_name) | Median score. |
.stdev_score(metric_name) | Standard deviation. |
.mean_latency_ms() | Mean API latency in milliseconds. |
.total_tokens() | Total tokens used across all completions. |
| Property | Description |
|---|---|
.num_samples | Total samples evaluated. |
.num_errors | Samples that failed after all retries. |
.success_rate | Fraction of samples that completed without error. |
.completions | List of CompletionResult objects (one per sample). |
.scores | List of {metric_name: ScoreResult} dicts (one per sample). |
.errors | List of error strings or None (one per sample). |