Reference-based metrics
These compare the model’s output against a known-goodideal answer in the dataset.
ROUGE
ROUGE
Measures word overlap between the response and the reference. Best for summarisation
and open-ended generation tasks where phrasing can vary.CLI:
--metric rougeBLEU
BLEU
N-gram precision with brevity penalty. More common in machine translation evals.CLI:
--metric bleuExact match
Exact match
Binary score — 1.0 if the response matches the ideal exactly, 0.0 otherwise.
Useful for classification, short answers, and code generation with deterministic output.CLI:
--metric exactLabel-free evaluation
When your dataset has no reference answers, useLLMJudge or a CustomMetric.
The runner checks metric.requires_ideal against dataset.has_ideals() before any
API calls are made and raises a clear error naming each offending metric.
Reference-based metrics (RougeScore, BleuScore, ExactMatch) set requires_ideal = True
and will be rejected upfront on label-free datasets.
LLM-as-judge
Uses a separate model to evaluate response quality on configurable criteria. Works with or without a reference answer.Multi-dimensional scoring
Score across multiple dimensions in a single API call. The overall score is the mean across dimensions; individual scores are available inresult.sub_scores.
Custom criteria
Custom prompt template
For full control over the judge prompt, pass a.md file with these placeholders:
{criteria}, {conversation}, {response}, {ideal_section}.
examples/judge_prompt.md in the repo is a copy of the default template to start from.
CLI
Custom metrics
Pass any Python function that takes(response, ideal=None, messages=None, **kwargs)
and returns a float or a dict with a "score" key.
CLI
Point at a Python file and name the function:examples/custom_metrics.py for three ready-to-use examples.