Metrics

Reference-based metrics

These compare the model’s output against a known-good ideal answer in the dataset.

ROUGE

Measures word overlap between the response and the reference. Best for summarisation and open-ended generation tasks where phrasing can vary.

from aevyra_verdict import RougeScore

RougeScore()                    # defaults to rougeL
RougeScore(variant="rouge1")    # unigram overlap
RougeScore(variant="rouge2")    # bigram overlap
RougeScore(variant="rougeL")    # longest common subsequence

CLI: --metric rouge

BLEU

N-gram precision with brevity penalty. More common in machine translation evals.

from aevyra_verdict import BleuScore

BleuScore()              # 4-gram BLEU by default
BleuScore(max_ngram=2)   # bigram BLEU

CLI: --metric bleu

Exact match

Binary score — 1.0 if the response matches the ideal exactly, 0.0 otherwise. Useful for classification, short answers, and code generation with deterministic output.

from aevyra_verdict import ExactMatch

ExactMatch()                     # case-insensitive, strips whitespace
ExactMatch(case_sensitive=True)

CLI: --metric exact

LLM-as-judge

Uses a separate model to evaluate response quality on configurable criteria. Works with or without a reference answer.

from aevyra_verdict import LLMJudge
from aevyra_verdict.providers import get_provider

judge = get_provider("openai", "gpt-5.4")
LLMJudge(judge_provider=judge)

The judge scores on a 1–5 scale (normalized to 0.0–1.0) and returns its reasoning alongside the score.

Custom criteria

LLMJudge(
    judge_provider=judge,
    criteria="Evaluate only factual accuracy. Ignore style and formatting.",
)

Custom prompt template

For full control over the judge prompt, pass a .md file with these placeholders: {criteria}, {conversation}, {response}, {ideal_section}.

aevyra-verdict run data.jsonl -m openai/gpt-5.4-nano \
  --judge openai/gpt-5.4 \
  --judge-prompt judge_prompt.md

examples/judge_prompt.md in the repo is a copy of the default template to start from.

CLI

# Default judge prompt
aevyra-verdict run data.jsonl -m openai/gpt-5.4-nano --judge openai/gpt-5.4

# Custom judge prompt file
aevyra-verdict run data.jsonl -m openai/gpt-5.4-nano \
  --judge openai/gpt-5.4 \
  --judge-prompt my_prompt.md

Custom metrics

Pass any Python function that takes (response, ideal=None, messages=None, **kwargs) and returns a float or a dict with a "score" key.

from aevyra_verdict import CustomMetric

def brevity_score(response, ideal=None, **kwargs):
    words = len(response.split())
    return 1.0 if words <= 150 else max(0.0, 1.0 - (words - 150) / 200)

runner.add_metric(CustomMetric("brevity", brevity_score))

Return a dict to include reasoning:

def contains_code(response, **kwargs):
    has_code = "```" in response
    return {
        "score": 1.0 if has_code else 0.0,
        "reasoning": "Contains code block" if has_code else "No code found",
    }

CLI

Point at a Python file and name the function:

aevyra-verdict run data.jsonl -m openai/gpt-5.4-nano \
  --custom-metric my_metrics.py:brevity_score \
  --custom-metric my_metrics.py:contains_code

See examples/custom_metrics.py for three ready-to-use examples.

Combining metrics

All metrics run on every sample. Results are reported per-metric in separate comparison tables.

runner.add_metric(RougeScore())
runner.add_metric(LLMJudge(judge_provider=judge))
runner.add_metric(CustomMetric("brevity", brevity_score))

aevyra-verdict run data.jsonl -m openai/gpt-5.4-nano \
  --metric rouge \
  --metric bleu \
  --judge openai/gpt-5.4 \
  --custom-metric my_metrics.py:brevity_score

Getting started

Guides

API reference

Reference-based metrics

LLM-as-judge

Custom criteria

Custom prompt template

CLI

Custom metrics

CLI

Combining metrics

​Reference-based metrics

​LLM-as-judge

​Custom criteria

​Custom prompt template

​CLI

​Custom metrics

​CLI

​Combining metrics

Reference-based metrics

LLM-as-judge

Custom criteria

Custom prompt template

CLI

Custom metrics

CLI

Combining metrics