Skip to main content
aevyra-verdict runs completions against any combination of models, scores the responses with pluggable metrics, and gives you structured results for comparison — from the terminal or in Python.

What it does

Given a dataset of prompts in OpenAI message format, aevyra-verdict:
  1. Sends each prompt to every model you’ve configured, concurrently
  2. Scores each response with your chosen metrics (ROUGE, BLEU, exact match, LLM-as-judge, or custom Python functions)
  3. Returns a comparison table with scores, latency, and token usage per model

When to use it

  • Choosing between models for a specific task
  • Catching regressions after a prompt or model change
  • Measuring the effect of system prompt variations
  • Benchmarking a locally-running model against hosted APIs

Supported providers

OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, OpenRouter, and any OpenAI-compatible API (vLLM, Ollama, Together, etc.).

Quick start

Run your first eval in under 5 minutes

CLI reference

All commands and flags

Providers

Configure models and local instances

Metrics

ROUGE, BLEU, LLM-as-judge, custom functions