Introduction

aevyra-verdict runs completions against any combination of models, scores the responses with pluggable metrics, and gives you structured results for comparison — from the terminal or in Python.

What it does

Given a dataset of prompts in OpenAI message format, aevyra-verdict:

Sends each prompt to every model you’ve configured, concurrently
Scores each response with your chosen metrics (ROUGE, BLEU, exact match, LLM-as-judge, or custom Python functions)
Returns a comparison table with scores, latency, and token usage per model

When to use it

Choosing between models for a specific task
Catching regressions after a prompt or model change
Measuring the effect of system prompt variations
Benchmarking a locally-running model against hosted APIs

Supported providers

OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, OpenRouter, and any OpenAI-compatible API (vLLM, Ollama, Together, etc.).

Quick start

Run your first eval in under 5 minutes

CLI reference

All commands and flags

Providers

Configure models and local instances

Metrics

ROUGE, BLEU, LLM-as-judge, custom functions

Quick startRun your first eval in under 5 minutes.

Getting started

Guides

API reference

What it does

When to use it

Supported providers

Quick start

CLI reference

Providers

Metrics

​What it does

​When to use it

​Supported providers

Quick start

CLI reference

Providers

Metrics

What it does

When to use it

Supported providers