What it does
Given a dataset of prompts in OpenAI message format, aevyra-verdict:- Sends each prompt to every model you’ve configured, concurrently
- Scores each response with your chosen metrics (ROUGE, BLEU, exact match, LLM-as-judge, or custom Python functions)
- Returns a comparison table with scores, latency, and token usage per model
When to use it
- Choosing between models for a specific task
- Catching regressions after a prompt or model change
- Measuring the effect of system prompt variations
- Benchmarking a locally-running model against hosted APIs
Supported providers
OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, OpenRouter, and any OpenAI-compatible API (vLLM, Ollama, Together, etc.).Quick start
Run your first eval in under 5 minutes
CLI reference
All commands and flags
Providers
Configure models and local instances
Metrics
ROUGE, BLEU, LLM-as-judge, custom functions