Quick start

Install

pip install aevyra-verdict

Set your API keys

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

Run aevyra-verdict providers to see which keys are configured.

Prepare a dataset

Create a JSONL file where each line is a conversation in OpenAI message format. The ideal field is the reference answer used by scoring metrics.

{"messages": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"messages": [{"role": "user", "content": "Explain binary search in one sentence."}], "ideal": "Binary search repeatedly halves a sorted array to find a target value in O(log n) time."}

Run your first eval

aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano -m qwen/qwen3.5-9b

You’ll see a progress bar and a comparison table when it finishes:

Eval: dataset | Metric: rouge_rougeL
------------------------------------------------------------------------
Model                               Mean     Stdev    Latency   Errors
------------------------------------------------------------------------
openai/gpt-5.4-nano                     0.7823      N/A    312.4ms        0
qwen/qwen3.5-9b                   0.7541      N/A    289.1ms        0
------------------------------------------------------------------------

Run against a local model

If you have Ollama running locally, you can benchmark against it without any API keys:

ollama pull llama3.1
ollama pull mistral

aevyra-verdict run examples/sample_data.jsonl \
  -m local/llama3.1 \
  -m local/mistral \
  --base-url http://localhost:11434/v1

Or with a local vLLM instance:

aevyra-verdict run examples/sample_data.jsonl \
  -m openai/gpt-5.4-nano \
  -m local/meta-llama/Llama-3.1-8B-Instruct \
  --base-url http://localhost:8000/v1

This is useful for benchmarking a fine-tuned model against a hosted baseline before deciding whether to deploy it.

Save results

aevyra-verdict run examples/sample_data.jsonl -m openai/gpt-5.4-nano -o results.json

Getting started

Guides

API reference

Install

Set your API keys

Prepare a dataset

Run your first eval

Run against a local model

Save results

Next steps

Compare more models

Add an LLM judge

​Install

​Set your API keys

​Prepare a dataset

​Run your first eval

​Run against a local model

​Save results

​Next steps

Compare more models

Add an LLM judge

Install

Set your API keys

Prepare a dataset

Run your first eval

Run against a local model

Save results

Next steps