Installation
The CLI is included with the package:run
Run evals on a dataset and print a comparison table.
Models
| Flag | Short | Description |
|---|---|---|
--model | -m | Model in provider/model format. Repeat for multiple. |
--config | -c | Path to a models config file (.yaml, .json, .toml). |
--model and --config are mutually exclusive.
Metrics
| Flag | Description |
|---|---|
--metric | Built-in metric: rouge, bleu, or exact. Repeat for multiple. Default: rouge. |
--judge | Add an LLM-as-judge using this model spec. |
--judge-prompt | Path to a custom judge prompt template (.md or .txt). |
--custom-metric | Custom scoring function in file.py:function_name format. Repeat for multiple. |
Dataset field mapping
| Flag | Description |
|---|---|
--input-field | Field name to use as the user message (for JSONL that doesn’t follow a standard schema). Example: --input-field question |
--output-field | Field name to use as the reference answer. Omit for label-free datasets. Example: --output-field answer |
Output
| Flag | Short | Description |
|---|---|---|
--output | -o | Save results as JSON to this path. |
Tuning
| Flag | Default | Description |
|---|---|---|
--max-workers | 10 | Concurrent requests per model. Lower if hitting rate limits. |
--temperature | 0.0 | Sampling temperature. |
--max-tokens | 1024 | Max tokens per completion. |