optimize
The main command. Runs a baseline eval, optimizes the prompt, and verifies the
result.
aevyra-reflex optimize <dataset> <prompt> [OPTIONS]
Arguments
| Argument | Description |
|---|
dataset | Path to a JSONL file in verdict format |
prompt | Path to a text file containing the system prompt |
Options
| Flag | Default | Description |
|---|
-m, --model | (required) | Model to optimize, in provider/model format. Examples: local/llama3.1:8b , openai/gpt-5.4-nano, openrouter/meta-llama/llama-3.1-8b-instruct |
--target | — | Target model(s) to benchmark against. The best score becomes the threshold. Repeatable. Mutually exclusive with --verdict-results |
--verdict-results | — | Path to a verdict results JSON file. Sets the threshold from the best model’s score. Mutually exclusive with --target |
-s, --strategy | auto | Optimization strategy: auto, iterative, structural, pdo, fewshot (or any custom registered strategy) |
--metric | rouge | Scoring metric: rouge, bleu, exact. Mutually exclusive with --judge |
--judge | — | Use an LLM judge instead of automated metrics. Format: provider/model. Mutually exclusive with --metric |
--judge-criteria | — | Path to a text file containing a custom evaluation rubric for the judge. Only used with --judge. Without this flag the judge uses a default accuracy/helpfulness/clarity/completeness rubric |
--max-iterations | 10 | Maximum optimization iterations (total budget for auto) |
--threshold | — | Explicit score threshold (0.0–1.0). Overrides --target and --verdict-results. Defaults to 0.85 if no target is set |
--max-workers | 4 | Parallel workers for variant evaluation |
--reasoning-model | claude-sonnet-4-20250514 | LLM for reasoning in provider/model format. Supports ollama/, openai/, openrouter/, etc. |
--reasoning-api-key | — | API key for the reasoning model. Also reads REFLEX_REASONING_API_KEY env var |
--reasoning-base-url | — | Base URL for self-hosted reasoning model endpoints |
--source-model | — | The model family this prompt was originally written for (e.g. claude-sonnet, gpt-4o). Enables migration mode — the reasoning model adapts idioms (XML tags → Markdown, role framing, etc.) for the target model |
--eval-runs | 1 | Number of eval passes to average for the baseline and final verification. Use 3–5 for noisy tasks or small datasets. Reports mean ± std and always tests significance |
-o, --output | — | Save the optimized prompt to this file |
--results-json | — | Save full results (iterations, scores, analysis) to JSON |
--run-dir | .reflex/ | Directory for run history and checkpoints |
--train-split | 0.65 | Fraction of data used for optimization. The rest is held out for baseline and final eval scores. Set to 1.0 to disable splitting |
--val-split | 0.20 | Fraction of total data reserved as a validation set, carved from the training portion. Val scores are tracked per-iteration to detect overfitting. Set to 0.0 to disable |
--early-stopping-patience | 3 | Stop optimization early if val score has not improved for N consecutive iterations. Only active when --val-split > 0. Set to 0 to disable |
--batch-size | 0 | Mini-batch size for each optimization iteration. 0 = full training set. When > 0, each iteration samples this many examples at random. Baseline and final evals are unaffected |
--full-eval-steps | 0 | When using --batch-size, run a full training-set eval every N iterations. 0 = never. E.g. --batch-size 32 --full-eval-steps 5 gives accurate checkpoint scores on iters 5, 10, 15, … Full-eval iters are marked in the dashboard |
--resume | false | Resume the latest interrupted run for this dataset |
--resume-from | — | Resume a specific run by ID (e.g. 001) or directory path |
-v, --verbose | false | Show detailed logs including timing |
--version | — | Show version and exit |
Examples
# Optimize llama to match gpt-4o-mini (live benchmark sets the target)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--target openai/gpt-4o-mini \
-o best_prompt.md
# Multiple targets — the best score wins
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--target openai/gpt-4o-mini \
--target openai/gpt-4o \
-o best_prompt.md
# Use existing verdict results as the target
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--verdict-results results.json \
-o best_prompt.md
# Explicit threshold (overrides --target and --verdict-results)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--threshold 0.90 \
-o best_prompt.md
# Auto strategy with defaults (threshold defaults to 0.85)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--max-workers 4 \
-o best_prompt.md
# PDO with more rounds
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
-s pdo \
--max-iterations 50 \
--results-json results.json
# Judge-only scoring (no ROUGE)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m openrouter/meta-llama/llama-3.1-8b-instruct \
--judge openrouter/openai/gpt-4o-mini \
-o best_prompt.md
# Judge with a custom evaluation rubric
aevyra-reflex optimize dataset.jsonl prompt.md \
-m anthropic/claude-haiku-4-5-20251001 \
--judge anthropic/claude-sonnet-4-6 \
--judge-criteria rubric.md \
-o best_prompt.md
# Use a local Ollama model for reasoning
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.2:1b \
--reasoning-model ollama/llama3.3:70b \
-o best_prompt.md
# Use OpenAI for reasoning instead of Claude
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model openai/gpt-4o \
-o best_prompt.md
# Self-hosted reasoning model (vLLM, TGI, etc.)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model openai/my-model \
--reasoning-base-url http://localhost:8000/v1
Resume examples
# Resume the latest interrupted run for this dataset
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--resume
# Resume a specific run by ID
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--resume-from 003
# Use a custom run directory
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--run-dir ./my-experiments/.reflex \
-o best_prompt.md
runs
List all optimization runs and their status.
aevyra-reflex runs [OPTIONS]
Options
| Flag | Default | Description |
|---|
--run-dir | .reflex/ | Directory for run history |
-v, --verbose | false | Show config details for each run |
Output
ID Status Strategy Iters Baseline Best Final Dataset
------------------------------------------------------------------------------------------
001 ✓ completed auto 5 0.5821 0.8612 0.8612 dataset.jsonl
002 ⚡ interrupted iterative 3 0.6100 0.7450 — dataset.jsonl
003 … running structural 1 — — — other.jsonl
Status icons: ✓ completed, ⚡ interrupted (resumable), … running.
dashboard
Launch a local web UI for exploring optimization runs.
aevyra-reflex dashboard [OPTIONS]
Options
| Flag | Default | Description |
|---|
--run-dir | .reflex/ | Directory for run history |
-p, --port | 8128 | Port to serve on |
--host | 127.0.0.1 | Bind address |
--no-open | false | Don’t open the browser automatically |
The dashboard shows all runs with score trajectory charts, per-iteration
prompt diffs, reasoning analysis, and configuration snapshots. It’s a
read-only view backed by the same .reflex/ directory that optimize
and runs use.
Models are specified as provider/model:
| Provider | Format | API key env var |
|---|
| Local (Ollama) | local/llama3.1:8b | — |
| OpenAI | openai/gpt-5.4-nano | OPENAI_API_KEY |
| OpenRouter | openrouter/meta-llama/llama-3.1-8b-instruct | OPENROUTER_API_KEY |
| Together | together/meta-llama/Llama-3.1-8B-Instruct | TOGETHER_API_KEY |
| Fireworks | fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct | FIREWORKS_API_KEY |
| Groq | groq/llama-3.1-8b-instant | GROQ_API_KEY |
| DeepInfra | deepinfra/meta-llama/Llama-3.1-8B-Instruct | DEEPINFRA_API_KEY |
Provider aliases resolve to OpenAI-compatible endpoints automatically. No need
to manually set OPENAI_BASE_URL.