Skip to main content

optimize

The main command. Runs a baseline eval, optimizes the prompt, and verifies the result.
aevyra-reflex optimize <dataset> <prompt> [OPTIONS]

Arguments

ArgumentDescription
datasetPath to a JSONL file in verdict format
promptPath to a text file containing the system prompt

Options

FlagDefaultDescription
-m, --model(required)Model to optimize, in provider/model format. Examples: local/llama3.1:8b , openai/gpt-5.4-nano, openrouter/meta-llama/llama-3.1-8b-instruct
--targetTarget model(s) to benchmark against. The best score becomes the threshold. Repeatable. Mutually exclusive with --verdict-results
--verdict-resultsPath to a verdict results JSON file. Sets the threshold from the best model’s score. Mutually exclusive with --target
-s, --strategyautoOptimization strategy: auto, iterative, structural, pdo, fewshot (or any custom registered strategy)
--metricrougeScoring metric: rouge, bleu, exact. Mutually exclusive with --judge
--judgeUse an LLM judge instead of automated metrics. Format: provider/model. Mutually exclusive with --metric
--judge-criteriaPath to a text file containing a custom evaluation rubric for the judge. Only used with --judge. Without this flag the judge uses a default accuracy/helpfulness/clarity/completeness rubric
--max-iterations10Maximum optimization iterations (total budget for auto)
--thresholdExplicit score threshold (0.0–1.0). Overrides --target and --verdict-results. Defaults to 0.85 if no target is set
--max-workers4Parallel workers for variant evaluation
--reasoning-modelclaude-sonnet-4-20250514LLM for reasoning in provider/model format. Supports ollama/, openai/, openrouter/, etc.
--reasoning-api-keyAPI key for the reasoning model. Also reads REFLEX_REASONING_API_KEY env var
--reasoning-base-urlBase URL for self-hosted reasoning model endpoints
--source-modelThe model family this prompt was originally written for (e.g. claude-sonnet, gpt-4o). Enables migration mode — the reasoning model adapts idioms (XML tags → Markdown, role framing, etc.) for the target model
--eval-runs1Number of eval passes to average for the baseline and final verification. Use 3–5 for noisy tasks or small datasets. Reports mean ± std and always tests significance
-o, --outputSave the optimized prompt to this file
--results-jsonSave full results (iterations, scores, analysis) to JSON
--run-dir.reflex/Directory for run history and checkpoints
--train-split0.65Fraction of data used for optimization. The rest is held out for baseline and final eval scores. Set to 1.0 to disable splitting
--val-split0.20Fraction of total data reserved as a validation set, carved from the training portion. Val scores are tracked per-iteration to detect overfitting. Set to 0.0 to disable
--early-stopping-patience3Stop optimization early if val score has not improved for N consecutive iterations. Only active when --val-split > 0. Set to 0 to disable
--batch-size0Mini-batch size for each optimization iteration. 0 = full training set. When > 0, each iteration samples this many examples at random. Baseline and final evals are unaffected
--full-eval-steps0When using --batch-size, run a full training-set eval every N iterations. 0 = never. E.g. --batch-size 32 --full-eval-steps 5 gives accurate checkpoint scores on iters 5, 10, 15, … Full-eval iters are marked in the dashboard
--resumefalseResume the latest interrupted run for this dataset
--resume-fromResume a specific run by ID (e.g. 001) or directory path
-v, --verbosefalseShow detailed logs including timing
--versionShow version and exit

Examples

# Optimize llama to match gpt-4o-mini (live benchmark sets the target)
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --target openai/gpt-4o-mini \
  -o best_prompt.md

# Multiple targets — the best score wins
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --target openai/gpt-4o-mini \
  --target openai/gpt-4o \
  -o best_prompt.md

# Use existing verdict results as the target
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --verdict-results results.json \
  -o best_prompt.md

# Explicit threshold (overrides --target and --verdict-results)
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --threshold 0.90 \
  -o best_prompt.md

# Auto strategy with defaults (threshold defaults to 0.85)
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --max-workers 4 \
  -o best_prompt.md

# PDO with more rounds
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  -s pdo \
  --max-iterations 50 \
  --results-json results.json

# Judge-only scoring (no ROUGE)
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m openrouter/meta-llama/llama-3.1-8b-instruct \
  --judge openrouter/openai/gpt-4o-mini \
  -o best_prompt.md

# Judge with a custom evaluation rubric
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m anthropic/claude-haiku-4-5-20251001 \
  --judge anthropic/claude-sonnet-4-6 \
  --judge-criteria rubric.md \
  -o best_prompt.md

# Use a local Ollama model for reasoning
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.2:1b \
  --reasoning-model ollama/llama3.3:70b \
  -o best_prompt.md

# Use OpenAI for reasoning instead of Claude
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model openai/gpt-4o \
  -o best_prompt.md

# Self-hosted reasoning model (vLLM, TGI, etc.)
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model openai/my-model \
  --reasoning-base-url http://localhost:8000/v1

Resume examples

# Resume the latest interrupted run for this dataset
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume

# Resume a specific run by ID
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume-from 003

# Use a custom run directory
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --run-dir ./my-experiments/.reflex \
  -o best_prompt.md

runs

List all optimization runs and their status.
aevyra-reflex runs [OPTIONS]

Options

FlagDefaultDescription
--run-dir.reflex/Directory for run history
-v, --verbosefalseShow config details for each run

Output

  ID  Status        Strategy      Iters  Baseline      Best     Final  Dataset
------------------------------------------------------------------------------------------
 001  ✓ completed   auto              5    0.5821    0.8612    0.8612  dataset.jsonl
 002  ⚡ interrupted  iterative        3    0.6100    0.7450         —  dataset.jsonl
 003  … running     structural        1         —         —         —  other.jsonl
Status icons: completed, interrupted (resumable), running.

dashboard

Launch a local web UI for exploring optimization runs.
aevyra-reflex dashboard [OPTIONS]

Options

FlagDefaultDescription
--run-dir.reflex/Directory for run history
-p, --port8128Port to serve on
--host127.0.0.1Bind address
--no-openfalseDon’t open the browser automatically
The dashboard shows all runs with score trajectory charts, per-iteration prompt diffs, reasoning analysis, and configuration snapshots. It’s a read-only view backed by the same .reflex/ directory that optimize and runs use.

Provider format

Models are specified as provider/model:
ProviderFormatAPI key env var
Local (Ollama)local/llama3.1:8b
OpenAIopenai/gpt-5.4-nanoOPENAI_API_KEY
OpenRouteropenrouter/meta-llama/llama-3.1-8b-instructOPENROUTER_API_KEY
Togethertogether/meta-llama/Llama-3.1-8B-InstructTOGETHER_API_KEY
Fireworksfireworks/accounts/fireworks/models/llama-v3p1-8b-instructFIREWORKS_API_KEY
Groqgroq/llama-3.1-8b-instantGROQ_API_KEY
DeepInfradeepinfra/meta-llama/Llama-3.1-8B-InstructDEEPINFRA_API_KEY
Provider aliases resolve to OpenAI-compatible endpoints automatically. No need to manually set OPENAI_BASE_URL.