Setting the target score
The score threshold tells reflex when to stop — “make my model at least this
good.” Rather than picking an arbitrary number, you should set the target from
a real benchmark.
There are three ways to set the target, in priority order:
If you’ve already run aevyra-verdict to compare models, pass the results
file and reflex uses the best model’s score:# Step 1: benchmark with verdict
aevyra-verdict run data.jsonl \
-m openai/gpt-4o-mini -m local/llama3.1:8b\
-o results.json
# Step 2: optimize — target is automatically gpt-4o-mini's score
aevyra-reflex optimize data.jsonl prompt.md \
-m local/llama3.1:8b\
--verdict-results results.json \
-o best_prompt.md
Python API:optimizer.set_target_from_verdict("results.json")
Pass --target models and reflex benchmarks them first, then optimizes
your model to match the best:aevyra-reflex optimize data.jsonl prompt.md \
-m local/llama3.1:8b\
--target openai/gpt-4o-mini \
--target openai/gpt-4o \
-o best_prompt.md
Reflex runs all models (yours + targets) against the dataset, picks the
highest-scoring one, and uses its score as the threshold.Python API:benchmark = optimizer.benchmark_and_set_target(
prompt, all_providers,
)
Set the number directly. This overrides --verdict-results and --target:aevyra-reflex optimize data.jsonl prompt.md \
-m local/llama3.1:8b\
--threshold 0.90
If no target source is set, defaults to 0.85.
OptimizerConfig
All configuration goes through OptimizerConfig:
from aevyra_reflex import OptimizerConfig
config = OptimizerConfig(
strategy="auto", # auto, iterative, structural, pdo, fewshot
max_iterations=10, # total iteration budget
score_threshold=0.85, # stop when score reaches this
train_ratio=0.8, # 70% train / 10% val / 20% test (default)
val_ratio=0.1, # fraction reserved for validation (0 = disabled)
early_stopping_patience=3, # stop when val score stagnates for N iters (0 = disabled)
batch_size=0, # 0 = full training set; >0 = examples per iteration
batch_seed=42, # base seed for mini-batch sampling
full_eval_steps=0, # full-set checkpoint every N iters (0 = disabled)
max_workers=4, # parallel workers for variant evaluation
eval_runs=1, # eval passes to average (1 = single pass)
reasoning_model="claude-sonnet-4-20250514",
eval_temperature=0.0, # temperature for target model completions
extra_kwargs={}, # strategy-specific parameters
)
| Parameter | Default | Description |
|---|
strategy | "auto" | Which optimization strategy to use (or any custom registered name) |
max_iterations | 10 | Maximum iterations (total budget for auto across all phases) |
score_threshold | 0.85 | Target score — optimization stops when reached. Set automatically by --target or --verdict-results |
train_ratio | 0.8 | Fraction of examples used for optimization. The rest are held out for baseline and final eval. Set to 1.0 to disable splitting |
val_ratio | 0.1 | Fraction of total examples reserved as a validation set, carved from the training portion. Used to detect overfitting mid-run. Set to 0.0 to disable |
early_stopping_patience | 3 | Stop optimization early if val score has not improved for this many consecutive iterations. Only active when val_ratio > 0. Set to 0 to disable |
batch_size | 0 | Mini-batch size per iteration. 0 = full training set. When > 0, each iteration samples this many examples at random from the training data |
batch_seed | 42 | Base seed for mini-batch sampling. Iteration i uses batch_seed + i so every batch is different but the run is reproducible |
full_eval_steps | 0 | When using mini-batch mode, run a full training-set eval every this many iterations. 0 = never (use mini-batch scores throughout). E.g. 5 runs a full eval on iterations 5, 10, 15, … Full-eval iterations are marked in the dashboard |
max_workers | 4 | Thread pool size for parallel evaluation |
eval_runs | 1 | Number of eval passes to run and average for the baseline and final verification. Use 3–5 for noisy tasks or small datasets. Does not affect optimization iterations |
reasoning_model | "claude-sonnet-4-20250514" | LLM used for reasoning (failure analysis, prompt rewriting) |
reasoning_provider | None | Provider for the reasoning model: "anthropic", "openai", "ollama", "gemini", or any alias ("openrouter", "groq", etc.). Auto-detected from model name if None |
reasoning_api_key | None | API key for the reasoning model. Falls back to the provider’s default env var |
reasoning_base_url | None | Base URL for self-hosted or OpenAI-compatible reasoning model endpoints |
eval_temperature | 0.0 | Temperature for target model completions |
target_model | None | Label of the model whose score is the target (set automatically) |
target_source | None | How the target was set: "verdict_json", "verdict_run", or "manual" |
source_model | None | The model family this prompt was originally written for (e.g. "claude-sonnet", "gpt-4o"). Enables migration mode — the reasoning model adapts idioms for the target model. Also set via --source-model |
extra_kwargs | {} | Strategy-specific parameters (see below) |
Train/test split
By default, reflex splits your dataset 80/20 before running. The 80% train
portion is used during the optimization loop — failing samples are drawn from
here and all iteration scores are computed on these examples. The 20% test
set is held out completely, and the baseline and final scores you see in the
results summary are computed on that held-out set only.
This matters because without a split, the same examples that drive rewrites
also determine whether the prompt “improved.” That inflates the reported
improvement, especially on small datasets where a handful of examples have
an outsized effect on the mean score.
# Default: 80/20 split (the banner shows "80 train / 20 test")
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1
# Larger test set (70/30)
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 0.7
# No split — all examples used for both optimization and eval
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 1.0
The split is deterministic (seeded at 42) so the same dataset always
produces the same partition.
With small datasets (fewer than ~20 examples), the test set may be too small
for stable scores. In that case, use --train-split 1.0 to disable splitting
and rely on the score trajectory to judge improvement instead.
Validation split and early stopping
Reflex supports a 3-way train / val / test split to detect overfitting
mid-run. Without a validation set, a prompt can score well on training
examples simply by overfitting to their specific patterns — the optimization
loop has no way to notice.
With a validation set, reflex evaluates each candidate prompt on the val
examples after every iteration. If the train score climbs but val plateaus
or declines, that’s a sign the prompt is fitting the training examples
specifically rather than generalizing. Early stopping on val plateau saves
you compute and returns the least-overfit prompt.
# Default: 70% train / 10% val / 20% test, early stopping patience=3
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1
# Disable val split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --val-split 0.0
# Custom patience
aevyra-reflex optimize data.jsonl prompt.md \
-m local/llama3.1:8b\
--train-split 0.8 \
--val-split 0.1 \
--early-stopping-patience 5
Python API:
config = OptimizerConfig(
train_ratio=0.8,
val_ratio=0.1,
early_stopping_patience=3,
)
The validation split is deterministic (same seed as train/test). The summary
shows all three split sizes, a per-iteration val trajectory, and — if early
stopping triggered — marks which iteration was actually the best:
Train/val/test : 70 / 10 / 20 samples
Baseline score : 0.5500 (on 20-sample test set)
Final score : 0.7100 (on 20-sample test set)
Improvement : +0.1600 (+29.1%)
Iterations : 6
Early stopped : Yes (val score plateaued)
Train traj : 0.600 → 0.650 → 0.710 → 0.720 → 0.725 → 0.724
Val traj : 0.580 → 0.640 → 0.690 → 0.688 → 0.685 → 0.682
Use --early-stopping-patience 2 for fast iteration or 3–4 when
your dataset is small and single-iteration val scores are noisy.
Without --early-stopping-patience, val scores are still tracked and
reported but optimization runs to completion.
The validation set adds one extra eval call per optimization iteration.
With val_ratio=0.1 on 100 examples you’re evaluating 10 extra samples
per iteration — usually inexpensive compared to the reasoning step.
Mini-batch mode
By default every optimization iteration evaluates the full training set. For
large datasets this can be slow — each iteration costs n_train LLM calls.
Mini-batch mode samples a random subset each round:
# 32 examples per iteration
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --batch-size 32
# Combine with train/test split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
--train-split 0.8 --batch-size 40
config = OptimizerConfig(
batch_size=32, # examples per iteration; 0 = full training set (default)
batch_seed=42, # base seed — iteration i uses batch_seed + i
full_eval_steps=5, # full training-set eval every 5 iterations (0 = never)
)
Each iteration draws a fresh sample seeded by batch_seed + i, so the
optimizer sees different examples each round. The stochasticity can also
help escape local optima by smoothing out noise from individual examples.
Baseline and final verification evals always use the full test set —
batch_size only affects the per-iteration training evals used by the
optimization strategy.
Periodic full-eval checkpoints
Mini-batch scores are noisy — a single good or bad batch can skew the
trajectory. Use --full-eval-steps to periodically score the full
training set for an accurate checkpoint:
# Batch of 32 per iter; full eval every 5 iterations
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
--batch-size 32 --full-eval-steps 5
Iterations 5, 10, 15, … score the full training set; all others score
the mini-batch. Full-eval iterations are marked with a ◈ full eval
badge in the dashboard flow graph.
A good starting point is 20–50% of your training set size. Too small a
batch makes the per-iteration score noisy; too large gives diminishing
returns over the full set.
Statistical significance
Reflex reports whether the improvement from baseline to final is statistically
significant using a paired test on per-sample scores. This compares how
each individual sample scored under the original prompt vs the optimized one.
The result appears in result.summary():
Baseline score : 0.6200
Final score : 0.7450
Improvement : +0.1250 (+20.2%)
Significance : p=0.0234 ✓ significant (α=0.05, paired test)
The test uses the Wilcoxon signed-rank test (non-parametric, no normality
assumption) if scipy is installed, and falls back to a paired t-test otherwise.
Install scipy for best results:
pip install "aevyra-reflex[stats]"
For noisy tasks where LLM responses vary run-to-run, use --eval-runs to
average multiple passes before computing the test:
# Average 3 eval passes for baseline and final, then test significance
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--eval-runs 3
This reports mean ± std in the results summary:
Baseline score : 0.6200 ± 0.0180 (3 runs)
Final score : 0.7450 ± 0.0120 (3 runs)
--eval-runs only affects the baseline and final verification evals — not
the optimization iterations, which always use a single pass for speed.
--eval-runs 3 triples the cost of those two checkpoints.
Strategy-specific parameters
Pass strategy-specific parameters via extra_kwargs:
Auto
PDO
Few-shot
Structural
config = OptimizerConfig(
strategy="auto",
max_iterations=20,
extra_kwargs={
"max_phases": 4,
"start_structural": True,
"min_phases": 2,
},
)
| Parameter | Default | Description |
|---|
max_phases | 4 | Maximum number of optimization phases |
start_structural | True | Always begin with structural optimization |
min_phases | 2 | Minimum phases before convergence (prevents noisy early stops) |
config = OptimizerConfig(
strategy="pdo",
max_iterations=50,
extra_kwargs={
"duels_per_round": 3,
"samples_per_duel": 10,
"initial_pool_size": 6,
"thompson_alpha": 1.2,
"mutation_frequency": 5,
"num_top_to_mutate": 2,
"max_pool_size": 20,
# "auto" (default) learns which ranking method works best for
# this dataset over time via adaptive Dirichlet fusion.
# Other options: "fused", "copeland", "borda", "elo", "avg_winrate"
"ranking_method": "auto",
},
)
| Parameter | Default | Description |
|---|
duels_per_round | 3 | Duels per round |
samples_per_duel | 10 | Dataset samples per duel |
initial_pool_size | 6 | Starting pool of candidate prompts |
thompson_alpha | 1.2 | Thompson sampling exploration parameter |
mutation_frequency | 5 | Mutate top prompts every N rounds |
num_top_to_mutate | 2 | How many top prompts to mutate |
max_pool_size | 20 | Maximum pool size before pruning |
ranking_method | "auto" | How to rank the prompt pool each round. "auto" adaptively learns the best method; "fused" uses equal-weight fusion; "copeland", "borda", "elo", "avg_winrate" use a single method explicitly |
config = OptimizerConfig(
strategy="fewshot",
max_iterations=8,
extra_kwargs={
"max_examples": 5,
"candidate_pool_size": 20,
"bootstrap_rounds": 3,
"selection_strategy": "diverse",
},
)
| Parameter | Default | Description |
|---|
max_examples | 5 | Examples to include in the prompt |
candidate_pool_size | 20 | Exemplars to bootstrap |
bootstrap_rounds | 3 | Re-bootstrap every N iterations |
selection_strategy | "diverse" | Example selection method |
config = OptimizerConfig(
strategy="structural",
max_iterations=6,
extra_kwargs={
"variants_per_round": 4,
},
)
| Parameter | Default | Description |
|---|
variants_per_round | 4 | Structural variants to try per iteration |
Parallel execution
Strategies like structural and PDO evaluate multiple prompt variants per
iteration. These variants are evaluated in parallel using threads.
For cloud APIs (OpenAI, OpenRouter, Together), parallelism works out of the
box:
aevyra-reflex optimize dataset.jsonl prompt.md \
-m openrouter/meta-llama/llama-3.1-8b-instruct \
--max-workers 4
For Ollama, you need to explicitly enable parallel inference — by default
Ollama processes one request at a time:
OLLAMA_NUM_PARALLEL=4 ollama serve &
OLLAMA_NUM_PARALLEL=4 aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b \
--max-workers 4
If reflex detects Ollama without OLLAMA_NUM_PARALLEL set, it automatically
falls back to 1 worker and logs a warning with setup instructions.
Higher OLLAMA_NUM_PARALLEL uses more VRAM. Guidelines:
| Model size | VRAM | Suggested parallel |
|---|
| 1B | 4GB+ | 8 |
| 8B | 8GB+ | 4 |
| 70B | 48GB+ | 2 |
Choosing a reasoning model
Reflex is an agent — it observes eval results, diagnoses failures, and
iteratively rewrites your prompt. To do this reasoning, it calls an LLM.
By default that’s Claude Sonnet, but you can swap in any model: a local
Ollama model, an OpenAI-compatible endpoint, or another cloud provider.
The reasoning model is separate from the model being optimized (-m).
A more capable reasoning model produces better prompt rewrites.
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.2:1b \
--reasoning-model ollama/llama3.3:70b
Python API:config = OptimizerConfig(
reasoning_model="llama3.3:70b",
reasoning_provider="ollama",
)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model openai/gpt-4o
Python API:config = OptimizerConfig(
reasoning_model="gpt-4o",
reasoning_provider="openai",
)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model gemini/gemini-2.0-flash
Requires GOOGLE_API_KEY. Routes through Google’s OpenAI-compatible
v1beta endpoint — no extra package needed beyond the default install.
gemini-2.5-pro is the strongest option for complex diagnostic reasoning.Python API:config = OptimizerConfig(
reasoning_model="gemini-2.0-flash",
reasoning_provider="gemini", # or omit — inferred from model name
)
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model openrouter/meta-llama/llama-3.1-70b-instruct
Provider aliases (openrouter, together, groq, gemini, etc.) work
the same way as for --target and -m. The API key is read from the
provider’s env var (e.g. OPENROUTER_API_KEY), or you can pass
--reasoning-api-key.aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--reasoning-model openai/my-model \
--reasoning-base-url http://localhost:8000/v1 \
--reasoning-api-key dummy
Any OpenAI-compatible server (vLLM, TGI, llama.cpp, etc.) works with
provider openai and a custom --reasoning-base-url.
The reasoning model needs strong analytical abilities. Local models work
well for simpler strategies (iterative), but the auto strategy benefits
from a more capable model since it makes multi-step decisions about which
optimization axes to apply.
Run persistence and checkpointing
Reflex automatically saves every run to a .reflex/ directory. Each
iteration is checkpointed so that if a run crashes or is interrupted, you
can resume exactly where it left off.
Directory structure
.reflex/
runs/
001_2026-04-04T10-32-15/
config.json # full config snapshot
baseline.json # baseline eval scores
checkpoint.json # resume state (updated each iteration)
best_prompt.md # current best prompt (always up to date)
iterations/
001.json # per-iteration state
002.json
003.json
result.json # only written when run completes
002_2026-04-05T14-10-00/
...
Each run gets a sequential ID (001, 002, …) and a timestamped
directory. The config, dataset path, and initial prompt are captured at
the start so every run is fully reproducible.
Resuming interrupted runs
If a run is interrupted (Ctrl-C, crash, network timeout), use --resume
to pick up where it left off:
# Resume the latest interrupted run matching this dataset
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--resume
# Resume a specific run by ID
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--resume-from 003
The checkpoint contains the full state: current best prompt, score
trajectory, completed iterations, baseline scores, and strategy-specific
state. On resume, reflex skips the baseline eval and jumps straight to
the next iteration.
Listing runs
Shows all runs with their status, strategy, scores, and iteration count.
Add -v for config details (reasoning model, target models).
Custom run directory
By default, runs are stored in .reflex/ in the current working
directory. Use --run-dir to change this:
aevyra-reflex optimize dataset.jsonl prompt.md \
-m local/llama3.1:8b\
--run-dir ./experiments/.reflex
Python API
from aevyra_reflex import PromptOptimizer, OptimizerConfig, RunStore
store = RunStore(root=".reflex")
# New run with checkpointing
result = optimizer.run(
initial_prompt,
run_store=store,
)
# Resume an interrupted run
incomplete = store.find_incomplete_run(dataset_path="data.jsonl")
if incomplete:
result = optimizer.run(
initial_prompt,
run_store=store,
resume_run=incomplete,
)
# List all runs
for run_summary in store.list_runs():
print(f"{run_summary.run_id}: {run_summary.status} — best: {run_summary.best_score}")
Metrics
By default, reflex uses ROUGE. You can switch to a different metric or use an
LLM judge:
# Explicit metric
aevyra-reflex optimize ... --metric bleu
# LLM judge (no automated metric)
aevyra-reflex optimize ... --judge openrouter/openai/gpt-4o-mini
# LLM judge with a custom evaluation rubric
aevyra-reflex optimize ... \
--judge anthropic/claude-sonnet-4-6 \
--judge-criteria rubric.md
--metric and --judge are mutually exclusive. If neither is specified,
ROUGE is used as the default.
--judge-criteria only applies when --judge is set. It accepts a path to a
plain text file describing your scoring rubric (1–5 scale). Use this when the
default accuracy/helpfulness/clarity/completeness criteria don’t match your
task — for example, when you need to enforce a strict output format, check
domain-specific correctness, or evaluate against a proprietary style guide.
Score the response from 1 to 5:
5 — Exactly 3 sentences covering what happened, impact, and remediation.
4 — 3 sentences but missing one required detail (e.g. financial cost).
3 — Wrong number of sentences but factually accurate.
2 — Free-form prose with correct facts but ignores structure.
1 — Missing critical content or fabricated details.