Skip to main content

Setting the target score

The score threshold tells reflex when to stop — “make my model at least this good.” Rather than picking an arbitrary number, you should set the target from a real benchmark. There are three ways to set the target, in priority order:

OptimizerConfig

All configuration goes through OptimizerConfig:
from aevyra_reflex import OptimizerConfig

config = OptimizerConfig(
    strategy="auto",                   # auto, iterative, structural, pdo, fewshot
    max_iterations=10,                 # total iteration budget
    score_threshold=0.85,              # stop when score reaches this
    train_ratio=0.8,                   # 70% train / 10% val / 20% test (default)
    val_ratio=0.1,                     # fraction reserved for validation (0 = disabled)
    early_stopping_patience=3,         # stop when val score stagnates for N iters (0 = disabled)
    batch_size=0,                      # 0 = full training set; >0 = examples per iteration
    batch_seed=42,                     # base seed for mini-batch sampling
    full_eval_steps=0,                 # full-set checkpoint every N iters (0 = disabled)
    max_workers=4,                     # parallel workers for variant evaluation
    eval_runs=1,                       # eval passes to average (1 = single pass)
    reasoning_model="claude-sonnet-4-20250514",
    eval_temperature=0.0,              # temperature for target model completions
    extra_kwargs={},                   # strategy-specific parameters
)
ParameterDefaultDescription
strategy"auto"Which optimization strategy to use (or any custom registered name)
max_iterations10Maximum iterations (total budget for auto across all phases)
score_threshold0.85Target score — optimization stops when reached. Set automatically by --target or --verdict-results
train_ratio0.8Fraction of examples used for optimization. The rest are held out for baseline and final eval. Set to 1.0 to disable splitting
val_ratio0.1Fraction of total examples reserved as a validation set, carved from the training portion. Used to detect overfitting mid-run. Set to 0.0 to disable
early_stopping_patience3Stop optimization early if val score has not improved for this many consecutive iterations. Only active when val_ratio > 0. Set to 0 to disable
batch_size0Mini-batch size per iteration. 0 = full training set. When > 0, each iteration samples this many examples at random from the training data
batch_seed42Base seed for mini-batch sampling. Iteration i uses batch_seed + i so every batch is different but the run is reproducible
full_eval_steps0When using mini-batch mode, run a full training-set eval every this many iterations. 0 = never (use mini-batch scores throughout). E.g. 5 runs a full eval on iterations 5, 10, 15, … Full-eval iterations are marked in the dashboard
max_workers4Thread pool size for parallel evaluation
eval_runs1Number of eval passes to run and average for the baseline and final verification. Use 3–5 for noisy tasks or small datasets. Does not affect optimization iterations
reasoning_model"claude-sonnet-4-20250514"LLM used for reasoning (failure analysis, prompt rewriting)
reasoning_providerNoneProvider for the reasoning model: "anthropic", "openai", "ollama", "gemini", or any alias ("openrouter", "groq", etc.). Auto-detected from model name if None
reasoning_api_keyNoneAPI key for the reasoning model. Falls back to the provider’s default env var
reasoning_base_urlNoneBase URL for self-hosted or OpenAI-compatible reasoning model endpoints
eval_temperature0.0Temperature for target model completions
target_modelNoneLabel of the model whose score is the target (set automatically)
target_sourceNoneHow the target was set: "verdict_json", "verdict_run", or "manual"
source_modelNoneThe model family this prompt was originally written for (e.g. "claude-sonnet", "gpt-4o"). Enables migration mode — the reasoning model adapts idioms for the target model. Also set via --source-model
extra_kwargs{}Strategy-specific parameters (see below)

Train/test split

By default, reflex splits your dataset 80/20 before running. The 80% train portion is used during the optimization loop — failing samples are drawn from here and all iteration scores are computed on these examples. The 20% test set is held out completely, and the baseline and final scores you see in the results summary are computed on that held-out set only. This matters because without a split, the same examples that drive rewrites also determine whether the prompt “improved.” That inflates the reported improvement, especially on small datasets where a handful of examples have an outsized effect on the mean score.
# Default: 80/20 split (the banner shows "80 train / 20 test")
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1

# Larger test set (70/30)
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 0.7

# No split — all examples used for both optimization and eval
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 1.0
The split is deterministic (seeded at 42) so the same dataset always produces the same partition.
With small datasets (fewer than ~20 examples), the test set may be too small for stable scores. In that case, use --train-split 1.0 to disable splitting and rely on the score trajectory to judge improvement instead.

Validation split and early stopping

Reflex supports a 3-way train / val / test split to detect overfitting mid-run. Without a validation set, a prompt can score well on training examples simply by overfitting to their specific patterns — the optimization loop has no way to notice. With a validation set, reflex evaluates each candidate prompt on the val examples after every iteration. If the train score climbs but val plateaus or declines, that’s a sign the prompt is fitting the training examples specifically rather than generalizing. Early stopping on val plateau saves you compute and returns the least-overfit prompt.
# Default: 70% train / 10% val / 20% test, early stopping patience=3
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1

# Disable val split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --val-split 0.0

# Custom patience
aevyra-reflex optimize data.jsonl prompt.md \
  -m local/llama3.1:8b\
  --train-split 0.8 \
  --val-split 0.1 \
  --early-stopping-patience 5
Python API:
config = OptimizerConfig(
    train_ratio=0.8,
    val_ratio=0.1,
    early_stopping_patience=3,
)
The validation split is deterministic (same seed as train/test). The summary shows all three split sizes, a per-iteration val trajectory, and — if early stopping triggered — marks which iteration was actually the best:
  Train/val/test   : 70 / 10 / 20 samples
  Baseline score   : 0.5500  (on 20-sample test set)
  Final score      : 0.7100  (on 20-sample test set)
  Improvement      : +0.1600 (+29.1%)
  Iterations       : 6
  Early stopped    : Yes (val score plateaued)
  Train traj   : 0.600 → 0.650 → 0.710 → 0.720 → 0.725 → 0.724
  Val traj     : 0.580 → 0.640 → 0.690 → 0.688 → 0.685 → 0.682
Use --early-stopping-patience 2 for fast iteration or 3–4 when your dataset is small and single-iteration val scores are noisy. Without --early-stopping-patience, val scores are still tracked and reported but optimization runs to completion.
The validation set adds one extra eval call per optimization iteration. With val_ratio=0.1 on 100 examples you’re evaluating 10 extra samples per iteration — usually inexpensive compared to the reasoning step.

Mini-batch mode

By default every optimization iteration evaluates the full training set. For large datasets this can be slow — each iteration costs n_train LLM calls. Mini-batch mode samples a random subset each round:
# 32 examples per iteration
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --batch-size 32

# Combine with train/test split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
  --train-split 0.8 --batch-size 40
config = OptimizerConfig(
    batch_size=32,        # examples per iteration; 0 = full training set (default)
    batch_seed=42,        # base seed — iteration i uses batch_seed + i
    full_eval_steps=5,    # full training-set eval every 5 iterations (0 = never)
)
Each iteration draws a fresh sample seeded by batch_seed + i, so the optimizer sees different examples each round. The stochasticity can also help escape local optima by smoothing out noise from individual examples. Baseline and final verification evals always use the full test setbatch_size only affects the per-iteration training evals used by the optimization strategy.

Periodic full-eval checkpoints

Mini-batch scores are noisy — a single good or bad batch can skew the trajectory. Use --full-eval-steps to periodically score the full training set for an accurate checkpoint:
# Batch of 32 per iter; full eval every 5 iterations
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
  --batch-size 32 --full-eval-steps 5
Iterations 5, 10, 15, … score the full training set; all others score the mini-batch. Full-eval iterations are marked with a ◈ full eval badge in the dashboard flow graph.
A good starting point is 20–50% of your training set size. Too small a batch makes the per-iteration score noisy; too large gives diminishing returns over the full set.

Statistical significance

Reflex reports whether the improvement from baseline to final is statistically significant using a paired test on per-sample scores. This compares how each individual sample scored under the original prompt vs the optimized one. The result appears in result.summary():
  Baseline score   : 0.6200
  Final score      : 0.7450
  Improvement      : +0.1250 (+20.2%)
  Significance     : p=0.0234  ✓ significant (α=0.05, paired test)
The test uses the Wilcoxon signed-rank test (non-parametric, no normality assumption) if scipy is installed, and falls back to a paired t-test otherwise. Install scipy for best results:
pip install "aevyra-reflex[stats]"
For noisy tasks where LLM responses vary run-to-run, use --eval-runs to average multiple passes before computing the test:
# Average 3 eval passes for baseline and final, then test significance
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --eval-runs 3
This reports mean ± std in the results summary:
  Baseline score   : 0.6200 ± 0.0180  (3 runs)
  Final score      : 0.7450 ± 0.0120  (3 runs)
--eval-runs only affects the baseline and final verification evals — not the optimization iterations, which always use a single pass for speed. --eval-runs 3 triples the cost of those two checkpoints.

Strategy-specific parameters

Pass strategy-specific parameters via extra_kwargs:
config = OptimizerConfig(
    strategy="auto",
    max_iterations=20,
    extra_kwargs={
        "max_phases": 4,
        "start_structural": True,
        "min_phases": 2,
    },
)
ParameterDefaultDescription
max_phases4Maximum number of optimization phases
start_structuralTrueAlways begin with structural optimization
min_phases2Minimum phases before convergence (prevents noisy early stops)

Parallel execution

Strategies like structural and PDO evaluate multiple prompt variants per iteration. These variants are evaluated in parallel using threads. For cloud APIs (OpenAI, OpenRouter, Together), parallelism works out of the box:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m openrouter/meta-llama/llama-3.1-8b-instruct \
  --max-workers 4
For Ollama, you need to explicitly enable parallel inference — by default Ollama processes one request at a time:
OLLAMA_NUM_PARALLEL=4 ollama serve &
OLLAMA_NUM_PARALLEL=4 aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b \
  --max-workers 4
If reflex detects Ollama without OLLAMA_NUM_PARALLEL set, it automatically falls back to 1 worker and logs a warning with setup instructions.
Higher OLLAMA_NUM_PARALLEL uses more VRAM. Guidelines:
Model sizeVRAMSuggested parallel
1B4GB+8
8B8GB+4
70B48GB+2

Choosing a reasoning model

Reflex is an agent — it observes eval results, diagnoses failures, and iteratively rewrites your prompt. To do this reasoning, it calls an LLM. By default that’s Claude Sonnet, but you can swap in any model: a local Ollama model, an OpenAI-compatible endpoint, or another cloud provider. The reasoning model is separate from the model being optimized (-m). A more capable reasoning model produces better prompt rewrites.
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.2:1b \
  --reasoning-model ollama/llama3.3:70b
Python API:
config = OptimizerConfig(
    reasoning_model="llama3.3:70b",
    reasoning_provider="ollama",
)
The reasoning model needs strong analytical abilities. Local models work well for simpler strategies (iterative), but the auto strategy benefits from a more capable model since it makes multi-step decisions about which optimization axes to apply.

Run persistence and checkpointing

Reflex automatically saves every run to a .reflex/ directory. Each iteration is checkpointed so that if a run crashes or is interrupted, you can resume exactly where it left off.

Directory structure

.reflex/
  runs/
    001_2026-04-04T10-32-15/
      config.json           # full config snapshot
      baseline.json         # baseline eval scores
      checkpoint.json       # resume state (updated each iteration)
      best_prompt.md        # current best prompt (always up to date)
      iterations/
        001.json            # per-iteration state
        002.json
        003.json
      result.json           # only written when run completes
    002_2026-04-05T14-10-00/
      ...
Each run gets a sequential ID (001, 002, …) and a timestamped directory. The config, dataset path, and initial prompt are captured at the start so every run is fully reproducible.

Resuming interrupted runs

If a run is interrupted (Ctrl-C, crash, network timeout), use --resume to pick up where it left off:
# Resume the latest interrupted run matching this dataset
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume

# Resume a specific run by ID
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume-from 003
The checkpoint contains the full state: current best prompt, score trajectory, completed iterations, baseline scores, and strategy-specific state. On resume, reflex skips the baseline eval and jumps straight to the next iteration.

Listing runs

aevyra-reflex runs
Shows all runs with their status, strategy, scores, and iteration count. Add -v for config details (reasoning model, target models).

Custom run directory

By default, runs are stored in .reflex/ in the current working directory. Use --run-dir to change this:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --run-dir ./experiments/.reflex

Python API

from aevyra_reflex import PromptOptimizer, OptimizerConfig, RunStore

store = RunStore(root=".reflex")

# New run with checkpointing
result = optimizer.run(
    initial_prompt,
    run_store=store,
)

# Resume an interrupted run
incomplete = store.find_incomplete_run(dataset_path="data.jsonl")
if incomplete:
    result = optimizer.run(
        initial_prompt,
        run_store=store,
        resume_run=incomplete,
    )

# List all runs
for run_summary in store.list_runs():
    print(f"{run_summary.run_id}: {run_summary.status} — best: {run_summary.best_score}")

Metrics

By default, reflex uses ROUGE. You can switch to a different metric or use an LLM judge:
# Explicit metric
aevyra-reflex optimize ... --metric bleu

# LLM judge (no automated metric)
aevyra-reflex optimize ... --judge openrouter/openai/gpt-4o-mini

# LLM judge with a custom evaluation rubric
aevyra-reflex optimize ... \
  --judge anthropic/claude-sonnet-4-6 \
  --judge-criteria rubric.md
--metric and --judge are mutually exclusive. If neither is specified, ROUGE is used as the default. --judge-criteria only applies when --judge is set. It accepts a path to a plain text file describing your scoring rubric (1–5 scale). Use this when the default accuracy/helpfulness/clarity/completeness criteria don’t match your task — for example, when you need to enforce a strict output format, check domain-specific correctness, or evaluate against a proprietary style guide.
rubric.md (example)
Score the response from 1 to 5:

5 — Exactly 3 sentences covering what happened, impact, and remediation.
4 — 3 sentences but missing one required detail (e.g. financial cost).
3 — Wrong number of sentences but factually accurate.
2 — Free-form prose with correct facts but ignores structure.
1 — Missing critical content or fabricated details.