Configuration

Setting the target score

The score threshold tells reflex when to stop — “make my model at least this good.” Rather than picking an arbitrary number, you should set the target from a real benchmark. There are three ways to set the target, in priority order:

From verdict results (recommended)
Live benchmark
Explicit threshold

If you’ve already run aevyra-verdict to compare models, pass the results file and reflex uses the best model’s score:

# Step 1: benchmark with verdict
aevyra-verdict run data.jsonl \
  -m openai/gpt-4o-mini -m local/llama3.1:8b\
  -o results.json

# Step 2: optimize — target is automatically gpt-4o-mini's score
aevyra-reflex optimize data.jsonl prompt.md \
  -m local/llama3.1:8b\
  --verdict-results results.json \
  -o best_prompt.md

Python API:

optimizer.set_target_from_verdict("results.json")

Pass --target models and reflex benchmarks them first, then optimizes your model to match the best:

aevyra-reflex optimize data.jsonl prompt.md \
  -m local/llama3.1:8b\
  --target openai/gpt-4o-mini \
  --target openai/gpt-4o \
  -o best_prompt.md

Reflex runs all models (yours + targets) against the dataset, picks the highest-scoring one, and uses its score as the threshold.Python API:

benchmark = optimizer.benchmark_and_set_target(
    prompt, all_providers,
)

Set the number directly. This overrides --verdict-results and --target:

aevyra-reflex optimize data.jsonl prompt.md \
  -m local/llama3.1:8b\
  --threshold 0.90

If no target source is set, defaults to 0.85.

OptimizerConfig

All configuration goes through OptimizerConfig:

from aevyra_reflex import OptimizerConfig

config = OptimizerConfig(
    strategy="auto",                   # auto, iterative, structural, pdo, fewshot
    max_iterations=10,                 # total iteration budget
    score_threshold=0.85,              # stop when score reaches this
    train_ratio=0.8,                   # 70% train / 10% val / 20% test (default)
    val_ratio=0.1,                     # fraction reserved for validation (0 = disabled)
    early_stopping_patience=3,         # stop when val score stagnates for N iters (0 = disabled)
    batch_size=0,                      # 0 = full training set; >0 = examples per iteration
    batch_seed=42,                     # base seed for mini-batch sampling
    full_eval_steps=0,                 # full-set checkpoint every N iters (0 = disabled)
    max_workers=4,                     # parallel workers for variant evaluation
    eval_runs=1,                       # eval passes to average (1 = single pass)
    reasoning_model="claude-sonnet-4-20250514",
    eval_temperature=0.0,              # temperature for target model completions
    extra_kwargs={},                   # strategy-specific parameters
)

Parameter	Default	Description
`strategy`	`"auto"`	Which optimization strategy to use (or any custom registered name)
`max_iterations`	`10`	Maximum iterations (total budget for auto across all phases)
`score_threshold`	`0.85`	Target score — optimization stops when reached. Set automatically by `--target` or `--verdict-results`
`train_ratio`	`0.8`	Fraction of examples used for optimization. The rest are held out for baseline and final eval. Set to `1.0` to disable splitting
`val_ratio`	`0.1`	Fraction of total examples reserved as a validation set, carved from the training portion. Used to detect overfitting mid-run. Set to `0.0` to disable
`early_stopping_patience`	`3`	Stop optimization early if val score has not improved for this many consecutive iterations. Only active when `val_ratio > 0`. Set to `0` to disable
`batch_size`	`0`	Mini-batch size per iteration. `0` = full training set. When > 0, each iteration samples this many examples at random from the training data
`batch_seed`	`42`	Base seed for mini-batch sampling. Iteration `i` uses `batch_seed + i` so every batch is different but the run is reproducible
`full_eval_steps`	`0`	When using mini-batch mode, run a full training-set eval every this many iterations. `0` = never (use mini-batch scores throughout). E.g. `5` runs a full eval on iterations 5, 10, 15, … Full-eval iterations are marked in the dashboard
`max_workers`	`4`	Thread pool size for parallel evaluation
`eval_runs`	`1`	Number of eval passes to run and average for the baseline and final verification. Use 3–5 for noisy tasks or small datasets. Does not affect optimization iterations
`reasoning_model`	`"claude-sonnet-4-20250514"`	LLM used for reasoning (failure analysis, prompt rewriting)
`reasoning_provider`	`None`	Provider for the reasoning model: `"anthropic"`, `"openai"`, `"ollama"`, `"gemini"`, or any alias (`"openrouter"`, `"groq"`, etc.). Auto-detected from model name if `None`
`reasoning_api_key`	`None`	API key for the reasoning model. Falls back to the provider’s default env var
`reasoning_base_url`	`None`	Base URL for self-hosted or OpenAI-compatible reasoning model endpoints
`eval_temperature`	`0.0`	Temperature for target model completions
`target_model`	`None`	Label of the model whose score is the target (set automatically)
`target_source`	`None`	How the target was set: `"verdict_json"`, `"verdict_run"`, or `"manual"`
`source_model`	`None`	The model family this prompt was originally written for (e.g. `"claude-sonnet"`, `"gpt-4o"`). Enables migration mode — the reasoning model adapts idioms for the target model. Also set via `--source-model`
`extra_kwargs`	`{}`	Strategy-specific parameters (see below)

Train/test split

By default, reflex splits your dataset 80/20 before running. The 80% train portion is used during the optimization loop — failing samples are drawn from here and all iteration scores are computed on these examples. The 20% test set is held out completely, and the baseline and final scores you see in the results summary are computed on that held-out set only. This matters because without a split, the same examples that drive rewrites also determine whether the prompt “improved.” That inflates the reported improvement, especially on small datasets where a handful of examples have an outsized effect on the mean score.

# Default: 80/20 split (the banner shows "80 train / 20 test")
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1

# Larger test set (70/30)
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 0.7

# No split — all examples used for both optimization and eval
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --train-split 1.0

The split is deterministic (seeded at 42) so the same dataset always produces the same partition.

With small datasets (fewer than ~20 examples), the test set may be too small for stable scores. In that case, use --train-split 1.0 to disable splitting and rely on the score trajectory to judge improvement instead.

Validation split and early stopping

Reflex supports a 3-way train / val / test split to detect overfitting mid-run. Without a validation set, a prompt can score well on training examples simply by overfitting to their specific patterns — the optimization loop has no way to notice. With a validation set, reflex evaluates each candidate prompt on the val examples after every iteration. If the train score climbs but val plateaus or declines, that’s a sign the prompt is fitting the training examples specifically rather than generalizing. Early stopping on val plateau saves you compute and returns the least-overfit prompt.

# Default: 70% train / 10% val / 20% test, early stopping patience=3
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1

# Disable val split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --val-split 0.0

# Custom patience
aevyra-reflex optimize data.jsonl prompt.md \
  -m local/llama3.1:8b\
  --train-split 0.8 \
  --val-split 0.1 \
  --early-stopping-patience 5

Python API:

config = OptimizerConfig(
    train_ratio=0.8,
    val_ratio=0.1,
    early_stopping_patience=3,
)

The validation split is deterministic (same seed as train/test). The summary shows all three split sizes, a per-iteration val trajectory, and — if early stopping triggered — marks which iteration was actually the best:

  Train/val/test   : 70 / 10 / 20 samples
  Baseline score   : 0.5500  (on 20-sample test set)
  Final score      : 0.7100  (on 20-sample test set)
  Improvement      : +0.1600 (+29.1%)
  Iterations       : 6
  Early stopped    : Yes (val score plateaued)
  Train traj   : 0.600 → 0.650 → 0.710 → 0.720 → 0.725 → 0.724
  Val traj     : 0.580 → 0.640 → 0.690 → 0.688 → 0.685 → 0.682

Use --early-stopping-patience 2 for fast iteration or 3–4 when your dataset is small and single-iteration val scores are noisy. Without --early-stopping-patience, val scores are still tracked and reported but optimization runs to completion.

The validation set adds one extra eval call per optimization iteration. With val_ratio=0.1 on 100 examples you’re evaluating 10 extra samples per iteration — usually inexpensive compared to the reasoning step.

Mini-batch mode

By default every optimization iteration evaluates the full training set. For large datasets this can be slow — each iteration costs n_train LLM calls. Mini-batch mode samples a random subset each round:

# 32 examples per iteration
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b --batch-size 32

# Combine with train/test split
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
  --train-split 0.8 --batch-size 40

config = OptimizerConfig(
    batch_size=32,        # examples per iteration; 0 = full training set (default)
    batch_seed=42,        # base seed — iteration i uses batch_seed + i
    full_eval_steps=5,    # full training-set eval every 5 iterations (0 = never)
)

Each iteration draws a fresh sample seeded by batch_seed + i, so the optimizer sees different examples each round. The stochasticity can also help escape local optima by smoothing out noise from individual examples. Baseline and final verification evals always use the full test set — batch_size only affects the per-iteration training evals used by the optimization strategy.

Periodic full-eval checkpoints

Mini-batch scores are noisy — a single good or bad batch can skew the trajectory. Use --full-eval-steps to periodically score the full training set for an accurate checkpoint:

# Batch of 32 per iter; full eval every 5 iterations
aevyra-reflex optimize data.jsonl prompt.md -m local/llama3.1:8b\
  --batch-size 32 --full-eval-steps 5

Iterations 5, 10, 15, … score the full training set; all others score the mini-batch. Full-eval iterations are marked with a ◈ full eval badge in the dashboard flow graph.

A good starting point is 20–50% of your training set size. Too small a batch makes the per-iteration score noisy; too large gives diminishing returns over the full set.

Statistical significance

Reflex reports whether the improvement from baseline to final is statistically significant using a paired test on per-sample scores. This compares how each individual sample scored under the original prompt vs the optimized one. The result appears in result.summary():

  Baseline score   : 0.6200
  Final score      : 0.7450
  Improvement      : +0.1250 (+20.2%)
  Significance     : p=0.0234  ✓ significant (α=0.05, paired test)

The test uses the Wilcoxon signed-rank test (non-parametric, no normality assumption) if scipy is installed, and falls back to a paired t-test otherwise. Install scipy for best results:

pip install "aevyra-reflex[stats]"

For noisy tasks where LLM responses vary run-to-run, use --eval-runs to average multiple passes before computing the test:

# Average 3 eval passes for baseline and final, then test significance
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --eval-runs 3

This reports mean ± std in the results summary:

  Baseline score   : 0.6200 ± 0.0180  (3 runs)
  Final score      : 0.7450 ± 0.0120  (3 runs)

--eval-runs only affects the baseline and final verification evals — not the optimization iterations, which always use a single pass for speed. --eval-runs 3 triples the cost of those two checkpoints.

Strategy-specific parameters

Pass strategy-specific parameters via extra_kwargs:

Auto
PDO
Few-shot
Structural

config = OptimizerConfig(
    strategy="auto",
    max_iterations=20,
    extra_kwargs={
        "max_phases": 4,
        "start_structural": True,
        "min_phases": 2,
    },
)

Parameter	Default	Description
`max_phases`	`4`	Maximum number of optimization phases
`start_structural`	`True`	Always begin with structural optimization
`min_phases`	`2`	Minimum phases before convergence (prevents noisy early stops)

config = OptimizerConfig(
    strategy="pdo",
    max_iterations=50,
    extra_kwargs={
        "duels_per_round": 3,
        "samples_per_duel": 10,
        "initial_pool_size": 6,
        "thompson_alpha": 1.2,
        "mutation_frequency": 5,
        "num_top_to_mutate": 2,
        "max_pool_size": 20,
        # "auto" (default) learns which ranking method works best for
        # this dataset over time via adaptive Dirichlet fusion.
        # Other options: "fused", "copeland", "borda", "elo", "avg_winrate"
        "ranking_method": "auto",
    },
)

Parameter	Default	Description
`duels_per_round`	`3`	Duels per round
`samples_per_duel`	`10`	Dataset samples per duel
`initial_pool_size`	`6`	Starting pool of candidate prompts
`thompson_alpha`	`1.2`	Thompson sampling exploration parameter
`mutation_frequency`	`5`	Mutate top prompts every N rounds
`num_top_to_mutate`	`2`	How many top prompts to mutate
`max_pool_size`	`20`	Maximum pool size before pruning
`ranking_method`	`"auto"`	How to rank the prompt pool each round. `"auto"` adaptively learns the best method; `"fused"` uses equal-weight fusion; `"copeland"`, `"borda"`, `"elo"`, `"avg_winrate"` use a single method explicitly

config = OptimizerConfig(
    strategy="fewshot",
    max_iterations=8,
    extra_kwargs={
        "max_examples": 5,
        "candidate_pool_size": 20,
        "bootstrap_rounds": 3,
        "selection_strategy": "diverse",
    },
)

Parameter	Default	Description
`max_examples`	`5`	Examples to include in the prompt
`candidate_pool_size`	`20`	Exemplars to bootstrap
`bootstrap_rounds`	`3`	Re-bootstrap every N iterations
`selection_strategy`	`"diverse"`	Example selection method

config = OptimizerConfig(
    strategy="structural",
    max_iterations=6,
    extra_kwargs={
        "variants_per_round": 4,
    },
)

Parameter	Default	Description
`variants_per_round`	`4`	Structural variants to try per iteration

Parallel execution

Strategies like structural and PDO evaluate multiple prompt variants per iteration. These variants are evaluated in parallel using threads. For cloud APIs (OpenAI, OpenRouter, Together), parallelism works out of the box:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m openrouter/meta-llama/llama-3.1-8b-instruct \
  --max-workers 4

For Ollama, you need to explicitly enable parallel inference — by default Ollama processes one request at a time:

OLLAMA_NUM_PARALLEL=4 ollama serve &
OLLAMA_NUM_PARALLEL=4 aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b \
  --max-workers 4

If reflex detects Ollama without OLLAMA_NUM_PARALLEL set, it automatically falls back to 1 worker and logs a warning with setup instructions.

Higher OLLAMA_NUM_PARALLEL uses more VRAM. Guidelines:

Model size	VRAM	Suggested parallel
1B	4GB+	8
8B	8GB+	4
70B	48GB+	2

Choosing a reasoning model

Reflex is an agent — it observes eval results, diagnoses failures, and iteratively rewrites your prompt. To do this reasoning, it calls an LLM. By default that’s Claude Sonnet, but you can swap in any model: a local Ollama model, an OpenAI-compatible endpoint, or another cloud provider. The reasoning model is separate from the model being optimized (-m). A more capable reasoning model produces better prompt rewrites.

Ollama (local)
OpenAI
Gemini
OpenRouter / Together / etc.
Self-hosted endpoint

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.2:1b \
  --reasoning-model ollama/llama3.3:70b

Python API:

config = OptimizerConfig(
    reasoning_model="llama3.3:70b",
    reasoning_provider="ollama",
)

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model openai/gpt-4o

Python API:

config = OptimizerConfig(
    reasoning_model="gpt-4o",
    reasoning_provider="openai",
)

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model gemini/gemini-2.0-flash

Requires GOOGLE_API_KEY. Routes through Google’s OpenAI-compatible v1beta endpoint — no extra package needed beyond the default install. gemini-2.5-pro is the strongest option for complex diagnostic reasoning.Python API:

config = OptimizerConfig(
    reasoning_model="gemini-2.0-flash",
    reasoning_provider="gemini",  # or omit — inferred from model name
)

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model openrouter/meta-llama/llama-3.1-70b-instruct

Provider aliases (openrouter, together, groq, gemini, etc.) work the same way as for --target and -m. The API key is read from the provider’s env var (e.g. OPENROUTER_API_KEY), or you can pass --reasoning-api-key.

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model openai/my-model \
  --reasoning-base-url http://localhost:8000/v1 \
  --reasoning-api-key dummy

Any OpenAI-compatible server (vLLM, TGI, llama.cpp, etc.) works with provider openai and a custom --reasoning-base-url.

The reasoning model needs strong analytical abilities. Local models work well for simpler strategies (iterative), but the auto strategy benefits from a more capable model since it makes multi-step decisions about which optimization axes to apply.

Automatic language detection. Reflex samples up to 20 user messages from your training set and detects the dominant script family using Unicode heuristics (no external dependencies). The reasoning model is then instructed to write all revised prompts and diagnostic explanations in that language. This prevents multilingual models such as Qwen3 from silently switching languages mid-run when optimizing non-English datasets. No configuration is required — the detection is automatic.

Run persistence and checkpointing

Reflex automatically saves every run to a .reflex/ directory. Each iteration is checkpointed so that if a run crashes or is interrupted, you can resume exactly where it left off.

Directory structure

.reflex/
  runs/
    001_2026-04-04T10-32-15/
      config.json           # full config snapshot
      baseline.json         # baseline eval scores
      checkpoint.json       # resume state (updated each iteration)
      best_prompt.md        # current best prompt (always up to date)
      iterations/
        001.json            # per-iteration state
        002.json
        003.json
      result.json           # only written when run completes
    002_2026-04-05T14-10-00/
      ...

Each run gets a sequential ID (001, 002, …) and a timestamped directory. The config, dataset path, and initial prompt are captured at the start so every run is fully reproducible.

Resuming interrupted runs

If a run is interrupted (Ctrl-C, crash, network timeout), use --resume to pick up where it left off:

# Resume the latest interrupted run matching this dataset
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume

# Resume a specific run by ID
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --resume-from 003

The checkpoint contains the full state: current best prompt, score trajectory, completed iterations, baseline scores, and strategy-specific state. On resume, reflex skips the baseline eval and jumps straight to the next iteration.

Listing runs

aevyra-reflex runs

Shows all runs with their status, strategy, scores, and iteration count. Add -v for config details (reasoning model, target models).

Custom run directory

By default, runs are stored in .reflex/ in the current working directory. Use --run-dir to change this:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --run-dir ./experiments/.reflex

Python API

from aevyra_reflex import PromptOptimizer, OptimizerConfig, RunStore

store = RunStore(root=".reflex")

# New run with checkpointing
result = optimizer.run(
    initial_prompt,
    run_store=store,
)

# Resume an interrupted run
incomplete = store.find_incomplete_run(dataset_path="data.jsonl")
if incomplete:
    result = optimizer.run(
        initial_prompt,
        run_store=store,
        resume_run=incomplete,
    )

# List all runs
for run_summary in store.list_runs():
    print(f"{run_summary.run_id}: {run_summary.status} — best: {run_summary.best_score}")

Metrics

By default, reflex uses ROUGE. You can switch to a different metric or use an LLM judge:

# Explicit metric
aevyra-reflex optimize ... --metric bleu

# LLM judge (no automated metric)
aevyra-reflex optimize ... --judge openrouter/openai/gpt-4o-mini

# LLM judge with a custom evaluation rubric
aevyra-reflex optimize ... \
  --judge anthropic/claude-sonnet-4-6 \
  --judge-criteria rubric.md

--metric and --judge are mutually exclusive. If neither is specified, ROUGE is used as the default. --judge-criteria only applies when --judge is set. It accepts a path to a plain text file describing your scoring rubric (1–5 scale). Use this when the default accuracy/helpfulness/clarity/completeness criteria don’t match your task — for example, when you need to enforce a strict output format, check domain-specific correctness, or evaluate against a proprietary style guide.

rubric.md (example)

Score the response from 1 to 5:

— Exactly 3 sentences covering what happened, impact, and remediation.
— 3 sentences but missing one required detail (e.g. financial cost).
— Wrong number of sentences but factually accurate.
— Free-form prose with correct facts but ignores structure.
— Missing critical content or fabricated details.

Getting started

Guides

Tutorials

API reference

Setting the target score

OptimizerConfig

Train/test split

Validation split and early stopping

Mini-batch mode

Periodic full-eval checkpoints

Statistical significance

Strategy-specific parameters

Parallel execution

Choosing a reasoning model

Run persistence and checkpointing

Directory structure

Resuming interrupted runs

Listing runs

Custom run directory

Python API

Metrics

​Setting the target score

​OptimizerConfig

​Train/test split

​Validation split and early stopping

​Mini-batch mode

​Periodic full-eval checkpoints

​Statistical significance

​Strategy-specific parameters

​Parallel execution

​Choosing a reasoning model

​Run persistence and checkpointing

​Directory structure

​Resuming interrupted runs

​Listing runs

​Custom run directory

​Python API

​Metrics

Setting the target score

OptimizerConfig

Train/test split

Validation split and early stopping

Mini-batch mode

Periodic full-eval checkpoints

Statistical significance

Strategy-specific parameters

Parallel execution

Choosing a reasoning model

Run persistence and checkpointing

Directory structure

Resuming interrupted runs

Listing runs

Custom run directory

Python API

Metrics