Skip to main content
Reflex ships with five strategies. The auto strategy (default) chains multiple axes adaptively. Each axis can also be used standalone with -s <name>.

Auto (default)

The auto strategy runs a multi-phase pipeline:
  1. Run a baseline eval to measure the starting score
  2. The reasoning model analyzes the prompt’s weaknesses and recommends an optimization axis
  3. Apply that axis for a few iterations
  4. Re-evaluate — if the threshold is met, stop
  5. Otherwise the reasoning model picks the next axis based on what changed
  6. Repeat until the global iteration budget runs out
A typical run: structural (fix formatting) → iterative (fix wording) → fewshot (add examples) — each phase builds on the previous one’s improvements. Auto requires at least two phases before it considers converging, because LLM judge scores can be noisy and a single high score may not hold on re-evaluation.
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1

Iterative

Each iteration:
  1. Run completions with the current prompt via verdict
  2. Score all responses with the configured metrics
  3. Identify the worst-scoring samples
  4. The reasoning model analyzes the failures and proposes a revised prompt
  5. If the score meets the threshold, stop; otherwise repeat
The reasoning model maintains a causal rewrite log across iterations — a compact record of what was changed each round and what score delta resulted. From iteration 2 onwards, this history is injected into the prompt so the model knows which approaches helped (✓), had no effect (✗ no effect), or hurt (✗ hurt) — and can avoid repeating dead ends.
Rewrite history:
Iter 1 (score: 0.6234, Δ+0.0871 — ✓ helped): Added numbered reasoning steps
Iter 2 (score: 0.7105, Δ+0.0029 — ✗ no effect): Added "think carefully" instruction
Best for: prompts that are structurally fine but have wording issues, missing constraints, or ambiguous instructions.
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -s iterative

Structural

Optimizes the organization and formatting of the prompt:
  1. Run eval with the current prompt structure
  2. Generate variants using different transformations:
    • Markdown headers for clear sections
    • XML tags for structural clarity
    • Minimal flat paragraphs
    • Role/task/format split
    • Constraint emphasis
    • Task decomposition
    • Input-anchored layout
  3. The reasoning model also generates a free-form structural improvement
  4. Evaluate all variants in parallel; keep the best
  5. Repeat with the winning structure
Best for: prompts that are long, disorganized, or missing clear structure.
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -s structural
Structural evaluates multiple variants per iteration. Use --max-workers to control parallelism. For Ollama, see the parallelism guide.

PDO (Prompt Duel Optimizer)

Tournament-style search over prompt variants using dueling bandits with Thompson sampling:
  1. Generate an initial pool of diverse prompts
  2. Each round, Thompson sampling selects two prompts to duel
  3. Both are evaluated on a sample of the dataset
  4. An LLM judge picks the winner on each sample; majority wins the duel
  5. Win matrix is updated; Copeland rankings recalculated
  6. Periodically, top-ranked prompts are mutated to explore new variants
  7. Worst performers are pruned to keep the pool manageable
Based on the PDO paper (arXiv:2510.13907). Best for: when you have budget for many evaluations and want broad exploration of the prompt space.
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  -s pdo \
  --max-iterations 50

Few-shot

Optimizes which examples to include in the prompt:
  1. Bootstrap: run the bare instruction and collect highest-scoring samples as candidate exemplars
  2. The reasoning model selects a diverse, informative subset
  3. Build a composite prompt: instruction + curated few-shot examples
  4. Evaluate, identify remaining failures
  5. The reasoning model swaps examples to better cover the failure modes
  6. Periodically re-bootstrap to discover new exemplar candidates
Best for: tasks where showing the model examples helps more than refining instructions (translation, classification, structured extraction).
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -s fewshot

Custom strategies

You can implement your own strategy by subclassing Strategy and registering it. Your strategy then works in both the Python API and the CLI.
from aevyra_reflex import Strategy, register_strategy
from aevyra_reflex.result import OptimizationResult, IterationRecord

class MonteCarloStrategy(Strategy):
    def run(self, *, initial_prompt, dataset, providers, metrics,
            agent, config, on_iteration=None):
        best_prompt = initial_prompt
        best_score = 0.0
        iterations = []

        for i in range(config.max_iterations):
            # Generate a candidate, evaluate it, track the best...
            record = IterationRecord(i + 1, candidate, score)
            iterations.append(record)
            if on_iteration:
                on_iteration(record)
            if score >= config.score_threshold:
                break

        return OptimizationResult(
            best_prompt=best_prompt,
            best_score=best_score,
            iterations=iterations,
            converged=best_score >= config.score_threshold,
        )

register_strategy("montecarlo", MonteCarloStrategy)
Then use it like any built-in:
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -s montecarlo
The run() method receives the full eval infrastructure — dataset, providers, metrics, the reasoning model (as agent), and the optimizer config — so your strategy has everything it needs to evaluate prompts and propose improvements.
Register your strategy before calling PromptOptimizer.run() or the CLI. A common pattern is to put the registration in a module that’s imported at startup.