Introduction

aevyra-reflex diagnoses why scores are falling short and rewrites the prompt — iterating until it converges. It works in two modes: Standard mode — point it at an eval dataset. Reflex scores each candidate prompt against the dataset and rewrites until the score meets the target.

pip install aevyra-reflex
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -o best_prompt.md

Pipeline mode — point it at your agent. Reflex re-runs the full pipeline on every candidate prompt so the judge sees tool calls, intermediate outputs, and the final answer — not just the output string. Your existing agent code doesn’t change; you add one wrapper function.

aevyra-reflex optimize prompt.md \
  --pipeline-file pipeline.py \
  --inputs-file   inputs.json \
  --judge openrouter/qwen/qwen3-30b-a3b \
  --judge-criteria criteria.md

Works with any model — local Ollama or vLLM, OpenAI, Anthropic, Gemini, or any OpenAI-compatible endpoint. All evaluation runs through aevyra-verdict.

Dashboard

Every run is immediately explorable in the built-in dashboard:

aevyra-reflex dashboard

No separate server, no build step — opens http://localhost:8128 with score trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch from any iteration to continue with a different strategy.

Quick start

Optimize your first prompt in under 5 minutes

Tutorial: standard mode

Full walkthrough: 0.38 → 0.89 on a real format-compliance task

Tutorial: pipeline mode

Optimize a tool-calling agent with full execution trace evaluation

Open the dashboard

Score charts, prompt diffs, reasoning traces, and branch runs

Why reflex

No config files. No YAML. No framework to learn. Point it at a dataset and a prompt file and it runs. Lightweight. No heavy framework dependencies. Just Python, standard library, and numpy for PDO math. Installs in seconds and has no opinion about the rest of your stack. Fully local. Ollama and vLLM are supported — run everything on your own hardware so nothing leaves your machine:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model ollama/qwen3:8b

Agentic, not scripted. Each iteration the reasoning model explains why it made a change — you learn from the run, not just get an output. The causal rewrite log tracks what helped, what had no effect, and what hurt, so the model avoids repeating dead ends. Crash-safe resumption. Every iteration is checkpointed to disk as it completes. Kill the process, restart the machine, lose your connection — --resume picks up exactly where it left off. Val history, best-prompt selection, and token totals are all restored correctly across as many interruptions as you need. Full token accounting. Eval tokens and reasoning tokens are tracked per iteration and accumulated across sessions, including resumed runs. The final results show the true total cost of the optimization, not just the last session. Overfitting protection. An optional validation split monitors generalization throughout training. The best prompt is selected against the val set — so the final test eval reflects real-world performance rather than a prompt tuned to the specific examples it was optimized on.

What it does

In both modes, aevyra-reflex runs the same three-step loop:

Runs a baseline eval on a held-out test set to measure the starting score
Optimizes the prompt on the training set, iterating until the score meets the target
Re-evaluates on the held-out test set so reported improvement is honest

The difference is what “eval” means. In standard mode it scores a model response against an ideal. In pipeline mode it re-runs your full agent and scores the resulting execution trace.

When to use it

Standard mode — when the prompt directly produces the output you want to score:

A model scores poorly on your eval — you want a better prompt, not a bigger model
You ran verdict and model A beats model B — you want to close the gap through prompt engineering
You’re iterating on a system prompt and want to automate the feedback loop
You’re migrating a prompt from one model family to another (e.g. Claude → Llama)

Pipeline mode — when the prompt lives inside an agent and correctness depends on behaviour you can’t see from the output string alone:

Your agent calls tools and you need the judge to verify the right ones were called
A static (input, ideal) dataset can’t tell whether the model used its tools or answered from memory
You want the optimizer to diagnose grounding failures, not just surface-level output differences

Optimization strategies

The auto strategy (default) picks the right technique for each phase. You can also run any strategy directly.

Auto

Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively

Iterative

Diagnose failures, revise wording, repeat. Label-free aware.

Structural

Reorganize formatting, sections, and hierarchy

PDO

Tournament-style search with dueling bandits and adaptive ranking

A typical auto run (from the security incidents tutorial): Each phase hands its best prompt to the next. Structural made the biggest jump (formatting was the main gap); PDO polished it to convergence.

How it fits together

Standard mode Pipeline mode

CLI reference

All commands and flags

Strategies

How each optimization axis works

Configuration

Tuning iterations, thresholds, and parallelism

Providers

OpenAI, Anthropic, Gemini, Ollama, and more

Getting started

Guides

Tutorials

API reference

Dashboard

Quick start

Tutorial: standard mode

Tutorial: pipeline mode

Open the dashboard

Why reflex

What it does

When to use it

Optimization strategies

Auto

Iterative

Structural

PDO

How it fits together

CLI reference

Strategies

Configuration

Providers

​Dashboard

Quick start

Tutorial: standard mode

Tutorial: pipeline mode

Open the dashboard

​Why reflex

​What it does

​When to use it

​Optimization strategies

Auto

Iterative

Structural

PDO

​How it fits together

CLI reference

Strategies

Configuration

Providers

Dashboard

Why reflex

What it does

When to use it

Optimization strategies

How it fits together