Skip to main content
aevyra-reflex takes your dataset and prompt, runs evals, diagnoses why scores are falling short, and rewrites the prompt — iterating until it converges. Runs can be interrupted and resumed at any point without losing work, and every token spent is tracked across sessions. Every rewrite is accompanied by a reasoned explanation of what changed and why — prompt diffs, score attributions, and the full reasoning trace are persisted as a durable audit trail.
pip install aevyra-reflex
aevyra-reflex optimize dataset.jsonl prompt.md -m local/llama3.1:8b -o best_prompt.md
Works with any model — local Ollama or vLLM, OpenAI, Anthropic, Gemini, or any OpenAI-compatible endpoint. All evaluation runs through aevyra-verdict.

Dashboard

Every run is immediately explorable in the built-in dashboard:
aevyra-reflex dashboard
No separate server, no build step — opens http://localhost:8128 with score trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch from any iteration to continue with a different strategy.

Quick start

Optimize your first prompt in under 5 minutes

Tutorial

Full walkthrough: 0.38 → 0.89 on a real format-compliance task

Open the dashboard

Score charts, prompt diffs, reasoning traces, and branch runs

Strategies

Auto, iterative, structural, PDO, fewshot

Why reflex

No config files. No YAML. No framework to learn. Point it at a dataset and a prompt file and it runs. Lightweight. No heavy framework dependencies. Just Python, standard library, and numpy for PDO math. Installs in seconds and has no opinion about the rest of your stack. Fully local. Ollama and vLLM are supported — run everything on your own hardware so nothing leaves your machine:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model ollama/qwen3:8b
Agentic, not scripted. Each iteration the reasoning model explains why it made a change — you learn from the run, not just get an output. The causal rewrite log tracks what helped, what had no effect, and what hurt, so the model avoids repeating dead ends. Crash-safe resumption. Every iteration is checkpointed to disk as it completes. Kill the process, restart the machine, lose your connection — --resume picks up exactly where it left off. Val history, best-prompt selection, and token totals are all restored correctly across as many interruptions as you need. Full token accounting. Eval tokens and reasoning tokens are tracked per iteration and accumulated across sessions, including resumed runs. The final results show the true total cost of the optimization, not just the last session. Overfitting protection. An optional validation split monitors generalization throughout training. The best prompt is selected against the val set — so the final test eval reflects real-world performance rather than a prompt tuned to the specific examples it was optimized on.

What it does

Given a dataset and a starting prompt, aevyra-reflex:
  1. Runs a baseline eval on a held-out test set to measure the starting score
  2. Optimizes the prompt on the training set, iterating until the score meets the target
  3. Re-evaluates on the held-out test set so reported improvement is honest
  4. Returns the optimized prompt with a full before/after comparison, token costs, and significance test

When to use it

  • You ran verdict and model A beats model B — you want to close the gap through prompt engineering, not by switching models
  • A model scores poorly on your eval — you want a better prompt, not a bigger model
  • You’re iterating on a system prompt and want to automate the feedback loop
  • You want to understand why a prompt works (the analysis teaches prompt engineering)
  • You’re migrating a prompt from one model family to another (e.g. Claude → Llama)

Optimization strategies

The auto strategy (default) picks the right technique for each phase. You can also run any strategy directly.

Auto

Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively

Iterative

Diagnose failures, revise wording, repeat. Label-free aware.

Structural

Reorganize formatting, sections, and hierarchy

PDO

Tournament-style search with dueling bandits and adaptive ranking
A typical auto run (from the security incidents tutorial): Each phase hands its best prompt to the next. Structural made the biggest jump (formatting was the main gap); PDO polished it to convergence.

How it fits together

CLI reference

All commands and flags

Strategies

How each optimization axis works

Configuration

Tuning iterations, thresholds, and parallelism

Providers

OpenAI, Anthropic, Gemini, Ollama, and more