Dashboard
Every run is immediately explorable in the built-in dashboard:http://localhost:8128 with score
trajectory charts, prompt diffs, reasoning analysis, and token usage. Branch
from any iteration to continue with a different strategy.
Quick start
Optimize your first prompt in under 5 minutes
Tutorial
Full walkthrough: 0.38 → 0.89 on a real format-compliance task
Open the dashboard
Score charts, prompt diffs, reasoning traces, and branch runs
Strategies
Auto, iterative, structural, PDO, fewshot
Why reflex
No config files. No YAML. No framework to learn. Point it at a dataset and a prompt file and it runs. Lightweight. No heavy framework dependencies. Just Python, standard library, andnumpy for PDO math. Installs in seconds and has no opinion
about the rest of your stack.
Fully local. Ollama and vLLM are supported — run everything on your own
hardware so nothing leaves your machine:
--resume picks up exactly where it left off. Val history, best-prompt
selection, and token totals are all restored correctly across as many
interruptions as you need.
Full token accounting. Eval tokens and reasoning tokens are tracked per
iteration and accumulated across sessions, including resumed runs. The final
results show the true total cost of the optimization, not just the last session.
Overfitting protection. An optional validation split monitors generalization
throughout training. The best prompt is selected against the val set — so the
final test eval reflects real-world performance rather than a prompt tuned to
the specific examples it was optimized on.
What it does
Given a dataset and a starting prompt, aevyra-reflex:- Runs a baseline eval on a held-out test set to measure the starting score
- Optimizes the prompt on the training set, iterating until the score meets the target
- Re-evaluates on the held-out test set so reported improvement is honest
- Returns the optimized prompt with a full before/after comparison, token costs, and significance test
When to use it
- You ran verdict and model A beats model B — you want to close the gap through prompt engineering, not by switching models
- A model scores poorly on your eval — you want a better prompt, not a bigger model
- You’re iterating on a system prompt and want to automate the feedback loop
- You want to understand why a prompt works (the analysis teaches prompt engineering)
- You’re migrating a prompt from one model family to another (e.g. Claude → Llama)
Optimization strategies
The auto strategy (default) picks the right technique for each phase. You can also run any strategy directly.Auto
Multi-phase pipeline — structural → iterative → fewshot, chosen adaptively
Iterative
Diagnose failures, revise wording, repeat. Label-free aware.
Structural
Reorganize formatting, sections, and hierarchy
PDO
Tournament-style search with dueling bandits and adaptive ranking
How it fits together
CLI reference
All commands and flags
Strategies
How each optimization axis works
Configuration
Tuning iterations, thresholds, and parallelism
Providers
OpenAI, Anthropic, Gemini, Ollama, and more