Skip to main content

Install

pip install aevyra-reflex
That’s the entire setup. No YAML config, no framework to configure, no separate server. This also installs aevyra-verdict for evaluation.

Set your API keys

Set keys for whichever model provider you’re using:
export OPENROUTER_API_KEY=sk-or-...    # for OpenRouter models (eval + reasoning)
export ANTHROPIC_API_KEY=sk-ant-...    # if using Claude as the reasoning model
export OPENAI_API_KEY=sk-...           # if optimizing an OpenAI model
For local models (Ollama), no key is needed — everything runs on your machine:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model ollama/qwen3:8b

Run the example

The examples/ directory includes a ready-to-run dataset: 100 security incident reports where the task is to produce a strict 3-sentence executive brief. The starting prompt is four words. The model starts at 0.38 and finishes at 0.89 — a 134% improvement, statistically significant on held-out data.
export OPENROUTER_API_KEY=sk-or-...

aevyra-reflex optimize examples/security_incidents.jsonl \
  examples/security_incidents_prompt.md \
  -m openrouter/meta-llama/llama-3.1-8b-instruct \
  --reasoning-model openrouter/qwen/qwen3-8b \
  --judge openrouter/qwen/qwen3-8b \
  --judge-criteria examples/security_incidents_judge.md \
  --max-workers 4 \
  -o examples/security_incidents_best_prompt.md
You’ll see a baseline eval, 4 strategy phases, and a final test set verification:
====================================================
  OPTIMIZATION RESULTS
====================================================
  Train/val/test   : 45 / 20 / 35 samples
  Baseline score   : 0.3786  (on 35-sample test set)
  Final score      : 0.8857  (on 35-sample test set)
  Improvement      : +0.5071 (+134.0%)
  Significance     : p=0.0000  ✓ significant (α=0.05, paired test)
  Iterations       : 10
  Converged        : True
====================================================
See the full walkthrough for a phase-by-phase breakdown of every decision reflex made.

Bring your own dataset

Use the same JSONL format as verdict — each line has messages and an ideal answer:
{"messages": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"messages": [{"role": "user", "content": "Explain binary search in one sentence."}], "ideal": "Binary search repeatedly halves a sorted array to find a target value in O(log n) time."}
CSV is also supported:
aevyra-reflex optimize data.csv prompt.md -m openrouter/meta-llama/llama-3.1-8b-instruct
No ideal answers? Use an LLM judge instead of automated metrics — see Label-free evaluation.

Write a starting prompt

Create a plain text file with your system prompt. It doesn’t need to be good — reflex will improve it:
You are a helpful assistant. Answer questions concisely.

Use the Python API

from aevyra_verdict import Dataset, LLMJudge
from aevyra_verdict.providers import OpenRouterProvider
from aevyra_reflex import PromptOptimizer
from pathlib import Path

result = (
    PromptOptimizer()
    .set_dataset(Dataset.from_jsonl("examples/security_incidents.jsonl"))
    .add_provider("openrouter", "meta-llama/llama-3.1-8b-instruct")
    .add_metric(LLMJudge(
        judge_provider=OpenRouterProvider(model="qwen/qwen3-8b"),
        criteria=Path("examples/security_incidents_judge.md").read_text(),
    ))
    .run(Path("examples/security_incidents_prompt.md").read_text())
)

print(result.summary())
result.save_best_prompt("best_prompt.md")

Explore the run in the dashboard

Once you have a run, open the dashboard to see score trajectory, prompt diffs between iterations, and the reasoning model’s analysis:
aevyra-reflex dashboard
Opens http://localhost:8128. No separate server, no build step. Click into any run to see what changed each iteration and why.

Set a real target (verdict → reflex)

Instead of an arbitrary threshold, set the target from a real benchmark. If you already ran aevyra-verdict, pass the results file:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --verdict-results results.json \
  -o best_prompt.md
Or let reflex benchmark for you in one command:
aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --target openai/gpt-4o-mini \
  -o best_prompt.md

Next steps

Tutorial

Full walkthrough of the security incidents example

Dashboard

Score charts, prompt diffs, branch runs

Strategies

Auto, iterative, structural, PDO, fewshot

Configuration

Iterations, thresholds, parallelism, strategy params