Quick start

Install

pip install aevyra-reflex

That’s the entire setup. No YAML config, no framework to configure, no separate server. This also installs aevyra-verdict for evaluation.

Set your API keys

Set keys for whichever model provider you’re using:

export OPENROUTER_API_KEY=sk-or-...    # for OpenRouter models (eval + reasoning)
export ANTHROPIC_API_KEY=sk-ant-...    # if using Claude as the reasoning model
export OPENAI_API_KEY=sk-...           # if optimizing an OpenAI model

For local models (Ollama), no key is needed — everything runs on your machine:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --reasoning-model ollama/qwen3:8b

Run the example

The examples/ directory includes a ready-to-run dataset: 100 security incident reports where the task is to produce a strict 3-sentence executive brief. The starting prompt is four words. The model starts at 0.38 and finishes at 0.89 — a 134% improvement, statistically significant on held-out data.

export OPENROUTER_API_KEY=sk-or-...

aevyra-reflex optimize examples/security_incidents.jsonl \
  examples/security_incidents_prompt.md \
  -m openrouter/meta-llama/llama-3.1-8b-instruct \
  --reasoning-model openrouter/qwen/qwen3-8b \
  --judge openrouter/qwen/qwen3-8b \
  --judge-criteria examples/security_incidents_judge.md \
  --max-workers 4 \
  -o examples/security_incidents_best_prompt.md

You’ll see a baseline eval, 4 strategy phases, and a final test set verification:

====================================================
  OPTIMIZATION RESULTS
====================================================
  Train/val/test   : 45 / 20 / 35 samples
  Baseline score   : 0.3786  (on 35-sample test set)
  Final score      : 0.8857  (on 35-sample test set)
  Improvement      : +0.5071 (+134.0%)
  Significance     : p=0.0000  ✓ significant (α=0.05, paired test)
  Iterations       : 10
  Converged        : True
====================================================

See the full walkthrough for a phase-by-phase breakdown of every decision reflex made.

Bring your own dataset

Use the same JSONL format as verdict — each line has messages and an ideal answer:

{"messages": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"messages": [{"role": "user", "content": "Explain binary search in one sentence."}], "ideal": "Binary search repeatedly halves a sorted array to find a target value in O(log n) time."}

CSV is also supported:

aevyra-reflex optimize data.csv prompt.md -m openrouter/meta-llama/llama-3.1-8b-instruct

No ideal answers? Use an LLM judge instead of automated metrics — see Label-free evaluation.

Write a starting prompt

Create a plain text file with your system prompt. It doesn’t need to be good — reflex will improve it:

You are a helpful assistant. Answer questions concisely.

Use the Python API

from aevyra_verdict import Dataset, LLMJudge
from aevyra_verdict.providers import OpenRouterProvider
from aevyra_reflex import PromptOptimizer
from pathlib import Path

result = (
    PromptOptimizer()
    .set_dataset(Dataset.from_jsonl("examples/security_incidents.jsonl"))
    .add_provider("openrouter", "meta-llama/llama-3.1-8b-instruct")
    .add_metric(LLMJudge(
        judge_provider=OpenRouterProvider(model="qwen/qwen3-8b"),
        criteria=Path("examples/security_incidents_judge.md").read_text(),
    ))
    .run(Path("examples/security_incidents_prompt.md").read_text())
)

print(result.summary())
result.save_best_prompt("best_prompt.md")

Explore the run in the dashboard

Once you have a run, open the dashboard to see score trajectory, prompt diffs between iterations, and the reasoning model’s analysis:

aevyra-reflex dashboard

Opens http://localhost:8128. No separate server, no build step. Click into any run to see what changed each iteration and why.

Set a real target (verdict → reflex)

Instead of an arbitrary threshold, set the target from a real benchmark. If you already ran aevyra-verdict, pass the results file:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --verdict-results results.json \
  -o best_prompt.md

Or let reflex benchmark for you in one command:

aevyra-reflex optimize dataset.jsonl prompt.md \
  -m local/llama3.1:8b\
  --target openai/gpt-4o-mini \
  -o best_prompt.md

Next steps

Tutorial

Full walkthrough of the security incidents example

Dashboard

Score charts, prompt diffs, branch runs

Strategies

Auto, iterative, structural, PDO, fewshot

Configuration

Iterations, thresholds, parallelism, strategy params

Getting started

Guides

Tutorials

API reference

Install

Set your API keys

Run the example

Bring your own dataset

Write a starting prompt

Use the Python API

Explore the run in the dashboard

Set a real target (verdict → reflex)

Next steps

Tutorial

Dashboard

Strategies

Configuration

​Install

​Set your API keys

​Run the example

​Bring your own dataset

​Write a starting prompt

​Use the Python API

​Explore the run in the dashboard

​Set a real target (verdict → reflex)

​Next steps

Tutorial

Dashboard

Strategies

Configuration

Install

Set your API keys

Run the example

Bring your own dataset

Write a starting prompt

Use the Python API

Explore the run in the dashboard

Set a real target (verdict → reflex)

Next steps