Install
Set your API keys
Set keys for whichever model provider you’re using:Run the example
Theexamples/ directory includes a ready-to-run dataset: 100 security
incident reports where the task is to produce a strict 3-sentence executive
brief. The starting prompt is four words. The model starts at 0.38 and finishes
at 0.89 — a 134% improvement, statistically significant on held-out data.
Bring your own dataset
Use the same JSONL format as verdict — each line hasmessages and an ideal answer:
Write a starting prompt
Create a plain text file with your system prompt. It doesn’t need to be good — reflex will improve it:Use the Python API
Explore the run in the dashboard
Once you have a run, open the dashboard to see score trajectory, prompt diffs between iterations, and the reasoning model’s analysis:http://localhost:8128. No separate server, no build step. Click into
any run to see what changed each iteration and why.
Set a real target (verdict → reflex)
Instead of an arbitrary threshold, set the target from a real benchmark. If you already ranaevyra-verdict, pass the results file:
Next steps
Tutorial
Full walkthrough of the security incidents example
Dashboard
Score charts, prompt diffs, branch runs
Strategies
Auto, iterative, structural, PDO, fewshot
Configuration
Iterations, thresholds, parallelism, strategy params