What gets recorded per iteration
For every iteration, reflex saves:| Field | Description |
|---|---|
iteration | Iteration number |
system_prompt | The exact prompt used |
score | Train score for this iteration |
val_score | Validation score (when val split is active) |
scores_by_metric | Per-metric breakdown (rouge, bleu, judge, etc.) |
reasoning | The reasoning model’s full explanation of what it changed and why |
change_summary | A one-line human-readable summary of the change (e.g. “Added numbered reasoning steps and explicit output constraints”) |
eval_tokens | Tokens consumed by the target model and eval scoring this iteration |
reasoning_tokens | Tokens consumed by the reasoning model |
elapsed_seconds | Cumulative wall time from run start at the point this iteration completed |
timestamp | ISO 8601 timestamp |
.reflex/runs/<run_id>/iterations/ as individual
JSON files — one per iteration — and is never overwritten.
The causal rewrite log
The iterative and auto strategies maintain a causal rewrite log — a compact, structured record of every change made across iterations and its observed effect:reasoning field of each IterationState
contains the full rationale that produced that iteration’s prompt, including
the rewrite history that was visible at the time. This means you can
reconstruct the exact reasoning chain that led to any prompt.
The run artifact
Every run produces a structured directory under.reflex/runs/:
result.json and all iteration files are append-only. They are never
modified after being written, making them suitable as audit trail artifacts.
Reading programmatically
Prompt diffs
Every iteration’s prompt is stored in full. Reflex computes diffs between consecutive iterations on demand — in the dashboard as a line-level colored diff, and programmatically:Reproducibility
A run can be reproduced by re-running with the same config:eval_temperature=0 (the default). Reasoning model outputs are
non-deterministic even at temperature 0 for most providers.
The config.json artifact records every setting that influenced the run,
including the reasoning model, target model, strategy, thresholds, dataset
path, metrics, and train/val/test split ratios. This is sufficient to
reconstruct the conditions of any run.
Using reflex as part of a compliance workflow
What reflex can provide
- Durable artifacts — all run outputs are written to disk and never modified, providing a tamper-evident record of the optimization process
- Reasoning traces — every prompt revision includes the reasoning model’s explanation of what it changed and why, attributing each score delta to a specific design decision
- Score accountability — baseline and final scores are computed on a held-out test set, separate from the training set used during optimization, reducing the risk of overfitting being mistaken for genuine improvement
- Statistical significance — reflex runs a paired Wilcoxon/t-test and reports whether the improvement is statistically significant at α=0.05, flagging when a score gain may be noise
- Token accounting — full token usage is recorded per iteration and summarized in the final result, supporting cost attribution
What reflex cannot provide
- Model behavior guarantees — scores reflect performance on your specific evaluation dataset; reflex cannot guarantee the optimized prompt generalizes to all inputs
- Deterministic reasoning — the reasoning model’s choices (which edits to make, which strategy to pursue) are not deterministic and will differ across runs even with the same config
- Human review — reflex automates the iteration loop but does not replace human judgment about whether the optimized prompt is appropriate for deployment
What the dashboard shows
The dashboard provides a visual audit trail of any run:- Flow graph — each iteration as a node with its score, score delta, and token usage
- Iteration detail — click any node to see the full prompt (with diff against the previous version), the reasoning model’s explanation, and per-metric scores
- Phase breakdown (auto strategy) — which optimization axis ran in each phase and what it contributed
- Token summary — total eval and reasoning token costs for the run
- Duration — total wall time per iteration and for the full run