Usage
Pass acallbacks list to .run():
MLflow
MLflowCallback logs the full run to an MLflow experiment using MLflow’s
standard tracking API. No server required — by default it writes to a local
./mlruns directory that you can open with mlflow ui.
Install:
What gets logged
| When | What |
|---|---|
| Run start | Params: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source |
| Baseline eval | Metric: score_test at step=0 — held-out test score before any optimization |
| Each iteration | Metrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge) |
| Each iteration | Artifact: iterations.json — table of iteration, score, prompt, reasoning (updated live) |
| Final eval | Metric: score_test at step=N+1 — held-out test score after optimization |
| Run end | Metrics: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations |
| Run end | Artifact: prompts/best_prompt_*.txt — the winning prompt |
Viewing results
- Experiments (left sidebar) → click the experiment name (e.g.
security-incidents) - Click Evaluation Runs — the table lists every reflex run with its baseline and best score
- Click a run name to open it, then:
- Overview — run params (
strategy,reasoning_model, etc.) and summary metrics (best_score_train,baseline_score,final_score_test,improvement) - Model metrics — score trajectory chart, one point per iteration
- Artifacts → iterations.json — interactive table showing the prompt and reasoning used at each iteration alongside its score
- Artifacts → prompts/ —
best_prompt_*.txtwith the final winning prompt
- Overview — run params (
Writing a custom callback
Implement any subset of the three lifecycle methods:on_run_start → on_baseline → on_iteration (×N) → on_final → on_run_end.
Weights & Biases
WandbCallback logs the full run to a W&B project using the standard
wandb Python SDK.
Install:
mode="offline" to write runs locally without any network access.
Run wandb sync ./wandb/offline-run-* later to push them.
What gets logged
| When | What |
|---|---|
| Run start | Config: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source |
| Baseline eval | Metric: score_test — held-out test score before any optimization |
| Each iteration | Metrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge) |
| Final eval | Metric: score_test — held-out test score after optimization |
| Run end | Summary: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations |
| Run end | Artifact: best-prompt (type: prompt) containing best_prompt.txt |
score_train, score_val, and score_test (baseline + final) as separate series,
giving you a clear view of train vs. validation vs. held-out test performance across the run.
Summary fields appear in the run Overview table, making it easy to compare runs side by side.