Skip to main content
Reflex ships with a callback system that lets you stream run data to experiment tracking platforms. Callbacks are optional — they never affect core optimization behavior, and a broken callback will never crash a run.

Usage

Pass a callbacks list to .run():
from aevyra_reflex import PromptOptimizer, MLflowCallback

result = (
    PromptOptimizer()
    .set_dataset(dataset)
    .add_provider("openai", "gpt-4o-mini")
    .add_metric(RougeScore())
    .run("You are a helpful assistant.", callbacks=[MLflowCallback()])
)
Multiple callbacks can be composed freely:
result = optimizer.run(prompt, callbacks=[MLflowCallback(), MyCustomCallback()])

MLflow

MLflowCallback logs the full run to an MLflow experiment using MLflow’s standard tracking API. No server required — by default it writes to a local ./mlruns directory that you can open with mlflow ui. Install:
pip install aevyra-reflex[mlflow]
CLI:
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --mlflow

# Custom experiment name and remote tracking server
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
  --mlflow \
  --mlflow-experiment security-incidents \
  --mlflow-tracking-uri http://localhost:5000
Python API:
from aevyra_reflex import MLflowCallback

result = optimizer.run(prompt, callbacks=[MLflowCallback()])
With options:
cb = MLflowCallback(
    run_name="summarization-v2",
    tracking_uri="http://localhost:5000",  # remote MLflow server
    experiment_name="prompt-experiments",
    tags={"team": "nlp", "dataset": "cnn-dm"},
    log_prompt_each_iter=True,             # save prompt artifact every iteration
)
result = optimizer.run(prompt, callbacks=[cb])

What gets logged

WhenWhat
Run startParams: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source
Baseline evalMetric: score_test at step=0 — held-out test score before any optimization
Each iterationMetrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge)
Each iterationArtifact: iterations.json — table of iteration, score, prompt, reasoning (updated live)
Final evalMetric: score_test at step=N+1 — held-out test score after optimization
Run endMetrics: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations
Run endArtifact: prompts/best_prompt_*.txt — the winning prompt

Viewing results

mlflow ui
# opens at http://localhost:5000
Navigate to your results in three steps:
  1. Experiments (left sidebar) → click the experiment name (e.g. security-incidents)
  2. Click Evaluation Runs — the table lists every reflex run with its baseline and best score
  3. Click a run name to open it, then:
    • Overview — run params (strategy, reasoning_model, etc.) and summary metrics (best_score_train, baseline_score, final_score_test, improvement)
    • Model metrics — score trajectory chart, one point per iteration
    • Artifacts → iterations.json — interactive table showing the prompt and reasoning used at each iteration alongside its score
    • Artifacts → prompts/best_prompt_*.txt with the final winning prompt

Writing a custom callback

Implement any subset of the three lifecycle methods:
class MyCallback:
    def on_run_start(self, config, initial_prompt: str) -> None:
        """Called once before the baseline eval."""
        print(f"Starting {config.strategy} run")

    def on_baseline(self, snapshot) -> None:
        """Called once after the baseline (test-set) eval completes.

        snapshot fields:
            mean_score (float)      — baseline score on the held-out test set
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — the initial prompt that was evaluated
        """
        print(f"Baseline: {snapshot.mean_score:.4f}")

    def on_iteration(self, record) -> None:
        """Called after each iteration completes (after val eval if val-ratio is set).

        record fields:
            iteration (int)         — 1-based iteration number
            score (float)           — mean train score this iteration
            val_score (float|None)  — mean val score, or None if no val split
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — prompt used this iteration
            reasoning (str)         — reasoning model's explanation
        """
        print(f"  #{record.iteration}  train={record.score:.4f}")

    def on_final(self, snapshot) -> None:
        """Called once after the final verification (test-set) eval completes.

        snapshot fields:
            mean_score (float)      — final score on the held-out test set
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — the best prompt that was evaluated
        """
        print(f"Final test: {snapshot.mean_score:.4f}")

    def on_run_end(self, result) -> None:
        """Called once after on_final, with the full result object.

        result fields:
            best_prompt (str)         — the best system prompt found
            best_score (float)        — best train score seen during optimization
            baseline.mean_score       — test-set score before optimization
            final.mean_score          — test-set score after optimization
            improvement (float)       — final - baseline
            improvement_pct (float)   — improvement as a percentage
            score_trajectory (list)   — train score after each iteration
            val_trajectory (list)     — val score after each iteration (empty if no val split)
            phase_history (list)      — auto strategy phase breakdown
        """
        print(f"Done. {result.baseline.mean_score:.4f}{result.final.mean_score:.4f}")

result = optimizer.run(prompt, callbacks=[MyCallback()])
You only need to implement the methods you care about. Callbacks that are missing a method are skipped silently — there is no base class to inherit from. The lifecycle order is: on_run_starton_baselineon_iteration (×N) → on_finalon_run_end.

Weights & Biases

WandbCallback logs the full run to a W&B project using the standard wandb Python SDK. Install:
pip install aevyra-reflex[wandb]
wandb login  # one-time setup
CLI:
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --wandb

# Custom project name
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
  --wandb \
  --wandb-project security-incidents
Python API:
from aevyra_reflex import WandbCallback

result = optimizer.run(prompt, callbacks=[WandbCallback()])
With options:
cb = WandbCallback(
    project="prompt-experiments",
    run_name="summarization-v2",
    entity="my-team",
    tags=["auto", "cnn-dm"],
    log_prompt_each_iter=True,  # log prompt text as a W&B Table each iteration
)
result = optimizer.run(prompt, callbacks=[cb])
Testing without a W&B account: Use mode="offline" to write runs locally without any network access. Run wandb sync ./wandb/offline-run-* later to push them.
cb = WandbCallback(mode="offline")

What gets logged

WhenWhat
Run startConfig: strategy, reasoning_model, max_iterations, score_threshold, temperature, max_workers, target_model, target_source
Baseline evalMetric: score_test — held-out test score before any optimization
Each iterationMetrics: score_train, score_val (when --val-ratio is set), score_<metric> (e.g. score_rouge)
Final evalMetric: score_test — held-out test score after optimization
Run endSummary: best_score_train, baseline_score, final_score_test, improvement, improvement_pct, converged, total_iterations
Run endArtifact: best-prompt (type: prompt) containing best_prompt.txt
The Charts tab shows score_train, score_val, and score_test (baseline + final) as separate series, giving you a clear view of train vs. validation vs. held-out test performance across the run. Summary fields appear in the run Overview table, making it easy to compare runs side by side.

Using both together

MLflow and W&B can run simultaneously — pass both in the callbacks list:
result = optimizer.run(prompt, callbacks=[
    MLflowCallback(experiment_name="reflex"),
    WandbCallback(project="reflex"),
])