Integrations

Reflex ships with a callback system that lets you stream run data to experiment tracking platforms. Callbacks are optional — they never affect core optimization behavior, and a broken callback will never crash a run.

Usage

Pass a callbacks list to .run():

from aevyra_reflex import PromptOptimizer, MLflowCallback

result = (
    PromptOptimizer()
    .set_dataset(dataset)
    .add_provider("openai", "gpt-4o-mini")
    .add_metric(RougeScore())
    .run("You are a helpful assistant.", callbacks=[MLflowCallback()])
)

Multiple callbacks can be composed freely:

result = optimizer.run(prompt, callbacks=[MLflowCallback(), MyCustomCallback()])

MLflow

MLflowCallback logs the full run to an MLflow experiment using MLflow’s standard tracking API. No server required — by default it writes to a local ./mlruns directory that you can open with mlflow ui. Install:

pip install aevyra-reflex[mlflow]

CLI:

aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --mlflow

# Custom experiment name and remote tracking server
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
  --mlflow \
  --mlflow-experiment security-incidents \
  --mlflow-tracking-uri http://localhost:5000

Python API:

from aevyra_reflex import MLflowCallback

result = optimizer.run(prompt, callbacks=[MLflowCallback()])

With options:

cb = MLflowCallback(
    run_name="summarization-v2",
    tracking_uri="http://localhost:5000",  # remote MLflow server
    experiment_name="prompt-experiments",
    tags={"team": "nlp", "dataset": "cnn-dm"},
    log_prompt_each_iter=True,             # save prompt artifact every iteration
)
result = optimizer.run(prompt, callbacks=[cb])

What gets logged

When	What
Run start	Params: `strategy`, `reasoning_model`, `max_iterations`, `score_threshold`, `temperature`, `max_workers`, `target_model`, `target_source`
Baseline eval	Metric: `score_test` at `step=0` — held-out test score before any optimization
Each iteration	Metrics: `score_train`, `score_val` (when `--val-ratio` is set), `score_<metric>` (e.g. `score_rouge`)
Each iteration	Artifact: `iterations.json` — table of iteration, score, prompt, reasoning (updated live)
Final eval	Metric: `score_test` at `step=N+1` — held-out test score after optimization
Run end	Metrics: `best_score_train`, `baseline_score`, `final_score_test`, `improvement`, `improvement_pct`, `converged`, `total_iterations`
Run end	Artifact: `prompts/best_prompt_*.txt` — the winning prompt

Viewing results

mlflow ui
# opens at http://localhost:5000

Navigate to your results in three steps:

Experiments (left sidebar) → click the experiment name (e.g. security-incidents)
Click Evaluation Runs — the table lists every reflex run with its baseline and best score
Click a run name to open it, then:
- Overview — run params (strategy, reasoning_model, etc.) and summary metrics (best_score_train, baseline_score, final_score_test, improvement)
- Model metrics — score trajectory chart, one point per iteration
- Artifacts → iterations.json — interactive table showing the prompt and reasoning used at each iteration alongside its score
- Artifacts → prompts/ — best_prompt_*.txt with the final winning prompt

Writing a custom callback

Implement any subset of the three lifecycle methods:

class MyCallback:
    def on_run_start(self, config, initial_prompt: str) -> None:
        """Called once before the baseline eval."""
        print(f"Starting {config.strategy} run")

    def on_baseline(self, snapshot) -> None:
        """Called once after the baseline (test-set) eval completes.

        snapshot fields:
            mean_score (float)      — baseline score on the held-out test set
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — the initial prompt that was evaluated
        """
        print(f"Baseline: {snapshot.mean_score:.4f}")

    def on_iteration(self, record) -> None:
        """Called after each iteration completes (after val eval if val-ratio is set).

        record fields:
            iteration (int)         — 1-based iteration number
            score (float)           — mean train score this iteration
            val_score (float|None)  — mean val score, or None if no val split
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — prompt used this iteration
            reasoning (str)         — reasoning model's explanation
        """
        print(f"  #{record.iteration}  train={record.score:.4f}")

    def on_final(self, snapshot) -> None:
        """Called once after the final verification (test-set) eval completes.

        snapshot fields:
            mean_score (float)      — final score on the held-out test set
            scores_by_metric (dict) — per-metric breakdown
            system_prompt (str)     — the best prompt that was evaluated
        """
        print(f"Final test: {snapshot.mean_score:.4f}")

    def on_run_end(self, result) -> None:
        """Called once after on_final, with the full result object.

        result fields:
            best_prompt (str)         — the best system prompt found
            best_score (float)        — best train score seen during optimization
            baseline.mean_score       — test-set score before optimization
            final.mean_score          — test-set score after optimization
            improvement (float)       — final - baseline
            improvement_pct (float)   — improvement as a percentage
            score_trajectory (list)   — train score after each iteration
            val_trajectory (list)     — val score after each iteration (empty if no val split)
            phase_history (list)      — auto strategy phase breakdown
        """
        print(f"Done. {result.baseline.mean_score:.4f} → {result.final.mean_score:.4f}")

result = optimizer.run(prompt, callbacks=[MyCallback()])

You only need to implement the methods you care about. Callbacks that are missing a method are skipped silently — there is no base class to inherit from. The lifecycle order is: on_run_start → on_baseline → on_iteration (×N) → on_final → on_run_end.

Weights & Biases

WandbCallback logs the full run to a W&B project using the standard wandb Python SDK. Install:

pip install aevyra-reflex[wandb]
wandb login  # one-time setup

CLI:

aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 --wandb

# Custom project name
aevyra-reflex optimize dataset.jsonl prompt.md -m openrouter/llama3 \
  --wandb \
  --wandb-project security-incidents

Python API:

from aevyra_reflex import WandbCallback

result = optimizer.run(prompt, callbacks=[WandbCallback()])

With options:

cb = WandbCallback(
    project="prompt-experiments",
    run_name="summarization-v2",
    entity="my-team",
    tags=["auto", "cnn-dm"],
    log_prompt_each_iter=True,  # log prompt text as a W&B Table each iteration
)
result = optimizer.run(prompt, callbacks=[cb])

Testing without a W&B account: Use mode="offline" to write runs locally without any network access. Run wandb sync ./wandb/offline-run-* later to push them.

cb = WandbCallback(mode="offline")

What gets logged

When	What
Run start	Config: `strategy`, `reasoning_model`, `max_iterations`, `score_threshold`, `temperature`, `max_workers`, `target_model`, `target_source`
Baseline eval	Metric: `score_test` — held-out test score before any optimization
Each iteration	Metrics: `score_train`, `score_val` (when `--val-ratio` is set), `score_<metric>` (e.g. `score_rouge`)
Final eval	Metric: `score_test` — held-out test score after optimization
Run end	Summary: `best_score_train`, `baseline_score`, `final_score_test`, `improvement`, `improvement_pct`, `converged`, `total_iterations`
Run end	Artifact: `best-prompt` (type: `prompt`) containing `best_prompt.txt`

The Charts tab shows score_train, score_val, and score_test (baseline + final) as separate series, giving you a clear view of train vs. validation vs. held-out test performance across the run. Summary fields appear in the run Overview table, making it easy to compare runs side by side.

Using both together

MLflow and W&B can run simultaneously — pass both in the callbacks list:

result = optimizer.run(prompt, callbacks=[
    MLflowCallback(experiment_name="reflex"),
    WandbCallback(project="reflex"),
])

Getting started

Guides

Tutorials

API reference

Usage

MLflow

What gets logged

Viewing results

Writing a custom callback

Weights & Biases

What gets logged

Using both together

​Usage

​MLflow

​What gets logged

​Viewing results

​Writing a custom callback

​Weights & Biases

​What gets logged

​Using both together

Usage

MLflow

What gets logged

Viewing results

Writing a custom callback

Weights & Biases

What gets logged

Using both together