OptimizationResult
Returned byPromptOptimizer.run(). Contains the optimized prompt, scores,
iteration history, and analysis.
Properties
| Property | Type | Description |
|---|---|---|
best_prompt | str | The highest-scoring prompt found |
best_score | float | The score of the best prompt |
iterations | list[IterationRecord] | All iteration records |
converged | bool | Whether the score threshold was reached |
baseline | EvalSnapshot | None | Baseline eval snapshot (on held-out test set if split is enabled) |
final | EvalSnapshot | None | Final verification snapshot (on held-out test set if split is enabled) |
train_size | int | Number of training examples used for optimization (0 if no split) |
test_size | int | Number of held-out test examples used for baseline and final eval (0 if no split) |
val_size | int | Number of validation examples tracked per-iteration (0 if val_ratio=0) |
val_trajectory | list[float] | Val set mean score after each optimization iteration (empty if no val split) |
early_stopped | bool | True if optimization was stopped early because val score plateaued |
batch_size | int | Per-iteration mini-batch size (0 = full training set was used) |
p_value | float | None | p-value from paired significance test (Wilcoxon or t-test). None if fewer than 2 samples or scipy not installed |
is_significant | bool | None | True if p_value < 0.05. None when p_value is unavailable |
total_eval_tokens | int | Total tokens used by the eval model across the run |
total_reasoning_tokens | int | Total tokens used by the reasoning model across the run |
strategy_name | str | None | Strategy that was used |
phase_history | list[dict] | None | Auto mode phase breakdown |
Computed properties
| Property | Type | Description |
|---|---|---|
score_trajectory | list[float] | Score at each iteration |
improvement | float | None | Absolute score improvement (final − baseline) |
improvement_pct | float | None | Percentage improvement |
Methods
summary()
Returns a formatted string with scores, trajectory, strategy analysis,
prompt diff, and before/after example.
to_dict()
Serialize to a dictionary.
to_json(path)
Save full results to a JSON file.
save_best_prompt(path)
Write the optimized prompt to a text file.
IterationRecord
A single optimization iteration.| Property | Type | Description |
|---|---|---|
iteration | int | Iteration number |
system_prompt | str | The prompt used in this iteration |
score | float | Overall score |
scores_by_metric | dict[str, float] | Per-metric scores |
reasoning | str | Agent’s reasoning for the change |
eval_tokens | int | Tokens used by the eval model this iteration |
reasoning_tokens | int | Tokens used by the reasoning model this iteration |
change_summary | str | One-line description of what the agent changed (e.g. “Added output format constraints”) |
val_score | float | None | Validation set score for this iteration (None when val_ratio=0) |
EvalSnapshot
Scores from a single eval run (baseline or final).| Property | Type | Description |
|---|---|---|
mean_score | float | Mean score across all samples (and across runs when eval_runs > 1) |
std_score | float | Std dev of per-run mean scores. 0.0 when eval_runs=1 |
n_runs | int | Number of eval passes averaged to produce mean_score. 1 by default |
scores_by_metric | dict[str, float] | Per-metric mean scores |
system_prompt | str | The system prompt used |
samples | list[SampleSnapshot] | Per-sample results (scores averaged across runs when eval_runs > 1) |
total_tokens | int | Total tokens used by the eval model in this snapshot |
SampleSnapshot
A single sample’s input, output, and score.| Property | Type | Description |
|---|---|---|
input | str | The input prompt |
response | str | The model’s response |
ideal | str | The reference answer |
score | float | Score for this sample |