The problem
A security team needs executive briefs from incident reports. The ideal format is strict:- Sentence 1 — what happened and the attack vector
- Sentence 2 — duration, affected count, and financial cost
- Sentence 3 — containment action and longer-term controls
security_incidents_prompt.md
The dataset
100 synthetic security incident reports across 15 incident types (ransomware, phishing, credential stuffing, supply chain, API key leaks, and more). Each report is 200–400 words of structured plain text; each ideal is a tight 3-sentence brief. The dataset is split automatically: 45 train / 20 val / 35 test.The judge
Instead of ROUGE (which measures word overlap, not format compliance), this run uses an LLM judge with a custom rubric:security_incidents_judge.md
Running the optimization
What happened: phase by phase
Theauto strategy ran 4 phases over 90 minutes, chaining strategies based on
what the score trajectory revealed.
Baseline: 0.3786
The starting prompt scores 0.3786 on the held-out test set. A score of 0.38 on a 1–5 rubric normalized to 0–1 means the model is consistently producing 2-rated output — free-form prose with the right facts but none of the required structure.Phase 1 — structural (12 minutes)
The reasoning model identified this as a structural problem: the prompt says nothing about output format. It generated 4 structural variants and tested each on all 45 training examples in parallel:| Variant | Score | Delta |
|---|---|---|
| markdown_structure | 0.3889 | +0.0000 — no effect |
| minimal_flat | 0.4167 | +0.0278 — helped |
| section_reorder | 0.4333 | +0.0444 — helped |
| agent_guided | 0.8389 | +0.4500 — helped |
agent_guided variant — which added explicit numbered instructions,
a word limit, and a priority order — jumped the score from 0.39 → 0.84 in a
single step. Adding headers or flattening the prompt barely moved the needle.
The model just needed to be told what to put in each sentence.
The structural phase converged at 0.8611 after its second iteration, exceeding
the 0.85 target.
Phase 2 — fewshot (29 minutes)
The reasoning model recommended adding few-shot examples to show the exact phrasing and word economy expected:“The prompt has good structure but would benefit from concrete examples showing the exact 3-sentence format and word economy expected. Few-shot examples would clarify how to distill complex incident reports into the precise output format while maintaining the priority order of information.”Reflex bootstrapped exemplar candidates by running the current best prompt across the training set and collecting the 20 highest-scoring outputs. It then tested prompts with 5 examples included.
| Fewshot iteration | Train score |
|---|---|
| 1 | 0.7389 |
| 2 | 0.8278 |
| 3 | 0.8222 |
Phase 3 — iterative (35 minutes)
The reasoning model switched to iterative refinement to fix the instruction language causing the remaining failures:“The fewshot phase showed declining performance, suggesting the core instructions may have unclear wording or missing edge case handling that examples alone can’t fix.”Each iteration diagnosed the worst-scoring samples and proposed a targeted fix:
- Iter 1 → 2 (+0.028): Added explicit 75-word limit and mandatory financial impact inclusion
- Iter 2 → 3 (-0.006): Word count verification and financial guidance restructuring — marginal regression
- Iter 3 → 4 (-0.156): Overly specific conditional logic (“use context-aware rules based on data type, access evidence”) — this hurt badly. The model struggled when given complex conditional rules.
Phase 4 — PDO (7 minutes)
PDO (Prompt Distribution Optimization) runs tournament-style duels between prompt variants, using Copeland, Borda, Elo, and average win-rate scoring to find the best overall:“PDO is ideal as a final polish step to identify subtle improvements when other optimization approaches have been exhausted.”
Final test set eval
The test set eval uses the best-val prompt (iteration 9: val=0.9000), not the final training-perfect prompt — this guards against overfitting on the training distribution.Before and after
The same input, the same model, two different prompts: Input (abridged):The optimized prompt
security_incidents_best_prompt.md
Score trajectory
Key takeaways
Format problems need structural fixes, not examples. The single biggest gain (+0.45) came from adding explicit numbered instructions and a word limit — before the optimizer had seen a single failure in detail. Examples helped briefly but introduced fragility. Complex conditional logic hurts small models. The worst regression (-0.16) came from adding “context-aware” estimation rules. Llama 3.1 8B follows clear imperative instructions well; it struggles when given conditional branching within a prompt. The val split prevents overfitting from being invisible. The training score hit 1.0, but the val score was 0.9 and the test score was 0.886. Without the val checkpoint, reflex would have saved the iteration-10 prompt (train=1.0, val=0.838) instead of the iteration-9 prompt (train=1.0, val=0.9) — a worse choice for generalization. Small judge models work for format compliance. Qwen3 8B (judge) is a fraction of the cost of GPT-4o, and format compliance is an easy judgment: count the sentences, check the required content is present, verify figures match the report. You don’t need a frontier model for that.Run it yourself
The dataset, starting prompt, judge rubric, and best prompt are all inexamples/: