Why standard prompt optimization fails for agentic systems
Standard prompt optimization has a clean abstraction: a dataset of (input, ideal output) pairs, a model, a judge that compares outputs to ideals, and an optimization loop that rewrites the prompt until scores improve. This works well when the prompt is the only thing between the input and the output. An agentic system breaks that abstraction immediately. The prompt doesn’t produce the output directly — it shapes how the agent behaves across a chain of decisions. Which tools it decides to call. Whether it verifies arithmetic or estimates from memory. How it incorporates what the tools returned before composing its answer. The final output is downstream of all of those decisions, and most of them are invisible when you score a string against an ideal string. Static eval sees only the question and the final answer — every decision in between is invisible. Whethersearch_docs was called, whether calculate
was used or the model estimated, whether get_date was consulted at all: none
of that is visible when you compare a string to an ideal.
The failure mode is predictable. A model with broad domain knowledge can produce
answers that look correct — right ballpark numbers, plausible API details —
without calling a single tool. A static (input, output) dataset scores those
answers a 4 or 5. A judge watching the full execution trace scores them a 2:
correct answer, grounding bypassed. The two scores say completely different
things about whether the system is working correctly. Optimizing against the
first signal points the prompt in the wrong direction.
How reflex handles this: pipeline mode
Pipeline mode changes what gets evaluated. Instead of scoring a string against an ideal, reflex re-runs your entire agent pipeline on every candidate prompt. When the prompt changes, the agent’s tool-calling behaviour changes with it — and the score reflects that change in behaviour, not just in output text. You write apipeline_fn that runs your agent and returns an AgentTrace: a
structured record of every step the agent took, which tools it called, what they
returned, and what the agent finally said. Reflex calls that function on every
candidate prompt, passes the full trace to the judge, and uses the score to
drive the optimization loop.
Because the judge sees the full trace, the rubric can enforce grounding rules
that a static eval cannot express: “if calculate was available and the
question involves arithmetic, a score of 5 requires that calculate was
called.” When the optimizer diagnoses a low-scoring sample, it has the complete
picture — which tools were called, what they returned, where the final answer
diverged — and can write prompt revisions that target the actual failure mode
rather than surface-level output differences.
Two constraints follow from this architecture. First, temperature must be zero
inside pipeline_fn. Reflex compares prompt variants by running the same inputs
multiple times; sampling noise would make comparisons unreliable. Second, the
judge rubric must explicitly penalise bypassed tools — without that rule, a
model that estimates correctly scores the same as one that calls the tool, and
the optimizer has no signal to improve grounding.
The rest of this tutorial walks through a real run that shows both of these in
action.
The task
A developer assistant needs to answer questions about pyflow accurately. Some questions are pure doc lookups. Others require arithmetic (throughput estimates, batch counts, time-to-completion). Some ask how long ago a version was released, which requires today’s date. All of them have one thing in common: the correct answer is only provably correct when it comes from the tools. The starting prompt is intentionally vague:dev_assistant/prompt.md
search_docs and answers
pyflow API questions from training knowledge — which may be stale or simply
wrong. For math questions it estimates rather than calls calculate. The
judge scores both of these cases a 2, not a 4, even when the answer happens to
be numerically close. Grounding is the primary criterion.
The pipeline
The agent runs an agentic loop with three tools and up to four rounds of tool calls per question:dev_assistant/pipeline.py
temperature=0 in your pipeline function
Reflex compares prompt variants by running pipeline_fn on the same inputs and
comparing scores. If temperature is left at the provider default (typically
0.7–1.0), the same prompt will score differently on different runs due to
sampling noise — a variant may appear to improve or hurt purely by chance.
Always set temperature=0.0 on every LLM call inside pipeline_fn. This makes
each eval deterministic so variant comparisons reflect real prompt differences,
not random variation. The smoke test also benefits: you get the same output
every time you run it, making it easier to confirm a fix worked.
:::
The tools_called and tool_results nodes are marked optimize=False —
reflex passes them to the judge for grounding evaluation but does not try to
optimize them directly. Only the answer node is marked optimize=True.
The multi-round loop matters: a question like “how many records can I process
in 90 minutes at workers=4?” requires two tool calls in sequence — first
search_docs to retrieve the benchmark figure (170,000 records/sec), then
calculate to multiply 170_000 * 5400. A single-round loop would miss this
pattern.
The dataset
30 questions about the pyflow library, each with an ideal answer used by the judge. The questions are split into three categories:- Doc lookup (8 questions) — API configuration, error handling, source types
- Doc + arithmetic (14 questions) — throughput estimates, batch counts, time-to-completion, queue capacities
- Doc + date (4 questions) — how long ago a version was released, derived
from
get_date()plus subtraction - Multi-tool chains (4 questions) — combine all three tools in sequence
dev_assistant/questions.json (excerpt)
The judge
A standard LLM judge comparing final answers to ideals would miss the most important failure mode: the model answers from training knowledge instead of calling tools. The judge rubric is built around this:dev_assistant/judge.md
Running the optimization
-m / --model flag. In pipeline mode the model is baked
into pipeline_fn — reflex calls your function and receives an AgentTrace
back. The judge model (--judge) is separate and scores the trace.
Reflex prints the run header before doing anything:
Score trajectory
Key takeaways
Static datasets cannot score tool-grounding. A dataset of (question, ideal-answer) pairs has no way to detect whethercalculate was called or
whether the agent multiplied in its head. Pipeline mode surfaces this by
passing the full trace — tool calls, tool results, and final answer — to the
judge on every iteration.
The judge rubric must explicitly penalise bypassing tools. Without the
“correct-but-ignores-tools → score 2” rule, the judge would reward plausible
estimates and the optimizer would have no signal to fix the grounding problem.
The rubric is as important as the starting prompt.
Structural fixes dominate for tool-use problems. The single biggest gain
(+0.16 on train, iteration 1) came from adding markdown structure and explicit
response phases — before the optimizer had diagnosed a single specific failure.
Tool use is a behaviour that models exhibit reliably when told to, and ignore
reliably when not told to.
The val split catches over-optimization. Train and val diverged sharply in
the iterative phase: the prompt that scored 0.9583 on val (iteration 5) scored
only 0.6429 on train, while the PDO champion scored 0.7679 on train but 0.5000
on val. Reflex correctly picked iteration 5 for the final test eval. Without a
val set this would have been invisible.
Prompt length is not a proxy for quality. The iterative phase grew the
prompt from 856 to 3,251 characters and train score dropped by 0.18. The final
best-val prompt was the 2,277-character iteration-5 version — not the longest,
not the shortest, but the one that generalised best.
Run it yourself
The pipeline, dataset, starting prompt, and judge rubric are inexamples/dev_assistant/. The pipeline works with any OpenAI-compatible
provider — pick whichever you have credentials for.
A note on the judge
The--judge model is separate from the pipeline model and matters a lot for
this task. The judge reads the full trace — which tools were called, what they
returned, and what the final answer said — and applies the judge.md rubric.
Small models (7B–14B) consistently score too leniently: they see that the
arithmetic in the answer is correct and give 5/5 even when calculate was never
called. A frontier model (Claude Sonnet, GPT-4o) enforces the “tool bypassed →
score 2” rule reliably because it strictly audits tools_called against the
rubric rather than just checking the answer for correctness.
Recommendation: use a frontier model as judge even if you run the pipeline
locally on Ollama. The pipeline model pays the per-token cost of tool calling;
the judge only reads the final trace — it makes a handful of calls per
iteration, so the added cost is small.
OpenRouter (cloud, needs OPENROUTER_API_KEY):
qwen3:8b runs on your machine for free,
Claude Sonnet scores the traces reliably. The judge accounts for most of the
OpenRouter cost — well under $5 for a full run.
Fully local (Ollama for both pipeline and judge — lower quality scores but
no cloud calls needed):
calculate was never called —
the model multiplied mentally from the retrieved benchmark figure. The judge
rubric scores this 2/5: correct answer, tool bypassed. The pre-check
flags this mechanically without needing a judge inference call.
This is exactly the signal the optimizer uses. After running aevyra-reflex optimize, test the best prompt with --prompt:
calculate, answer explicitly cites
the tool result. This is the target behaviour the optimizer produces.
Total cost on OpenRouter for a full run: well under $5. The agent itself
(Qwen3 8B) handles both the tool-calling and the final generation. The judge
(also Qwen3 8B) scores each trace. The reasoning model that rewrites the prompt
runs on Claude Sonnet by default and accounts for most of the cost.