Standard mode
Standard mode is the default. You provide a dataset of (input, ideal output) pairs and a starting prompt. Reflex runs evals, scores each prompt candidate against the dataset, and rewrites the prompt until scores converge.- Your task has a well-defined correct output — summarization, extraction, classification, translation, format compliance
- You can build a dataset of (input, ideal) pairs that covers the failure cases you care about
- The prompt directly produces the output you want to score
Pipeline mode
In pipeline mode, you provide a Python function that runs your full agent. Reflex re-runs that function on every candidate prompt and passes the complete execution trace — tool calls, tool results, and final answer — to the judge.- Your agent calls tools, retrieves context, or takes actions — and correctness depends on whether those steps happened correctly, not just whether the final string looks right
- A static (input, ideal) dataset cannot tell whether the model used its tools or answered from memory
- You need the judge to see intermediate steps — which tools were called, what they returned, where the answer diverged from the evidence
How it works
You write apipeline_fn(prompt, input) that wraps your existing agent and
returns an AgentTrace. Your agent code doesn’t change — you add one function
around it:
optimize=True marks the node whose prompt is being optimized. Nodes marked
optimize=False are passed to the judge as context but not directly optimized.
The judge receives the full trace and can enforce rules across all nodes —
for example, penalizing a correct answer that bypassed available tools.
Judge rubric
The judge rubric is the most important part of pipeline mode. It must explicitly penalize the failure modes that static eval cannot detect:Temperature
Settemperature=0.0 on every LLM call inside pipeline_fn. Reflex compares
prompt variants by running the same inputs through your function multiple times.
Sampling noise would make variant comparisons unreliable — a prompt may appear
to improve or hurt purely by chance.
All strategies work in pipeline mode
Standard mode strategies (iterative, structural, fewshot, PDO, auto) all work in pipeline mode. The optimizer callseval_fn(prompt, dataset) instead of
_run_eval(prompt, dataset) — your pipeline_fn is invoked inside eval_fn
on every candidate. The strategy sees the same score signal regardless of mode.
See the pipeline mode tutorial for a complete
walkthrough with real run logs.