Documentation Index
Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt
Use this file to discover all available pages before exploring further.
Origin ships three attribution methods. You can run any one individually or
combine them with method="all" (the default).
LLM-as-critic (method="critic")
One LLM call. Origin sends the full execution trace, the rubric, and the
judge score to an LLM and asks: “given that this pipeline scored poorly, which
span is most responsible and why?” The LLM reads the trace holistically — the
way a senior engineer would scan logs — and returns a ranked list of culprit
spans, each with a severity, confidence score, explanation, and fix_type.
Think of it as asking a colleague to eyeball the trace and point at what looks
wrong. It’s fast and works well when one span clearly dominates the failure.
Best for: fast diagnosis, single-cause failures, and traces where one span
clearly dominates the failure.
Limitations: the critic sees the trace as text and can be misled by a
span that looks suspicious but is not the root cause. It has no causal
guarantee.
result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="critic")
Score decomposition (method="decomposition")
One LLM call. Instead of looking at the trace holistically, decomposition
first breaks the rubric into its individual pass/fail criteria, then asks which
span is responsible for each criterion that failed.
For example, a rubric like “the agent should acknowledge the charge, cite the
refund policy, and confirm the refund” gets split into three separate
questions. For each failed criterion, the LLM identifies the responsible span.
Blame is then aggregated per span across all the criteria it failed.
This gives you a more structured view: you can see not just which span failed,
but which requirement it failed and why — useful when a failure has multiple
contributing factors.
Best for: rubrics that bundle multiple requirements, distributed failures
where two or three spans each contributed, and cases where you want a richer
breakdown by criterion.
Limitations: still an LLM judgment — the decomposition of the rubric into
criteria can be imperfect.
result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="decomposition")
Ablation (method="ablation")
Causal. For each candidate span, Origin replaces its output with a neutral
placeholder ("null" by default, or the ideal output if ablation_placeholder="ideal"),
replays the pipeline via your runner, and re-scores via your judge. A
large score drop when span X is ablated means span X is genuinely causal —
removing its real output materially changed the outcome.
Best for: confirming that a span is the root cause (not just suspicious),
ruling out false positives, and pipelines where LLM confabulation is a risk.
Limitations: requires a deterministic runner and a judge callable. Each
ablated span costs one runner invocation + one judge call. Use
ablation_budget=N to cap total invocations.
def my_runner(trace: AgentTrace, overrides: dict) -> AgentTrace:
# Replay with overrides[span_id] forced as that span's output.
...
result = origin.diagnose(
trace=trace, score=0.2, rubric=rubric,
method="ablation",
runner=my_runner,
)
How ablation confidence is calculated
When a span is ablated, Origin runs the pipeline and asks the judge to score
the result. Confidence is the normalized score drop:
ablation_confidence = (score_original - score_ablated) / score_range
score_range is score_max - score_min (typically 1.0 - 0.0 = 1.0). A span
that drops the score from 1.0 to 0.0 gets confidence 1.0; a span that
changes nothing gets 0.0.
Neutral placeholders by output type. The ablated output must be
structurally valid so the rest of the pipeline doesn’t crash:
| Span output type | Placeholder |
|---|
dict | {} |
list | [] |
str | "" |
Judge scoring tiers. Deterministic judges often return one of a small set of
scores based on observable facts rather than LLM opinion. In the coding agent
example the judge uses three tiers:
| Result | Score | Meaning |
|---|
| All tests pass | 1.0 | Fully correct |
| Compiles but tests fail | 0.4 | Partial credit |
| Compile error | 0.0 | Broken output |
With score_original = 0.4 and score_ablated = 0.0, ablation confidence is
(0.4 - 0.0) / 1.0 = 0.40.
Ablation cost control
result = diagnose_pipeline(
my_agent, question,
judge=judge, rubric=rubric, llm=llm, runner=my_runner,
ablation_budget=5, # cap at 5 runner+judge invocations
)
The raw on-ramp (Origin.diagnose) also exposes candidates=["span_a", "span_b"]
to restrict the ablation sweep to specific span ids.
Combined (method="all")
Runs critic and decomposition always (two LLM calls total). Ablation
participates when a runner is supplied; it is silently skipped otherwise.
Results are merged per span:
- Confidence — spans named by multiple methods receive a corroboration
bonus. Merged confidence lies between the arithmetic mean and the max of the
individual confidences, weighted toward the max by the number of methods that
agreed. A span all three methods agree on gets the highest possible merged
confidence.
- Severity — the max severity across methods wins.
- fix_type — resolved to the most specific type across methods using a
priority ordering:
prompt > tool_schema > retrieval > routing >
infrastructure > unknown. If critic says retrieval and decomposition
says unknown, the merged fix_type is retrieval.
Let confidences be the list of per-method confidence scores for a span. The
merged confidence is:
weight = 1.0 - 1.0 / len(confidences)
avg = mean(confidences)
peak = max(confidences)
merged = avg + (peak - avg) * weight
weight scales with the number of agreeing methods:
| Methods that named the span | Weight | Effect |
|---|
| 1 | 0.00 | merged = avg (no bonus — only one signal) |
| 2 | 0.50 | merged halfway between avg and peak |
| 3 | 0.67 | merged pulled strongly toward the peak |
The more methods agree, the closer the merged score is to the highest individual
confidence. A span that only one method flags gets no bonus at all.
Worked example
In the coding agent tutorial, the planner span receives these per-method scores:
| Method | Confidence | How it was derived |
|---|
| Critic | 0.95 | LLM judged the planner as the most suspicious span |
| Decomposition | 0.28 | Partial blame — planner contributed to one failed criterion |
| Ablation | 0.40 | Score dropped from 0.4 → 0.0 when planner output was ablated |
Applying the formula with N=3 methods:
weight = 1.0 - 1.0 / 3 = 0.667
avg = (0.95 + 0.28 + 0.40) / 3 = 0.543
peak = 0.95
merged = 0.543 + (0.95 - 0.543) × 0.667 = 0.543 + 0.271 = 0.814
The planner ends up with confidence=0.81. Three independent methods all
flagged it, so the corroboration bonus pulled the merged score well above the
average of the individual confidences.
result = diagnose_pipeline(
my_agent, question,
judge=judge, rubric=rubric, llm=llm,
runner=my_runner, # enables ablation
method="all",
)
Choosing a method
| Critic | Decomposition | Ablation |
|---|
| LLM calls | 1 | 1 | 0 (+ runner×N) |
| Runner required | No | No | Yes |
| Causal guarantee | No | No | Yes |
| Multi-criterion rubrics | Partial | Yes | Partial |
| Cost | Low | Low | Medium–High |
Start with method="all" (without a runner) for most use cases — two LLM calls,
no runner needed, corroboration bonus when both methods agree. Add a runner when
you want ablation’s causal confirmation.