Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt

Use this file to discover all available pages before exploring further.

Origin ships three attribution methods. You can run any one individually or combine them with method="all" (the default).

LLM-as-critic (method="critic")

One LLM call. Origin sends the full execution trace, the rubric, and the judge score to an LLM and asks: “given that this pipeline scored poorly, which span is most responsible and why?” The LLM reads the trace holistically — the way a senior engineer would scan logs — and returns a ranked list of culprit spans, each with a severity, confidence score, explanation, and fix_type. Think of it as asking a colleague to eyeball the trace and point at what looks wrong. It’s fast and works well when one span clearly dominates the failure. Best for: fast diagnosis, single-cause failures, and traces where one span clearly dominates the failure. Limitations: the critic sees the trace as text and can be misled by a span that looks suspicious but is not the root cause. It has no causal guarantee.
result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="critic")

Score decomposition (method="decomposition")

One LLM call. Instead of looking at the trace holistically, decomposition first breaks the rubric into its individual pass/fail criteria, then asks which span is responsible for each criterion that failed. For example, a rubric like “the agent should acknowledge the charge, cite the refund policy, and confirm the refund” gets split into three separate questions. For each failed criterion, the LLM identifies the responsible span. Blame is then aggregated per span across all the criteria it failed. This gives you a more structured view: you can see not just which span failed, but which requirement it failed and why — useful when a failure has multiple contributing factors. Best for: rubrics that bundle multiple requirements, distributed failures where two or three spans each contributed, and cases where you want a richer breakdown by criterion. Limitations: still an LLM judgment — the decomposition of the rubric into criteria can be imperfect.
result = origin.diagnose(trace=trace, score=0.2, rubric=rubric, method="decomposition")

Ablation (method="ablation")

Causal. For each candidate span, Origin replaces its output with a neutral placeholder ("null" by default, or the ideal output if ablation_placeholder="ideal"), replays the pipeline via your runner, and re-scores via your judge. A large score drop when span X is ablated means span X is genuinely causal — removing its real output materially changed the outcome. Best for: confirming that a span is the root cause (not just suspicious), ruling out false positives, and pipelines where LLM confabulation is a risk. Limitations: requires a deterministic runner and a judge callable. Each ablated span costs one runner invocation + one judge call. Use ablation_budget=N to cap total invocations.
def my_runner(trace: AgentTrace, overrides: dict) -> AgentTrace:
    # Replay with overrides[span_id] forced as that span's output.
    ...

result = origin.diagnose(
    trace=trace, score=0.2, rubric=rubric,
    method="ablation",
    runner=my_runner,
)

How ablation confidence is calculated

When a span is ablated, Origin runs the pipeline and asks the judge to score the result. Confidence is the normalized score drop:
ablation_confidence = (score_original - score_ablated) / score_range
score_range is score_max - score_min (typically 1.0 - 0.0 = 1.0). A span that drops the score from 1.0 to 0.0 gets confidence 1.0; a span that changes nothing gets 0.0. Neutral placeholders by output type. The ablated output must be structurally valid so the rest of the pipeline doesn’t crash:
Span output typePlaceholder
dict{}
list[]
str""
Judge scoring tiers. Deterministic judges often return one of a small set of scores based on observable facts rather than LLM opinion. In the coding agent example the judge uses three tiers:
ResultScoreMeaning
All tests pass1.0Fully correct
Compiles but tests fail0.4Partial credit
Compile error0.0Broken output
With score_original = 0.4 and score_ablated = 0.0, ablation confidence is (0.4 - 0.0) / 1.0 = 0.40.

Ablation cost control

result = diagnose_pipeline(
    my_agent, question,
    judge=judge, rubric=rubric, llm=llm, runner=my_runner,
    ablation_budget=5,          # cap at 5 runner+judge invocations
)
The raw on-ramp (Origin.diagnose) also exposes candidates=["span_a", "span_b"] to restrict the ablation sweep to specific span ids.

Combined (method="all")

Runs critic and decomposition always (two LLM calls total). Ablation participates when a runner is supplied; it is silently skipped otherwise. Results are merged per span:
  • Confidence — spans named by multiple methods receive a corroboration bonus. Merged confidence lies between the arithmetic mean and the max of the individual confidences, weighted toward the max by the number of methods that agreed. A span all three methods agree on gets the highest possible merged confidence.
  • Severity — the max severity across methods wins.
  • fix_type — resolved to the most specific type across methods using a priority ordering: prompt > tool_schema > retrieval > routing > infrastructure > unknown. If critic says retrieval and decomposition says unknown, the merged fix_type is retrieval.

Corroboration formula

Let confidences be the list of per-method confidence scores for a span. The merged confidence is:
weight  = 1.0 - 1.0 / len(confidences)
avg     = mean(confidences)
peak    = max(confidences)
merged  = avg + (peak - avg) * weight
weight scales with the number of agreeing methods:
Methods that named the spanWeightEffect
10.00merged = avg (no bonus — only one signal)
20.50merged halfway between avg and peak
30.67merged pulled strongly toward the peak
The more methods agree, the closer the merged score is to the highest individual confidence. A span that only one method flags gets no bonus at all.

Worked example

In the coding agent tutorial, the planner span receives these per-method scores:
MethodConfidenceHow it was derived
Critic0.95LLM judged the planner as the most suspicious span
Decomposition0.28Partial blame — planner contributed to one failed criterion
Ablation0.40Score dropped from 0.4 → 0.0 when planner output was ablated
Applying the formula with N=3 methods:
weight  = 1.0 - 1.0 / 3  = 0.667
avg     = (0.95 + 0.28 + 0.40) / 3  = 0.543
peak    = 0.95
merged  = 0.543 + (0.95 - 0.543) × 0.667  = 0.543 + 0.271  = 0.814
The planner ends up with confidence=0.81. Three independent methods all flagged it, so the corroboration bonus pulled the merged score well above the average of the individual confidences.
result = diagnose_pipeline(
    my_agent, question,
    judge=judge, rubric=rubric, llm=llm,
    runner=my_runner,   # enables ablation
    method="all",
)

Choosing a method

CriticDecompositionAblation
LLM calls110 (+ runner×N)
Runner requiredNoNoYes
Causal guaranteeNoNoYes
Multi-criterion rubricsPartialYesPartial
CostLowLowMedium–High
Start with method="all" (without a runner) for most use cases — two LLM calls, no runner needed, corroboration bonus when both methods agree. Add a runner when you want ablation’s causal confirmation.