Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.aevyra.ai/llms.txt

Use this file to discover all available pages before exploring further.

tune

Start a new tuning run or resume an interrupted one.
aevyra-forge tune [OPTIONS]
aevyra-forge tune resume
resume reads all parameters from the run’s config.json — no flags needed.

Options

FlagDefaultDescription
--model(required)HuggingFace model ID or local path (e.g. Qwen/Qwen2.5-3B, meta-llama/Llama-3.2-1B-Instruct)
--devicecudaGPU backend: cuda, rocm, or cpu. cuda and rocm auto-detect GPU name and VRAM via nvidia-smi / rocm-smi. Use cpu with --dry-run
--workload(required)Path to workload JSONL. Each line must have a "prompt" field and optionally "expected_output_tokens" and "arrival_offset_s"
--concurrency8Max concurrent in-flight requests during benchmarking. T4/A10: 8–16. A100/H100: 32–64
--llmanthropic/claude-sonnet-4-6Agent LLM in provider/model format. Examples: openrouter/meta-llama/llama-3.1-70b, openai/gpt-4o, ollama/qwen3:8b
--max-experiments50Total experiment budget across all layers
--max-hours12.0Wall-clock time limit in hours
--max-dollarsLLM spend cap in USD
--accuracy-floor0.99Minimum acceptable accuracy. Experiments that regress below this are not kept regardless of throughput gains
--playbook(bundled)Path to a custom playbook .md file. Defaults to the bundled playbook.md
--run-dir.forgeRoot directory for run storage
--dry-runfalseSkip vLLM; use synthetic bench results. Useful for testing the loop without a GPU
--verbosefalseDebug logging

Layer control

FlagDefaultDescription
--skip-configfalseSkip Layer 1 config tuning — go straight to Layer 2 quantization
--skip-quantfalseSkip Layer 2 quantization
--skip-kernelfalseSkip Layer 3 kernel synthesis
--max-config-experiments NCap Layer 1 at N experiments, then escalate to Layer 2 regardless of convergence. Useful on T4 where the config search space is narrow
--max-quant-experiments NCap Layer 2 at N experiments

Examples

# Standard overnight run — L1 then L2 automatically
aevyra-forge tune \
  --model Qwen/Qwen2.5-3B \
  --device cuda \
  --workload prod_trace.jsonl \
  --max-experiments 50 \
  --max-hours 10

# Cap L1 to 3 experiments on a T4 (small config search space)
aevyra-forge tune \
  --model Qwen/Qwen2.5-3B \
  --device cuda \
  --workload examples/sample_workload.jsonl \
  --max-config-experiments 3

# Skip L1 — quantize only (you already have a tuned config)
aevyra-forge tune \
  --model Qwen/Qwen2.5-3B \
  --device cuda \
  --workload examples/sample_workload.jsonl \
  --skip-config

# Use a different agent LLM
aevyra-forge tune \
  --model Qwen/Qwen2.5-3B \
  --device cuda \
  --workload examples/sample_workload.jsonl \
  --llm openrouter/meta-llama/llama-3.1-70b

# Dry-run on CPU (no vLLM, no GPU)
aevyra-forge tune \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --device cpu \
  --workload examples/sample_workload.jsonl \
  --max-experiments 5 \
  --dry-run

# Resume latest interrupted run (zero args)
aevyra-forge tune resume

Run directory layout

.forge/
  runs/
    001_2026-05-13T04-10-00/
      config.json          ← model, hardware, workload path, all CLI flags
      experiments.jsonl    ← append-only log (one line per experiment)
      experiments.tsv      ← human-readable table
      experiments.json     ← structured table for tooling
      best_recipe.yaml     ← best config found so far
      completed.json       ← written on clean finish; absent = interrupted
A run with experiments.jsonl but no completed.json was interrupted and can be resumed with aevyra-forge tune resume.

report

Print a summary of a completed or in-progress run.
aevyra-forge report <run-dir> [OPTIONS]

Arguments

ArgumentDescription
run-dirPath to a run directory (e.g. .forge/ or .forge/runs/001_2026-05-13T04-10-00)

Options

FlagDefaultDescription
--formattableOutput format: table or json

Output

=== Forge Report: .forge/runs/001_2026-05-13T04-10-00 ===

Total experiments: 7
Best score:        2.2509
Best recipe ID:    c3d4e5f6
Best generation:   6
Throughput:        516.4 tok/s
P99 latency:       181 ms

exp  id        layer   score   throughput  p99_ms  accuracy  kept  rationale
0    a1b2c3d4  config  1.0000  229.4       312     0.991     ✓     baseline
1    b2c3d4e5  config  1.0467  240.0       298     0.993     ✓     enable_prefix_caching
2    c3d4e5f6  config  0.9942  228.1       341     0.989     ✗     max_num_seqs=64 stressed VRAM
3    d4e5f6a7  quant   1.2703  290.9       261     0.990     ✗     int8: score below best
4    c3d4e5f6  quant   2.2509  516.4       181     0.992     ✓     int4_awq: 60% VRAM freed for KV cache

playbook

Inspect the active playbook.
aevyra-forge playbook show [--playbook PATH]
aevyra-forge playbook validate [--playbook PATH]

Subcommands

SubcommandDescription
showPrint the full playbook text to stdout
validateCheck the playbook’s structure and YAML front-matter. Exits non-zero if invalid

Options

FlagDefaultDescription
--playbook(bundled)Path to a custom playbook file. Defaults to the bundled playbook.md

Examples

# View the bundled playbook
aevyra-forge playbook show

# Validate a custom playbook before a run
aevyra-forge playbook validate --playbook my_playbook.md

# Pipe to less
aevyra-forge playbook show | less

LLM providers

The --llm flag follows a provider/model convention shared across the Aevyra stack:
ProviderFormatRequired env var
Anthropic (default)anthropic/claude-sonnet-4-6ANTHROPIC_API_KEY
OpenAIopenai/gpt-4oOPENAI_API_KEY
OpenRouteropenrouter/meta-llama/llama-3.1-70bOPENROUTER_API_KEY
Together AItogether/meta-llama/Llama-3-70bTOGETHER_API_KEY
Groqgroq/llama3-70b-8192GROQ_API_KEY
Ollama (local)ollama/qwen3:8b
Any OpenAI-compatopenai/model-name + custom base URL