Docs Evaluation Guide

Evaluation Guide

How the skill evaluation system verifies skill quality and reliability.

What Is the Eval System?

The evaluation system verifies that every skill behaves correctly across four failure modes defined by the Agent Skills whitepaper (Singhal et al., 2026): trigger (does the skill activate when it should?), execution (is the output correct?), token budget (is it within limits?), and regression (do co-loaded skills interfere?).

The Four Failure Modes

ModeWhat It TestsEval Type
TriggerDoes the skill fire when it should (and stay silent when it shouldn't)?trigger_positive, trigger_negative
ExecutionDoes the skill produce correct output according to a rubric?golden.jsonl with rubric criteria
Token BudgetDoes the SKILL.md stay within 5000 tokens?Automated by skill-lint.sh
RegressionDo co-loaded skills interfere with each other?regression cases

Adding Evals to a Skill

Each skill has an evals/ directory with up to three files: trigger.jsonl (positive and negative trigger tests), golden.jsonl (known-good inputs with rubrics), and adversarial.jsonl (rephrasing, boundary, and edge cases). Use the skill-creator skill to generate eval cases automatically.

skills/<skill-name>/
├── SKILL.md
└── evals/
    ├── trigger.jsonl       # 2+ positive, 1+ negative
    ├── golden.jsonl        # 1+ case with rubric
    └── adversarial.jsonl   # rephrasing + boundary + edge

Running Evals

Use the eval scripts to validate skill behavior:

bash scripts/eval/run-evals.sh --all           # 172 cases
bash scripts/eval/run-golden.sh --all          # 60 cases
bash scripts/eval/run-adversarial.sh --all     # 118 cases
bash scripts/eval/check-coverage.sh            # Coverage report

Skill Tiers

Every skill has a tier in its frontmatter that governs what it can do: read-only (analyzes, never writes), draft (in development, not trusted for production), action-allowed (full read/write/execute capability). Generated skills start at draft and require human review before promotion.

Full Documentation

For the complete reference including format specification, schema, trigger accuracy targets, coverage checklist, and examples, read EVAL-GUIDE.md.

→ Open EVAL-GUIDE.md (full reference)