Evaluation Guide
How the skill evaluation system verifies skill quality and reliability.
What Is the Eval System?
The evaluation system verifies that every skill behaves correctly across four failure modes defined by the Agent Skills whitepaper (Singhal et al., 2026): trigger (does the skill activate when it should?), execution (is the output correct?), token budget (is it within limits?), and regression (do co-loaded skills interfere?).
The Four Failure Modes
| Mode | What It Tests | Eval Type |
|---|---|---|
| Trigger | Does the skill fire when it should (and stay silent when it shouldn't)? | trigger_positive, trigger_negative |
| Execution | Does the skill produce correct output according to a rubric? | golden.jsonl with rubric criteria |
| Token Budget | Does the SKILL.md stay within 5000 tokens? | Automated by skill-lint.sh |
| Regression | Do co-loaded skills interfere with each other? | regression cases |
Adding Evals to a Skill
Each skill has an evals/ directory with up to three files: trigger.jsonl (positive and negative trigger tests), golden.jsonl (known-good inputs with rubrics), and adversarial.jsonl (rephrasing, boundary, and edge cases). Use the skill-creator skill to generate eval cases automatically.
skills/<skill-name>/
├── SKILL.md
└── evals/
├── trigger.jsonl # 2+ positive, 1+ negative
├── golden.jsonl # 1+ case with rubric
└── adversarial.jsonl # rephrasing + boundary + edge
Running Evals
Use the eval scripts to validate skill behavior:
bash scripts/eval/run-evals.sh --all # 172 cases
bash scripts/eval/run-golden.sh --all # 60 cases
bash scripts/eval/run-adversarial.sh --all # 118 cases
bash scripts/eval/check-coverage.sh # Coverage report
Skill Tiers
Every skill has a tier in its frontmatter that governs what it can do: read-only (analyzes, never writes), draft (in development, not trusted for production), action-allowed (full read/write/execute capability). Generated skills start at draft and require human review before promotion.
Full Documentation
For the complete reference including format specification, schema, trigger accuracy targets, coverage checklist, and examples, read EVAL-GUIDE.md.