Docs › Evaluation Guide

Evaluation Guide

How the skill evaluation system verifies skill quality and reliability.

What Is the Eval System?

The evaluation system verifies that every skill behaves correctly across four failure modes defined by the Agent Skills whitepaper (Singhal et al., 2026): trigger (does the skill activate when it should?), execution (is the output correct?), token budget (is it within limits?), and regression (do co-loaded skills interfere?).

The Four Failure Modes

Mode	What It Tests	Eval Type
Trigger	Does the skill fire when it should (and stay silent when it shouldn't)?	`trigger_positive`, `trigger_negative`
Execution	Does the skill produce correct output according to a rubric?	`golden.jsonl` with rubric criteria
Token Budget	Does the SKILL.md stay within 5000 tokens?	Automated by `skill-lint.sh`
Regression	Do co-loaded skills interfere with each other?	`regression` cases

Adding Evals to a Skill

Each skill has an evals/ directory with up to three files: trigger.jsonl (positive and negative trigger tests), golden.jsonl (known-good inputs with rubrics), and adversarial.jsonl (rephrasing, boundary, and edge cases). Use the skill-creator skill to generate eval cases automatically.

skills/<skill-name>/
├── SKILL.md
└── evals/
    ├── trigger.jsonl       # 2+ positive, 1+ negative
    ├── golden.jsonl        # 1+ case with rubric
    └── adversarial.jsonl   # rephrasing + boundary + edge

Running Evals

Use the eval scripts to validate skill behavior:

bash scripts/eval/run-evals.sh --all           # 172 cases
bash scripts/eval/run-golden.sh --all          # 60 cases
bash scripts/eval/run-adversarial.sh --all     # 118 cases
bash scripts/eval/check-coverage.sh            # Coverage report

Skill Tiers

Every skill has a tier in its frontmatter that governs what it can do: read-only (analyzes, never writes), draft (in development, not trusted for production), action-allowed (full read/write/execute capability). Generated skills start at draft and require human review before promotion.

Full Documentation

For the complete reference including format specification, schema, trigger accuracy targets, coverage checklist, and examples, read EVAL-GUIDE.md.

→ Open EVAL-GUIDE.md (full reference)