Skills are agent prompts. Like any prompt, they have measurable failure modes. This directory documents the empirical evidence that shaped every structural choice in
Frontify/skillsβ from the wording of adescriptionfield to the absence of any automation in the repo.
Every choice in README.md, CONTRIBUTING.md, AGENTS.md, and the shipped skills traces to one of the documents below. If a contributor asks "why is the description in this form?", "why are there no scripts in this repo?", or "why don't skills reference each other?" β the answer is here, with a citation, not a hunch.
| Document | What it documents |
|---|---|
| Activation | Why every description is in directive form, the exclusion-clause discipline, the compliance ceiling |
| Body anatomy | Numbered rules with rationale, length budgets, anti-patterns sections, reference depth |
| Execution | The two reliability problems; the forced-visible-output fix; Reflexion as the deeper pattern |
| Self-containment | Why skills don't reference each other; the AGENTS.md contract; file-based state externalisation |
| Task files | How task templates relate to plan-mode files; why most workflow skills ship one and a few don't; Anthropic's three-file canonical pattern |
| Scope | What belongs here, what deliberately doesn't, and the principle behind each exclusion |
| Sources | Full bibliography. Every claim in this directory cites one of these |
A controlled 650-trial experiment measured Claude Code skill activation across three description styles and four environment conditions [3]. Passive "Use when β¦" descriptions activated 50β77 % of the time and collapsed to 37 % under hooks. Directive descriptions β "ALWAYS apply this skill when β¦ Do not Y. Skip for Z." β activated 100 % of trials, with ~20Γ higher odds (CMH OR = 20.6, p < 0.0001). A separate analysis distinguished that activation failure from a different failure mode where late-stage verification steps inside an already-loaded skill get silently skipped because they delay output without producing visible content [4] β the same verbal-feedback discipline Reflexion [27] showed lifts HumanEval pass@1 from 80 % to 91 %. Anthropic's official guidance adds a 500-line body cap [2] anchored in the U-shaped attention curve [5][30], and a canonical three-file note-taking pattern for long-running tasks (task_plan / progress_log / decisions) [20] β vendor-grade validation of file-based state externalisation, which an InfiAgent ablation [29] measured at 21Γ performance degradation when removed. The ETH Zurich AGENTS.md study [32] closes the loop: tool-specific commands have 2.5Γ the impact when present, but LLM-generated narrative context costs +20 % for β3 % success β the empirical case for this repo's split between universal how-to-work skills and project-specific AGENTS.md what-to-run commands.
Everything in this directory applies that paragraph to specific structural choices in the repo.
flowchart LR
A["Empirical findings<br/>(see Sources)"] --> B("6 design principles")
B --> C1["description: directive form"]
B --> C2["body: numbered rules + anti-patterns"]
B --> C3["length: ~200 lines, 500 cap"]
B --> C4["verification: forced visible output"]
B --> C5["skills: self-contained"]
B --> C6["state: externalised to task files"]
C1 --> D1["skills/*/SKILL.md frontmatter"]
C2 --> D2["skills/*/SKILL.md body"]
C3 --> D2
C4 --> D3["empirical-proof, write-testing, write-research"]
C5 --> D4["no cross-skill links + AGENTS.md contract"]
C6 --> D5["skills/*/references/task-template.md"]
Each principle has a dedicated document in this directory; the Sources page lists the primary evidence behind every link in the chain.
This is a working record, not a frozen artefact. When a new primary source materially changes a design choice, the relevant document and the Sources entry are updated together. If a source's URL goes dead, the citation is replaced or marked [archived].