Summary
Build an evals extension for Spec Kit following EDD (Eval-Driven Development) principles, integrating with PromptFoo for eval execution and annotation.
Reference
The 10 EDD Principles
| # |
Principle |
Core Implication |
| I |
Spec-Driven Contracts |
Specs precede evals; evals validate spec compliance |
| II |
Binary Pass/Fail |
No Likert scales — output meets spec or fails |
| III |
Error Analysis & Pattern Discovery |
Open/Axial coding → bottom-up failure taxonomy |
| IV |
Evaluation Pyramid (Offline vs. Online) |
CI/CD: fast checks + goldset judges; Production: 10-20% sampling |
| V |
Trajectory Observability |
Full multi-turn traces, not just outputs |
| VI |
RAG Decomposition |
Separate retrieval from generation; IR metrics + LLM judges |
| VII |
Annotation Queues |
Route high-risk traces to humans for binary review |
| VIII |
Close the Production Loop |
Spec failures → fix directives; Gen failures → add to dataset + evaluator |
| IX |
Test Data is Code |
Version datasets; include adversarial inputs; hold-out test set |
| X |
Cross-Functional Observability |
PMs, domain experts, and AI engineers all collaborate |
Project Structure
{project}/
├── .specify/
│ ├── drafts/ # Draft eval records (markdown + YAML frontmatter)
│ │ └── eval-*.md
│ └── config.yml # evals.system configured here
├── evals/
│ └── {system}/ # promptfoo | custom | llm-judge
│ ├── goldset.md # Published evals (markdown + YAML frontmatter)
│ ├── goldset.json # Auto-generated for PromptFoo consumption
│ ├── graders/
│ │ ├── check_pii_leakage.py # Security baseline (always applied)
│ │ ├── check_prompt_injection.py
│ │ └── {failure-mode}.py # One per generalization_failure
│ ├── config.js # Generated PromptFoo config
│ └── config.yml # System-specific config
├── evals/results/ # Git-ignored; PromptFoo outputs + traces
└── specs/
└── {feature}/
└── tasks.md # [EVAL] markers per task
Commands
| Command |
Type |
Purpose |
evals.init |
Manual |
Initialize evals/ directory |
evals.specify |
Manual |
Bottom-up goldset definition from human error analysis → drafts/ |
evals.clarify |
Manual |
Axial coding + accept drafts → goldset.md + goldset.json |
evals.analyze |
Manual |
Re-code + quantify + saturation + adversarial check + holdout split |
evals.tasks |
Hook (after_tasks) |
Match published evals to feature tasks → [EVAL] markers |
evals.implement |
Hook (after_implement) |
Generate PromptFoo config + graders from goldset |
evals.levelup |
Manual |
Scan evals/results/ + annotation queue → PR to team-ai-directives |
evals.validate |
Manual |
TPR/TNR + goldset quality + PromptFoo pass rate thresholds |
Hooks
hooks:
after_tasks:
command: "adlc.evals.tasks"
after_implement:
command: "adlc.evals.implement"
Goldset Lifecycle (ADR/CDR Pattern)
evals.specify → evals.clarify → evals.analyze → evals.implement
(draft) (accept) (finalize) (publish)
Goldset Record Format (Markdown + YAML frontmatter)
---
id: eval-001
status: draft | accepted | published
name: {name}
description: {description}
# Binary pass/fail only (EDD Principle II)
pass_condition: {precise spec constraint}
fail_condition: {precise spec violation}
# Failure type gate (EDD Principle VIII)
failure_type:
specification_failure:
action: fix_directive
generalization_failure:
action: build_evaluator
evaluator_type: code-based | llm-judge
# Error analysis provenance (EDD Principle III)
error_analysis:
traces_analyzed: N
theoretical_saturation: true | false
# Test data hygiene (EDD Principle IX)
test_data:
adversarial_included: true | false
holdout_ratio: 0.2
# RAG decomposition (EDD Principle VI, optional)
rag_decomposition:
retrieval_check: ir-metrics | llm-judge | none
generation_check: llm-judge | none
---
EDD Principles → Command Mapping
| Principle |
Integration |
| I Spec-Driven Contracts |
evals.specify creates criteria from human error analysis, not generic metrics |
| II Binary Pass/Fail |
All graders output binary pass/fail — no Likert scales |
| III Error Analysis |
evals.specify (open coding) → evals.clarify (axial coding) → evals.analyze (saturation) |
| IV Evaluation Pyramid |
evals.implement generates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges) |
| V Trajectory Observability |
evals.levelup scans full traces from evals/results/ including tool calls |
| VI RAG Decomposition |
rag_decomposition fields in goldset record; separate graders for retrieval vs. generation |
| VII Annotation Queues |
Annotation integration via PromptFoo/system native annotation support |
| VIII Close the Loop |
Gate in evals.implement: specification_failure → fix directive; generalization_failure → build evaluator |
| IX Test Data is Code |
evals.analyze ensures adversarial inputs included, hold-out set preserved, datasets versioned |
| X Cross-Functional |
evals.levelup PR targets team-ai-directives/AGENTS.md for PMs, domain experts, and AI engineers |
Lifecycle
evals.init
│
▼
evals.specify ← bottom-up error analysis from human
(free-text notes per trace → drafts/eval-*.md)
│
▼
evals.clarify ← axial coding
(cluster notes → failure modes → accept → goldset.md → goldset.json)
│
▼
evals.analyze ← quantify + saturation + adversarial + holdout
│
▼
evals.implement ← after_implement hook
┌─ Tier 1 (CI/CD): Fast deterministic graders
│ • code-based: XML structure, SQL syntax, Regex match
│ • Security baseline: pii_leakage, injection, etc.
├─ Tier 2 (CI/CD): Goldset LLM judges
│ • semantic checks on version-controlled goldset
│ • blocks bad merges
└─ Annotation integration: routes traces to system annotation support
│
▼
evals.validate ← TPR/TNR + goldset quality + pass rate thresholds
│
▼
evals.levelup ← evals/results/ scan + annotation queue
(full trajectories → high-risk routing → binary human review
→ PR to team-ai-directives/AGENTS.md)
│
▼
evals.tasks ← after_tasks hook
(match all published evals → [EVAL] markers in specs/{feature}/tasks.md)
evals.tasks — Task Marker Matching
Matching logic: Keyword overlap (default) + exact tag match (if task_tags defined in goldset)
Multiple evals per task: Allowed; conflicts flagged for human review with AMBIGUOUS: marker
Scope: All published evals from goldset.md (no feature filtering)
Marker format: [EVAL]: eval-001 (check_name)
Variants:
/evals.tasks --dry-run → show proposed markers (default for manual)
/evals.tasks --apply → write markers to tasks.md
- Hook mode → auto-write (after_tasks hook)
Example output:
## TASK-001: Implement token validation
[EVAL]: eval-001 (auth_token_present)
[EVAL]: eval-003 (pii_not_leaked)
⚠ AMBIGUOUS: eval-002 vs eval-005 — review before apply
## TASK-002: Implement password reset
[EVAL]: eval-001 (auth_token_present)
Extension Files
.specify/extensions/evals/
├── extension.yml # 8 commands, 2 hooks, tags, defaults, handoffs
├── README.md # Full docs + EDD Principles + Evaluation Pyramid
├── CHANGELOG.md
├── config-template.yml # system: promptfoo (default)
│
├── commands/
│ ├── init.md # Initialize evals/{system}/ directory
│ ├── specify.md # Bottom-up from human error analysis → drafts/
│ ├── clarify.md # Axial coding → goldset.md + goldset.json
│ ├── analyze.md # Quantify + saturation + adversarial + holdout
│ ├── tasks.md # Match evals → [EVAL] markers (after_tasks hook)
│ ├── implement.md # Tier 1 + Tier 2 graders + annotation integration
│ ├── validate.md # TPR/TNR + goldset quality + pass rate thresholds
│ └── levelup.md # evals/results/ scan + annotation queue + PR
│
├── scripts/bash/
│ └── setup-evals.sh # All 8 actions + --json output
│
└── templates/
├── eval-criterion.md # Individual eval criterion (binary pass/fail)
├── goldset-record.md # Draft → Accept → Published lifecycle
├── failure-mode-registry.md # Failure taxonomy (per-system)
├── promptfoo-test.yaml # PromptFoo test template
└── grader-template.py # Python grader template (code-based)
Implementation Order
- Scaffold:
extension.yml + setup-evals.sh + config-template.yml
- Goldset lifecycle:
init.md → specify.md → clarify.md → analyze.md
- Evaluator building:
implement.md + PromptFoo templates + grader template
- Validation:
validate.md (TPR/TNR + quality + pass rates)
- Ops:
levelup.md (traces + annotation + PR) + tasks.md (markers)
- Docs:
templates/ + README.md + CHANGELOG.md
Summary
Build an evals extension for Spec Kit following EDD (Eval-Driven Development) principles, integrating with PromptFoo for eval execution and annotation.
Reference
1O0XSF31Fp9e6SXFCujO6C2zATPVAk_HJayzhg-KYAEoThe 10 EDD Principles
Project Structure
Commands
evals.initevals/directoryevals.specifydrafts/evals.clarifygoldset.md+goldset.jsonevals.analyzeevals.tasksafter_tasks)[EVAL]markersevals.implementafter_implement)evals.levelupevals/results/+ annotation queue → PR to team-ai-directivesevals.validateHooks
Goldset Lifecycle (ADR/CDR Pattern)
Goldset Record Format (Markdown + YAML frontmatter)
EDD Principles → Command Mapping
evals.specifycreates criteria from human error analysis, not generic metricsevals.specify(open coding) →evals.clarify(axial coding) →evals.analyze(saturation)evals.implementgenerates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges)evals.levelupscans full traces fromevals/results/including tool callsrag_decompositionfields in goldset record; separate graders for retrieval vs. generationevals.implement:specification_failure→ fix directive;generalization_failure→ build evaluatorevals.analyzeensures adversarial inputs included, hold-out set preserved, datasets versionedevals.levelupPR targetsteam-ai-directives/AGENTS.mdfor PMs, domain experts, and AI engineersLifecycle
evals.tasks— Task Marker MatchingMatching logic: Keyword overlap (default) + exact tag match (if
task_tagsdefined in goldset)Multiple evals per task: Allowed; conflicts flagged for human review with
AMBIGUOUS:markerScope: All published evals from
goldset.md(no feature filtering)Marker format:
[EVAL]: eval-001 (check_name)Variants:
/evals.tasks --dry-run→ show proposed markers (default for manual)/evals.tasks --apply→ write markers to tasks.mdExample output:
Extension Files
Implementation Order
extension.yml+setup-evals.sh+config-template.ymlinit.md→specify.md→clarify.md→analyze.mdimplement.md+ PromptFoo templates + grader templatevalidate.md(TPR/TNR + quality + pass rates)levelup.md(traces + annotation + PR) +tasks.md(markers)templates/+README.md+CHANGELOG.md