Skip to content

feat: add failure taxonomy and LLM judge template#1

Merged
widingmarcus-cyber merged 1 commit into
mainfrom
feat/evals-skills-taxonomy
Mar 4, 2026
Merged

feat: add failure taxonomy and LLM judge template#1
widingmarcus-cyber merged 1 commit into
mainfrom
feat/evals-skills-taxonomy

Conversation

@widingmarcus-cyber

Copy link
Copy Markdown
Owner

Summary

Based on Hamel Husain's evals-skills approach and team discussion.

Failure Taxonomy (v1.1)

8 primary modes:

  • MISREAD_SPEC - Task misunderstanding
  • HALLUCINATION - Referenced non-existent things
  • IMPL_BUG - Implementation bug
  • REGRESSION - Fixed problem, broke something else
  • TIMEOUT_STUCK - No progress, looped
  • BUDGET_EXCEEDED - Token/time/cost limits
  • CONSTRAINT_VIOLATION - Broke rules
  • INFRA_FAILURE - Environment failed

7 secondary signals: scope_drift, premature_victory, instruction_amnesia, tool_fumbling, loop_stuck, state_drift, retry_violation

LLM-as-Judge Template

challenges/_templates/judge.yaml with:

  • Binary 0/1 criteria with weights
  • Anchored examples (good/bad)
  • JSON-only output with schema validation
  • temperature=0, seed=42 for reproducibility
  • Calibration set for drift detection

Next Steps

  1. Implement opengym judge <attempt> command
  2. Add judge.yaml to a few challenges as examples
  3. Write calibration tests

Co-authored-by: Loyd, Liselott (swarm team)

Based on Hamel Husain's evals-skills approach and team input:

## Failure Taxonomy (v1.1)
8 primary modes: MISREAD_SPEC, HALLUCINATION, IMPL_BUG, REGRESSION,
TIMEOUT_STUCK, BUDGET_EXCEEDED, CONSTRAINT_VIOLATION, INFRA_FAILURE

7 secondary signals: scope_drift, premature_victory, instruction_amnesia,
tool_fumbling, loop_stuck, state_drift, retry_violation

## LLM-as-Judge Template (challenges/_templates/judge.yaml)
- Binary 0/1 criteria with weights
- Anchored examples (good/bad)
- JSON-only output with schema validation
- temperature=0, seed=42 for reproducibility
- Calibration set for drift detection

Inspired by: https://hamel.dev/blog/posts/evals-skills/
Co-authored-by: Loyd <loyd@swarm.local>
Co-authored-by: Liselott <liselott@swarm.local>
@widingmarcus-cyber widingmarcus-cyber merged commit 7f5393e into main Mar 4, 2026
9 checks passed
@widingmarcus-cyber widingmarcus-cyber deleted the feat/evals-skills-taxonomy branch March 4, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant