feat: add failure taxonomy and LLM judge template by widingmarcus-cyber · Pull Request #1 · widingmarcus-cyber/opengym

widingmarcus-cyber · 2026-03-04T06:39:17Z

Summary

Based on Hamel Husain's evals-skills approach and team discussion.

Failure Taxonomy (v1.1)

8 primary modes:

MISREAD_SPEC - Task misunderstanding
HALLUCINATION - Referenced non-existent things
IMPL_BUG - Implementation bug
REGRESSION - Fixed problem, broke something else
TIMEOUT_STUCK - No progress, looped
BUDGET_EXCEEDED - Token/time/cost limits
CONSTRAINT_VIOLATION - Broke rules
INFRA_FAILURE - Environment failed

7 secondary signals: scope_drift, premature_victory, instruction_amnesia, tool_fumbling, loop_stuck, state_drift, retry_violation

LLM-as-Judge Template

challenges/_templates/judge.yaml with:

Binary 0/1 criteria with weights
Anchored examples (good/bad)
JSON-only output with schema validation
temperature=0, seed=42 for reproducibility
Calibration set for drift detection

Next Steps

Implement opengym judge <attempt> command
Add judge.yaml to a few challenges as examples
Write calibration tests

Co-authored-by: Loyd, Liselott (swarm team)

Based on Hamel Husain's evals-skills approach and team input: ## Failure Taxonomy (v1.1) 8 primary modes: MISREAD_SPEC, HALLUCINATION, IMPL_BUG, REGRESSION, TIMEOUT_STUCK, BUDGET_EXCEEDED, CONSTRAINT_VIOLATION, INFRA_FAILURE 7 secondary signals: scope_drift, premature_victory, instruction_amnesia, tool_fumbling, loop_stuck, state_drift, retry_violation ## LLM-as-Judge Template (challenges/_templates/judge.yaml) - Binary 0/1 criteria with weights - Anchored examples (good/bad) - JSON-only output with schema validation - temperature=0, seed=42 for reproducibility - Calibration set for drift detection Inspired by: https://hamel.dev/blog/posts/evals-skills/ Co-authored-by: Loyd <loyd@swarm.local> Co-authored-by: Liselott <liselott@swarm.local>

widingmarcus-cyber merged commit 7f5393e into main Mar 4, 2026
9 checks passed

widingmarcus-cyber deleted the feat/evals-skills-taxonomy branch March 4, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add failure taxonomy and LLM judge template#1

feat: add failure taxonomy and LLM judge template#1
widingmarcus-cyber merged 1 commit into
mainfrom
feat/evals-skills-taxonomy

widingmarcus-cyber commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

widingmarcus-cyber commented Mar 4, 2026

Summary

Failure Taxonomy (v1.1)

LLM-as-Judge Template

Next Steps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant