feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1)#7
Conversation
Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1.
Origin: /checkit on Garry Tan's skillify essay surfaced that bstack tests its
code + governs skill promotion but never tests its skills as artifacts; nothing
asserts a lens trigger actually routes.
- Fix BRO-1338: _kw_matches() substring-matches multi-word/punctuated keywords
('check this out', '/checkit', "let's research this") against the raw prompt;
clean single tokens keep word-boundary set-membership (no regression).
- Add role-x.py eval: runs roles/<lens>.eval.yaml should_fire/should_not_fire
fixtures through _select_lenses, asserts selection; exit 1 on fail (CI-gate).
- Add include_statuses to _select_lenses so candidate lenses are routing-testable
before promotion.
- 13 tests (tests/test_eval.py). Full suite 66 green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 50 minutes and 29 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR introduces a resolver-eval testing harness (BRO-1411) that validates lens routing logic against YAML-defined intent fixtures. It also fixes multi-word keyword phrase matching in lens scoring (BRO-1338) and extends lens selection to filter by status for pre-promotion evaluation. ChangesResolver-eval harness with keyword phrase matching and lens status filtering
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/test_eval.py (1)
164-173: ⚡ Quick winAdd a failing-case
--jsontest to lock the output contract.This suite checks JSON only on pass. Add one failure-path JSON parse/assertion so machine-readable output is enforced when eval fails too.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_eval.py` around lines 164 - 173, Extend the JSON-output test suite by adding a new test (similar to test_eval_json_output) that seeds a workspace producing a failing lens and invokes _run("eval", "--roles-dir", str(roles), "--json", cwd=tmp_path) to ensure machine-readable output on failure; parse json.loads(out) and assert the payload contains the expected keys and values (e.g., payload["failed"] > 0, payload["passed"] == 0 and payload["results"][0]["lens"] == "<failing-lens-name>"), using the existing helpers _seed_eval_workspace and _run to locate where to add the test and mirror the success-case assertions for structure and fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/role-x.py`:
- Around line 1528-1530: When running with --json, human log lines must not go
to stdout; change the print calls that emit failure/verbose messages (the print
inside the if spec.get("_error") block referencing spec['_path'] and the prints
covering the 1551-1557 section) to write to stderr when JSON mode is enabled
(i.e., detect the JSON flag/variable used by the script and, when set, direct
those human-readable prints to sys.stderr or an equivalent logger that writes to
stderr) so only pure JSON remains on stdout.
- Around line 1472-1478: The code currently returns None for malformed eval
items and the caller filters those out (lines ~1496-1498), hiding broken
fixtures; change the behavior so malformed entries fail loudly: either have this
parsing block raise a ValueError (including the offending item and a short
message) instead of returning None, or keep returning None but update the caller
to detect any None results and raise an exception listing the malformed inputs;
reference the anonymous parsing logic that checks isinstance(item, dict) and
item.get("intent") and the filtering at lines 1496-1498 when making the change.
- Around line 1486-1491: The code assumes the YAML root is a mapping and calls
data.get("lens"), which will raise AttributeError for non-mapping roots; before
using data.get() (after yaml.safe_load), check that data is a dict (e.g.
isinstance(data, dict)) and if not, append an error entry to specs (similar
shape to the existing {"_path": path, "_error": ...}) and continue; ensure you
still derive lens only from mapping roots and keep the existing fallback of
path.name[: -len(".eval.yaml")] when appropriate.
---
Nitpick comments:
In `@tests/test_eval.py`:
- Around line 164-173: Extend the JSON-output test suite by adding a new test
(similar to test_eval_json_output) that seeds a workspace producing a failing
lens and invokes _run("eval", "--roles-dir", str(roles), "--json", cwd=tmp_path)
to ensure machine-readable output on failure; parse json.loads(out) and assert
the payload contains the expected keys and values (e.g., payload["failed"] > 0,
payload["passed"] == 0 and payload["results"][0]["lens"] ==
"<failing-lens-name>"), using the existing helpers _seed_eval_workspace and _run
to locate where to add the test and mirror the success-case assertions for
structure and fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6be866a4-bf05-453a-a648-97177d8ef68c
📒 Files selected for processing (4)
CHANGELOG.mdSKILL.mdscripts/role-x.pytests/test_eval.py
…ot type, clean --json P20 cross-review (role-x#7), all 3 findings — correctness holes in the correctness-testing tool itself: - malformed eval cases now FAIL loudly (counted) instead of silently dropped → no false-green from a broken fixture shrinking the assertion count. - guard non-mapping YAML root (list/scalar) → reported as fixture error, no AttributeError abort. - --json keeps stdout strictly machine-readable; human log lines → stderr. +3 tests (69 green; real-registry 19/19; --json parses on failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What & why
Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1.
Origin: `/checkit` on Garry Tan's "skillify" essay surfaced the gap (filed `research/entities/concept/skillify.md`): bstack tests its code (P11) and governs skill promotion (P16), but never tests its skills as artifacts — nothing asserts a lens trigger actually routes. A resolver trigger says "phrase X selects lens Y"; a resolver eval proves it does.
Changes
_score_lensmatchedprompt_keywordsonly via single-token set membership, so multi-word/punctuated triggers ("check this out","/checkit","let's research this","last 30 days") could never match. New_kw_matches()substring-matches punctuated/multi-word keywords against the raw lowercased prompt; clean single tokens keep the original word-boundary-safe semantics → no regression.role-x.py eval— runsroles/<lens>.eval.yaml(should_fire/should_not_fire, optional per-casetouched_files/branch) through_select_lenses, asserts the declared lens's selection. Exit 1 on any failure (CI-gateable).--json,--verbose,--lens,--active-only.include_statuseson_select_lenses(default("active",)) — eval passes("active","candidate")so a lens is routing-testable before promotion.Validation (P11)
tests/test_eval.py).checkit's"check this out https://…"now fires at score 1/1 — pre-fix it scored 0/1 and would have FAILED. The unit testtest_score_lens_phrase_via_prompt_lcpins the before/after.Consumers
The real
roles/*.eval.yamlfixtures + Makefile wiring land in the workspace PR (slice 1b). Slices 2–4 (skill-script test gate, E2E skill-smoke, per-skill LLM-eval) tracked in BRO-1411.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests