feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1) by broomva · Pull Request #7 · broomva/role-x

broomva · 2026-06-05T17:28:46Z

What & why

Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1.

Origin: `/checkit` on Garry Tan's "skillify" essay surfaced the gap (filed `research/entities/concept/skillify.md`): bstack tests its code (P11) and governs skill promotion (P16), but never tests its skills as artifacts — nothing asserts a lens trigger actually routes. A resolver trigger says "phrase X selects lens Y"; a resolver eval proves it does.

Changes

Fix BRO-1338 — _score_lens matched prompt_keywords only via single-token set membership, so multi-word/punctuated triggers ("check this out", "/checkit", "let's research this", "last 30 days") could never match. New _kw_matches() substring-matches punctuated/multi-word keywords against the raw lowercased prompt; clean single tokens keep the original word-boundary-safe semantics → no regression.
role-x.py eval — runs roles/<lens>.eval.yaml (should_fire/should_not_fire, optional per-case touched_files/branch) through _select_lenses, asserts the declared lens's selection. Exit 1 on any failure (CI-gateable). --json, --verbose, --lens, --active-only.
include_statuses on _select_lenses (default ("active",)) — eval passes ("active","candidate") so a lens is routing-testable before promotion.

Validation (P11)

Full suite 66 green (53 prior + 13 new in tests/test_eval.py).
End-to-end against the real workspace registry: 19/19 eval cases pass. The proof that BRO-1338 is closed: checkit's "check this out https://…" now fires at score 1/1 — pre-fix it scored 0/1 and would have FAILED. The unit test test_score_lens_phrase_via_prompt_lc pins the before/after.

Consumers

The real roles/*.eval.yaml fixtures + Makefile wiring land in the workspace PR (slice 1b). Slices 2–4 (skill-script test gate, E2E skill-smoke, per-skill LLM-eval) tracked in BRO-1411.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a resolver-eval testing harness to validate lens routing configurations
- Extended lens selection to support testing both active and candidate lenses
Bug Fixes
- Fixed keyword scoring for multi-word and punctuated search terms
Documentation
- Updated documentation and changelog with new version 0.6.0 entries
Tests
- Added comprehensive unit and integration tests for the resolver-eval harness and keyword matching behavior

Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1. Origin: /checkit on Garry Tan's skillify essay surfaced that bstack tests its code + governs skill promotion but never tests its skills as artifacts; nothing asserts a lens trigger actually routes. - Fix BRO-1338: _kw_matches() substring-matches multi-word/punctuated keywords ('check this out', '/checkit', "let's research this") against the raw prompt; clean single tokens keep word-boundary set-membership (no regression). - Add role-x.py eval: runs roles/<lens>.eval.yaml should_fire/should_not_fire fixtures through _select_lenses, asserts selection; exit 1 on fail (CI-gate). - Add include_statuses to _select_lenses so candidate lenses are routing-testable before promotion. - 13 tests (tests/test_eval.py). Full suite 66 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

linear · 2026-06-05T17:28:49Z

BRO-1338

BRO-1411

coderabbitai · 2026-06-05T17:28:58Z

Warning

Review limit reached

@broomva, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 29 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3dfc6066-cad0-44a9-829c-073ff4b18096

📥 Commits

Reviewing files that changed from the base of the PR and between d88c629 and 48ab34a.

📒 Files selected for processing (2)

scripts/role-x.py
tests/test_eval.py

📝 Walkthrough

Walkthrough

This PR introduces a resolver-eval testing harness (BRO-1411) that validates lens routing logic against YAML-defined intent fixtures. It also fixes multi-word keyword phrase matching in lens scoring (BRO-1338) and extends lens selection to filter by status for pre-promotion evaluation.

Changes

Resolver-eval harness with keyword phrase matching and lens status filtering

Layer / File(s)	Summary
Keyword phrase matching for multi-word prompt keywords (BRO-1338) `scripts/role-x.py`, `tests/test_eval.py`	Adds `_kw_matches()` to handle both single-token set-membership and multi-word phrase matching. `_score_lens()` now accepts raw lowercased prompt to enable phrase matching for non-single-token keywords. Keyword scoring uses `_kw_matches()` instead of direct token lookup. Unit tests verify phrase matching, slash normalization, apostrophe handling, and false-negative prevention.
Lens selection with status-based filtering `scripts/role-x.py`, `tests/test_eval.py`	`_select_lenses()` gains `include_statuses` parameter to filter lenses by active/candidate status. Lowercased prompt is passed through to scoring to enable phrase matching during lens evaluation. Unit tests validate default candidate exclusion and conditional inclusion via parameter override.
Resolver-eval harness and command implementation (BRO-1411) `scripts/role-x.py`, `tests/test_eval.py`	Introduces `_normalize_eval_case()` and `_load_eval_specs()` utilities to load and parse `roles/*.eval.yaml` intent test fixtures. `cmd_eval()` executes routing assertions against loaded specs, compares resolved lenses to expected results, and outputs pass/fail with optional JSON structure and verbose diagnostics. Integration tests seed temporary eval fixtures and validate correct/false routing, JSON payload structure, and error handling.
CLI wiring for eval subcommand `scripts/role-x.py`	Argparse `build_parser()` adds `eval` subcommand with `--roles-dir`, `--lens` (optional filter), `--active-only` (exclude candidates), `--json`, and `--verbose` options.
Changelog and documentation updates `CHANGELOG.md`, `SKILL.md`	CHANGELOG.md documents v0.6.0 with resolver-eval harness, keyword phrase-matching fix (BRO-1338), status-filtering enhancement, and test additions. SKILL.md updated to list `roles/<name>.eval.yaml` fixtures and document the new `eval` CLI command.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

broomva/life#1632: Implements the resolver-eval harness (step 7) including the eval subcommand, fixture loading, and routing assertion logic directly described in the issue.
broomva/life#1550: Implements the phrase/substring matching fix for multi-word/punctuated keywords via _kw_matches() and raw lowercased prompt passing, addressing the exact scoring correction requested.

Possibly related PRs

broomva/role-x#1: Both PRs modify scripts/role-x.py lens-resolution pipeline and CLI wiring (build_parser), so the main PR's updated _score_lens/_select_lenses logic and status filtering interact with the intake subcommand behavior.
broomva/role-x#2: Both PRs update _score_lens() prompt-keyword matching and _select_lenses() call semantics; the main PR adds phrase matching and status filtering while the retrieved PR implements per-lens thresholds and signal weights at the same scoring/selection checkpoint.

Poem

🐰 A harness for testing, keywords now sing,
Multi-word phrases in prompts take wing,
Lenses can filter by status so bright,
Eval fixtures guide routing just right!
🎯✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: adding a resolver-eval harness and fixing a specific bug (BRO-1338), with reference to the parent issue (BRO-1411).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/bro-1411-resolver-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/test_eval.py (1)

164-173: ⚡ Quick win

Add a failing-case --json test to lock the output contract.

This suite checks JSON only on pass. Add one failure-path JSON parse/assertion so machine-readable output is enforced when eval fails too.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval.py` around lines 164 - 173, Extend the JSON-output test suite
by adding a new test (similar to test_eval_json_output) that seeds a workspace
producing a failing lens and invokes _run("eval", "--roles-dir", str(roles),
"--json", cwd=tmp_path) to ensure machine-readable output on failure; parse
json.loads(out) and assert the payload contains the expected keys and values
(e.g., payload["failed"] > 0, payload["passed"] == 0 and
payload["results"][0]["lens"] == "<failing-lens-name>"), using the existing
helpers _seed_eval_workspace and _run to locate where to add the test and mirror
the success-case assertions for structure and fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/role-x.py`:
- Around line 1528-1530: When running with --json, human log lines must not go
to stdout; change the print calls that emit failure/verbose messages (the print
inside the if spec.get("_error") block referencing spec['_path'] and the prints
covering the 1551-1557 section) to write to stderr when JSON mode is enabled
(i.e., detect the JSON flag/variable used by the script and, when set, direct
those human-readable prints to sys.stderr or an equivalent logger that writes to
stderr) so only pure JSON remains on stdout.
- Around line 1472-1478: The code currently returns None for malformed eval
items and the caller filters those out (lines ~1496-1498), hiding broken
fixtures; change the behavior so malformed entries fail loudly: either have this
parsing block raise a ValueError (including the offending item and a short
message) instead of returning None, or keep returning None but update the caller
to detect any None results and raise an exception listing the malformed inputs;
reference the anonymous parsing logic that checks isinstance(item, dict) and
item.get("intent") and the filtering at lines 1496-1498 when making the change.
- Around line 1486-1491: The code assumes the YAML root is a mapping and calls
data.get("lens"), which will raise AttributeError for non-mapping roots; before
using data.get() (after yaml.safe_load), check that data is a dict (e.g.
isinstance(data, dict)) and if not, append an error entry to specs (similar
shape to the existing {"_path": path, "_error": ...}) and continue; ensure you
still derive lens only from mapping roots and keep the existing fallback of
path.name[: -len(".eval.yaml")] when appropriate.

---

Nitpick comments:
In `@tests/test_eval.py`:
- Around line 164-173: Extend the JSON-output test suite by adding a new test
(similar to test_eval_json_output) that seeds a workspace producing a failing
lens and invokes _run("eval", "--roles-dir", str(roles), "--json", cwd=tmp_path)
to ensure machine-readable output on failure; parse json.loads(out) and assert
the payload contains the expected keys and values (e.g., payload["failed"] > 0,
payload["passed"] == 0 and payload["results"][0]["lens"] ==
"<failing-lens-name>"), using the existing helpers _seed_eval_workspace and _run
to locate where to add the test and mirror the success-case assertions for
structure and fields.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6be866a4-bf05-453a-a648-97177d8ef68c

📥 Commits

Reviewing files that changed from the base of the PR and between d8455a4 and d88c629.

📒 Files selected for processing (4)

CHANGELOG.md
SKILL.md
scripts/role-x.py
tests/test_eval.py

…ot type, clean --json P20 cross-review (role-x#7), all 3 findings — correctness holes in the correctness-testing tool itself: - malformed eval cases now FAIL loudly (counted) instead of silently dropped → no false-green from a broken fixture shrinking the assertion count. - guard non-mapping YAML root (list/scalar) → reported as fixture error, no AttributeError abort. - --json keeps stdout strictly machine-readable; human log lines → stderr. +3 tests (69 green; real-registry 19/19; --json parses on failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread scripts/role-x.py

Comment thread scripts/role-x.py Outdated

Comment thread scripts/role-x.py

broomva merged commit ccc4fe7 into main Jun 5, 2026
3 checks passed

broomva deleted the feature/bro-1411-resolver-eval branch June 5, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1)#7

feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1)#7
broomva merged 2 commits into
mainfrom
feature/bro-1411-resolver-eval

broomva commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

linear Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

broomva commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Changes

Validation (P11)

Consumers

Summary by CodeRabbit

Uh oh!

linear Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

broomva commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading