Skip to content

feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1)#7

Merged
broomva merged 2 commits into
mainfrom
feature/bro-1411-resolver-eval
Jun 5, 2026
Merged

feat(eval): resolver-eval harness + fix BRO-1338 (BRO-1411 slice 1)#7
broomva merged 2 commits into
mainfrom
feature/bro-1411-resolver-eval

Conversation

@broomva
Copy link
Copy Markdown
Owner

@broomva broomva commented Jun 5, 2026

What & why

Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1.

Origin: `/checkit` on Garry Tan's "skillify" essay surfaced the gap (filed `research/entities/concept/skillify.md`): bstack tests its code (P11) and governs skill promotion (P16), but never tests its skills as artifacts — nothing asserts a lens trigger actually routes. A resolver trigger says "phrase X selects lens Y"; a resolver eval proves it does.

Changes

  • Fix BRO-1338_score_lens matched prompt_keywords only via single-token set membership, so multi-word/punctuated triggers ("check this out", "/checkit", "let's research this", "last 30 days") could never match. New _kw_matches() substring-matches punctuated/multi-word keywords against the raw lowercased prompt; clean single tokens keep the original word-boundary-safe semantics → no regression.
  • role-x.py eval — runs roles/<lens>.eval.yaml (should_fire/should_not_fire, optional per-case touched_files/branch) through _select_lenses, asserts the declared lens's selection. Exit 1 on any failure (CI-gateable). --json, --verbose, --lens, --active-only.
  • include_statuses on _select_lenses (default ("active",)) — eval passes ("active","candidate") so a lens is routing-testable before promotion.

Validation (P11)

  • Full suite 66 green (53 prior + 13 new in tests/test_eval.py).
  • End-to-end against the real workspace registry: 19/19 eval cases pass. The proof that BRO-1338 is closed: checkit's "check this out https://…" now fires at score 1/1 — pre-fix it scored 0/1 and would have FAILED. The unit test test_score_lens_phrase_via_prompt_lc pins the before/after.

Consumers

The real roles/*.eval.yaml fixtures + Makefile wiring land in the workspace PR (slice 1b). Slices 2–4 (skill-script test gate, E2E skill-smoke, per-skill LLM-eval) tracked in BRO-1411.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a resolver-eval testing harness to validate lens routing configurations
    • Extended lens selection to support testing both active and candidate lenses
  • Bug Fixes

    • Fixed keyword scoring for multi-word and punctuated search terms
  • Documentation

    • Updated documentation and changelog with new version 0.6.0 entries
  • Tests

    • Added comprehensive unit and integration tests for the resolver-eval harness and keyword matching behavior

Skillify step 7 (resolver eval), built bstack-native — BRO-1411 slice 1.
Origin: /checkit on Garry Tan's skillify essay surfaced that bstack tests its
code + governs skill promotion but never tests its skills as artifacts; nothing
asserts a lens trigger actually routes.

- Fix BRO-1338: _kw_matches() substring-matches multi-word/punctuated keywords
  ('check this out', '/checkit', "let's research this") against the raw prompt;
  clean single tokens keep word-boundary set-membership (no regression).
- Add role-x.py eval: runs roles/<lens>.eval.yaml should_fire/should_not_fire
  fixtures through _select_lenses, asserts selection; exit 1 on fail (CI-gate).
- Add include_statuses to _select_lenses so candidate lenses are routing-testable
  before promotion.
- 13 tests (tests/test_eval.py). Full suite 66 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented Jun 5, 2026

BRO-1338

BRO-1411

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

Warning

Review limit reached

@broomva, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 29 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3dfc6066-cad0-44a9-829c-073ff4b18096

📥 Commits

Reviewing files that changed from the base of the PR and between d88c629 and 48ab34a.

📒 Files selected for processing (2)
  • scripts/role-x.py
  • tests/test_eval.py
📝 Walkthrough

Walkthrough

This PR introduces a resolver-eval testing harness (BRO-1411) that validates lens routing logic against YAML-defined intent fixtures. It also fixes multi-word keyword phrase matching in lens scoring (BRO-1338) and extends lens selection to filter by status for pre-promotion evaluation.

Changes

Resolver-eval harness with keyword phrase matching and lens status filtering

Layer / File(s) Summary
Keyword phrase matching for multi-word prompt keywords (BRO-1338)
scripts/role-x.py, tests/test_eval.py
Adds _kw_matches() to handle both single-token set-membership and multi-word phrase matching. _score_lens() now accepts raw lowercased prompt to enable phrase matching for non-single-token keywords. Keyword scoring uses _kw_matches() instead of direct token lookup. Unit tests verify phrase matching, slash normalization, apostrophe handling, and false-negative prevention.
Lens selection with status-based filtering
scripts/role-x.py, tests/test_eval.py
_select_lenses() gains include_statuses parameter to filter lenses by active/candidate status. Lowercased prompt is passed through to scoring to enable phrase matching during lens evaluation. Unit tests validate default candidate exclusion and conditional inclusion via parameter override.
Resolver-eval harness and command implementation (BRO-1411)
scripts/role-x.py, tests/test_eval.py
Introduces _normalize_eval_case() and _load_eval_specs() utilities to load and parse roles/*.eval.yaml intent test fixtures. cmd_eval() executes routing assertions against loaded specs, compares resolved lenses to expected results, and outputs pass/fail with optional JSON structure and verbose diagnostics. Integration tests seed temporary eval fixtures and validate correct/false routing, JSON payload structure, and error handling.
CLI wiring for eval subcommand
scripts/role-x.py
Argparse build_parser() adds eval subcommand with --roles-dir, --lens (optional filter), --active-only (exclude candidates), --json, and --verbose options.
Changelog and documentation updates
CHANGELOG.md, SKILL.md
CHANGELOG.md documents v0.6.0 with resolver-eval harness, keyword phrase-matching fix (BRO-1338), status-filtering enhancement, and test additions. SKILL.md updated to list roles/<name>.eval.yaml fixtures and document the new eval CLI command.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • broomva/life#1632: Implements the resolver-eval harness (step 7) including the eval subcommand, fixture loading, and routing assertion logic directly described in the issue.
  • broomva/life#1550: Implements the phrase/substring matching fix for multi-word/punctuated keywords via _kw_matches() and raw lowercased prompt passing, addressing the exact scoring correction requested.

Possibly related PRs

  • broomva/role-x#1: Both PRs modify scripts/role-x.py lens-resolution pipeline and CLI wiring (build_parser), so the main PR's updated _score_lens/_select_lenses logic and status filtering interact with the intake subcommand behavior.
  • broomva/role-x#2: Both PRs update _score_lens() prompt-keyword matching and _select_lenses() call semantics; the main PR adds phrase matching and status filtering while the retrieved PR implements per-lens thresholds and signal weights at the same scoring/selection checkpoint.

Poem

🐰 A harness for testing, keywords now sing,
Multi-word phrases in prompts take wing,
Lenses can filter by status so bright,
Eval fixtures guide routing just right!
🎯✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding a resolver-eval harness and fixing a specific bug (BRO-1338), with reference to the parent issue (BRO-1411).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/bro-1411-resolver-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/test_eval.py (1)

164-173: ⚡ Quick win

Add a failing-case --json test to lock the output contract.

This suite checks JSON only on pass. Add one failure-path JSON parse/assertion so machine-readable output is enforced when eval fails too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval.py` around lines 164 - 173, Extend the JSON-output test suite
by adding a new test (similar to test_eval_json_output) that seeds a workspace
producing a failing lens and invokes _run("eval", "--roles-dir", str(roles),
"--json", cwd=tmp_path) to ensure machine-readable output on failure; parse
json.loads(out) and assert the payload contains the expected keys and values
(e.g., payload["failed"] > 0, payload["passed"] == 0 and
payload["results"][0]["lens"] == "<failing-lens-name>"), using the existing
helpers _seed_eval_workspace and _run to locate where to add the test and mirror
the success-case assertions for structure and fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/role-x.py`:
- Around line 1528-1530: When running with --json, human log lines must not go
to stdout; change the print calls that emit failure/verbose messages (the print
inside the if spec.get("_error") block referencing spec['_path'] and the prints
covering the 1551-1557 section) to write to stderr when JSON mode is enabled
(i.e., detect the JSON flag/variable used by the script and, when set, direct
those human-readable prints to sys.stderr or an equivalent logger that writes to
stderr) so only pure JSON remains on stdout.
- Around line 1472-1478: The code currently returns None for malformed eval
items and the caller filters those out (lines ~1496-1498), hiding broken
fixtures; change the behavior so malformed entries fail loudly: either have this
parsing block raise a ValueError (including the offending item and a short
message) instead of returning None, or keep returning None but update the caller
to detect any None results and raise an exception listing the malformed inputs;
reference the anonymous parsing logic that checks isinstance(item, dict) and
item.get("intent") and the filtering at lines 1496-1498 when making the change.
- Around line 1486-1491: The code assumes the YAML root is a mapping and calls
data.get("lens"), which will raise AttributeError for non-mapping roots; before
using data.get() (after yaml.safe_load), check that data is a dict (e.g.
isinstance(data, dict)) and if not, append an error entry to specs (similar
shape to the existing {"_path": path, "_error": ...}) and continue; ensure you
still derive lens only from mapping roots and keep the existing fallback of
path.name[: -len(".eval.yaml")] when appropriate.

---

Nitpick comments:
In `@tests/test_eval.py`:
- Around line 164-173: Extend the JSON-output test suite by adding a new test
(similar to test_eval_json_output) that seeds a workspace producing a failing
lens and invokes _run("eval", "--roles-dir", str(roles), "--json", cwd=tmp_path)
to ensure machine-readable output on failure; parse json.loads(out) and assert
the payload contains the expected keys and values (e.g., payload["failed"] > 0,
payload["passed"] == 0 and payload["results"][0]["lens"] ==
"<failing-lens-name>"), using the existing helpers _seed_eval_workspace and _run
to locate where to add the test and mirror the success-case assertions for
structure and fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6be866a4-bf05-453a-a648-97177d8ef68c

📥 Commits

Reviewing files that changed from the base of the PR and between d8455a4 and d88c629.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • SKILL.md
  • scripts/role-x.py
  • tests/test_eval.py

Comment thread scripts/role-x.py
Comment thread scripts/role-x.py Outdated
Comment thread scripts/role-x.py
…ot type, clean --json

P20 cross-review (role-x#7), all 3 findings — correctness holes in the
correctness-testing tool itself:
- malformed eval cases now FAIL loudly (counted) instead of silently dropped
  → no false-green from a broken fixture shrinking the assertion count.
- guard non-mapping YAML root (list/scalar) → reported as fixture error, no
  AttributeError abort.
- --json keeps stdout strictly machine-readable; human log lines → stderr.
+3 tests (69 green; real-registry 19/19; --json parses on failure).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@broomva broomva merged commit ccc4fe7 into main Jun 5, 2026
3 checks passed
@broomva broomva deleted the feature/bro-1411-resolver-eval branch June 5, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant