Skip to content

fix(ci): rule-3 accepts competitors/*.md as competitive-benchmark doc surface#1052

Merged
fyodoriv merged 1 commit into
mainfrom
fix/rule-3-accept-competitor-md
Jun 2, 2026
Merged

fix(ci): rule-3 accepts competitors/*.md as competitive-benchmark doc surface#1052
fyodoriv merged 1 commit into
mainfrom
fix/rule-3-accept-competitor-md

Conversation

@fyodoriv
Copy link
Copy Markdown
Owner

@fyodoriv fyodoriv commented Jun 2, 2026

Why this is needed

Every corpus-refresh / competitor task fails the rule-3-doc-first CI gate, which blocks ci (it aggregates rule-3). Root cause: those tasks update both the reading in novel/competitive-benchmark/src/competitors.ts (code) AND the narrative + provenance in competitors/<name>.md (doc), but rule-3 only recognized user-stories/*.md and the package README.md as doc surfaces — not competitors/*.md. The gate is skipped locally (no pr-body), so worker --stage=full passed while CI failed. This silently blocked #1047 (corpus-refresh-openhands) and #1050 (corpus-refresh-swe-agent).

What changed

  • competitors/<name>.md now satisfies the doc clause for the novel/competitive-benchmark package (those files ARE the human-facing corpus docs).
  • Scoped to that package only — a competitors/*.md touch does NOT excuse code changes in other packages.
  • Extracted a packageDocError helper to keep checkRule3DocFirst under the complexity gate.
  • 3 paired tests: corpus-refresh passes, scoping holds (budget-guard still fails), no-doc still fails with a hint that names competitors/*.md.

Verification

  • pnpm exec vitest run scripts/check-rule-3-doc-first.test.mjs → 22 passed.
  • biome check --error-on-warnings on both files → exit 0.
  • Simulated corpus-refresh diff (competitors.ts + competitors/openhands.md) → rule-3 PASS.

Hypothesis self-grade

Vision trace

Security & privacy

No new attack surface. Pure-function change to a lint over the PR diff; reads no new files, writes nothing, binds no ports. vision.md § 13 minimum-bar reviewed.


🤖 Written by an agent, not Fyodor. Ping me if this looks off.

… surface

corpus-refresh + competitor tasks update both the reading in
novel/competitive-benchmark/src/competitors.ts (code) AND the narrative in
competitors/<name>.md (doc), but rule-3-doc-first only recognized
user-stories/*.md and the package README — so every corpus refresh failed the
gate in CI (it's skipped locally without a pr-body, so worker --stage=full
passed). Teach rule-3 that competitors/*.md IS the doc surface for the
competitive-benchmark package. Extracted packageDocError helper to keep
complexity ≤10; added 3 paired tests (corpus-refresh passes, scoping holds,
no-doc still fails). Unblocks #1047/#1050 and the whole corpus-refresh class.
@fyodoriv fyodoriv merged commit 3dae1a1 into main Jun 2, 2026
93 checks passed
@fyodoriv fyodoriv deleted the fix/rule-3-accept-competitor-md branch June 2, 2026 15:33
fyodoriv added a commit that referenced this pull request Jun 2, 2026
fyodoriv added a commit that referenced this pull request Jun 2, 2026
fyodoriv added a commit that referenced this pull request Jun 2, 2026
…-swe-agent (#1050)

* chore: refresh swe-agent SWE-bench Verified reading to mini-swe-agent 0.74 (corpus-refresh-swe-agent)

Replace the stale 2024 NeurIPS SWE-agent + GPT-4 reading (0.125, full-split
proxy, asOf 2024-10-01, 604 days very-stale) with the SWE-agent project's
current flagship scaffold mini-swe-agent: Gemini 3 Pro at 0.74 on the
SWE-bench Verified 500-instance split, submitted 2026-02-26 to the official
swebench.com "Bash Only" leaderboard (primary statement at mini-swe-agent.com).

This is a true Verified-split number, so the prior full-split/Lite proxy
caveat is dropped. asOf 2026-02-26 is 96 days old -> freshness status moves
very-stale -> stale (3 days outside the 90-day "fresh" bucket). Per rule #9
and the task Pivot, the most authoritative primary-source date for the
project's own scaffold is used rather than a fabricated fresher date.

- competitors.ts: refreshed citation, asOf, value for the swe-agent entry
- competitors/swe-agent.md: Scorecard readings table + superseded-reading
  history note + Last reviewed date
- competitors/scorecard.md: updated the two swe-agent rows in the (now
  hand-maintained) static snapshot to match

Hypothesis self-grade:
Predicted: refreshing swe-agent to a publication <=90 days old returns "fresh".
Observed: freshest cleanly-attributable primary source is 2026-02-26 (96d) -> "stale", not "fresh"; value 0.125 -> 0.74.
Match: partial (very-stale -> stale, not fresh; the 90-day bar is missed by 3 days; no honest <=90d project-scaffold Verified source exists).
Lesson: the SWE-agent project's last dated Verified-split scaffold submission is 2026-02-26; a strict 90-day fresh bar can be unmeetable without fabricating a date, which the Pivot forbids.

* chore(ci): re-trigger rule-3 against fixed main (#1052) for corpus-refresh-swe-agent
fyodoriv added a commit that referenced this pull request Jun 2, 2026
…-openhands (#1047)

* chore: refresh openhands corpus reading to 0.728 SWE-bench Verified (corpus-refresh-openhands)

Refresh the openhands competitor reading from the 406-day-stale 2025-04-15
65.8% (0.658) inference-time-scaling number to the current first-party
72.8% (0.728), citing the OpenHands Software Agent SDK paper
(arXiv:2511.03690v2, 2026-04-22, Table 4 section 5.4 - Claude Sonnet 4.5 +
extended thinking on the V1 SDK). The v2 revision (41 days old) flips the
corpus-freshness bucket from "very-stale" to "fresh".

Hypothesis self-grade:
Predicted: check-corpus-freshness returns "fresh" (<=90d) for openhands.
Observed: status="fresh", ageDays=41 (asOf 2026-04-22).
Match: yes
Lesson: the vendor's newest exact-number publication (SDK paper v2) supersedes
the Apr-2025 reading; refreshing to the real higher number is honest, not
masking - Pivot's "stale-by-vendor" clause does not apply when the vendor
actively publishes.

* chore(ci): re-trigger rule-3 against fixed main (#1052) for corpus-refresh-openhands
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant