Skip to content

feat: remove full eval from PR builds, add daily scheduled cron eval#381

Open
hadarzaguri wants to merge 3 commits into
mainfrom
feat/reduce-eval-ci-costs
Open

feat: remove full eval from PR builds, add daily scheduled cron eval#381
hadarzaguri wants to merge 3 commits into
mainfrom
feat/reduce-eval-ci-costs

Conversation

@hadarzaguri

Copy link
Copy Markdown
Contributor

Summary

  • PR builds: removed the 30-min EvalForge eval poll entirely. PRs now run lightweight validation only — YAML parse, file path existence, and tag name check against EvalForge. Fast (~seconds), cheap, still catches structural errors before merge.
  • Daily cron (skill-eval-scheduled.yml): new workflow runs at 8am UTC every day. Finds yaml/wix-manage/**/documentation.yaml files changed in the last 25 hours, collects their tags, and runs the full eval against those tags. If nothing changed → skips. On failure → opens a GitHub Issue labeled skill-eval-regression. On recovery → auto-closes the issue.
  • Action refactored: mode input added (required: pr | scheduled). runPr and runScheduled are separate code paths. Removed unused formatEvalPassed/Failed/Timeout/NoScenarios. Unified scheduled formatters into single formatScheduledResult(status, sha, opts).

One-time setup (before first cron fires)

gh label create skill-eval-regression --color e11d48 --description "Skill eval regression detected by scheduled run"

Note

dist/index.js needs to be rebuilt locally before merging:

cd .github/actions/skill-eval && yarn install && yarn build

The skill-eval-ci.yml workflow will verify this on the PR.

Test plan

  • Open a PR touching yaml/wix-manage/ — confirm workflow completes in seconds, PR comment shows validation passed/failed (no EvalForge eval run)
  • Trigger skill-eval-scheduled via workflow_dispatch — confirm it runs the full eval and logs correctly
  • Confirm a regression creates a skill-eval-regression issue and a passing run closes it

🤖 Generated with Claude Code

hadarzaguri and others added 3 commits June 9, 2026 15:47
PRs now run lightweight validation only (YAML parse + tag name check
against EvalForge), cutting per-PR cost from a 30-min EvalForge poll
to a single cheap API call. A new daily cron workflow (8am UTC) runs
the full eval against tags from YAML files changed in the last 25 hours,
reports failures as GitHub Issues labeled skill-eval-regression, and
auto-closes them on recovery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Version arg order)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant