feat: eval and optimization suite for skills (in-progress)#6
Closed
jdehorty wants to merge 5 commits into
Closed
Conversation
…n branch - rsync of tuning/ excluding traces/, demo HTML with baked tokens, and lean4 tooling - updated .gitignore with surgical excludes (traces, demos, internal plans) - scripts/check-secrets.sh for two-mode gitleaks scan + project regexes - coverage-report.json paths scrubbed to relative form (no /Users/... leaks)
6 tasks
…o/viewer/ A single-page CesiumJS viewer designed to be driven by browser automation (Chrome DevTools MCP, already wired into .mcp.json, or Playwright MCP) from a Claude Code session. Loads once and stays loaded; drivers inject code, screenshot, and read state via evaluate() without reloading. Distinct from the existing demo/index.html showcase: that page is for humans to click around in. This page is for agents to drive. Files: - demo/viewer/index.html: CDN-loaded Cesium 1.139, error and warning capture on window.__evalErrors and window.__evalWarnings, Ion token read from ?ionToken= URL param, helpers __viewerState(), __resetViewer(), and __viewer convention slot. Status badge in the upper-left for at-a-glance verification. - demo/viewer/README.md: usage walkthrough for driving the viewer via chrome-devtools MCP, table of exposed window globals, rationale for the persistent-page pattern vs per-eval HTML materialization. Smoke-tested end to end before commit on a prior branch: page loads, __viewerReady is true, viewer creates from injected code, __resetViewer clears state, error capture correctly recorded the "no Ion token" warning. The viewer has no opinion about what code gets injected. It just provides a clean Cesium environment, error capture, and reset machinery, so skill authors can iterate locally without rebuilding HTML files between attempts.
Replace the gitleaks --no-git working-tree scan with a `git grep`-based scan over tracked content for the project-specific regex patterns. Why: --no-git mode reads every file on disk, including gitignored local development artifacts. On the main worktree (where the user keeps their local tuning/<skill>/iterations/<n>/traces/ content for ongoing eval runs), those gitignored files contain the old (rotated) Cesium Ion token, which trips the scanner with false positives and blocks legitimate pushes that don't include any of that content. A push only ever sends tracked commits. Scanning history (Mode 1, unchanged) plus tracked content (Mode 2, new) covers exactly that. The scanner now runs cleanly from the canonical eval worktree, the main worktree with local dev content present, or a fresh checkout, without needing to dance around the working-tree state. Mode 1 still catches anything ever committed across all refs. Mode 2 still catches the project-specific patterns (hardcoded Cesium Ion token literals, JWT-shaped access_token query params, /Users/ path leaks, @bentley.com email leaks). The pathspec excludes the scanner script itself since it carries the patterns as string literals.
This was referenced Jun 2, 2026
Contributor
Author
|
Superseded by #13, which republishes the evaluation and optimization suite from the cleaned public-facing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
Draft. In progress. Not intended to merge into
main. This PR exists for review visibility and to give a place for inline file-level feedback on top of the discussion happening in #5. Theeval-and-optimizationbranch is meant to live alongsidemainas a standalone artifact, not be merged in.Summary
Introduces an eval and optimization suite for the skills in this repo, published as a standalone branch. Provides regression detection plus a propose, evaluate, decide loop for actively improving the skills as CesiumJS evolves.
Why
CesiumJS evolves continuously. New APIs land, old ones deprecate, behavior changes across versions. Without an automated quality bar the skills silently drift. To stay healthy over time the skills need:
Recent agentic AI work (Meta-Harness, Lee et al., Stanford 2025; SPO, EMNLP 2025) describes a sustainable propose, evaluate, decide loop driven by a coding agent with full filesystem access to prior iteration history. The same primitives used to write the skills can be turned around to evaluate and optimize them.
See #5 for the full discussion thread.
What is included
tuning/cesiumjs-skills-eval-methodology.mdtuning/RESEARCH-DIARY.mdtuning/<skill>/evals/eval-NNN-<landmark>.jsontuning/<skill>/iterations/<n>/tuning/<skill>/best.jsontuning/<skill>/coverage-report.jsontuning/tools/run_eval_suite.pytuning/tools/coverage-analyzer.pytuning/.claude/tuning/examples/screenshots/tuning/README.mdscripts/check-secrets.sh.gitignoretuning/ignore onmainHow it works
A coding agent (Claude Code) reads the full prior-iteration history and proposes a revised SKILL.md. The runner generates code from that SKILL.md, executes it in headless Chrome, captures screenshots, and three independent judges compare the candidate against the current best in pairwise fashion. Majority vote decides keep or discard. Programmatic checks (code runs, no console errors, expected APIs present) gate code-correctness regressions independently of the visual signal.
Visual scenes use iconic landmarks (Eiffel Tower, Grand Canyon, NYC, London) so judges can reliably tell when output matches the eval's expected outcome.
The runner is intentionally runtime-independent. It executes generated CesiumJS code against a generic browser harness, not against any plugin-provided MCP tools. This keeps it usable today and leaves room for an MCP-native mode later.
Why pairwise visual comparison with three judges
An early pilot used absolute scoring with a single judge. Variance was around 0.05 to 0.10 per response on functionally identical outputs. In one camera eval, the same generated code scored 0.90 on the baseline iteration and 0.793 on a candidate that produced an identical screenshot. That swing alone flipped the aggregate decision. Pairwise comparison (judge sees both screenshots, picks A or B or TIE) eliminates absolute-calibration noise. Three independent judges with no shared context eliminate anchoring bias and consistency pressure.
What is intentionally not on the branch
Materialized per-eval traces (eval HTML, console captures, generation prompts, generated code, raw screenshots) are excluded. They contain run-time artifacts that may include access tokens, local file paths, and prompt commentary. Scores, decisions, judge verdicts, proposer reasoning, and the curated gallery are all committed and readable as is.
Roadmap
index.htmlplushttp-serverplus Playwright or Chrome DevTools MCP, replacing the per-eval HTML materialization step.Security
Three independent secret scans ran during the push:
--log-opts=--all): 21 commits, 1.5 MB scanned, no leaks.--no-git, includes untracked and ignored): 1.36 MB scanned, no leaks./Users/,@bentley.com, JWT shapes, hardcodedCesium.Ion.defaultAccessToken): no matches.Coverage report path absolute references were scrubbed to relative form before the first commit. The Cesium Ion token used during eval runs was rotated as a hygiene measure.
Test plan
tuning/cesiumjs-skills-eval-methodology.md.tuning/cesiumjs-camera/evals/andtuning/cesiumjs-camera/iterations/003/to see scenario format and iteration artifacts.tuning/examples/screenshots/and confirm the 8 curated images look representative../scripts/check-secrets.shfrom the branch checkout and confirm it returns clean.Out of scope
main.Related
Tracks #5.