Skip to content

feat: eval and optimization suite for skills (in-progress)#6

Closed
jdehorty wants to merge 5 commits into
mainfrom
eval-and-optimization
Closed

feat: eval and optimization suite for skills (in-progress)#6
jdehorty wants to merge 5 commits into
mainfrom
eval-and-optimization

Conversation

@jdehorty

Copy link
Copy Markdown
Contributor

Status

Draft. In progress. Not intended to merge into main. This PR exists for review visibility and to give a place for inline file-level feedback on top of the discussion happening in #5. The eval-and-optimization branch is meant to live alongside main as a standalone artifact, not be merged in.

Summary

Introduces an eval and optimization suite for the skills in this repo, published as a standalone branch. Provides regression detection plus a propose, evaluate, decide loop for actively improving the skills as CesiumJS evolves.

Why

CesiumJS evolves continuously. New APIs land, old ones deprecate, behavior changes across versions. Without an automated quality bar the skills silently drift. To stay healthy over time the skills need:

  1. Regression detection when SKILL.md edits, CesiumJS API changes, or model upgrades affect output.
  2. Active quality improvement, not just measurement. Manually tuning prose across 14 SKILL.md files based on hunches does not scale.

Recent agentic AI work (Meta-Harness, Lee et al., Stanford 2025; SPO, EMNLP 2025) describes a sustainable propose, evaluate, decide loop driven by a coding agent with full filesystem access to prior iteration history. The same primitives used to write the skills can be turned around to evaluate and optimize them.

See #5 for the full discussion thread.

What is included

Path Purpose
tuning/cesiumjs-skills-eval-methodology.md Full methodology document
tuning/RESEARCH-DIARY.md Historical notes from harness development
tuning/<skill>/evals/eval-NNN-<landmark>.json Per-skill eval scenarios (full sets for camera, viewer-setup, entities, imagery; baselines for the rest)
tuning/<skill>/iterations/<n>/ SKILL.md versions, scores, decisions, pairwise judge verdicts, proposer reasoning
tuning/<skill>/best.json Pointer to the current-best iteration
tuning/<skill>/coverage-report.json Section coverage analysis
tuning/tools/run_eval_suite.py Browser execution helper (headless Chrome, screenshots, console capture)
tuning/tools/coverage-analyzer.py Enforces 90 percent SKILL.md section coverage
tuning/.claude/ Proposer skill and Stop hook used to drive the optimization loop
tuning/examples/screenshots/ Curated 8-image visual gallery
tuning/README.md Orientation for outside readers
scripts/check-secrets.sh Two-mode gitleaks scan plus project regexes
.gitignore Surgical excludes (traces, demos, internal plans) replacing the blanket tuning/ ignore on main

How it works

A coding agent (Claude Code) reads the full prior-iteration history and proposes a revised SKILL.md. The runner generates code from that SKILL.md, executes it in headless Chrome, captures screenshots, and three independent judges compare the candidate against the current best in pairwise fashion. Majority vote decides keep or discard. Programmatic checks (code runs, no console errors, expected APIs present) gate code-correctness regressions independently of the visual signal.

Visual scenes use iconic landmarks (Eiffel Tower, Grand Canyon, NYC, London) so judges can reliably tell when output matches the eval's expected outcome.

The runner is intentionally runtime-independent. It executes generated CesiumJS code against a generic browser harness, not against any plugin-provided MCP tools. This keeps it usable today and leaves room for an MCP-native mode later.

Why pairwise visual comparison with three judges

An early pilot used absolute scoring with a single judge. Variance was around 0.05 to 0.10 per response on functionally identical outputs. In one camera eval, the same generated code scored 0.90 on the baseline iteration and 0.793 on a candidate that produced an identical screenshot. That swing alone flipped the aggregate decision. Pairwise comparison (judge sees both screenshots, picks A or B or TIE) eliminates absolute-calibration noise. Three independent judges with no shared context eliminate anchoring bias and consistency pressure.

What is intentionally not on the branch

Materialized per-eval traces (eval HTML, console captures, generation prompts, generated code, raw screenshots) are excluded. They contain run-time artifacts that may include access tokens, local file paths, and prompt commentary. Scores, decisions, judge verdicts, proposer reasoning, and the curated gallery are all committed and readable as is.

Roadmap

  1. Tier 1, now: archival publication on this branch (this PR).
  2. Tier 2, planned: a faster localhost iteration loop using a single CDN-loaded index.html plus http-server plus Playwright or Chrome DevTools MCP, replacing the per-eval HTML materialization step.
  3. Tier 3, later: MCP-native evals once the skills gain a runtime.

Security

Three independent secret scans ran during the push:

  • gitleaks history scan across all refs (--log-opts=--all): 21 commits, 1.5 MB scanned, no leaks.
  • gitleaks working-tree scan (--no-git, includes untracked and ignored): 1.36 MB scanned, no leaks.
  • Project-specific regex patterns (/Users/, @bentley.com, JWT shapes, hardcoded Cesium.Ion.defaultAccessToken): no matches.

Coverage report path absolute references were scrubbed to relative form before the first commit. The Cesium Ion token used during eval runs was rotated as a hygiene measure.

Test plan

  • Browse the methodology document at tuning/cesiumjs-skills-eval-methodology.md.
  • Sample tuning/cesiumjs-camera/evals/ and tuning/cesiumjs-camera/iterations/003/ to see scenario format and iteration artifacts.
  • Open tuning/examples/screenshots/ and confirm the 8 curated images look representative.
  • Run ./scripts/check-secrets.sh from the branch checkout and confirm it returns clean.
  • Leave inline review comments here, or top-level comments on [Feedback] Evaluation and Optimization #5.

Out of scope

  • The skills themselves continue to ship from main.
  • Closed-source MCP server evals belong with that server's code, not in this repo.
  • Tier 2 (faster localhost iteration loop) and Tier 3 (MCP-native tool-call evals) are separate follow-ups, not included in this PR.

Related

Tracks #5.

…n branch

- rsync of tuning/ excluding traces/, demo HTML with baked tokens, and lean4 tooling
- updated .gitignore with surgical excludes (traces, demos, internal plans)
- scripts/check-secrets.sh for two-mode gitleaks scan + project regexes
- coverage-report.json paths scrubbed to relative form (no /Users/... leaks)
…o/viewer/

A single-page CesiumJS viewer designed to be driven by browser
automation (Chrome DevTools MCP, already wired into .mcp.json, or
Playwright MCP) from a Claude Code session. Loads once and stays
loaded; drivers inject code, screenshot, and read state via evaluate()
without reloading.

Distinct from the existing demo/index.html showcase: that page is for
humans to click around in. This page is for agents to drive.

Files:
- demo/viewer/index.html: CDN-loaded Cesium 1.139, error and warning
  capture on window.__evalErrors and window.__evalWarnings, Ion token
  read from ?ionToken= URL param, helpers __viewerState(),
  __resetViewer(), and __viewer convention slot. Status badge in the
  upper-left for at-a-glance verification.
- demo/viewer/README.md: usage walkthrough for driving the viewer via
  chrome-devtools MCP, table of exposed window globals, rationale for
  the persistent-page pattern vs per-eval HTML materialization.

Smoke-tested end to end before commit on a prior branch: page loads,
__viewerReady is true, viewer creates from injected code,
__resetViewer clears state, error capture correctly recorded the
"no Ion token" warning.

The viewer has no opinion about what code gets injected. It just
provides a clean Cesium environment, error capture, and reset
machinery, so skill authors can iterate locally without rebuilding
HTML files between attempts.
Replace the gitleaks --no-git working-tree scan with a `git grep`-based
scan over tracked content for the project-specific regex patterns.

Why: --no-git mode reads every file on disk, including gitignored
local development artifacts. On the main worktree (where the user keeps
their local tuning/<skill>/iterations/<n>/traces/ content for ongoing
eval runs), those gitignored files contain the old (rotated) Cesium
Ion token, which trips the scanner with false positives and blocks
legitimate pushes that don't include any of that content.

A push only ever sends tracked commits. Scanning history (Mode 1,
unchanged) plus tracked content (Mode 2, new) covers exactly that.
The scanner now runs cleanly from the canonical eval worktree, the
main worktree with local dev content present, or a fresh checkout,
without needing to dance around the working-tree state.

Mode 1 still catches anything ever committed across all refs.
Mode 2 still catches the project-specific patterns (hardcoded Cesium
Ion token literals, JWT-shaped access_token query params, /Users/
path leaks, @bentley.com email leaks). The pathspec excludes the
scanner script itself since it carries the patterns as string literals.
@jdehorty

jdehorty commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #13, which republishes the evaluation and optimization suite from the cleaned public-facing feat/eval-and-optimization branch. Closing this older draft to avoid keeping two public PR references active.

@jdehorty jdehorty closed this Jun 2, 2026
@jdehorty jdehorty deleted the eval-and-optimization branch June 3, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant