feat: eval and optimization suite for skills (in-progress) by jdehorty · Pull Request #6 · CesiumGS/cesiumjs-skills

jdehorty · 2026-04-30T12:08:18Z

Status

Draft. In progress. Not intended to merge into main. This PR exists for review visibility and to give a place for inline file-level feedback on top of the discussion happening in #5. The eval-and-optimization branch is meant to live alongside main as a standalone artifact, not be merged in.

Summary

Introduces an eval and optimization suite for the skills in this repo, published as a standalone branch. Provides regression detection plus a propose, evaluate, decide loop for actively improving the skills as CesiumJS evolves.

Why

CesiumJS evolves continuously. New APIs land, old ones deprecate, behavior changes across versions. Without an automated quality bar the skills silently drift. To stay healthy over time the skills need:

Regression detection when SKILL.md edits, CesiumJS API changes, or model upgrades affect output.
Active quality improvement, not just measurement. Manually tuning prose across 14 SKILL.md files based on hunches does not scale.

Recent agentic AI work (Meta-Harness, Lee et al., Stanford 2025; SPO, EMNLP 2025) describes a sustainable propose, evaluate, decide loop driven by a coding agent with full filesystem access to prior iteration history. The same primitives used to write the skills can be turned around to evaluate and optimize them.

See #5 for the full discussion thread.

What is included

Path	Purpose
`tuning/cesiumjs-skills-eval-methodology.md`	Full methodology document
`tuning/RESEARCH-DIARY.md`	Historical notes from harness development
`tuning/<skill>/evals/eval-NNN-<landmark>.json`	Per-skill eval scenarios (full sets for camera, viewer-setup, entities, imagery; baselines for the rest)
`tuning/<skill>/iterations/<n>/`	SKILL.md versions, scores, decisions, pairwise judge verdicts, proposer reasoning
`tuning/<skill>/best.json`	Pointer to the current-best iteration
`tuning/<skill>/coverage-report.json`	Section coverage analysis
`tuning/tools/run_eval_suite.py`	Browser execution helper (headless Chrome, screenshots, console capture)
`tuning/tools/coverage-analyzer.py`	Enforces 90 percent SKILL.md section coverage
`tuning/.claude/`	Proposer skill and Stop hook used to drive the optimization loop
`tuning/examples/screenshots/`	Curated 8-image visual gallery
`tuning/README.md`	Orientation for outside readers
`scripts/check-secrets.sh`	Two-mode gitleaks scan plus project regexes
`.gitignore`	Surgical excludes (traces, demos, internal plans) replacing the blanket `tuning/` ignore on `main`

How it works

A coding agent (Claude Code) reads the full prior-iteration history and proposes a revised SKILL.md. The runner generates code from that SKILL.md, executes it in headless Chrome, captures screenshots, and three independent judges compare the candidate against the current best in pairwise fashion. Majority vote decides keep or discard. Programmatic checks (code runs, no console errors, expected APIs present) gate code-correctness regressions independently of the visual signal.

Visual scenes use iconic landmarks (Eiffel Tower, Grand Canyon, NYC, London) so judges can reliably tell when output matches the eval's expected outcome.

The runner is intentionally runtime-independent. It executes generated CesiumJS code against a generic browser harness, not against any plugin-provided MCP tools. This keeps it usable today and leaves room for an MCP-native mode later.

Why pairwise visual comparison with three judges

An early pilot used absolute scoring with a single judge. Variance was around 0.05 to 0.10 per response on functionally identical outputs. In one camera eval, the same generated code scored 0.90 on the baseline iteration and 0.793 on a candidate that produced an identical screenshot. That swing alone flipped the aggregate decision. Pairwise comparison (judge sees both screenshots, picks A or B or TIE) eliminates absolute-calibration noise. Three independent judges with no shared context eliminate anchoring bias and consistency pressure.

What is intentionally not on the branch

Materialized per-eval traces (eval HTML, console captures, generation prompts, generated code, raw screenshots) are excluded. They contain run-time artifacts that may include access tokens, local file paths, and prompt commentary. Scores, decisions, judge verdicts, proposer reasoning, and the curated gallery are all committed and readable as is.

Roadmap

Tier 1, now: archival publication on this branch (this PR).
Tier 2, planned: a faster localhost iteration loop using a single CDN-loaded index.html plus http-server plus Playwright or Chrome DevTools MCP, replacing the per-eval HTML materialization step.
Tier 3, later: MCP-native evals once the skills gain a runtime.

Security

Three independent secret scans ran during the push:

gitleaks history scan across all refs (--log-opts=--all): 21 commits, 1.5 MB scanned, no leaks.
gitleaks working-tree scan (--no-git, includes untracked and ignored): 1.36 MB scanned, no leaks.
Project-specific regex patterns (/Users/, @bentley.com, JWT shapes, hardcoded Cesium.Ion.defaultAccessToken): no matches.

Coverage report path absolute references were scrubbed to relative form before the first commit. The Cesium Ion token used during eval runs was rotated as a hygiene measure.

Test plan

Browse the methodology document at tuning/cesiumjs-skills-eval-methodology.md.
Sample tuning/cesiumjs-camera/evals/ and tuning/cesiumjs-camera/iterations/003/ to see scenario format and iteration artifacts.
Open tuning/examples/screenshots/ and confirm the 8 curated images look representative.
Run ./scripts/check-secrets.sh from the branch checkout and confirm it returns clean.
Leave inline review comments here, or top-level comments on [Feedback] Evaluation and Optimization #5.

Out of scope

The skills themselves continue to ship from main.
Closed-source MCP server evals belong with that server's code, not in this repo.
Tier 2 (faster localhost iteration loop) and Tier 3 (MCP-native tool-call evals) are separate follow-ups, not included in this PR.

…n branch - rsync of tuning/ excluding traces/, demo HTML with baked tokens, and lean4 tooling - updated .gitignore with surgical excludes (traces, demos, internal plans) - scripts/check-secrets.sh for two-mode gitleaks scan + project regexes - coverage-report.json paths scrubbed to relative form (no /Users/... leaks)

…layout

…o/viewer/ A single-page CesiumJS viewer designed to be driven by browser automation (Chrome DevTools MCP, already wired into .mcp.json, or Playwright MCP) from a Claude Code session. Loads once and stays loaded; drivers inject code, screenshot, and read state via evaluate() without reloading. Distinct from the existing demo/index.html showcase: that page is for humans to click around in. This page is for agents to drive. Files: - demo/viewer/index.html: CDN-loaded Cesium 1.139, error and warning capture on window.__evalErrors and window.__evalWarnings, Ion token read from ?ionToken= URL param, helpers __viewerState(), __resetViewer(), and __viewer convention slot. Status badge in the upper-left for at-a-glance verification. - demo/viewer/README.md: usage walkthrough for driving the viewer via chrome-devtools MCP, table of exposed window globals, rationale for the persistent-page pattern vs per-eval HTML materialization. Smoke-tested end to end before commit on a prior branch: page loads, __viewerReady is true, viewer creates from injected code, __resetViewer clears state, error capture correctly recorded the "no Ion token" warning. The viewer has no opinion about what code gets injected. It just provides a clean Cesium environment, error capture, and reset machinery, so skill authors can iterate locally without rebuilding HTML files between attempts.

Replace the gitleaks --no-git working-tree scan with a `git grep`-based scan over tracked content for the project-specific regex patterns. Why: --no-git mode reads every file on disk, including gitignored local development artifacts. On the main worktree (where the user keeps their local tuning/<skill>/iterations/<n>/traces/ content for ongoing eval runs), those gitignored files contain the old (rotated) Cesium Ion token, which trips the scanner with false positives and blocks legitimate pushes that don't include any of that content. A push only ever sends tracked commits. Scanning history (Mode 1, unchanged) plus tracked content (Mode 2, new) covers exactly that. The scanner now runs cleanly from the canonical eval worktree, the main worktree with local dev content present, or a fresh checkout, without needing to dance around the working-tree state. Mode 1 still catches anything ever committed across all refs. Mode 2 still catches the project-specific patterns (hardcoded Cesium Ion token literals, JWT-shaped access_token query params, /Users/ path leaks, @bentley.com email leaks). The pathspec excludes the scanner script itself since it carries the patterns as string literals.

jdehorty · 2026-06-02T12:13:34Z

Superseded by #13, which republishes the evaluation and optimization suite from the cleaned public-facing feat/eval-and-optimization branch. Closing this older draft to avoid keeping two public PR references active.

jdehorty added 3 commits April 30, 2026 06:02

docs: add curated 8-image eval screenshot gallery under tuning/examples/

25634fc

docs: add tuning/ README orienting outside readers to the eval suite …

91943a1

…layout

jdehorty mentioned this pull request Apr 30, 2026

feat: add persistent in-browser eval harness under harness/ #7

Closed

6 tasks

jdehorty added 2 commits April 30, 2026 07:49

This was referenced Jun 2, 2026

feat: public evaluation and optimization suite #13

Draft

[Feedback] Evaluation and Optimization #5

Open

jdehorty closed this Jun 2, 2026

jdehorty deleted the eval-and-optimization branch June 3, 2026 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval and optimization suite for skills (in-progress)#6

feat: eval and optimization suite for skills (in-progress)#6
jdehorty wants to merge 5 commits into
mainfrom
eval-and-optimization

jdehorty commented Apr 30, 2026

Uh oh!

jdehorty commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdehorty commented Apr 30, 2026

Status

Summary

Why

What is included

How it works

Why pairwise visual comparison with three judges

What is intentionally not on the branch

Roadmap

Security

Test plan

Out of scope

Related

Uh oh!

jdehorty commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant