CesiumGS · jdehorty · Jun 3, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.architecture/acd.md b/.architecture/acd.md
diff --git a/.architecture/adr0001-skills-first-sequencing.md b/.architecture/adr0001-skills-first-sequencing.md
@@ -0,0 +1,27 @@
+# ADR-0001: Use Skills-First Sequencing
+
+## Status
+
+Accepted
+
+## Context
+
+The long-term goal is a reusable Cesium AI evaluation framework that can evaluate skills, documentation, examples, MCP tools, and Cesium ion workflows. The near-term implementation evidence exists in `CesiumGS/cesiumjs-skills`, where skill documents already have representative scenarios, browser execution helpers, and iteration history.
+
+Two sequencing options were considered:
+
+- **Suite-first:** design a generic benchmark framework first, then apply it to skills.
+- **Skills-first:** prove the framework in `cesiumjs-skills`, then generalize reusable components.
+
+## Decision
+
+Use skills-first sequencing.
+
+The initial canonical implementation should focus on `CesiumGS/cesiumjs-skills` because the target artifacts, scenarios, and runner already exist there. The architecture must still preserve extension points for MCP and tool-call evaluations.
+
+## Consequences
+
+- The first ACD and ADR set can be grounded in real branch artifacts.
+- The framework can produce useful value before MCP/tool runtimes are mature.
+- Some abstractions may need to be extracted later.
+- Scenario and runner design must avoid hard-coding assumptions that make future MCP evaluation impossible.
diff --git a/.architecture/adr0002-pairwise-judge-protocol.md b/.architecture/adr0002-pairwise-judge-protocol.md
@@ -0,0 +1,28 @@
+# ADR-0002: Use Pairwise Judge Protocol
+
+## Status
+
+Accepted
+
+## Context
+
+CesiumJS correctness is often visual. A scene can execute without console errors and still be wrong because the camera is poorly framed, terrain is missing, imagery is incorrect, or an expected object is not visible.
+
+Early absolute visual scoring is vulnerable to judge calibration noise. The same or equivalent output can receive different numeric scores across runs, which can flip aggregate decisions even when the candidate did not materially change a scenario.
+
+## Decision
+
+Use pairwise A/B/TIE judging for visual and qualitative comparisons:
+
+1. Present the same scenario evidence for current best and candidate.
+2. Ask each judge to choose `BASELINE`, `CANDIDATE`, or `TIE`.
+3. Use three independent judges.
+4. Compute the per-scenario majority verdict.
+
+## Consequences
+
+- Relative comparison reduces absolute-score calibration noise.
+- Three independent judges reduce single-judge bias.
+- Judge results are easier for maintainers to inspect.
+- The protocol costs more than one judge, so it should be used selectively in CI.
+- Numeric scores can still be produced for reporting, but they are not the primary visual decision mechanism.
diff --git a/.architecture/adr0003-deterministic-decision-policy.md b/.architecture/adr0003-deterministic-decision-policy.md
@@ -0,0 +1,34 @@
+# ADR-0003: Separate Deterministic Decision Policy From Report Scores
+
+## Status
+
+Accepted
+
+## Context
+
+Evaluation systems often mix LLM judgment, programmatic checks, and numeric aggregate scores. That can make decisions hard to audit. For Cesium AI evaluation, maintainers need to know whether a candidate was kept because it passed critical correctness gates, won visual comparisons, or merely improved a weighted average.
+
+The desired pattern is:
+
+```text
+structured evidence -> deterministic decision -> report score
+```
+
+## Decision
+
+Use deterministic gates and rule-based decisions as the source of truth:
+
+1. Deterministic checks produce pass/fail facts.
+2. Pairwise judges produce structured verdicts.
+3. Critical-regression failures block acceptance.
+4. Majority verdicts decide visual wins, losses, and ties.
+5. Numeric report scores are derived afterward for dashboards and trends.
+
+Manual override is allowed only when a maintainer records the reason.
+
+## Consequences
+
+- Decisions are explainable and auditable.
+- Dashboards can still show scores without hiding the decision basis.
+- Weighted formulas can evolve without changing the core keep/discard semantics.
+- More metadata must be stored so decisions can be reconstructed later.
diff --git a/.architecture/adr0004-browser-visual-evaluation.md b/.architecture/adr0004-browser-visual-evaluation.md
@@ -0,0 +1,25 @@
+# ADR-0004: Use Browser-Backed Visual Evaluation
+
+## Status
+
+Accepted
+
+## Context
+
+Many CesiumJS failures cannot be detected from generated code alone. Examples include wrong camera distance, reversed longitude sign, missing terrain, invisible imagery, flat city scenes where 3D buildings were expected, or asynchronous loading that never completes.
+
+Code-pattern checks are necessary but insufficient.
+
+## Decision
+
+Use a browser-backed Cesium evaluation environment for CesiumJS scenarios.
+
+The runner should execute generated code in a controlled page, capture console output, page errors, screenshots, and, where useful, scene state. Visual scenarios should use recognizable landmarks or controlled fixtures to make judgment reliable.
+
+## Consequences
+
+- The framework measures rendered behavior, not just code shape.
+- Screenshots and scene evidence help maintainers debug failures.
+- Browser automation adds setup cost and potential flakiness.
+- Scenarios need careful timing and stable viewport settings.
+- Some checks should use scene state in addition to screenshots when visual evidence alone is ambiguous.
diff --git a/.architecture/adr0005-ci-trigger-policy.md b/.architecture/adr0005-ci-trigger-policy.md
@@ -0,0 +1,26 @@
+# ADR-0005: Use Cost-Aware CI Trigger Policy
+
+## Status
+
+Accepted
+
+## Context
+
+Full LLM-assisted judging is useful but can be expensive and slow. Public contributors need fast feedback on routine changes, while maintainers need deeper evaluation before releases or high-risk changes.
+
+## Decision
+
+Use tiered CI triggers:
+
+- Pull requests run deterministic checks by default.
+- Visual judge suites run on scheduled, manual, release, or high-risk triggers.
+- Maintainers can request a full judge run when a change plausibly affects generated visual quality.
+- CI output must include enough artifacts to reproduce failures locally.
+
+## Consequences
+
+- Routine PRs remain practical for contributors.
+- Expensive judge runs are used where they add the most value.
+- Some visual regressions may not be caught immediately on every PR.
+- Release and scheduled runs become important safety nets.
+- Trigger policy must be documented so contributors know what evidence is required.
diff --git a/.architecture/adr0006-public-artifact-policy.md b/.architecture/adr0006-public-artifact-policy.md
@@ -0,0 +1,26 @@
+# ADR-0006: Commit Public-Safe Evidence Only
+
+## Status
+
+Accepted
+
+## Context
+
+Evaluation traces can include generated HTML, access tokens, local paths, prompts, console output, screenshots, and model metadata. Some of these artifacts are useful for debugging but unsafe to commit publicly without review.
+
+The framework should be open and inspectable without leaking credentials or private context.
+
+## Decision
+
+Commit public-safe artifacts and ignore raw sensitive traces by default.
+
+Committed artifacts may include scenario manifests, coverage reports, programmatic check summaries, judge verdict summaries, decision records, curated screenshots, and public-safe reports.
+
+Raw generated traces should remain local or CI artifacts unless explicitly reviewed and sanitized.
+
+## Consequences
+
+- The public repository stays safer and easier to review.
+- Maintainers may need to regenerate raw traces locally for deep debugging.
+- Secret scanning remains mandatory before publication.
+- Reports should use relative paths and avoid private environment details.
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -4,13 +4,13 @@
     "name": "CesiumGS"
   },
   "metadata": {
-    "description": "CesiumJS domain skills covering ~535 public symbols across the CesiumJS v1.139 API surface"
+    "description": "CesiumJS domain skills covering ~550 public symbols across the CesiumJS v1.142 API surface"
   },
   "plugins": [
     {
       "name": "cesiumjs-skills",
       "source": "./",
-      "description": "14 domain skills for CesiumJS development with SessionStart hook and Chrome DevTools MCP integration",
+      "description": "14 domain skills for CesiumJS development with Chrome DevTools MCP integration",
       "version": "0.3.0",
       "category": "development",
       "tags": ["cesium", "cesiumjs", "3d", "geospatial", "gis", "maps"],

diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -53,7 +53,7 @@ body:
       label: Branch
       options:
         - main
-        - eval-and-optimization
+        - feat/eval-and-optimization
         - Other (please note in Context)
     validations:
       required: true

diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -16,5 +16,5 @@ contact_links:
     url: https://docs.claude.com/en/docs/claude-code/overview
     about: Questions about Claude Code, plugins, or the Skill API belong with Anthropic's docs.
   - name: Eval and optimization branch
-    url: https://github.com/CesiumGS/cesiumjs-skills/tree/eval-and-optimization
+    url: https://github.com/CesiumGS/cesiumjs-skills/tree/feat/eval-and-optimization
     about: Browse the eval suite, methodology, scoring history, and curated screenshot gallery.
diff --git a/.github/ISSUE_TEMPLATE/feature-request.yml b/.github/ISSUE_TEMPLATE/feature-request.yml
@@ -41,7 +41,7 @@ body:
     attributes:
       label: Area
       description: Which skill, file, or branch this affects. Paths welcome.
-      placeholder: e.g. "skills/cesiumjs-entities/SKILL.md" or "tuning/tools/run_eval_suite.py"
+      placeholder: e.g. "skills/cesiumjs-entities/SKILL.md" or "optimization/scripts/run-all-evals.py"
     validations:
       required: false
 
@@ -97,5 +97,5 @@ body:
       options:
         - label: I searched existing issues and PRs for duplicates.
           required: true
-        - label: For new skills, I checked `docs/DOMAINS.md` to confirm the area is not already covered by an existing skill.
+        - label: For new skills, I checked `wiki/Domain-Mapping.md` to confirm the area is not already covered by an existing skill.
           required: false
diff --git a/.github/ISSUE_TEMPLATE/feedback-or-rfc.yml b/.github/ISSUE_TEMPLATE/feedback-or-rfc.yml
@@ -13,7 +13,7 @@ body:
   - type: markdown
     attributes:
       value: |
-        Use this form for feedback on the skills in `main`, the eval and optimization suite on `eval-and-optimization`, the methodology, or the runner. For a likely bug in a specific skill, add the `bug` label.
+        Use this form for feedback on the skills in `main`, the eval and optimization suite on `feat/eval-and-optimization`, the methodology, or the runner. For a likely bug in a specific skill, add the `bug` label.
 
   - type: input
     id: summary
@@ -43,7 +43,7 @@ body:
     attributes:
       label: Area affected
       description: Which skill, file, or branch is this about? Paths welcome.
-      placeholder: e.g. "tuning/cesiumjs-camera/evals/" or "skills/cesiumjs-imagery/SKILL.md"
+      placeholder: e.g. "evaluation/cases/cesiumjs-camera/" or "skills/cesiumjs-imagery/SKILL.md"
     validations:
       required: false
 
@@ -86,5 +86,5 @@ body:
       options:
         - label: I searched existing issues for duplicates.
           required: true
-        - label: If proposing an eval scenario, I followed the schema in `tuning/<skill>/evals/eval-NNN-*.json`.
+        - label: If proposing an eval scenario, I followed the schema in `evaluation/cases/<skill>/eval-NNN-*.json`.
           required: false
diff --git a/.github/workflows/baseline-audit.yml b/.github/workflows/baseline-audit.yml
@@ -0,0 +1,137 @@
+name: Baseline Audit (deterministic + qualitative)
+
+# Two-lane baseline audit. The deterministic lane is the binding PR gate and
+# runs over tracked evidence fixtures (no secrets, fast). The qualitative lane
+# (static 0-10 visual judge) needs rendered baselines, which live under the
+# gitignored optimization/runs/ — so it renders them first and runs nightly /
+# on demand, posting an advisory result. Both lanes share ONE entrypoint:
+# evaluation/scripts/run-baseline-audit.py.
+
+on:
+  pull_request:
+    paths:
+      - 'skills/**'
+      - 'evaluation/**'
+      - 'optimization/scenarios/**'
+      - '.github/workflows/baseline-audit.yml'
+  push:
+    branches: [main]
+    paths:
+      - 'skills/**'
+      - 'evaluation/**'
+  workflow_dispatch:
+    inputs:
+      skills:
+        description: "Skills to audit ('all' or comma-separated)"
+        required: false
+        default: 'all'
+      judge_model:
+        description: 'Judge model alias'
+        required: false
+        default: 'sonnet'
+      n_judges:
+        description: 'Panel size'
+        required: false
+        default: '3'
+  schedule:
+    - cron: '0 3 * * *'   # nightly qualitative audit
+
+jobs:
+  # ----- Lane 1: deterministic (blocking PR gate, no secrets) -----
+  deterministic:
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+          cache: 'pip'
+      - run: pip install -r requirements.txt
+      - name: Enforce eval/optimization import boundary
+        run: python3 evaluation/scripts/validate-evaluation.py
+      - name: Unit tests (judge + audit, FakeAdapter; CI-safe)
+        run: python3 -m pytest -q evaluation/tests
+      - name: Deterministic baseline audit over tracked evidence (binding gate)
+        run: |
+          python3 evaluation/scripts/run-baseline-audit.py \
+            --skills "${{ github.event.inputs.skills || 'all' }}" \
+            --no-judge \
+            --output-dir evaluation/artifacts/audits/ci-deterministic
+      - name: Upload deterministic scorecard
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: baseline-audit-deterministic-${{ github.run_number }}
+          path: evaluation/artifacts/audits/ci-deterministic/**/scorecard.*
+
+  # ----- Lane 2: qualitative (nightly / dispatch, advisory) -----
+  qualitative:
+    needs: deterministic
+    if: ${{ github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
+    runs-on: ubuntu-latest
+    timeout-minutes: 120
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+          cache: 'pip'
+      - run: |
+          pip install -r requirements.txt
+          playwright install chromium
+          playwright install-deps chromium
+      - uses: actions/setup-node@v4
+        with:
+          node-version: '22'
+      - name: Install Claude CLI
+        run: npm install -g @anthropic-ai/claude-code && claude --version
+      - name: Validate secrets
+        env:
+          CESIUM_ION_TOKEN: ${{ secrets.CESIUM_ION_TOKEN }}
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+        run: |
+          test -n "$CESIUM_ION_TOKEN" || { echo "CESIUM_ION_TOKEN missing"; exit 1; }
+          test -n "$ANTHROPIC_API_KEY" || { echo "ANTHROPIC_API_KEY missing"; exit 1; }
+      - name: Render current-best baselines (gitignored; needed for the visual lane)
+        env:
+          CESIUM_ION_TOKEN: ${{ secrets.CESIUM_ION_TOKEN }}
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+        run: |
+          # The optimization loop's baseline guard renders skills/<skill>/SKILL.md
+          # into optimization/runs/<skill>/baseline (1 iteration is enough to
+          # establish the baseline bundle the audit judges).
+          SKILLS="${{ github.event.inputs.skills || 'all' }}"
+          python3 optimization/scripts/run-all-evals.py --skills "$SKILLS" --max-iterations 1 --stop-on max || true
+      - name: Qualitative + deterministic baseline audit (advisory)
+        id: audit
+        continue-on-error: true
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+        run: |
+          python3 evaluation/scripts/run-baseline-audit.py \
+            --skills "${{ github.event.inputs.skills || 'all' }}" \
+            --judge-model "${{ github.event.inputs.judge_model || 'sonnet' }}" \
+            --n-judges "${{ github.event.inputs.n_judges || '3' }}" \
+            --output-dir evaluation/artifacts/audits/ci
+      - name: Sanitize before upload
+        if: always()
+        run: |
+          python3 optimization/scripts/check-public-artifacts.py
+          bash optimization/scripts/check-secrets.sh
+          find optimization/runs/ evaluation/artifacts/ -name '*.html' -delete || true
+      - name: Build audit dashboard
+        if: always()
+        run: |
+          SC=$(ls -t evaluation/artifacts/audits/ci/*/scorecard.json 2>/dev/null | head -1)
+          if [ -n "$SC" ]; then python3 evaluation/scripts/build-audit-ui.py "$SC" --output evaluation/artifacts/audits/ci/audit.html; fi
+      - name: Upload audit scorecard + dashboard
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: baseline-audit-qualitative-${{ github.run_number }}
+          path: |
+            evaluation/artifacts/audits/ci/**/scorecard.*
+            evaluation/artifacts/audits/ci/audit.html