Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
504 changes: 504 additions & 0 deletions .architecture/acd.md

Large diffs are not rendered by default.

27 changes: 27 additions & 0 deletions .architecture/adr0001-skills-first-sequencing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# ADR-0001: Use Skills-First Sequencing

## Status

Accepted

## Context

The long-term goal is a reusable Cesium AI evaluation framework that can evaluate skills, documentation, examples, MCP tools, and Cesium ion workflows. The near-term implementation evidence exists in `CesiumGS/cesiumjs-skills`, where skill documents already have representative scenarios, browser execution helpers, and iteration history.

Two sequencing options were considered:

- **Suite-first:** design a generic benchmark framework first, then apply it to skills.
- **Skills-first:** prove the framework in `cesiumjs-skills`, then generalize reusable components.

## Decision

Use skills-first sequencing.

The initial canonical implementation should focus on `CesiumGS/cesiumjs-skills` because the target artifacts, scenarios, and runner already exist there. The architecture must still preserve extension points for MCP and tool-call evaluations.

## Consequences

- The first ACD and ADR set can be grounded in real branch artifacts.
- The framework can produce useful value before MCP/tool runtimes are mature.
- Some abstractions may need to be extracted later.
- Scenario and runner design must avoid hard-coding assumptions that make future MCP evaluation impossible.
28 changes: 28 additions & 0 deletions .architecture/adr0002-pairwise-judge-protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# ADR-0002: Use Pairwise Judge Protocol

## Status

Accepted

## Context

CesiumJS correctness is often visual. A scene can execute without console errors and still be wrong because the camera is poorly framed, terrain is missing, imagery is incorrect, or an expected object is not visible.

Early absolute visual scoring is vulnerable to judge calibration noise. The same or equivalent output can receive different numeric scores across runs, which can flip aggregate decisions even when the candidate did not materially change a scenario.

## Decision

Use pairwise A/B/TIE judging for visual and qualitative comparisons:

1. Present the same scenario evidence for current best and candidate.
2. Ask each judge to choose `BASELINE`, `CANDIDATE`, or `TIE`.
3. Use three independent judges.
4. Compute the per-scenario majority verdict.

## Consequences

- Relative comparison reduces absolute-score calibration noise.
- Three independent judges reduce single-judge bias.
- Judge results are easier for maintainers to inspect.
- The protocol costs more than one judge, so it should be used selectively in CI.
- Numeric scores can still be produced for reporting, but they are not the primary visual decision mechanism.
34 changes: 34 additions & 0 deletions .architecture/adr0003-deterministic-decision-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# ADR-0003: Separate Deterministic Decision Policy From Report Scores

## Status

Accepted

## Context

Evaluation systems often mix LLM judgment, programmatic checks, and numeric aggregate scores. That can make decisions hard to audit. For Cesium AI evaluation, maintainers need to know whether a candidate was kept because it passed critical correctness gates, won visual comparisons, or merely improved a weighted average.

The desired pattern is:

```text
structured evidence -> deterministic decision -> report score
```

## Decision

Use deterministic gates and rule-based decisions as the source of truth:

1. Deterministic checks produce pass/fail facts.
2. Pairwise judges produce structured verdicts.
3. Critical-regression failures block acceptance.
4. Majority verdicts decide visual wins, losses, and ties.
5. Numeric report scores are derived afterward for dashboards and trends.

Manual override is allowed only when a maintainer records the reason.

## Consequences

- Decisions are explainable and auditable.
- Dashboards can still show scores without hiding the decision basis.
- Weighted formulas can evolve without changing the core keep/discard semantics.
- More metadata must be stored so decisions can be reconstructed later.
25 changes: 25 additions & 0 deletions .architecture/adr0004-browser-visual-evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# ADR-0004: Use Browser-Backed Visual Evaluation

## Status

Accepted

## Context

Many CesiumJS failures cannot be detected from generated code alone. Examples include wrong camera distance, reversed longitude sign, missing terrain, invisible imagery, flat city scenes where 3D buildings were expected, or asynchronous loading that never completes.

Code-pattern checks are necessary but insufficient.

## Decision

Use a browser-backed Cesium evaluation environment for CesiumJS scenarios.

The runner should execute generated code in a controlled page, capture console output, page errors, screenshots, and, where useful, scene state. Visual scenarios should use recognizable landmarks or controlled fixtures to make judgment reliable.

## Consequences

- The framework measures rendered behavior, not just code shape.
- Screenshots and scene evidence help maintainers debug failures.
- Browser automation adds setup cost and potential flakiness.
- Scenarios need careful timing and stable viewport settings.
- Some checks should use scene state in addition to screenshots when visual evidence alone is ambiguous.
26 changes: 26 additions & 0 deletions .architecture/adr0005-ci-trigger-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# ADR-0005: Use Cost-Aware CI Trigger Policy

## Status

Accepted

## Context

Full LLM-assisted judging is useful but can be expensive and slow. Public contributors need fast feedback on routine changes, while maintainers need deeper evaluation before releases or high-risk changes.

## Decision

Use tiered CI triggers:

- Pull requests run deterministic checks by default.
- Visual judge suites run on scheduled, manual, release, or high-risk triggers.
- Maintainers can request a full judge run when a change plausibly affects generated visual quality.
- CI output must include enough artifacts to reproduce failures locally.

## Consequences

- Routine PRs remain practical for contributors.
- Expensive judge runs are used where they add the most value.
- Some visual regressions may not be caught immediately on every PR.
- Release and scheduled runs become important safety nets.
- Trigger policy must be documented so contributors know what evidence is required.
26 changes: 26 additions & 0 deletions .architecture/adr0006-public-artifact-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# ADR-0006: Commit Public-Safe Evidence Only

## Status

Accepted

## Context

Evaluation traces can include generated HTML, access tokens, local paths, prompts, console output, screenshots, and model metadata. Some of these artifacts are useful for debugging but unsafe to commit publicly without review.

The framework should be open and inspectable without leaking credentials or private context.

## Decision

Commit public-safe artifacts and ignore raw sensitive traces by default.

Committed artifacts may include scenario manifests, coverage reports, programmatic check summaries, judge verdict summaries, decision records, curated screenshots, and public-safe reports.

Raw generated traces should remain local or CI artifacts unless explicitly reviewed and sanitized.

## Consequences

- The public repository stays safer and easier to review.
- Maintainers may need to regenerate raw traces locally for deep debugging.
- Secret scanning remains mandatory before publication.
- Reports should use relative paths and avoid private environment details.
4 changes: 2 additions & 2 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
"name": "CesiumGS"
},
"metadata": {
"description": "CesiumJS domain skills covering ~535 public symbols across the CesiumJS v1.139 API surface"
"description": "CesiumJS domain skills covering ~550 public symbols across the CesiumJS v1.142 API surface"
},
"plugins": [
{
"name": "cesiumjs-skills",
"source": "./",
"description": "14 domain skills for CesiumJS development with SessionStart hook and Chrome DevTools MCP integration",
"description": "14 domain skills for CesiumJS development with Chrome DevTools MCP integration",
"version": "0.3.0",
"category": "development",
"tags": ["cesium", "cesiumjs", "3d", "geospatial", "gis", "maps"],
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ body:
label: Branch
options:
- main
- eval-and-optimization
- feat/eval-and-optimization
- Other (please note in Context)
validations:
required: true
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,5 @@ contact_links:
url: https://docs.claude.com/en/docs/claude-code/overview
about: Questions about Claude Code, plugins, or the Skill API belong with Anthropic's docs.
- name: Eval and optimization branch
url: https://github.com/CesiumGS/cesiumjs-skills/tree/eval-and-optimization
url: https://github.com/CesiumGS/cesiumjs-skills/tree/feat/eval-and-optimization
about: Browse the eval suite, methodology, scoring history, and curated screenshot gallery.
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/feature-request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ body:
attributes:
label: Area
description: Which skill, file, or branch this affects. Paths welcome.
placeholder: e.g. "skills/cesiumjs-entities/SKILL.md" or "tuning/tools/run_eval_suite.py"
placeholder: e.g. "skills/cesiumjs-entities/SKILL.md" or "optimization/scripts/run-all-evals.py"
validations:
required: false

Expand Down Expand Up @@ -97,5 +97,5 @@ body:
options:
- label: I searched existing issues and PRs for duplicates.
required: true
- label: For new skills, I checked `docs/DOMAINS.md` to confirm the area is not already covered by an existing skill.
- label: For new skills, I checked `wiki/Domain-Mapping.md` to confirm the area is not already covered by an existing skill.
required: false
6 changes: 3 additions & 3 deletions .github/ISSUE_TEMPLATE/feedback-or-rfc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ body:
- type: markdown
attributes:
value: |
Use this form for feedback on the skills in `main`, the eval and optimization suite on `eval-and-optimization`, the methodology, or the runner. For a likely bug in a specific skill, add the `bug` label.
Use this form for feedback on the skills in `main`, the eval and optimization suite on `feat/eval-and-optimization`, the methodology, or the runner. For a likely bug in a specific skill, add the `bug` label.

- type: input
id: summary
Expand Down Expand Up @@ -43,7 +43,7 @@ body:
attributes:
label: Area affected
description: Which skill, file, or branch is this about? Paths welcome.
placeholder: e.g. "tuning/cesiumjs-camera/evals/" or "skills/cesiumjs-imagery/SKILL.md"
placeholder: e.g. "evaluation/cases/cesiumjs-camera/" or "skills/cesiumjs-imagery/SKILL.md"
validations:
required: false

Expand Down Expand Up @@ -86,5 +86,5 @@ body:
options:
- label: I searched existing issues for duplicates.
required: true
- label: If proposing an eval scenario, I followed the schema in `tuning/<skill>/evals/eval-NNN-*.json`.
- label: If proposing an eval scenario, I followed the schema in `evaluation/cases/<skill>/eval-NNN-*.json`.
required: false
137 changes: 137 additions & 0 deletions .github/workflows/baseline-audit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
name: Baseline Audit (deterministic + qualitative)

# Two-lane baseline audit. The deterministic lane is the binding PR gate and
# runs over tracked evidence fixtures (no secrets, fast). The qualitative lane
# (static 0-10 visual judge) needs rendered baselines, which live under the
# gitignored optimization/runs/ — so it renders them first and runs nightly /
# on demand, posting an advisory result. Both lanes share ONE entrypoint:
# evaluation/scripts/run-baseline-audit.py.

on:
pull_request:
paths:
- 'skills/**'
- 'evaluation/**'
- 'optimization/scenarios/**'
- '.github/workflows/baseline-audit.yml'
push:
branches: [main]
paths:
- 'skills/**'
- 'evaluation/**'
workflow_dispatch:
inputs:
skills:
description: "Skills to audit ('all' or comma-separated)"
required: false
default: 'all'
judge_model:
description: 'Judge model alias'
required: false
default: 'sonnet'
n_judges:
description: 'Panel size'
required: false
default: '3'
schedule:
- cron: '0 3 * * *' # nightly qualitative audit

jobs:
# ----- Lane 1: deterministic (blocking PR gate, no secrets) -----
deterministic:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: pip install -r requirements.txt
- name: Enforce eval/optimization import boundary
run: python3 evaluation/scripts/validate-evaluation.py
- name: Unit tests (judge + audit, FakeAdapter; CI-safe)
run: python3 -m pytest -q evaluation/tests
- name: Deterministic baseline audit over tracked evidence (binding gate)
run: |
python3 evaluation/scripts/run-baseline-audit.py \
--skills "${{ github.event.inputs.skills || 'all' }}" \
--no-judge \
--output-dir evaluation/artifacts/audits/ci-deterministic
- name: Upload deterministic scorecard
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-audit-deterministic-${{ github.run_number }}
path: evaluation/artifacts/audits/ci-deterministic/**/scorecard.*

# ----- Lane 2: qualitative (nightly / dispatch, advisory) -----
qualitative:
needs: deterministic
if: ${{ github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- run: |
pip install -r requirements.txt
playwright install chromium
playwright install-deps chromium
- uses: actions/setup-node@v4
with:
node-version: '22'
- name: Install Claude CLI
run: npm install -g @anthropic-ai/claude-code && claude --version
- name: Validate secrets
env:
CESIUM_ION_TOKEN: ${{ secrets.CESIUM_ION_TOKEN }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
test -n "$CESIUM_ION_TOKEN" || { echo "CESIUM_ION_TOKEN missing"; exit 1; }
test -n "$ANTHROPIC_API_KEY" || { echo "ANTHROPIC_API_KEY missing"; exit 1; }
- name: Render current-best baselines (gitignored; needed for the visual lane)
env:
CESIUM_ION_TOKEN: ${{ secrets.CESIUM_ION_TOKEN }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
# The optimization loop's baseline guard renders skills/<skill>/SKILL.md
# into optimization/runs/<skill>/baseline (1 iteration is enough to
# establish the baseline bundle the audit judges).
SKILLS="${{ github.event.inputs.skills || 'all' }}"
python3 optimization/scripts/run-all-evals.py --skills "$SKILLS" --max-iterations 1 --stop-on max || true
- name: Qualitative + deterministic baseline audit (advisory)
id: audit
continue-on-error: true
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python3 evaluation/scripts/run-baseline-audit.py \
--skills "${{ github.event.inputs.skills || 'all' }}" \
--judge-model "${{ github.event.inputs.judge_model || 'sonnet' }}" \
--n-judges "${{ github.event.inputs.n_judges || '3' }}" \
--output-dir evaluation/artifacts/audits/ci
- name: Sanitize before upload
if: always()
run: |
python3 optimization/scripts/check-public-artifacts.py
bash optimization/scripts/check-secrets.sh
find optimization/runs/ evaluation/artifacts/ -name '*.html' -delete || true
- name: Build audit dashboard
if: always()
run: |
SC=$(ls -t evaluation/artifacts/audits/ci/*/scorecard.json 2>/dev/null | head -1)
if [ -n "$SC" ]; then python3 evaluation/scripts/build-audit-ui.py "$SC" --output evaluation/artifacts/audits/ci/audit.html; fi
- name: Upload audit scorecard + dashboard
if: always()
uses: actions/upload-artifact@v4
with:
name: baseline-audit-qualitative-${{ github.run_number }}
path: |
evaluation/artifacts/audits/ci/**/scorecard.*
evaluation/artifacts/audits/ci/audit.html
Loading
Loading