Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions .github/workflows/agent-compat.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
name: agent-compat

# Verify external coding-agent CLIs (Claude Code, Gemini CLI, Antigravity) can drive
# the termchart CLI to push valid boards to a live viewer. Orchestrated by
# agent-generator's benchmark-runner; the real logic lives in
# scripts/experiments/agent-compat/ (run.sh + definitions/).
#
# Runnable in CI (dispatch) and locally via act+podman
# (scripts/experiments/agent-compat/run_locally.sh) — cloud-auth steps are ACT-gated.

on:
workflow_dispatch:
inputs:
backends:
description: "Space-separated backend sets"
default: "tc-claude tc-gemini tc-antigravity"
case_set:
description: "Case set (pattern ^TC-)"
default: "termchart-compat"

jobs:
compat:
runs-on: ubuntu-latest
env:
AGENT_GENERATOR_DIR: ${{ vars.AGENT_GENERATOR_DIR || '/home/ivanmkc/agent-generator' }}
GOOGLE_CLOUD_PROJECT: ${{ vars.GCP_PROJECT_ID || 'adk-coding-agents' }}
ANTHROPIC_VERTEX_PROJECT_ID: ${{ vars.GCP_PROJECT_ID || 'adk-coding-agents' }}
CASE_SET: ${{ inputs.case_set }}
steps:
- uses: actions/checkout@v4

- uses: actions/setup-node@v4
with: { node-version: "20" }
- uses: actions/setup-python@v5
with: { python-version: "3.11" }

- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh

- name: Install agent CLIs + termchart
run: |
npm i -g @anthropic-ai/claude-code @google/gemini-cli @ivanmkc/termchart
# Antigravity (agy) is not public; run_locally.sh bind-mounts the host binary
# under act. In hosted CI the tc-antigravity backend is skipped (agy absent).

- name: Install benchmark-runner (agent-generator)
run: |
if [ -d "$AGENT_GENERATOR_DIR" ]; then
(cd "$AGENT_GENERATOR_DIR" && uv tool install --prerelease allow .)
else
echo "::warning::AGENT_GENERATOR_DIR not present; provide agent-generator (bind-mount under act, or clone in CI)"
exit 1
fi
echo "$HOME/.local/bin" >> "$GITHUB_PATH"

- name: Install skills into the coding agents (agents-cli setup)
run: |
# Installs skills into ~/.claude/skills, ~/.gemini/extensions, ~/.agents/skills —
# the locations the simulator sandbox mounts. termchart's own skills are then
# installed on top by run.sh (from plugin/skills/), since termchart ships them as
# a plugin that doesn't land in those dirs.
if command -v agents-cli >/dev/null 2>&1; then
agents-cli setup --skip-auth || echo "::warning::agents-cli setup failed (continuing; run.sh still installs termchart skills)"
else
echo "::warning::agents-cli not found; skipping general skill setup (run.sh still installs termchart skills)"
fi

# Hosted CI: authenticate to GCP via WIF (skipped under act — uses mounted ADC).
- name: GCP auth (CI only)
if: ${{ !env.ACT }}
uses: google-github-actions/auth@v3
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.WIF_SERVICE_ACCOUNT }}

# Local act: verify a host ADC was mounted (models are Vertex-routed).
- name: Verify ADC (act only)
if: ${{ env.ACT }}
run: |
test -f "$HOME/.config/gcloud/application_default_credentials.json" \
|| { echo "Mount host ADC: run via run_locally.sh"; exit 1; }

- name: Run agent-compat
run: bash scripts/experiments/agent-compat/run.sh ${{ inputs.backends }}
85 changes: 85 additions & 0 deletions scripts/experiments/agent-compat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# agent-compat — do Claude Code, Gemini CLI & Antigravity play nice with termchart?

Runs each external coding-agent CLI **headless** on termchart tasks and verifies, via a
live viewer, that each one:

1. **can drive the `termchart` CLI** to push a valid board (smoke cases), and
2. **activates the right journey/recipe** — given a *scenario only* (never told the
diagram type), does it pick the correct termchart type for the job? (journey cases)

Orchestrated by [agent-generator]'s `benchmark-runner` (interactive-simulation cases,
`output_format: direct` — prompts go straight to the CLI backend).

## How it works

```
run.sh
├─ build + run the repo's termchart CLI: termchart serve → local viewer, capture URL+token
├─ PATH wrappers (under $HOME, NOT /tmp — the sandbox tmpfs-masks /tmp):
│ termchart → inject viewer URL/token, exec the repo CLI
│ antigravity → exec `agy --dangerously-skip-permissions` (harness calls `antigravity -p …`; agy -p == --print)
├─ for each backend {claude, gemini-cli, antigravity}:
│ (antigravity: preflight agy auth under a fresh HOME → SKIP if it can't auth)
│ termchart clear --all → fresh viewer state
│ benchmark-runner --config-dir definitions --case-set <set> --generator-set <backend>
└─ summary (PASS / FAIL / SKIP per backend)
```

**Why a `termchart` wrapper, not env vars:** agent-generator's sandbox forwards `PATH`
to the agent but not arbitrary env (`TERMCHART_VIEWER_URL/TOKEN`). Baking the viewer
config into a PATH wrapper lets every agent run `termchart push` with zero credential
handling. It must live under `$HOME` (ro-bound into the sandbox), not `/tmp` (the sandbox
mounts a fresh tmpfs over `/tmp`, which would hide it).

**Verification is deterministic, scope-specific, and trace-independent.** Each case pins
a unique scope (`--project … --agent <case>`) but lets the agent choose the `--type`;
a `command` objective then checks *that scope's* board has the prescribed type via
`termchart list` (e.g. `/compare …[component]`). Scope-specific ⇒ no cross-case
contamination / false passes. Trace-independent ⇒ it's the only fair way to score
**Antigravity**, whose harness produces no tool trace (the `cli_command` objective, which
needs a trace, is `required: false`).

## Cases (`definitions/cases/`)

| Case set | Cases | Tests |
|---|---|---|
| `termchart-smoke` | `TC-FLOW-001`, `TC-COMPONENT-001` | Prompt names the type — can the agent drive `termchart push` at all |
| `termchart-journeys` | `TC-JOURNEY-{COMPARE,ARCH,METRICS,ER,DASHBOARD}-001` | **Scenario only** — does the agent pick the right type/journey (component / flow / vegalite / flow / panes\|component) |
| `termchart-compat` | all `TC-*` | smoke + journeys |

Every case seeds one shared `AGENTS.md` "choose the diagram" decision guide (auto-read by
claude/gemini/agy), so all backends have the same recipe knowledge — the test is whether
they **activate** the right journey, not whether they know termchart exists.

## Run it

```bash
scripts/experiments/agent-compat/run.sh # all backends, all cases
CASE_SET=termchart-journeys scripts/experiments/agent-compat/run.sh tc-claude tc-gemini
scripts/experiments/agent-compat/run_locally.sh # via GitHub Actions + act+podman
```

## Requirements

- `claude`, `gemini`, `agy`, `node` on PATH (run.sh warns if any are missing).
- `benchmark-runner` (`cd <agent-generator> && uv tool install --prerelease allow .`) or `uv` on PATH.
- GCP ADC — models are Vertex-routed (`gcloud auth application-default login`).
- The repo's termchart CLI + viewer are built automatically on first run.
- `AGENT_GENERATOR_DIR` (default `/home/ivanmkc/agent-generator`).

## Backend status & caveats

- **Claude Code, Gemini CLI — supported and verified.** Both drive termchart correctly;
auth is forwarded into the sandbox via env (Vertex).
- **Antigravity (agy) — SKIPPED by default (auth limitation).** agy authenticates via a
login stored in `$HOME`, but agent-generator's sandbox forces `HOME=/tmp`, so agy can't
authenticate inside a case (its creds are neither ADC- nor env-based, so they can't be
forwarded the way Claude/Gemini creds are). run.sh detects this and marks `tc-antigravity`
**SKIP** instead of a misleading FAIL. The wrapper + deterministic check are correct and
will work once agy's auth is available in the sandbox — e.g. by extending agent-generator's
(stub) antigravity harness `get_sandbox_auth` to mount agy's credential dir, or running
agy's backend outside the sandbox. Its harness also extracts no tool trace (stub), so only
the deterministic viewer check applies to it.
- Add a case by dropping a `TC-*` interactive-simulation YAML in `definitions/cases/`.

[agent-generator]: https://github.com/ivanmkc/agent-generator
Loading
Loading