Skip to content

agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223

Draft
ivanmkc wants to merge 8 commits into
masterfrom
agent-compat
Draft

agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223
ivanmkc wants to merge 8 commits into
masterfrom
agent-compat

Conversation

@ivanmkc

@ivanmkc ivanmkc commented Jul 3, 2026

Copy link
Copy Markdown
Owner

What this adds

A harness (scripts/experiments/agent-compat/) that uses agent-generator's
benchmark-runner to drive Claude Code, Gemini CLI, and Antigravity headless
against termchart and check — deterministically, via a live viewer — whether each one
activates the right diagram journey for a scenario (not just "can it push").

  • Smoke cases (termchart-smoke): prompt names the type → can the agent drive termchart push at all.
  • Journey cases (termchart-journeys): scenario only (e.g. "compare 3 laptops", "show our request flow", "plot MAU growth") → does the agent pick the correct --type (component / flow / vegalite / flow / panes)? Shared AGENTS.md decision guide; scope-specific, trace-independent verify (the only fair way to score Antigravity, which produces no trace).
  • Runner (run.sh): builds+serves the repo viewer, PATH-wraps termchart (injects viewer creds — the sandbox forwards PATH but not env) and antigravityagy, clears the viewer per backend, and preflights+SKIPs agy when its auth can't survive the sandbox.
  • agent-compat.yml + run_locally.sh: runnable in CI or locally via act+podman (cloud-auth steps ACT-gated).

First results (RESULTS.md)

Smoke Journey activation
Claude Code 2/2 ✅ 0/5 — uses Mermaid/semantic type names (flowchart, erDiagram, table, line)
Gemini CLI 2/2 ✅ 2/5 — right for vegalite/panes, wrong (comparison, er, mermaid) elsewhere
Antigravity SKIP SKIP — agy auth can't survive the sandbox HOME=/tmp

Key findings:

  1. Neither CLI reliably activates termchart's real --type vocabulary from a scenario prompt → the journeys are mostly not being used correctly.
  2. termchart push silently accepts unknown --type values (exit 0, unrenderable board) → agents get no corrective feedback. Validating --type in push is the highest-leverage fix and would let any agent self-correct.

Notes

  • Draft. This is an experiment/harness under scripts/experiments/, plus one CI workflow; no product code changed.
  • Requires agent-generator + GCP ADC (Vertex-routed models). See README.

…y vs termchart

Uses agent-generator's benchmark-runner to drive each external coding-agent
CLI headless on termchart tasks and verify (via a local viewer, deterministically)
whether it activates the right diagram journey for a scenario.

- definitions/: smoke cases (can it push) + journey cases (scenario-only, does it
  pick the right --type). Shared AGENTS.md decision guide; scope-specific,
  trace-independent verify (fair to Antigravity which has no trace).
- run.sh: builds+serves the repo viewer, PATH-wraps termchart (inject viewer creds)
  and antigravity->agy, clears per backend, loops backends, preflights+SKIPs agy
  (its $HOME login can't survive the sandbox's HOME=/tmp).
- agent-compat.yml + run_locally.sh: run in CI or locally via act+podman.

First results (RESULTS.md): Gemini 2/5, Claude 0/5 journey activation; both mis-use
termchart's --type vocabulary and `push` silently accepts invalid types (no
corrective feedback). Antigravity SKIP (auth).
… works, journey selection still the gap

#219 (push --type validation) merged into master. Re-ran journeys: agents now
get an actionable error on an invalid --type and retry with a VALID type every
time (no more silently-stored unrenderable boards — finding #2 fixed). Scores
stay ~1/5 because agents retry to a valid-but-wrong type (Claude falls back to
mermaid); picking the RIGHT journey is a recipe-activation gap, not validation.
…), 2/5->3/5 (Gemini)

The #229 did-you-mean/intent hint closes the journey-selection gap: agents now
self-correct to the RIGHT type (comparison->component, architecture->flow, etc.),
not just a valid one. Only ER residual (agents prefer mermaid's native erDiagram).
… npx skills

Agents previously only had the seeded AGENTS.md — termchart's real skills
(diagram-recipes etc.) ship as a plugin under ~/.claude/plugins, which the
simulator sandbox does NOT mount. Install them with the Vercel `npx skills add
plugin/skills -g --all` CLI (same mechanism agents-cli setup uses) so they land
in the canonical ~/.agents/skills store with the per-agent symlinks the sandbox
mounts. Workflow also runs `agents-cli setup --skip-auth` for the general bundle.
…o-activate them

Installing termchart's skills into the sandbox (diagram-recipes etc.) did NOT
improve journey activation (Claude 1/5, Gemini 2/5) — transcripts show Claude
never read the skill (read-skill=False) and kept using Mermaid keywords. Passive
availability != activation. The #229 in-error did-you-mean hint (4/5) remains the
effective lever because it lands where the agent is already looking.
…mchart skill mis-activates

Traces show Claude DOES call the Skill tool; it activates the 'termchart' terminal
(Mermaid->ASCII) skill, not diagram-recipes, and gets anchored on Mermaid types.
Removing it and leaving only diagram-recipes still drifts to mermaid because the
Mermaid-terminal identity pervades termchart (CLI --help tagline etc.). run.sh now
installs only the viewer skill; the durable fix is deprecating the terminal surface.
….md) + self-contained shapes

Root cause of Claude's low journey scores was NOT skill quality/model: Claude Code reads
CLAUDE.md, but the harness seeded the guide only as AGENTS.md (Gemini's file). Claude ran
blind and fell back to mermaid 'graph LR' under --type flow. Fixes: seed CLAUDE.md too +
make the guide self-contained (exact per-type JSON + 'flow content is JSON not mermaid').
Now one-shot with NO error hint: Claude haiku 5/5, Claude sonnet 5/5, Gemini 5/5
(was 1/5, ~1/5, 2/5). Adds tc_claude_sonnet generator + SKIP_SKILL_INSTALL for A/B.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants