agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart by ivanmkc · Pull Request #223 · ivanmkc/termchart

ivanmkc · 2026-07-03T23:10:02Z

What this adds

A harness (scripts/experiments/agent-compat/) that uses agent-generator's
benchmark-runner to drive Claude Code, Gemini CLI, and Antigravity headless
against termchart and check — deterministically, via a live viewer — whether each one
activates the right diagram journey for a scenario (not just "can it push").

Smoke cases (termchart-smoke): prompt names the type → can the agent drive termchart push at all.
Journey cases (termchart-journeys): scenario only (e.g. "compare 3 laptops", "show our request flow", "plot MAU growth") → does the agent pick the correct --type (component / flow / vegalite / flow / panes)? Shared AGENTS.md decision guide; scope-specific, trace-independent verify (the only fair way to score Antigravity, which produces no trace).
Runner (run.sh): builds+serves the repo viewer, PATH-wraps termchart (injects viewer creds — the sandbox forwards PATH but not env) and antigravity→agy, clears the viewer per backend, and preflights+SKIPs agy when its auth can't survive the sandbox.
agent-compat.yml + run_locally.sh: runnable in CI or locally via act+podman (cloud-auth steps ACT-gated).

First results (`RESULTS.md`)

	Smoke	Journey activation
Claude Code	2/2 ✅	0/5 — uses Mermaid/semantic type names (`flowchart`, `erDiagram`, `table`, `line`)
Gemini CLI	2/2 ✅	2/5 — right for `vegalite`/`panes`, wrong (`comparison`, `er`, `mermaid`) elsewhere
Antigravity	SKIP	SKIP — `agy` auth can't survive the sandbox `HOME=/tmp`

Key findings:

Neither CLI reliably activates termchart's real --type vocabulary from a scenario prompt → the journeys are mostly not being used correctly.
termchart push silently accepts unknown --type values (exit 0, unrenderable board) → agents get no corrective feedback. Validating --type in push is the highest-leverage fix and would let any agent self-correct.

Notes

Draft. This is an experiment/harness under scripts/experiments/, plus one CI workflow; no product code changed.
Requires agent-generator + GCP ADC (Vertex-routed models). See README.

…y vs termchart Uses agent-generator's benchmark-runner to drive each external coding-agent CLI headless on termchart tasks and verify (via a local viewer, deterministically) whether it activates the right diagram journey for a scenario. - definitions/: smoke cases (can it push) + journey cases (scenario-only, does it pick the right --type). Shared AGENTS.md decision guide; scope-specific, trace-independent verify (fair to Antigravity which has no trace). - run.sh: builds+serves the repo viewer, PATH-wraps termchart (inject viewer creds) and antigravity->agy, clears per backend, loops backends, preflights+SKIPs agy (its $HOME login can't survive the sandbox's HOME=/tmp). - agent-compat.yml + run_locally.sh: run in CI or locally via act+podman. First results (RESULTS.md): Gemini 2/5, Claude 0/5 journey activation; both mis-use termchart's --type vocabulary and `push` silently accepts invalid types (no corrective feedback). Antigravity SKIP (auth).

… works, journey selection still the gap #219 (push --type validation) merged into master. Re-ran journeys: agents now get an actionable error on an invalid --type and retry with a VALID type every time (no more silently-stored unrenderable boards — finding #2 fixed). Scores stay ~1/5 because agents retry to a valid-but-wrong type (Claude falls back to mermaid); picking the RIGHT journey is a recipe-activation gap, not validation.

…), 2/5->3/5 (Gemini) The #229 did-you-mean/intent hint closes the journey-selection gap: agents now self-correct to the RIGHT type (comparison->component, architecture->flow, etc.), not just a valid one. Only ER residual (agents prefer mermaid's native erDiagram).

… npx skills Agents previously only had the seeded AGENTS.md — termchart's real skills (diagram-recipes etc.) ship as a plugin under ~/.claude/plugins, which the simulator sandbox does NOT mount. Install them with the Vercel `npx skills add plugin/skills -g --all` CLI (same mechanism agents-cli setup uses) so they land in the canonical ~/.agents/skills store with the per-agent symlinks the sandbox mounts. Workflow also runs `agents-cli setup --skip-auth` for the general bundle.

…o-activate them Installing termchart's skills into the sandbox (diagram-recipes etc.) did NOT improve journey activation (Claude 1/5, Gemini 2/5) — transcripts show Claude never read the skill (read-skill=False) and kept using Mermaid keywords. Passive availability != activation. The #229 in-error did-you-mean hint (4/5) remains the effective lever because it lands where the agent is already looking.

…mchart skill mis-activates Traces show Claude DOES call the Skill tool; it activates the 'termchart' terminal (Mermaid->ASCII) skill, not diagram-recipes, and gets anchored on Mermaid types. Removing it and leaving only diagram-recipes still drifts to mermaid because the Mermaid-terminal identity pervades termchart (CLI --help tagline etc.). run.sh now installs only the viewer skill; the durable fix is deprecating the terminal surface.

….md) + self-contained shapes Root cause of Claude's low journey scores was NOT skill quality/model: Claude Code reads CLAUDE.md, but the harness seeded the guide only as AGENTS.md (Gemini's file). Claude ran blind and fell back to mermaid 'graph LR' under --type flow. Fixes: seed CLAUDE.md too + make the guide self-contained (exact per-type JSON + 'flow content is JSON not mermaid'). Now one-shot with NO error hint: Claude haiku 5/5, Claude sonnet 5/5, Gemini 5/5 (was 1/5, ~1/5, 2/5). Adds tc_claude_sonnet generator + SKIP_SKILL_INSTALL for A/B.

ivanmkc-google added 8 commits July 3, 2026 23:09

Merge remote-tracking branch 'origin/master' into agent-compat

515a779

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223

agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223
ivanmkc wants to merge 8 commits into
masterfrom
agent-compat

ivanmkc commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanmkc commented Jul 3, 2026

What this adds

First results (RESULTS.md)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

First results (`RESULTS.md`)