agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223
Draft
ivanmkc wants to merge 8 commits into
Draft
agent-compat: journey-activation tests for Claude/Gemini/Antigravity vs termchart#223ivanmkc wants to merge 8 commits into
ivanmkc wants to merge 8 commits into
Conversation
…y vs termchart Uses agent-generator's benchmark-runner to drive each external coding-agent CLI headless on termchart tasks and verify (via a local viewer, deterministically) whether it activates the right diagram journey for a scenario. - definitions/: smoke cases (can it push) + journey cases (scenario-only, does it pick the right --type). Shared AGENTS.md decision guide; scope-specific, trace-independent verify (fair to Antigravity which has no trace). - run.sh: builds+serves the repo viewer, PATH-wraps termchart (inject viewer creds) and antigravity->agy, clears per backend, loops backends, preflights+SKIPs agy (its $HOME login can't survive the sandbox's HOME=/tmp). - agent-compat.yml + run_locally.sh: run in CI or locally via act+podman. First results (RESULTS.md): Gemini 2/5, Claude 0/5 journey activation; both mis-use termchart's --type vocabulary and `push` silently accepts invalid types (no corrective feedback). Antigravity SKIP (auth).
… works, journey selection still the gap #219 (push --type validation) merged into master. Re-ran journeys: agents now get an actionable error on an invalid --type and retry with a VALID type every time (no more silently-stored unrenderable boards — finding #2 fixed). Scores stay ~1/5 because agents retry to a valid-but-wrong type (Claude falls back to mermaid); picking the RIGHT journey is a recipe-activation gap, not validation.
…), 2/5->3/5 (Gemini) The #229 did-you-mean/intent hint closes the journey-selection gap: agents now self-correct to the RIGHT type (comparison->component, architecture->flow, etc.), not just a valid one. Only ER residual (agents prefer mermaid's native erDiagram).
… npx skills Agents previously only had the seeded AGENTS.md — termchart's real skills (diagram-recipes etc.) ship as a plugin under ~/.claude/plugins, which the simulator sandbox does NOT mount. Install them with the Vercel `npx skills add plugin/skills -g --all` CLI (same mechanism agents-cli setup uses) so they land in the canonical ~/.agents/skills store with the per-agent symlinks the sandbox mounts. Workflow also runs `agents-cli setup --skip-auth` for the general bundle.
…o-activate them Installing termchart's skills into the sandbox (diagram-recipes etc.) did NOT improve journey activation (Claude 1/5, Gemini 2/5) — transcripts show Claude never read the skill (read-skill=False) and kept using Mermaid keywords. Passive availability != activation. The #229 in-error did-you-mean hint (4/5) remains the effective lever because it lands where the agent is already looking.
…mchart skill mis-activates Traces show Claude DOES call the Skill tool; it activates the 'termchart' terminal (Mermaid->ASCII) skill, not diagram-recipes, and gets anchored on Mermaid types. Removing it and leaving only diagram-recipes still drifts to mermaid because the Mermaid-terminal identity pervades termchart (CLI --help tagline etc.). run.sh now installs only the viewer skill; the durable fix is deprecating the terminal surface.
….md) + self-contained shapes Root cause of Claude's low journey scores was NOT skill quality/model: Claude Code reads CLAUDE.md, but the harness seeded the guide only as AGENTS.md (Gemini's file). Claude ran blind and fell back to mermaid 'graph LR' under --type flow. Fixes: seed CLAUDE.md too + make the guide self-contained (exact per-type JSON + 'flow content is JSON not mermaid'). Now one-shot with NO error hint: Claude haiku 5/5, Claude sonnet 5/5, Gemini 5/5 (was 1/5, ~1/5, 2/5). Adds tc_claude_sonnet generator + SKIP_SKILL_INSTALL for A/B.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this adds
A harness (
scripts/experiments/agent-compat/) that uses agent-generator'sbenchmark-runnerto drive Claude Code, Gemini CLI, and Antigravity headlessagainst termchart and check — deterministically, via a live viewer — whether each one
activates the right diagram journey for a scenario (not just "can it push").
termchart-smoke): prompt names the type → can the agent drivetermchart pushat all.termchart-journeys): scenario only (e.g. "compare 3 laptops", "show our request flow", "plot MAU growth") → does the agent pick the correct--type(component / flow / vegalite / flow / panes)? SharedAGENTS.mddecision guide; scope-specific, trace-independent verify (the only fair way to score Antigravity, which produces no trace).run.sh): builds+serves the repo viewer, PATH-wrapstermchart(injects viewer creds — the sandbox forwards PATH but not env) andantigravity→agy, clears the viewer per backend, and preflights+SKIPs agy when its auth can't survive the sandbox.agent-compat.yml+run_locally.sh: runnable in CI or locally viaact+podman (cloud-auth steps ACT-gated).First results (
RESULTS.md)flowchart,erDiagram,table,line)vegalite/panes, wrong (comparison,er,mermaid) elsewhereagyauth can't survive the sandboxHOME=/tmpKey findings:
--typevocabulary from a scenario prompt → the journeys are mostly not being used correctly.termchart pushsilently accepts unknown--typevalues (exit 0, unrenderable board) → agents get no corrective feedback. Validating--typeinpushis the highest-leverage fix and would let any agent self-correct.Notes
scripts/experiments/, plus one CI workflow; no product code changed.