Skip to content

test(routing): real-world integration report against forgecode CLI#3

Merged
Duy-Nguyen-2006 merged 1 commit into
mainfrom
claude/test-forgekit-routing-OGMPj
May 12, 2026
Merged

test(routing): real-world integration report against forgecode CLI#3
Duy-Nguyen-2006 merged 1 commit into
mainfrom
claude/test-forgekit-routing-OGMPj

Conversation

@Duy-Nguyen-2006

Copy link
Copy Markdown
Owner

Summary

Verifies ForgeKit routing end-to-end against the real forge CLI (forgecode@2.12.14, model cx/gpt-5.5 via openai-compatible provider at api.trannhatcse.tokyo/v1), on real public repositories with concrete tasks, instead of generated model fixtures.

Adds tests/integration/:

  • real-world-tasks.json — 20 hand-written prompts tied to real repos (expressjs/express, vercel/next.js, prisma/prisma, stripe/stripe-node, microsoft/playwright, shadcn-ui/ui, …).
  • run-real-tasks.cjs — scores them with the deterministic router.
  • real-world-tasks-results.json — captured results.
  • README.md — full methodology and findings.

Results

Deterministic router (scripts/route-intent.cjs)

  • Existing fixtures: 98/100 (98.0%) — unchanged baseline.
  • Hand-written real-repo prompts: 19/20 (95.0%).

End-to-end via forge -p ':ck:auto ...' against real cloned repos: 5/5 expected behaviours.

# Repo Task Action Primary Conf Verdict
1 expressjs/express Tạo REST API endpoint POST /api/products với middleware validate request body route backend-development 1.00 PASS
2 expressjs/express Viết unit test với Jest cho lib/router/index.js, tăng test coverage route test 1.00 PASS
3 expressjs/express Viết playwright tests for login page tại tests/login.spec.ts disambiguate auth vs web-testing 0.50 PASS (asked, didn't mis-route)
4 expressjs/express Thêm đăng nhập Google OAuth2 với JWT session management route auth 1.00 PASS
5 shadcn-ui/ui Thiết kế landing page responsive đẹp với dark mode cho coffee shop route ui-ux-pro-max 1.00 PASS

All decisions hit .forgekit/route-log.jsonl with intent hashed (no raw text on disk) — verified.

Issues found (documented, not fixed in this PR)

  • B1 — installer ships an incomplete MCP runtime. bin/lgmmo-forgekit-installer.js doesn't copy mcp-server/, doesn't write a real project-root .mcp.json (only .mcp.json.example inside .forge/), and doesn't pull @modelcontextprotocol/sdk. After npx lgmmo-forgekit-installer, ForgeCode has no MCP router and the "MANDATORY FIRST ACTION: call route_intent" prompt silently degrades.
  • B2 — login verb collision (fixture #34 + microsoft/playwright real case): auth.verbs contains the bare word "login", hijacking testing prompts that mention a login page.
  • B3 — Deploy và scan security routes to security-scan instead of deploy because both verbs+nouns match and the noun-bonus tilts security-scan ahead.

See tests/integration/README.md for full reproduction details and suggested fixes.

Test plan

  • npm run test:routing — baseline 98/100
  • node tests/integration/run-real-tasks.cjs — 19/20 on hand-written prompts
  • 5 live forge -p runs against cloned express/ui repos with MCP wired manually
  • (Follow-up) fix B1 in installer
  • (Follow-up) fix B2/B3 in routing table

Generated by Claude Code

Tested the deterministic router and end-to-end MCP routing through the
actual `forge` CLI (forgecode@2.12.14, model cx/gpt-5.5 via openai-
compatible provider) on real repositories (expressjs/express, shadcn-ui/ui)
with concrete tasks. Results, runner, and findings live under
tests/integration/.

Key findings:
- Deterministic router: 19/20 (95%) on 20 hand-written real-repo prompts;
  the one miss reproduces existing fixture #34 (login → auth vs test).
- End-to-end via `forge -p ':ck:auto ...'`: 5/5 expected behaviours,
  including correct `disambiguate` action on the ambiguous case.
- B1 (installer ships incomplete MCP runtime): mcp-server/ + .mcp.json +
  @modelcontextprotocol/sdk are not copied by lgmmo-forgekit-installer,
  so MCP routing silently fails after `npx` install.
- B2 / B3: documented `login`-keyword collision and deploy+security
  compound-intent regression.
@Duy-Nguyen-2006 Duy-Nguyen-2006 marked this pull request as ready for review May 12, 2026 14:20
Copilot AI review requested due to automatic review settings May 12, 2026 14:20
@Duy-Nguyen-2006 Duy-Nguyen-2006 merged commit 1fb675b into main May 12, 2026
2 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new tests/integration/ “real-world” routing evaluation bundle to validate ForgeKit’s deterministic router (and document end-to-end CLI+MCP observations) using hand-written prompts tied to public repos.

Changes:

  • Add a Node runner to score real-world prompts via scripts/route-intent.cjs and write a results JSON snapshot.
  • Add a curated prompt set (real-world-tasks.json) and a captured run output (real-world-tasks-results.json).
  • Add a methodology/report README documenting setup, results, and known routing/installer issues.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
tests/integration/run-real-tasks.cjs Adds a CLI-style runner to route prompts and emit a results JSON file.
tests/integration/real-world-tasks.json Adds 20 real-repo prompt cases with expected primary skills.
tests/integration/real-world-tasks-results.json Commits a snapshot of routing results for the 20 cases.
tests/integration/README.md Documents methodology, commands, results, and observed issues (B1–B3).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const { route } = require('../../scripts/route-intent.cjs');
const fs = require('fs');

const cases = JSON.parse(fs.readFileSync('./real-world-tasks.json', 'utf8'));
}
}
console.log(`\n${pass}/${cases.length} passed (${(pass/cases.length*100).toFixed(1)}%)`);
fs.writeFileSync('./real-world-tasks-results.json', JSON.stringify(results, null, 2));
deterministic router (no LLM in the loop):

```
$ node tests/integration/run-real-tasks.cjs
Comment on lines +79 to +81
because the action was `disambiguate` (gap 0.20 < 0.15 threshold violated),
the orchestrator correctly stopped and asked the user. The product flow is
not broken on that input even though the top score is wrong.
}
}
console.log(`\n${pass}/${cases.length} passed (${(pass/cases.length*100).toFixed(1)}%)`);
fs.writeFileSync('./real-world-tasks-results.json', JSON.stringify(results, null, 2));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants