test(routing): real-world integration report against forgecode CLI#3
Merged
Merged
Conversation
Tested the deterministic router and end-to-end MCP routing through the actual `forge` CLI (forgecode@2.12.14, model cx/gpt-5.5 via openai- compatible provider) on real repositories (expressjs/express, shadcn-ui/ui) with concrete tasks. Results, runner, and findings live under tests/integration/. Key findings: - Deterministic router: 19/20 (95%) on 20 hand-written real-repo prompts; the one miss reproduces existing fixture #34 (login → auth vs test). - End-to-end via `forge -p ':ck:auto ...'`: 5/5 expected behaviours, including correct `disambiguate` action on the ambiguous case. - B1 (installer ships incomplete MCP runtime): mcp-server/ + .mcp.json + @modelcontextprotocol/sdk are not copied by lgmmo-forgekit-installer, so MCP routing silently fails after `npx` install. - B2 / B3: documented `login`-keyword collision and deploy+security compound-intent regression.
There was a problem hiding this comment.
Pull request overview
Adds a new tests/integration/ “real-world” routing evaluation bundle to validate ForgeKit’s deterministic router (and document end-to-end CLI+MCP observations) using hand-written prompts tied to public repos.
Changes:
- Add a Node runner to score real-world prompts via
scripts/route-intent.cjsand write a results JSON snapshot. - Add a curated prompt set (
real-world-tasks.json) and a captured run output (real-world-tasks-results.json). - Add a methodology/report README documenting setup, results, and known routing/installer issues.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/integration/run-real-tasks.cjs | Adds a CLI-style runner to route prompts and emit a results JSON file. |
| tests/integration/real-world-tasks.json | Adds 20 real-repo prompt cases with expected primary skills. |
| tests/integration/real-world-tasks-results.json | Commits a snapshot of routing results for the 20 cases. |
| tests/integration/README.md | Documents methodology, commands, results, and observed issues (B1–B3). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const { route } = require('../../scripts/route-intent.cjs'); | ||
| const fs = require('fs'); | ||
|
|
||
| const cases = JSON.parse(fs.readFileSync('./real-world-tasks.json', 'utf8')); |
| } | ||
| } | ||
| console.log(`\n${pass}/${cases.length} passed (${(pass/cases.length*100).toFixed(1)}%)`); | ||
| fs.writeFileSync('./real-world-tasks-results.json', JSON.stringify(results, null, 2)); |
| deterministic router (no LLM in the loop): | ||
|
|
||
| ``` | ||
| $ node tests/integration/run-real-tasks.cjs |
Comment on lines
+79
to
+81
| because the action was `disambiguate` (gap 0.20 < 0.15 threshold violated), | ||
| the orchestrator correctly stopped and asked the user. The product flow is | ||
| not broken on that input even though the top score is wrong. |
| } | ||
| } | ||
| console.log(`\n${pass}/${cases.length} passed (${(pass/cases.length*100).toFixed(1)}%)`); | ||
| fs.writeFileSync('./real-world-tasks-results.json', JSON.stringify(results, null, 2)); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Verifies ForgeKit routing end-to-end against the real
forgeCLI (forgecode@2.12.14, modelcx/gpt-5.5via openai-compatible provider atapi.trannhatcse.tokyo/v1), on real public repositories with concrete tasks, instead of generated model fixtures.Adds
tests/integration/:real-world-tasks.json— 20 hand-written prompts tied to real repos (expressjs/express, vercel/next.js, prisma/prisma, stripe/stripe-node, microsoft/playwright, shadcn-ui/ui, …).run-real-tasks.cjs— scores them with the deterministic router.real-world-tasks-results.json— captured results.README.md— full methodology and findings.Results
Deterministic router (
scripts/route-intent.cjs)End-to-end via
forge -p ':ck:auto ...'against real cloned repos: 5/5 expected behaviours.All decisions hit
.forgekit/route-log.jsonlwith intent hashed (no raw text on disk) — verified.Issues found (documented, not fixed in this PR)
bin/lgmmo-forgekit-installer.jsdoesn't copymcp-server/, doesn't write a real project-root.mcp.json(only.mcp.json.exampleinside.forge/), and doesn't pull@modelcontextprotocol/sdk. Afternpx lgmmo-forgekit-installer, ForgeCode has no MCP router and the "MANDATORY FIRST ACTION: call route_intent" prompt silently degrades.loginverb collision (fixture #34 + microsoft/playwright real case):auth.verbscontains the bare word"login", hijacking testing prompts that mention a login page.Deploy và scan securityroutes tosecurity-scaninstead ofdeploybecause both verbs+nouns match and the noun-bonus tilts security-scan ahead.See
tests/integration/README.mdfor full reproduction details and suggested fixes.Test plan
npm run test:routing— baseline 98/100node tests/integration/run-real-tasks.cjs— 19/20 on hand-written promptsforge -pruns against cloned express/ui repos with MCP wired manuallyGenerated by Claude Code