test: add mock-LLM E2E test for /model slash command (mid-conversation profile switch)#998
test: add mock-LLM E2E test for /model slash command (mid-conversation profile switch)#998malhotra5 wants to merge 6 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Add tests/e2e/mock-llm/mock-llm-model-switch.spec.ts exercising the
full /model mid-conversation profile switch flow against the real
agent-server with a scripted mock LLM backend.
Step 1 — setup:
- ensureMockLLMProfile() configures agent_settings.llm (proven pattern)
- POST /api/profiles/{name} creates profile B as the switch target
- Registers a 3-entry trajectory: padding for the internal LLM call,
INITIAL_REPLY_TOKEN, POST_SWITCH_REPLY_TOKEN
Step 2 — conversation + /model switch:
- Starts conversation from home page, waits for agent reply
- Types '/model model-switch-profile-b' and submits
- Verifies 'Switched to profile' confirmation in data-testid=model-messages
- Verifies POST /api/conversations/{id}/switch_profile was intercepted
- Sends follow-up, verifies agent responds (post-switch continuity)
- Verifies no error banners
Co-authored-by: openhands <openhands@all-hands.dev>
a806b0f to
b566848
Compare
Two Docker E2E reliability fixes: 1. Concurrency group: workflow_run triggers used the unique workflow_run.id as the group key, so multiple Docker builds completing on main each spawned their own E2E run (they never cancelled each other). Changed to key by the triggering workflow's head_branch, so main-branch workflow_run triggers share the group 'wr-main' and only the latest run survives. pull_request triggers still key by PR number (unchanged). 2. Retry: Docker Playwright config now uses retries:1 in CI to handle transient container startup ECONNREFUSED failures. The webServer health-check confirms the stack is up, but occasional races between container readiness and the first test request can still cause failures. Co-authored-by: openhands <openhands@all-hands.dev>
6484f16 to
1ab0495
Compare
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
The previous review comments have all been addressed cleanly — the setChatInput helper is now correctly in shared utils, the trajectory padding comment is well-documented with upstream references, and the switchProfileBody assertion uses a direct field check rather than a loose JSON.stringify substring match. The redundant toBeVisible check is gone too.
The CI changes are well-motivated: restricting workflow_run to main/master prevents duplicate E2E runs on e2e-tests PRs, the revamped concurrency group makes cancellation semantics explicit, and the single retry in CI is a pragmatic fix for Docker startup races. Comments in the workflow are unusually thorough.
A couple of things still worth addressing before merge:
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
- Make saveProfile idempotent with best-effort delete-first so stale profiles from crashed CI runs don't cause persistent 409 failures - Increase switch-confirmation timeout from 15s to 30s for consistency with all other waitForNonUserMessageText calls in this test - Add waitForTestId guard before post-switch setChatInput to handle potential UI disable during profile switch settling Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests14/14 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
❌ Mock-LLM Docker E2E Test Results8/14 passed · 2 failed · 4 skipped Commit:
🔍 Failure details (2)❌ mock-llm-conversation.spec.ts › mock-LLM agent-server conversation › step 1: create an LLM profile pointing at the mock LLM server❌ mock-llm-model-switch.spec.ts › mock-LLM /model slash command › step 1: configure LLM, create switch-target profile, register trajectoryPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
❌ Mock-LLM Docker E2E Test Results12/14 passed · 1 failed · 1 skipped Commit:
🔍 Failure details (1)❌ mock-llm-model-switch.spec.ts › mock-LLM /model slash command › step 1: configure LLM, create switch-target profile, register trajectoryPosted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
all-hands-bot
left a comment
There was a problem hiding this comment.
All feedback from the previous two review rounds has been addressed cleanly. The current state of the code is solid. A couple of minor observations below — neither is a blocker.
🟢 Good taste overall. This is a well-structured, comprehensive E2E test that fills a real gap in coverage for the /model slash command. The serial step split, idempotent profile setup (delete-before-post), trajectory padding documentation, shared setChatInput helper, and CI deduplication guards are all done correctly.
[IMPROVEMENT OPPORTUNITIES]
-
[
tests/e2e/mock-llm/mock-llm-model-switch.spec.ts, line 244–245] Afterexpect(switchProfileBody).toBeTruthy()on line 242, the subsequent(switchProfileBody as Record<string, unknown>)?.profile_namecast is slightly redundant — the?.optional chain suggests null is still possible, but the assertion above already guarantees it isn't.switchProfileBody!.profile_nameis more direct and makes the intent clearer (asserting on a known-non-null value). Minor nit, not a blocker. -
[
tests/e2e/mock-llm/utils/mock-llm-helpers.ts, line 394] The error thrown insidepage.evaluate()reads"Chat input not found"— sincetestIdis passed as a parameter, including it in the message (e.g.`Chat input [data-testid="${testId}"] not found`) would make failures easier to diagnose if the helper is ever extended to support different test IDs. Tiny nit.
CI changes are well-motivated:
- Restricting
workflow_runtomain/masterto eliminate duplicate E2E comment pairs one2e-testsPRs is the right call. - The revamped concurrency group keying (
pr_number || wr-{branch} || github.ref) is correct and explicit; the inline comment explaining each case is unusually good. - The empty
pr_number=forworkflow_runevents (results go to step summary only, no PR comment) is documented clearly — that's the right trade-off. retries: process.env.CI ? 1 : 0is a pragmatic fix for Docker container startup races with no correctness downside.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
New E2E test spec with zero production code changes. CI workflow changes are scoped to test infrastructure. The retry increase introduces minor latency for flaky-startup tests in CI but carries no correctness risk. Theworkflow_runmain/master guard removes a class of duplicate comment noise without affecting coverage.
VERDICT:
✅ Worth merging — all prior feedback addressed, code is clean, meaningful E2E coverage added for a previously untested flow.
KEY INSIGHT:
The serial step split with test.describe.configure({ mode: "serial" }) is the correct pattern here — step 2's browser interaction genuinely depends on step 1's API state, and Playwright's serial mode ensures skip-on-failure without requiring brittle test.skip guards.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and context the reviewer is missing. See the customization docs for the required frontmatter format.- Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
Was this review helpful? React with 👍 or 👎 to give feedback.
- Use switchProfileBody!.profile_name after toBeTruthy guard instead of optional chain (the assertion guarantees non-null) - Include testId in setChatInput error message for easier diagnosis; accept optional testId parameter for future reuse Co-authored-by: openhands <openhands@all-hands.dev>
✅ Mock-LLM E2E Tests14/14 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM Docker E2E Test Results14/14 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
✅ Mock-LLM Docker E2E Test Results14/14 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
📸 Snapshot Test Report✅ All snapshots match the main branch baselines.
✅ Unchanged snapshots (73)
Generated by the Snapshot Tests workflow. This comment was created by an AI agent (OpenHands) on behalf of the repo maintainers. |
Summary
Adds a new mock-LLM E2E test spec (
tests/e2e/mock-llm/mock-llm-model-switch.spec.ts) that exercises the/modelslash command for mid-conversation LLM profile switching — a key feature that previously had zero test coverage.What it tests
The test exercises the full
/modelflow end-to-end against the real agent-server with a scripted mock LLM backend:Step 1: Setup
openai/mock-model-alpha, B:openai/mock-model-beta) via the profiles API, both pointing at the mock LLM serverStep 2: Conversation +
/modelswitchGET /api/settings/model model-switch-profile-bin the chat input and submitsdata-testid="model-messages")POST /api/conversations/{id}/switch_profilerequest was intercepted with the correct profile nameTest infrastructure
playwright.mock-llm.config.ts)Coverage
Covers the following items from APP-1785:
/model gpt4and submit → intercept the switch-profile API call and assert it is called with the correct profile nameThis PR was created by an AI agent (OpenHands) on behalf of @rmalhot.
@malhotra5 can click here to continue refining the PR
🐳 Docker images for this PR
• GHCR package: https://github.com/OpenHands/agent-canvas/pkgs/container/agent-canvas
ghcr.io/openhands/agent-canvasghcr.io/openhands/agent-server:1.24.0-pythonopenhands-automation==1.0.0a5b5562647de4143341d67eb1da958cde6466ecf37Pull (multi-arch manifest)
# Multi-arch manifest — Docker automatically pulls the correct architecture docker pull ghcr.io/openhands/agent-canvas:sha-b556264Run
All tags pushed for this build
About Multi-Architecture Support
sha-b556264) is a multi-arch manifest supporting both amd64 and arm64sha-b556264-amd64) are also available if needed