test(gsar): add 5 substantive end-to-end integration tests by fede-kamel · Pull Request #17 · oracle-samples/locus

fede-kamel · 2026-04-30T16:35:25Z

Follow-up to #8. The original GSAR integration suite had 3 tests, all probing only the first iteration. This adds 5 substantive tests covering the outer-loop dynamics — the load-bearing claim of the framework.

New live integration tests

Test	What it proves
`test_gsar_recovery_then_proceed_live_cycle`	Loose synthesis with contradicted claim → some recovery branch fires → next judge pass converges to proceed; trajectory monotonic non-regressing.
`test_gsar_replan_then_proceed_live_cycle`	Empty evidence corpus → recovery → fresh evidence appended → proceed. Verifies abstain-as-replan dispatch.
`test_gsar_budget_exhaustion_sets_degraded_live`	Unsalvageable input + no-op replan → K_max=2 exhausted → `degraded=True` after exactly 3 trajectory entries. §5.3 contract under live judge noise.
`test_gsar_rho_zero_inflation_visible_live`	Property P5 in practice. Live judge produces partition with contradicted claim; `ρ=0` must strictly inflate score vs `ρ=0.5` when \|X\| > 0. Skips cleanly when judge doesn't surface a contradiction.
`test_gsar_cross_judge_decision_agreement`	Same grounded + ungrounded reports through `gpt-4o-mini` and `gpt-4o`; both must land in compatible decision tiers (paper §11 / Table 10 judge-agnostic claim).

Drive-by

Strengthened StructuredOutputGSARJudge default system prompt:

Explicit "every atomic claim in exactly one bucket".
"Plausibility is NOT grounding" rule with concrete examples (proper nouns, unmapped IDs, unmapped timestamps).
Symmetric reminder that evidence-matching claims MUST go to grounded — without it the judge over-corrected and dropped supported claims.

Needed because the live judge was previously labelling clearly-unsupported claims as grounded with synthesis type — a labelling error that masked real grounding violations.

Validation

8/8 GSAR live integration tests pass on gpt-4o-mini in 58-63s. Repeated runs stable.
14/14 combined providers + GSAR live integration suite passes with OPENAI_API_KEY + LOCUS_LIVE_IMAGE=1 + LOCUS_LIVE_SPEECH=1.
3179 unit tests pass, no regressions.
hatch run lint clean.

Test plan

CI runs the existing test files cleanly (the new tests are gated on OPENAI_API_KEY and skip in vanilla CI).

test(gsar): add 5 substantive end-to-end integration tests See merge request saas-observ-eng/locus!101

…ot B The 22 tutorials that print a per-call timing/token banner used a hardcoded "[OCI call: ...]" label, which was misleading whenever the workbench (or any CLI run) was pointed at OpenAI / Anthropic. The banner is now provider-agnostic ("[model call: ...]"). #16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the multi-model load-mixing pattern: triage/commentary roles read get_model_b() (= the workbench's Model B slot when set, else A), specialist/escalation roles stay on the primary model. With B unset, behavior is unchanged from before. #16 also drops a few redundant live receive_handoff calls — the same pattern is exercised live in Parts 1, 4, and 7, so the demos in 2/3/5/6/8 now print the data structures without a separate LLM round-trip. Same pedagogical value, runtime cut from ~6 minutes to under 2.

…ot B The 22 tutorials that print a per-call timing/token banner used a hardcoded "[OCI call: ...]" label, which was misleading whenever the workbench (or any CLI run) was pointed at OpenAI / Anthropic. The banner is now provider-agnostic ("[model call: ...]"). #16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the multi-model load-mixing pattern: triage/commentary roles read get_model_b() (= the workbench's Model B slot when set, else A), specialist/escalation roles stay on the primary model. With B unset, behavior is unchanged from before. #16 also drops a few redundant live receive_handoff calls — the same pattern is exercised live in Parts 1, 4, and 7, so the demos in 2/3/5/6/8 now print the data structures without a separate LLM round-trip. Same pedagogical value, runtime cut from ~6 minutes to under 2. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* chore(workbench): rename sandbox/ → workbench/ end-to-end Single name for the playground app: directory, npm package names, docker COPY paths, devcontainer scripts, docs, localStorage key (locus.sandbox.theme → locus.workbench.theme), env var (LOCUS_SANDBOX_REFLEXION → LOCUS_WORKBENCH_REFLEXION), e2e spec filename. Generic security-context "sandboxing" wording in SECURITY.md and the security review doc, and the TestPyPI "sandbox" reference in the release workflow, are intentionally left alone (different meaning). Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * fix(sdk): robustness for async runs and parallel pipelines - Drain httpx clients inside the run loop so consecutive run_sync() calls don't trip "RuntimeError: Event loop is closed" during the prior client's TLS teardown (agent.py + AnthropicModel.close). - ParallelPipeline.run() now uses gather(return_exceptions=True) and surfaces per-agent failures in error/outputs instead of collapsing the whole result to outputs=[] with a generic message. - Pin explicit timeout + max_retries on OCIOpenAIModel's AsyncOpenAI + httpx clients so a stuck request can no longer hang gather() for the openai SDK's ~10-min default. - Bump default request_timeout 60s → 120s on Openai/AnthropicConfig to give reasoning + tool-heavy turns enough headroom. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * feat(workbench): add Model A/B/C slots in Provider settings Settings modal now shows three model dropdowns sharing the active provider's API key. A is required; B and C are optional and fall back to A when blank. The backend forwards LOCUS_MODEL_ID, LOCUS_MODEL_ID_B, LOCUS_MODEL_ID_C to the subprocess. examples/config.py grows two helpers — get_model_b() and get_model_c() — that read the slot env vars and fall through to the primary slot when unset, so tutorials that mix models still work in plain CLI runs where only LOCUS_MODEL_ID is configured. Lets multi-agent tutorials demonstrate realistic load mixing — e.g. a fast model for triage/routing alongside a deep model for specialist work — without bespoke wiring. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * chore(tutorials): rename "OCI call" → "model call" + wire 16/17 to slot B The 22 tutorials that print a per-call timing/token banner used a hardcoded "[OCI call: ...]" label, which was misleading whenever the workbench (or any CLI run) was pointed at OpenAI / Anthropic. The banner is now provider-agnostic ("[model call: ...]"). #16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the multi-model load-mixing pattern: triage/commentary roles read get_model_b() (= the workbench's Model B slot when set, else A), specialist/escalation roles stay on the primary model. With B unset, behavior is unchanged from before. #16 also drops a few redundant live receive_handoff calls — the same pattern is exercised live in Parts 1, 4, and 7, so the demos in 2/3/5/6/8 now print the data structures without a separate LLM round-trip. Same pedagogical value, runtime cut from ~6 minutes to under 2. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * test(workbench/e2e): add per-tutorial sweep specs for each provider Three sister Playwright specs that fan a single test out per tutorial in the workbench catalog (skipping needs_stdin and OCI-only ones). Each test spawns its own browser context, configures a provider via the Settings modal, then drives one tutorial through the UI and asserts exit 0. Catalog is fetched synchronously via curl at module load so test() calls can be generated at the top level — avoids the top-level- await dance under CommonJS. All three honour Model A/B/C env vars (e.g. ANTHROPIC_MODEL_B, OPENAI_MODEL_B, OCI_MODEL_B) so the slot-B speedup we wired into tutorials 16/17 is exercised end-to-end. npx playwright test tests/all-openai.spec.ts --workers=3 Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * feat(sdk): re-export multi-agent + composition primitives at top level Surface Orchestrator, Specialist, StateGraph, Send, Handoff, HandoffContext, HandoffReason, RoutingDecision, SequentialPipeline, ParallelPipeline, and LoopAgent from ``locus`` directly so users can do ``from locus import Orchestrator, Specialist`` instead of hunting through ``locus.multiagent.*`` and ``locus.agent.composition``. These are first-class features of the SDK and were previously only discoverable via the implementation modules. Lazy-loaded so import cost stays put. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * chore(examples/17): trim AI commentary calls to fit per-tutorial budget Tutorial 17's _llm_call() helper fired 6 separate "AI commentary" sidebar prompts in addition to 4 specialist/orchestrator runs. Most of those commentary calls just narrated what the surrounding typed object (Orchestrator, RoutingDecision, etc.) already demonstrates. Drops 5 of 6 — keeps Part 2's commentary call as the helper's exemplar. Net live LLM-firing operations: 5 (was ~10). Tutorial finishes inside the workbench's per-tutorial budget under parallel sweep load (was timing out at 10 min on OCI v1 with workers=2). Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> * chore(coverage): update baseline for AnthropicModel client lifecycle Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com> --------- Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

Merge branch 'tests/gsar-integration-expansion' into 'main'

8a9e1e5

test(gsar): add 5 substantive end-to-end integration tests See merge request saas-observ-eng/locus!101

oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 30, 2026

fede-kamel merged commit 457bb60 into main Apr 30, 2026
1 check passed

fede-kamel mentioned this pull request Apr 30, 2026

test(gsar): stabilize live integration suite under judge variance #18

Merged

1 task

fede-kamel deleted the tests/gsar-integration-expansion-github branch May 13, 2026 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(gsar): add 5 substantive end-to-end integration tests#17

test(gsar): add 5 substantive end-to-end integration tests#17
fede-kamel merged 1 commit into
mainfrom
tests/gsar-integration-expansion-github

fede-kamel commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fede-kamel commented Apr 30, 2026

New live integration tests

Drive-by

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant