Skip to content

test(gsar): add 5 substantive end-to-end integration tests#17

Merged
fede-kamel merged 1 commit into
mainfrom
tests/gsar-integration-expansion-github
Apr 30, 2026
Merged

test(gsar): add 5 substantive end-to-end integration tests#17
fede-kamel merged 1 commit into
mainfrom
tests/gsar-integration-expansion-github

Conversation

@fede-kamel
Copy link
Copy Markdown
Contributor

Follow-up to #8. The original GSAR integration suite had 3 tests, all probing only the first iteration. This adds 5 substantive tests covering the outer-loop dynamics — the load-bearing claim of the framework.

New live integration tests

Test What it proves
test_gsar_recovery_then_proceed_live_cycle Loose synthesis with contradicted claim → some recovery branch fires → next judge pass converges to proceed; trajectory monotonic non-regressing.
test_gsar_replan_then_proceed_live_cycle Empty evidence corpus → recovery → fresh evidence appended → proceed. Verifies abstain-as-replan dispatch.
test_gsar_budget_exhaustion_sets_degraded_live Unsalvageable input + no-op replan → K_max=2 exhausted → degraded=True after exactly 3 trajectory entries. §5.3 contract under live judge noise.
test_gsar_rho_zero_inflation_visible_live Property P5 in practice. Live judge produces partition with contradicted claim; ρ=0 must strictly inflate score vs ρ=0.5 when |X| > 0. Skips cleanly when judge doesn't surface a contradiction.
test_gsar_cross_judge_decision_agreement Same grounded + ungrounded reports through gpt-4o-mini and gpt-4o; both must land in compatible decision tiers (paper §11 / Table 10 judge-agnostic claim).

Drive-by

Strengthened StructuredOutputGSARJudge default system prompt:

  • Explicit "every atomic claim in exactly one bucket".
  • "Plausibility is NOT grounding" rule with concrete examples (proper nouns, unmapped IDs, unmapped timestamps).
  • Symmetric reminder that evidence-matching claims MUST go to grounded — without it the judge over-corrected and dropped supported claims.

Needed because the live judge was previously labelling clearly-unsupported claims as grounded with synthesis type — a labelling error that masked real grounding violations.

Validation

  • 8/8 GSAR live integration tests pass on gpt-4o-mini in 58-63s. Repeated runs stable.
  • 14/14 combined providers + GSAR live integration suite passes with OPENAI_API_KEY + LOCUS_LIVE_IMAGE=1 + LOCUS_LIVE_SPEECH=1.
  • 3179 unit tests pass, no regressions.
  • hatch run lint clean.

Test plan

  • CI runs the existing test files cleanly (the new tests are gated on OPENAI_API_KEY and skip in vanilla CI).

test(gsar): add 5 substantive end-to-end integration tests

See merge request saas-observ-eng/locus!101
@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 30, 2026
@fede-kamel fede-kamel merged commit 457bb60 into main Apr 30, 2026
1 check passed
fede-kamel added a commit that referenced this pull request May 5, 2026
…ot B

The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").

#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.

#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
fede-kamel added a commit that referenced this pull request May 5, 2026
…ot B

The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").

#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.

#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
fede-kamel added a commit that referenced this pull request May 5, 2026
…ot B

The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").

#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.

#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
fede-kamel added a commit that referenced this pull request May 5, 2026
* chore(workbench): rename sandbox/ → workbench/ end-to-end

Single name for the playground app: directory, npm package names,
docker COPY paths, devcontainer scripts, docs, localStorage key
(locus.sandbox.theme → locus.workbench.theme), env var
(LOCUS_SANDBOX_REFLEXION → LOCUS_WORKBENCH_REFLEXION), e2e spec
filename. Generic security-context "sandboxing" wording in
SECURITY.md and the security review doc, and the TestPyPI
"sandbox" reference in the release workflow, are intentionally
left alone (different meaning).

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* fix(sdk): robustness for async runs and parallel pipelines

- Drain httpx clients inside the run loop so consecutive run_sync()
  calls don't trip "RuntimeError: Event loop is closed" during the
  prior client's TLS teardown (agent.py + AnthropicModel.close).
- ParallelPipeline.run() now uses gather(return_exceptions=True) and
  surfaces per-agent failures in error/outputs instead of collapsing
  the whole result to outputs=[] with a generic message.
- Pin explicit timeout + max_retries on OCIOpenAIModel's AsyncOpenAI
  + httpx clients so a stuck request can no longer hang gather()
  for the openai SDK's ~10-min default.
- Bump default request_timeout 60s → 120s on Openai/AnthropicConfig
  to give reasoning + tool-heavy turns enough headroom.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* feat(workbench): add Model A/B/C slots in Provider settings

Settings modal now shows three model dropdowns sharing the active
provider's API key. A is required; B and C are optional and fall
back to A when blank. The backend forwards LOCUS_MODEL_ID,
LOCUS_MODEL_ID_B, LOCUS_MODEL_ID_C to the subprocess.

examples/config.py grows two helpers — get_model_b() and
get_model_c() — that read the slot env vars and fall through to the
primary slot when unset, so tutorials that mix models still work in
plain CLI runs where only LOCUS_MODEL_ID is configured.

Lets multi-agent tutorials demonstrate realistic load mixing —
e.g. a fast model for triage/routing alongside a deep model for
specialist work — without bespoke wiring.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* chore(tutorials): rename "OCI call" → "model call" + wire 16/17 to slot B

The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").

#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.

#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* test(workbench/e2e): add per-tutorial sweep specs for each provider

Three sister Playwright specs that fan a single test out per
tutorial in the workbench catalog (skipping needs_stdin and
OCI-only ones). Each test spawns its own browser context,
configures a provider via the Settings modal, then drives one
tutorial through the UI and asserts exit 0.

Catalog is fetched synchronously via curl at module load so test()
calls can be generated at the top level — avoids the top-level-
await dance under CommonJS.

All three honour Model A/B/C env vars (e.g. ANTHROPIC_MODEL_B,
OPENAI_MODEL_B, OCI_MODEL_B) so the slot-B speedup we wired into
tutorials 16/17 is exercised end-to-end.

  npx playwright test tests/all-openai.spec.ts --workers=3

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* feat(sdk): re-export multi-agent + composition primitives at top level

Surface Orchestrator, Specialist, StateGraph, Send, Handoff,
HandoffContext, HandoffReason, RoutingDecision, SequentialPipeline,
ParallelPipeline, and LoopAgent from ``locus`` directly so users
can do ``from locus import Orchestrator, Specialist`` instead of
hunting through ``locus.multiagent.*`` and ``locus.agent.composition``.

These are first-class features of the SDK and were previously only
discoverable via the implementation modules. Lazy-loaded so import
cost stays put.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* chore(examples/17): trim AI commentary calls to fit per-tutorial budget

Tutorial 17's _llm_call() helper fired 6 separate "AI commentary"
sidebar prompts in addition to 4 specialist/orchestrator runs. Most
of those commentary calls just narrated what the surrounding
typed object (Orchestrator, RoutingDecision, etc.) already
demonstrates. Drops 5 of 6 — keeps Part 2's commentary call as the
helper's exemplar.

Net live LLM-firing operations: 5 (was ~10). Tutorial finishes
inside the workbench's per-tutorial budget under parallel sweep
load (was timing out at 10 min on OCI v1 with workers=2).

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

* chore(coverage): update baseline for AnthropicModel client lifecycle

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>

---------

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
@fede-kamel fede-kamel deleted the tests/gsar-integration-expansion-github branch May 13, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant