test(gsar): add 5 substantive end-to-end integration tests#17
Merged
Conversation
test(gsar): add 5 substantive end-to-end integration tests See merge request saas-observ-eng/locus!101
1 task
fede-kamel
added a commit
that referenced
this pull request
May 5, 2026
…ot B
The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").
#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.
#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
fede-kamel
added a commit
that referenced
this pull request
May 5, 2026
…ot B
The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").
#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.
#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
fede-kamel
added a commit
that referenced
this pull request
May 5, 2026
…ot B
The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").
#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.
#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
fede-kamel
added a commit
that referenced
this pull request
May 5, 2026
* chore(workbench): rename sandbox/ → workbench/ end-to-end
Single name for the playground app: directory, npm package names,
docker COPY paths, devcontainer scripts, docs, localStorage key
(locus.sandbox.theme → locus.workbench.theme), env var
(LOCUS_SANDBOX_REFLEXION → LOCUS_WORKBENCH_REFLEXION), e2e spec
filename. Generic security-context "sandboxing" wording in
SECURITY.md and the security review doc, and the TestPyPI
"sandbox" reference in the release workflow, are intentionally
left alone (different meaning).
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* fix(sdk): robustness for async runs and parallel pipelines
- Drain httpx clients inside the run loop so consecutive run_sync()
calls don't trip "RuntimeError: Event loop is closed" during the
prior client's TLS teardown (agent.py + AnthropicModel.close).
- ParallelPipeline.run() now uses gather(return_exceptions=True) and
surfaces per-agent failures in error/outputs instead of collapsing
the whole result to outputs=[] with a generic message.
- Pin explicit timeout + max_retries on OCIOpenAIModel's AsyncOpenAI
+ httpx clients so a stuck request can no longer hang gather()
for the openai SDK's ~10-min default.
- Bump default request_timeout 60s → 120s on Openai/AnthropicConfig
to give reasoning + tool-heavy turns enough headroom.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* feat(workbench): add Model A/B/C slots in Provider settings
Settings modal now shows three model dropdowns sharing the active
provider's API key. A is required; B and C are optional and fall
back to A when blank. The backend forwards LOCUS_MODEL_ID,
LOCUS_MODEL_ID_B, LOCUS_MODEL_ID_C to the subprocess.
examples/config.py grows two helpers — get_model_b() and
get_model_c() — that read the slot env vars and fall through to the
primary slot when unset, so tutorials that mix models still work in
plain CLI runs where only LOCUS_MODEL_ID is configured.
Lets multi-agent tutorials demonstrate realistic load mixing —
e.g. a fast model for triage/routing alongside a deep model for
specialist work — without bespoke wiring.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* chore(tutorials): rename "OCI call" → "model call" + wire 16/17 to slot B
The 22 tutorials that print a per-call timing/token banner used a
hardcoded "[OCI call: ...]" label, which was misleading whenever the
workbench (or any CLI run) was pointed at OpenAI / Anthropic. The
banner is now provider-agnostic ("[model call: ...]").
#16 (agent_handoff) and #17 (orchestrator_pattern) demonstrate the
multi-model load-mixing pattern: triage/commentary roles read
get_model_b() (= the workbench's Model B slot when set, else A),
specialist/escalation roles stay on the primary model. With B
unset, behavior is unchanged from before.
#16 also drops a few redundant live receive_handoff calls — the
same pattern is exercised live in Parts 1, 4, and 7, so the demos
in 2/3/5/6/8 now print the data structures without a separate LLM
round-trip. Same pedagogical value, runtime cut from ~6 minutes to
under 2.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* test(workbench/e2e): add per-tutorial sweep specs for each provider
Three sister Playwright specs that fan a single test out per
tutorial in the workbench catalog (skipping needs_stdin and
OCI-only ones). Each test spawns its own browser context,
configures a provider via the Settings modal, then drives one
tutorial through the UI and asserts exit 0.
Catalog is fetched synchronously via curl at module load so test()
calls can be generated at the top level — avoids the top-level-
await dance under CommonJS.
All three honour Model A/B/C env vars (e.g. ANTHROPIC_MODEL_B,
OPENAI_MODEL_B, OCI_MODEL_B) so the slot-B speedup we wired into
tutorials 16/17 is exercised end-to-end.
npx playwright test tests/all-openai.spec.ts --workers=3
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* feat(sdk): re-export multi-agent + composition primitives at top level
Surface Orchestrator, Specialist, StateGraph, Send, Handoff,
HandoffContext, HandoffReason, RoutingDecision, SequentialPipeline,
ParallelPipeline, and LoopAgent from ``locus`` directly so users
can do ``from locus import Orchestrator, Specialist`` instead of
hunting through ``locus.multiagent.*`` and ``locus.agent.composition``.
These are first-class features of the SDK and were previously only
discoverable via the implementation modules. Lazy-loaded so import
cost stays put.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* chore(examples/17): trim AI commentary calls to fit per-tutorial budget
Tutorial 17's _llm_call() helper fired 6 separate "AI commentary"
sidebar prompts in addition to 4 specialist/orchestrator runs. Most
of those commentary calls just narrated what the surrounding
typed object (Orchestrator, RoutingDecision, etc.) already
demonstrates. Drops 5 of 6 — keeps Part 2's commentary call as the
helper's exemplar.
Net live LLM-firing operations: 5 (was ~10). Tutorial finishes
inside the workbench's per-tutorial budget under parallel sweep
load (was timing out at 10 min on OCI v1 with workers=2).
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
* chore(coverage): update baseline for AnthropicModel client lifecycle
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
---------
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #8. The original GSAR integration suite had 3 tests, all probing only the first iteration. This adds 5 substantive tests covering the outer-loop dynamics — the load-bearing claim of the framework.
New live integration tests
test_gsar_recovery_then_proceed_live_cycletest_gsar_replan_then_proceed_live_cycletest_gsar_budget_exhaustion_sets_degraded_livedegraded=Trueafter exactly 3 trajectory entries. §5.3 contract under live judge noise.test_gsar_rho_zero_inflation_visible_liveρ=0must strictly inflate score vsρ=0.5when |X| > 0. Skips cleanly when judge doesn't surface a contradiction.test_gsar_cross_judge_decision_agreementgpt-4o-miniandgpt-4o; both must land in compatible decision tiers (paper §11 / Table 10 judge-agnostic claim).Drive-by
Strengthened
StructuredOutputGSARJudgedefault system prompt:Needed because the live judge was previously labelling clearly-unsupported claims as
groundedwithsynthesistype — a labelling error that masked real grounding violations.Validation
OPENAI_API_KEY+LOCUS_LIVE_IMAGE=1+LOCUS_LIVE_SPEECH=1.hatch run lintclean.Test plan
OPENAI_API_KEYand skip in vanilla CI).