feat(tests): L0 live-wire + L1 catalog cross-validate (P1)#1191
Conversation
L0 follow-up of Q00#1170 — wires the ``OUROBOROS_RUN_CANONICAL=1`` path to programmatically invoke the ``ouroboros_auto`` MCP tool against a canonical scenario, and adds the L1 catalog cross-validation tests that were deferred from L0-a (Q00#1174) until L1-a (Q00#1173) was on main. ## Summary What this PR adds: 1. **L1 catalog cross-validate** (always runs, hermetic): - ``test_scenario_domain_class_resolves_in_l1_catalog`` — pins ``expected.yaml`` ``domain_class`` to a real :class:`TaskClass` value. - ``test_scenario_completion_mode_matches_l1_catalog_default`` — pins ``completion_mode`` to the L1 catalog default for the scenario's class. - ``test_scenario_runtime_probe_kinds_subset_of_l1_catalog`` — pins ``runtime_probe_kinds`` to a subset of the catalog's declared probes (no ad-hoc sideways). 2. **Live-run wiring** (opt-in via env var): when ``OUROBOROS_RUN_CANONICAL=1`` is set, the test programmatically invokes ``AutoHandler.handle(...)`` with a goal/cwd/skip_run/pipeline_timeout_seconds arguments dict built from the scenario fixture. Result-shape assertions surface MCP misconfiguration with the underlying error message. ## What lands - ``tests/canonical/test_canonical.py``: - Three new L1-cross-validation tests (parametrized per scenario). - ``_invoke_ouroboros_auto`` helper that builds the arguments dict and calls ``AutoHandler.handle``. - ``test_scenario_live_run_or_skip`` upgraded from a permanent ``pytest.skip`` to a real invocation gated on the env var. - Header imports for ``Path`` / ``Any`` / ``asyncio`` (mypy + ruff clean). ## Test plan - [x] ``uv run pytest tests/canonical/ -v`` (hermetic default) → 9 passed, 1 skipped (live-run gate). - [x] ``uv run ruff check tests/canonical/test_canonical.py`` → clean. - [x] ``uv run ruff format tests/canonical/test_canonical.py`` → clean. - [x] ``uv run mypy tests/canonical/test_canonical.py`` → clean. ## What is NOT in this PR - Additional scenario fixtures (``webhook-receiver``, ``vertical-slice-refactor``, ``2d-kart-racer``) — fixtures are maintainer-curated example data, not substrate PRs. - Probe runner wiring for the canonical workdir — handled by ``probe_runner`` argument on the AutoPipeline; L0 live-run exercises the *end-to-end pipeline*, not the probe schema. - Recorded-replay layer — deferred per Q00#1170 minimal-substrate audit. ## References - Q00#1157 — Meta SSOT for ``ooo auto`` (L0 lane body). - Q00#1170 — L0 design issue (minimal harness). - Q00#1174 — L0-a harness skeleton (this PR's parent). - Q00#1173 — L1-a task-class catalog (consumed via cross-validation tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
4d61b52for PR #1191
Review record:
c4c29c21-3e1e-4137-91c6-23400d9d9d8d
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
Unable to assess architecture or changed behavior because the required local files could not be read in this execution environment.
Policy Notes
- Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
The L0 live path should fail when ouroboros_auto only returns an MCP-level OK while the pipeline itself is failed/blocked or stops at an unverified product handoff. Tie product-complete scenarios to complete_product and assert the terminal metadata the SSOT claims.\n\nConstraint: L0 remains a manual, opt-in pytest harness with no replay or CI/live-cost substrate.\nRejected: Adding replay, nightly runs, or new scenario fixtures | Q00#1170 explicitly excludes operational substrate until evidence demands it.\nConfidence: high\nScope-risk: narrow\nDirective: Keep canonical live assertions tied to documented expected.yaml semantics, not just handler transport success.\nTested: uv run ruff check tests/canonical/test_canonical.py; uv run ruff format tests/canonical/test_canonical.py --check; uv run mypy tests/canonical/test_canonical.py; uv run pytest tests/canonical/ -v\nNot-tested: OUROBOROS_RUN_CANONICAL=1 live LLM run
|
Readiness update after independent AgentOS/SSOT re-review: This PR is a narrow L0 follow-up for the What changed in the latest push:
Why this is merge-ready:
I consider this PR appropriate to merge once GitHub checks are green. |
PR Review SummaryVerdictApprove Scope Reviewed
Blocking IssuesNone. WarningsNone. Mutation-Test Thinking
Complexity / CRAP-style Risk
Test Quality Assessment (6/7)
Security / Operational RiskNone. The PR does not add default live execution, CI scheduling, new credentials, or persistence beyond the per-run temp workdir/store. Looks Good
Final RecommendationApprove. The PR is narrow, SSOT-aligned, and merge-ready once CI is green. |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
3570686for PR #1191
Review record:
2636e42c-d844-41ed-8d0a-f7780f5ae228
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
Unable to assess architecture or changed behavior because the source snapshot and diff were inaccessible in this environment.
Policy Notes
- Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
PR #1191
Branch: feat/pP1 | 1 files, +164/-21 | CI: Bridge TypeScript pass 11s https://github.com/Q00/ouroboros/actions/runs/26318488944/job/77482637405
Scope: architecture-level
HEAD checked: 3570686ea47e475b74e7528c1b64383ee0a909ee
What Improved
- The live canonical harness now passes
complete_product=Truefor product-complete scenarios and checks more than MCP transport success. - The hermetic canonical checks validate
domain_class,completion_mode, and runtime probe kinds against the L1 catalog.
Issue #N/A Requirements
| Requirement | Status |
|---|---|
| Live canonical product-complete scenarios must prove verified terminal product completion, not just pipeline completion. | Blocked by Finding 1. |
| Hermetic L1 catalog cross-validation for canonical fixture metadata. | Satisfied. |
Opt-in live run wiring through AutoHandler.handle(...). |
Partially satisfied; invocation is wired, but terminal assertion is incomplete. |
| Meaningful coverage for newly added live assertion logic. | Insufficient for plugin/external handoff boundary. |
Prior Findings Status
| Prior Finding | Status |
|---|---|
| Prior review context | MODIFIED — Prior concern is maintained, modified: the harness no longer stops at MCP-level OK or simple status="complete" for normal handoff-only completions, but it still does not close the affected product-completion boundary because plugin/external Ralph handoffs can report status="complete" without verified product completion and still pass this test. |
Blockers
| # | File:Line | Severity | Confidence | Finding |
|---|---|---|---|---|
| 1 | tests/canonical/test_canonical.py:263 | High | 90% | The product-complete live assertion still false-passes external OpenCode/Ralph plugin handoffs. It only rejects product_status == "not_verified_complete", but MCP metadata only sets that field for plain run-handoff-only completions (src/ouroboros/mcp/tools/auto_handler.py:1200), while plugin Ralph completion is a known external/pending product state (src/ouroboros/cli/commands/auto.py:684, src/ouroboros/cli/commands/auto.py:710) and current pipeline returns status="complete" with ralph_dispatch_mode="plugin" for that path (src/ouroboros/auto/pipeline.py:1308). Because meta.get("product_status") is then absent, this test passes even though the product is not verified complete, so the new canonical live gate does not prove the claimed terminal product completion contract. |
Follow-ups
| # | File:Line | Priority | Confidence | Suggestion |
|---|---|---|---|---|
| — | — | — | — | None. |
Test Coverage
- Ran
uv run pytest tests/canonical/ -q: 9 passed, 1 skipped. - Ran
uv run pytest tests/unit/auto/test_surface.py::test_auto_handler_meta_and_text_distinguish_handoff_from_product_completion -q: passed. - Ran
uv run pytest tests/unit/auto/test_pipeline_ralph_handoff.py -q -k "plugin and delegation": passed. - The opt-in live LLM path was not run; the blocker is in the assertion contract itself.
Design / Roadmap Gate
Fails affected-boundary review. The PR changed the canonical live acceptance surface, so the relevant boundary is the consumer-visible MCP result contract, not only the changed test lines. Current HEAD has at least three product states sharing status="complete": verified non-plugin Ralph completion, plain run handoff with product_status="not_verified_complete", and plugin/external Ralph delegation with ralph_dispatch_mode="plugin". The new assertion distinguishes only the second state, leaving the plugin/external state accepted as product-complete even though CLI code explicitly labels it not verified complete.
Merge Recommendation
Retrospective recommendation: follow up with a fix before relying on this canonical live gate. For completion_mode: product_complete, assert a positive verified-product signal, such as non-plugin completed Ralph/evaluator evidence, and explicitly fail ralph_dispatch_mode == "plugin" or any external/pending handoff shape.
Review-Metadata:
verdict: REQUEST_CHANGES
github_event: COMMENT
review_kind: post_merge_audit
merge_eligible: false
head_sha: 3570686
source_read_ok: true
diff_read_ok: true
blocking_count: 0
Summary
L0 follow-up of #1170 — wires the `OUROBOROS_RUN_CANONICAL=1` path to programmatically invoke the `ouroboros_auto` MCP tool against a canonical scenario, and adds the L1 catalog cross-validation tests deferred from L0-a (#1174) until L1-a (#1173) was on main.
What lands
What is NOT in this PR
Test plan
Refs #1157 (L0 lane), #1170 (L0 design), #1174 (L0-a skeleton), #1173 (L1-a catalog).