feat(tests): L0 live-wire + L1 catalog cross-validate (P1) by shaun0927 · Pull Request #1191 · Q00/ouroboros

shaun0927 · 2026-05-22T14:36:25Z

Summary

L0 follow-up of #1170 — wires the `OUROBOROS_RUN_CANONICAL=1` path to programmatically invoke the `ouroboros_auto` MCP tool against a canonical scenario, and adds the L1 catalog cross-validation tests deferred from L0-a (#1174) until L1-a (#1173) was on main.

What lands

L1 catalog cross-validate (always runs, hermetic):
- `test_scenario_domain_class_resolves_in_l1_catalog` — pins `expected.yaml` `domain_class` to a real `TaskClass` value.
- `test_scenario_completion_mode_matches_l1_catalog_default` — pins `completion_mode` to the L1 catalog default for the scenario's class.
- `test_scenario_runtime_probe_kinds_subset_of_l1_catalog` — pins `runtime_probe_kinds` to a subset of the catalog's declared probes.
Live-run wiring (opt-in via env var): when `OUROBOROS_RUN_CANONICAL=1` is set, the test programmatically invokes `AutoHandler.handle(...)` with a goal/cwd/skip_run/pipeline_timeout_seconds arguments dict built from the scenario fixture. Result assertions surface MCP misconfiguration with the underlying error message.

What is NOT in this PR

Additional scenario fixtures (`webhook-receiver`, `vertical-slice-refactor`, `2d-kart-racer`) — fixtures are maintainer-curated example data, not substrate PRs.
Probe runner wiring for the canonical workdir.
Recorded-replay layer — deferred per Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170 minimal-substrate audit.

Test plan

`uv run pytest tests/canonical/ -v` (hermetic default) → 9 passed, 1 skipped (live-run gate).
`uv run ruff check tests/canonical/test_canonical.py` → clean.
`uv run ruff format tests/canonical/test_canonical.py` → clean.
`uv run mypy tests/canonical/test_canonical.py` → clean.

Refs #1157 (L0 lane), #1170 (L0 design), #1174 (L0-a skeleton), #1173 (L1-a catalog).

L0 follow-up of Q00#1170 — wires the ``OUROBOROS_RUN_CANONICAL=1`` path to programmatically invoke the ``ouroboros_auto`` MCP tool against a canonical scenario, and adds the L1 catalog cross-validation tests that were deferred from L0-a (Q00#1174) until L1-a (Q00#1173) was on main. ## Summary What this PR adds: 1. **L1 catalog cross-validate** (always runs, hermetic): - ``test_scenario_domain_class_resolves_in_l1_catalog`` — pins ``expected.yaml`` ``domain_class`` to a real :class:`TaskClass` value. - ``test_scenario_completion_mode_matches_l1_catalog_default`` — pins ``completion_mode`` to the L1 catalog default for the scenario's class. - ``test_scenario_runtime_probe_kinds_subset_of_l1_catalog`` — pins ``runtime_probe_kinds`` to a subset of the catalog's declared probes (no ad-hoc sideways). 2. **Live-run wiring** (opt-in via env var): when ``OUROBOROS_RUN_CANONICAL=1`` is set, the test programmatically invokes ``AutoHandler.handle(...)`` with a goal/cwd/skip_run/pipeline_timeout_seconds arguments dict built from the scenario fixture. Result-shape assertions surface MCP misconfiguration with the underlying error message. ## What lands - ``tests/canonical/test_canonical.py``: - Three new L1-cross-validation tests (parametrized per scenario). - ``_invoke_ouroboros_auto`` helper that builds the arguments dict and calls ``AutoHandler.handle``. - ``test_scenario_live_run_or_skip`` upgraded from a permanent ``pytest.skip`` to a real invocation gated on the env var. - Header imports for ``Path`` / ``Any`` / ``asyncio`` (mypy + ruff clean). ## Test plan - [x] ``uv run pytest tests/canonical/ -v`` (hermetic default) → 9 passed, 1 skipped (live-run gate). - [x] ``uv run ruff check tests/canonical/test_canonical.py`` → clean. - [x] ``uv run ruff format tests/canonical/test_canonical.py`` → clean. - [x] ``uv run mypy tests/canonical/test_canonical.py`` → clean. ## What is NOT in this PR - Additional scenario fixtures (``webhook-receiver``, ``vertical-slice-refactor``, ``2d-kart-racer``) — fixtures are maintainer-curated example data, not substrate PRs. - Probe runner wiring for the canonical workdir — handled by ``probe_runner`` argument on the AutoPipeline; L0 live-run exercises the *end-to-end pipeline*, not the probe schema. - Recorded-replay layer — deferred per Q00#1170 minimal-substrate audit. ## References - Q00#1157 — Meta SSOT for ``ooo auto`` (L0 lane body). - Q00#1170 — L0 design issue (minimal harness). - Q00#1174 — L0-a harness skeleton (this PR's parent). - Q00#1173 — L1-a task-class catalog (consumed via cross-validation tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ouroboros-agent

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 4d61b52 for PR #1191

Review record: c4c29c21-3e1e-4137-91c6-23400d9d9d8d

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

Unable to assess architecture or changed behavior because the required local files could not be read in this execution environment.

Policy Notes

Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.

Recovery Notes

First recoverable review artifact generated from codex analysis log.

Reviewed by ouroboros-agent[bot] via Codex deep analysis

The L0 live path should fail when ouroboros_auto only returns an MCP-level OK while the pipeline itself is failed/blocked or stops at an unverified product handoff. Tie product-complete scenarios to complete_product and assert the terminal metadata the SSOT claims.\n\nConstraint: L0 remains a manual, opt-in pytest harness with no replay or CI/live-cost substrate.\nRejected: Adding replay, nightly runs, or new scenario fixtures | Q00#1170 explicitly excludes operational substrate until evidence demands it.\nConfidence: high\nScope-risk: narrow\nDirective: Keep canonical live assertions tied to documented expected.yaml semantics, not just handler transport success.\nTested: uv run ruff check tests/canonical/test_canonical.py; uv run ruff format tests/canonical/test_canonical.py --check; uv run mypy tests/canonical/test_canonical.py; uv run pytest tests/canonical/ -v\nNot-tested: OUROBOROS_RUN_CANONICAL=1 live LLM run

shaun0927 · 2026-05-23T00:35:32Z

Readiness update after independent AgentOS/SSOT re-review:

This PR is a narrow L0 follow-up for the ooo auto canonical acceptance harness. It does not add replay infrastructure, nightly/live CI, cost budgeting, or new scenario-matrix operations; those remain deliberately out of scope per the #1170/#961 minimal-substrate direction.

What changed in the latest push:

The opt-in live canonical path now asks ouroboros_auto for complete_product when the scenario declares completion_mode: product_complete.
The live test no longer treats transport-level MCP success as enough. It now fails if the tool result is failed/blocked, if terminal metadata is not status=complete, or if a product-complete scenario stops at an unverified run handoff.
The test emits a concise CANONICAL <slug>: ... line for maintainer copy/paste evidence after a live run.

Why this is merge-ready:

Alignment: the PR now matches the AgentOS SSOT direction: a manual, opt-in, concrete acceptance gate rather than operational substrate.
Scope control: still one test file, no new framework, no scheduler, no replay layer, no additional fixture surface.
Risk posture: the new assertions make the live path prove the terminal semantics it claims instead of merely proving that AutoHandler.handle(...) returned an OK wrapper.
Local validation passed:
- uv run ruff check tests/canonical/test_canonical.py
- uv run ruff format tests/canonical/test_canonical.py --check
- uv run mypy tests/canonical/test_canonical.py
- uv run pytest tests/canonical/ -v → 9 passed, 1 skipped (live gate disabled by default)

I consider this PR appropriate to merge once GitHub checks are green.

shaun0927 · 2026-05-23T00:35:33Z

PR Review Summary

Verdict

Approve

Scope Reviewed

PR intent: L0 follow-up for the canonical ooo auto acceptance harness: add L1 catalog cross-validation and wire the opt-in live canonical run path.
Main changed areas: tests/canonical/test_canonical.py only.
Tests reviewed: canonical shape tests, L1 catalog contract tests, opt-in live-run assertions.
Checks considered: local ruff, format check, mypy, and uv run pytest tests/canonical/ -v.

Blocking Issues

None.

Warnings

None.

Mutation-Test Thinking

Likely mutants that should be killed:
- Changing a fixture domain_class to a value outside TaskClass should fail the L1 catalog resolution test.
- Changing a scenario completion_mode away from the catalog default should fail the catalog-default test.
- Returning a failed/blocked MCP tool result from the live path should fail the live-run assertion.
- Stopping a product-complete scenario at an unverified handoff should fail the new product_status assertion.
Mutants current tests may not catch:
- The expensive live LLM path is intentionally skipped unless OUROBOROS_RUN_CANONICAL=1 is set.
Additional tests recommended:
- None for this PR. A recorded replay or CI live-run path would be over-substrate under the current Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170/Meta SSOT: AgentOS roadmap sequencing (#920–#960) #961 direction and should wait for evidence.

Complexity / CRAP-style Risk

High-risk functions/modules: none introduced; the helper is local to the canonical test module.
Complexity increase: low. The branch adds direct contract assertions rather than a new runner framework.
Test coverage concern: default CI covers hermetic fixture/catalog contracts; live behavior remains opt-in by design.
Refactoring recommendation: none.

Test Quality Assessment (6/7)

Strong tests: fixture schema, catalog consistency, probe whitelist, and terminal-result checks are concrete and mutation-resistant.
Weak tests: live run depends on external LLM/runtime configuration and is skipped by default.
Missing edge cases: no live replay; intentionally out of scope.
Mocking concerns: none; this is an acceptance harness, not a mocked unit path.

Security / Operational Risk

None. The PR does not add default live execution, CI scheduling, new credentials, or persistence beyond the per-run temp workdir/store.

Looks Good

Keeps the AgentOS direction minimal and manual.
Avoids the earlier over-engineered nightly/replay/cost-budget substrate.
Correctly ties product-complete scenarios to complete_product and verifies terminal metadata.

Final Recommendation

Approve. The PR is narrow, SSOT-aligned, and merge-ready once CI is green.

ouroboros-agent

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 3570686 for PR #1191

Review record: 2636e42c-d844-41ed-8d0a-f7780f5ae228

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

Unable to assess architecture or changed behavior because the source snapshot and diff were inaccessible in this environment.

Policy Notes

Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.

Recovery Notes

First recoverable review artifact generated from codex analysis log.

Reviewed by ouroboros-agent[bot] via Codex deep analysis

ouroboros-agent

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

PR #1191
Branch: feat/pP1 | 1 files, +164/-21 | CI: Bridge TypeScript pass 11s https://github.com/Q00/ouroboros/actions/runs/26318488944/job/77482637405
Scope: architecture-level
HEAD checked: 3570686ea47e475b74e7528c1b64383ee0a909ee

What Improved

The live canonical harness now passes complete_product=True for product-complete scenarios and checks more than MCP transport success.
The hermetic canonical checks validate domain_class, completion_mode, and runtime probe kinds against the L1 catalog.

Issue #N/A Requirements

Requirement	Status
Live canonical product-complete scenarios must prove verified terminal product completion, not just pipeline completion.	Blocked by Finding 1.
Hermetic L1 catalog cross-validation for canonical fixture metadata.	Satisfied.
Opt-in live run wiring through `AutoHandler.handle(...)`.	Partially satisfied; invocation is wired, but terminal assertion is incomplete.
Meaningful coverage for newly added live assertion logic.	Insufficient for plugin/external handoff boundary.

Prior Findings Status

Prior Finding	Status
Prior review context	MODIFIED — Prior concern is maintained, modified: the harness no longer stops at MCP-level OK or simple `status="complete"` for normal handoff-only completions, but it still does not close the affected product-completion boundary because plugin/external Ralph handoffs can report `status="complete"` without verified product completion and still pass this test.

Blockers

#	File:Line	Severity	Confidence	Finding
1	tests/canonical/test_canonical.py:263	High	90%	The product-complete live assertion still false-passes external OpenCode/Ralph plugin handoffs. It only rejects `product_status == "not_verified_complete"`, but MCP metadata only sets that field for plain run-handoff-only completions (`src/ouroboros/mcp/tools/auto_handler.py:1200`), while plugin Ralph completion is a known external/pending product state (`src/ouroboros/cli/commands/auto.py:684`, `src/ouroboros/cli/commands/auto.py:710`) and current pipeline returns `status="complete"` with `ralph_dispatch_mode="plugin"` for that path (`src/ouroboros/auto/pipeline.py:1308`). Because `meta.get("product_status")` is then absent, this test passes even though the product is not verified complete, so the new canonical live gate does not prove the claimed terminal product completion contract.

Follow-ups

#	File:Line	Priority	Confidence	Suggestion
—	—	—	—	None.

Test Coverage

Ran uv run pytest tests/canonical/ -q: 9 passed, 1 skipped.
Ran uv run pytest tests/unit/auto/test_surface.py::test_auto_handler_meta_and_text_distinguish_handoff_from_product_completion -q: passed.
Ran uv run pytest tests/unit/auto/test_pipeline_ralph_handoff.py -q -k "plugin and delegation": passed.
The opt-in live LLM path was not run; the blocker is in the assertion contract itself.

Design / Roadmap Gate

Fails affected-boundary review. The PR changed the canonical live acceptance surface, so the relevant boundary is the consumer-visible MCP result contract, not only the changed test lines. Current HEAD has at least three product states sharing status="complete": verified non-plugin Ralph completion, plain run handoff with product_status="not_verified_complete", and plugin/external Ralph delegation with ralph_dispatch_mode="plugin". The new assertion distinguishes only the second state, leaving the plugin/external state accepted as product-complete even though CLI code explicitly labels it not verified complete.

Merge Recommendation

Retrospective recommendation: follow up with a fix before relying on this canonical live gate. For completion_mode: product_complete, assert a positive verified-product signal, such as non-plugin completed Ralph/evaluator evidence, and explicitly fail ralph_dispatch_mode == "plugin" or any external/pending handoff shape.

Review-Metadata:
verdict: REQUEST_CHANGES
github_event: COMMENT
review_kind: post_merge_audit
merge_eligible: false
head_sha: 3570686
source_read_ok: true
diff_read_ok: true
blocking_count: 0

ouroboros-agent Bot approved these changes May 22, 2026

View reviewed changes

Q00 mentioned this pull request May 22, 2026

Meta SSOT: AgentOS roadmap sequencing (#920–#960) #961

Open

ouroboros-agent Bot approved these changes May 23, 2026

View reviewed changes

shaun0927 merged commit 47e0eed into Q00:main May 23, 2026
8 checks passed

ouroboros-agent Bot reviewed May 23, 2026

View reviewed changes

Q00 mentioned this pull request May 24, 2026

fix: resolve post-merge review blocker foundations #1205

Merged

shaun0927 mentioned this pull request May 25, 2026

fix(canonical): call Result.is_ok/is_err as properties #1218

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tests): L0 live-wire + L1 catalog cross-validate (P1)#1191

feat(tests): L0 live-wire + L1 catalog cross-validate (P1)#1191
shaun0927 merged 2 commits into
Q00:mainfrom
shaun0927:feat/pP1

shaun0927 commented May 22, 2026

Uh oh!

ouroboros-agent Bot left a comment

Uh oh!

shaun0927 commented May 23, 2026

Uh oh!

shaun0927 commented May 23, 2026

Uh oh!

ouroboros-agent Bot left a comment

Uh oh!

Uh oh!

ouroboros-agent Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaun0927 commented May 22, 2026

Summary

What lands

What is NOT in this PR

Test plan

Uh oh!

ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

Review — ouroboros-agent[bot]

Blocking Findings

Non-blocking Suggestions

Design Notes

Policy Notes

Recovery Notes

Uh oh!

shaun0927 commented May 23, 2026

Uh oh!

shaun0927 commented May 23, 2026

PR Review Summary

Verdict

Scope Reviewed

Blocking Issues

Warnings

Mutation-Test Thinking

Complexity / CRAP-style Risk

Test Quality Assessment (6/7)

Security / Operational Risk

Looks Good

Final Recommendation

Uh oh!

ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

Review — ouroboros-agent[bot]

Blocking Findings

Non-blocking Suggestions

Design Notes

Policy Notes

Recovery Notes

Uh oh!

Uh oh!

ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

Review — ouroboros-agent[bot]

What Improved

Issue #N/A Requirements

Prior Findings Status

Blockers

Follow-ups

Test Coverage

Design / Roadmap Gate

Merge Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant