Skip to content

feat(tests): L0 live-wire + L1 catalog cross-validate (P1)#1191

Merged
shaun0927 merged 2 commits into
Q00:mainfrom
shaun0927:feat/pP1
May 23, 2026
Merged

feat(tests): L0 live-wire + L1 catalog cross-validate (P1)#1191
shaun0927 merged 2 commits into
Q00:mainfrom
shaun0927:feat/pP1

Conversation

@shaun0927
Copy link
Copy Markdown
Collaborator

Summary

L0 follow-up of #1170 — wires the `OUROBOROS_RUN_CANONICAL=1` path to programmatically invoke the `ouroboros_auto` MCP tool against a canonical scenario, and adds the L1 catalog cross-validation tests deferred from L0-a (#1174) until L1-a (#1173) was on main.

What lands

  • L1 catalog cross-validate (always runs, hermetic):
    • `test_scenario_domain_class_resolves_in_l1_catalog` — pins `expected.yaml` `domain_class` to a real `TaskClass` value.
    • `test_scenario_completion_mode_matches_l1_catalog_default` — pins `completion_mode` to the L1 catalog default for the scenario's class.
    • `test_scenario_runtime_probe_kinds_subset_of_l1_catalog` — pins `runtime_probe_kinds` to a subset of the catalog's declared probes.
  • Live-run wiring (opt-in via env var): when `OUROBOROS_RUN_CANONICAL=1` is set, the test programmatically invokes `AutoHandler.handle(...)` with a goal/cwd/skip_run/pipeline_timeout_seconds arguments dict built from the scenario fixture. Result assertions surface MCP misconfiguration with the underlying error message.

What is NOT in this PR

Test plan

  • `uv run pytest tests/canonical/ -v` (hermetic default) → 9 passed, 1 skipped (live-run gate).
  • `uv run ruff check tests/canonical/test_canonical.py` → clean.
  • `uv run ruff format tests/canonical/test_canonical.py` → clean.
  • `uv run mypy tests/canonical/test_canonical.py` → clean.

Refs #1157 (L0 lane), #1170 (L0 design), #1174 (L0-a skeleton), #1173 (L1-a catalog).

L0 follow-up of Q00#1170 — wires the ``OUROBOROS_RUN_CANONICAL=1``
path to programmatically invoke the ``ouroboros_auto`` MCP tool
against a canonical scenario, and adds the L1 catalog
cross-validation tests that were deferred from L0-a (Q00#1174) until
L1-a (Q00#1173) was on main.

## Summary

What this PR adds:

1. **L1 catalog cross-validate** (always runs, hermetic):
   - ``test_scenario_domain_class_resolves_in_l1_catalog`` — pins
     ``expected.yaml`` ``domain_class`` to a real :class:`TaskClass`
     value.
   - ``test_scenario_completion_mode_matches_l1_catalog_default`` —
     pins ``completion_mode`` to the L1 catalog default for the
     scenario's class.
   - ``test_scenario_runtime_probe_kinds_subset_of_l1_catalog`` —
     pins ``runtime_probe_kinds`` to a subset of the catalog's
     declared probes (no ad-hoc sideways).
2. **Live-run wiring** (opt-in via env var): when
   ``OUROBOROS_RUN_CANONICAL=1`` is set, the test
   programmatically invokes ``AutoHandler.handle(...)`` with a
   goal/cwd/skip_run/pipeline_timeout_seconds arguments dict built
   from the scenario fixture. Result-shape assertions surface MCP
   misconfiguration with the underlying error message.

## What lands

- ``tests/canonical/test_canonical.py``:
  - Three new L1-cross-validation tests (parametrized per
    scenario).
  - ``_invoke_ouroboros_auto`` helper that builds the arguments
    dict and calls ``AutoHandler.handle``.
  - ``test_scenario_live_run_or_skip`` upgraded from a permanent
    ``pytest.skip`` to a real invocation gated on the env var.
  - Header imports for ``Path`` / ``Any`` / ``asyncio`` (mypy +
    ruff clean).

## Test plan

- [x] ``uv run pytest tests/canonical/ -v`` (hermetic default)
  → 9 passed, 1 skipped (live-run gate).
- [x] ``uv run ruff check tests/canonical/test_canonical.py`` →
  clean.
- [x] ``uv run ruff format tests/canonical/test_canonical.py`` →
  clean.
- [x] ``uv run mypy tests/canonical/test_canonical.py`` → clean.

## What is NOT in this PR

- Additional scenario fixtures (``webhook-receiver``,
  ``vertical-slice-refactor``, ``2d-kart-racer``) — fixtures are
  maintainer-curated example data, not substrate PRs.
- Probe runner wiring for the canonical workdir — handled by
  ``probe_runner`` argument on the AutoPipeline; L0 live-run
  exercises the *end-to-end pipeline*, not the probe schema.
- Recorded-replay layer — deferred per Q00#1170 minimal-substrate
  audit.

## References

- Q00#1157 — Meta SSOT for ``ooo auto`` (L0 lane body).
- Q00#1170 — L0 design issue (minimal harness).
- Q00#1174 — L0-a harness skeleton (this PR's parent).
- Q00#1173 — L1-a task-class catalog (consumed via
  cross-validation tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 4d61b52 for PR #1191

Review record: c4c29c21-3e1e-4137-91c6-23400d9d9d8d

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

Unable to assess architecture or changed behavior because the required local files could not be read in this execution environment.

Policy Notes

  • Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

The L0 live path should fail when ouroboros_auto only returns an MCP-level OK while the pipeline itself is failed/blocked or stops at an unverified product handoff. Tie product-complete scenarios to complete_product and assert the terminal metadata the SSOT claims.\n\nConstraint: L0 remains a manual, opt-in pytest harness with no replay or CI/live-cost substrate.\nRejected: Adding replay, nightly runs, or new scenario fixtures | Q00#1170 explicitly excludes operational substrate until evidence demands it.\nConfidence: high\nScope-risk: narrow\nDirective: Keep canonical live assertions tied to documented expected.yaml semantics, not just handler transport success.\nTested: uv run ruff check tests/canonical/test_canonical.py; uv run ruff format tests/canonical/test_canonical.py --check; uv run mypy tests/canonical/test_canonical.py; uv run pytest tests/canonical/ -v\nNot-tested: OUROBOROS_RUN_CANONICAL=1 live LLM run
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Readiness update after independent AgentOS/SSOT re-review:

This PR is a narrow L0 follow-up for the ooo auto canonical acceptance harness. It does not add replay infrastructure, nightly/live CI, cost budgeting, or new scenario-matrix operations; those remain deliberately out of scope per the #1170/#961 minimal-substrate direction.

What changed in the latest push:

  • The opt-in live canonical path now asks ouroboros_auto for complete_product when the scenario declares completion_mode: product_complete.
  • The live test no longer treats transport-level MCP success as enough. It now fails if the tool result is failed/blocked, if terminal metadata is not status=complete, or if a product-complete scenario stops at an unverified run handoff.
  • The test emits a concise CANONICAL <slug>: ... line for maintainer copy/paste evidence after a live run.

Why this is merge-ready:

  • Alignment: the PR now matches the AgentOS SSOT direction: a manual, opt-in, concrete acceptance gate rather than operational substrate.
  • Scope control: still one test file, no new framework, no scheduler, no replay layer, no additional fixture surface.
  • Risk posture: the new assertions make the live path prove the terminal semantics it claims instead of merely proving that AutoHandler.handle(...) returned an OK wrapper.
  • Local validation passed:
    • uv run ruff check tests/canonical/test_canonical.py
    • uv run ruff format tests/canonical/test_canonical.py --check
    • uv run mypy tests/canonical/test_canonical.py
    • uv run pytest tests/canonical/ -v → 9 passed, 1 skipped (live gate disabled by default)

I consider this PR appropriate to merge once GitHub checks are green.

@shaun0927
Copy link
Copy Markdown
Collaborator Author

PR Review Summary

Verdict

Approve

Scope Reviewed

  • PR intent: L0 follow-up for the canonical ooo auto acceptance harness: add L1 catalog cross-validation and wire the opt-in live canonical run path.
  • Main changed areas: tests/canonical/test_canonical.py only.
  • Tests reviewed: canonical shape tests, L1 catalog contract tests, opt-in live-run assertions.
  • Checks considered: local ruff, format check, mypy, and uv run pytest tests/canonical/ -v.

Blocking Issues

None.

Warnings

None.

Mutation-Test Thinking

  • Likely mutants that should be killed:
    • Changing a fixture domain_class to a value outside TaskClass should fail the L1 catalog resolution test.
    • Changing a scenario completion_mode away from the catalog default should fail the catalog-default test.
    • Returning a failed/blocked MCP tool result from the live path should fail the live-run assertion.
    • Stopping a product-complete scenario at an unverified handoff should fail the new product_status assertion.
  • Mutants current tests may not catch:
    • The expensive live LLM path is intentionally skipped unless OUROBOROS_RUN_CANONICAL=1 is set.
  • Additional tests recommended:

Complexity / CRAP-style Risk

  • High-risk functions/modules: none introduced; the helper is local to the canonical test module.
  • Complexity increase: low. The branch adds direct contract assertions rather than a new runner framework.
  • Test coverage concern: default CI covers hermetic fixture/catalog contracts; live behavior remains opt-in by design.
  • Refactoring recommendation: none.

Test Quality Assessment (6/7)

  • Strong tests: fixture schema, catalog consistency, probe whitelist, and terminal-result checks are concrete and mutation-resistant.
  • Weak tests: live run depends on external LLM/runtime configuration and is skipped by default.
  • Missing edge cases: no live replay; intentionally out of scope.
  • Mocking concerns: none; this is an acceptance harness, not a mocked unit path.

Security / Operational Risk

None. The PR does not add default live execution, CI scheduling, new credentials, or persistence beyond the per-run temp workdir/store.

Looks Good

  • Keeps the AgentOS direction minimal and manual.
  • Avoids the earlier over-engineered nightly/replay/cost-budget substrate.
  • Correctly ties product-complete scenarios to complete_product and verifies terminal metadata.

Final Recommendation

Approve. The PR is narrow, SSOT-aligned, and merge-ready once CI is green.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 3570686 for PR #1191

Review record: 2636e42c-d844-41ed-8d0a-f7780f5ae228

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

Unable to assess architecture or changed behavior because the source snapshot and diff were inaccessible in this environment.

Policy Notes

  • Omitted 1 finding(s) that referenced files outside the current PR changed-files scope.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

PR #1191
Branch: feat/pP1 | 1 files, +164/-21 | CI: Bridge TypeScript pass 11s https://github.com/Q00/ouroboros/actions/runs/26318488944/job/77482637405
Scope: architecture-level
HEAD checked: 3570686ea47e475b74e7528c1b64383ee0a909ee

What Improved

  • The live canonical harness now passes complete_product=True for product-complete scenarios and checks more than MCP transport success.
  • The hermetic canonical checks validate domain_class, completion_mode, and runtime probe kinds against the L1 catalog.

Issue #N/A Requirements

Requirement Status
Live canonical product-complete scenarios must prove verified terminal product completion, not just pipeline completion. Blocked by Finding 1.
Hermetic L1 catalog cross-validation for canonical fixture metadata. Satisfied.
Opt-in live run wiring through AutoHandler.handle(...). Partially satisfied; invocation is wired, but terminal assertion is incomplete.
Meaningful coverage for newly added live assertion logic. Insufficient for plugin/external handoff boundary.

Prior Findings Status

Prior Finding Status
Prior review context MODIFIED — Prior concern is maintained, modified: the harness no longer stops at MCP-level OK or simple status="complete" for normal handoff-only completions, but it still does not close the affected product-completion boundary because plugin/external Ralph handoffs can report status="complete" without verified product completion and still pass this test.

Blockers

# File:Line Severity Confidence Finding
1 tests/canonical/test_canonical.py:263 High 90% The product-complete live assertion still false-passes external OpenCode/Ralph plugin handoffs. It only rejects product_status == "not_verified_complete", but MCP metadata only sets that field for plain run-handoff-only completions (src/ouroboros/mcp/tools/auto_handler.py:1200), while plugin Ralph completion is a known external/pending product state (src/ouroboros/cli/commands/auto.py:684, src/ouroboros/cli/commands/auto.py:710) and current pipeline returns status="complete" with ralph_dispatch_mode="plugin" for that path (src/ouroboros/auto/pipeline.py:1308). Because meta.get("product_status") is then absent, this test passes even though the product is not verified complete, so the new canonical live gate does not prove the claimed terminal product completion contract.

Follow-ups

# File:Line Priority Confidence Suggestion
None.

Test Coverage

  • Ran uv run pytest tests/canonical/ -q: 9 passed, 1 skipped.
  • Ran uv run pytest tests/unit/auto/test_surface.py::test_auto_handler_meta_and_text_distinguish_handoff_from_product_completion -q: passed.
  • Ran uv run pytest tests/unit/auto/test_pipeline_ralph_handoff.py -q -k "plugin and delegation": passed.
  • The opt-in live LLM path was not run; the blocker is in the assertion contract itself.

Design / Roadmap Gate

Fails affected-boundary review. The PR changed the canonical live acceptance surface, so the relevant boundary is the consumer-visible MCP result contract, not only the changed test lines. Current HEAD has at least three product states sharing status="complete": verified non-plugin Ralph completion, plain run handoff with product_status="not_verified_complete", and plugin/external Ralph delegation with ralph_dispatch_mode="plugin". The new assertion distinguishes only the second state, leaving the plugin/external state accepted as product-complete even though CLI code explicitly labels it not verified complete.

Merge Recommendation

Retrospective recommendation: follow up with a fix before relying on this canonical live gate. For completion_mode: product_complete, assert a positive verified-product signal, such as non-plugin completed Ralph/evaluator evidence, and explicitly fail ralph_dispatch_mode == "plugin" or any external/pending handoff shape.

Review-Metadata:
verdict: REQUEST_CHANGES
github_event: COMMENT
review_kind: post_merge_audit
merge_eligible: false
head_sha: 3570686
source_read_ok: true
diff_read_ok: true
blocking_count: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant