Skip to content

auto cli-todo canonical run blocked at safe-default closure (R1 evidence) #1219

@shaun0927

Description

@shaun0927

Summary

R1 live canonical run (OUROBOROS_RUN_CANONICAL=1 pytest tests/canonical/ -k cli-todo on 265aedb4) terminated with phase=blocked at safe-default synthesis. ooo auto cannot complete the cli-todo canonical scenario, which fails SSOT #1157 success condition #1 (all four canonical goals reach status=complete). Two coupled defects produce a single symptom:

  • B1 — Logic: safe-default fallback cannot close the interview when the backend accepts the synthesis answer but does not flag the resulting turn as seed_ready/completed. Defaults are then rolled back and the auto pipeline exits blocked.
  • B2 — Envelope: the resulting blocker is emitted with last_error_code=None, violating the canonical 8-code mapping contract introduced by feat(auto): canonical stop_reason_code for interview-layer blockers #1151.

This issue tracks the behavioural fix. #1211 (open) is the observability complement for the same decision point and should land alongside, not instead of, the work proposed here. Bug A (the test-harness is_ok() / unwrap_err() mis-call that previously masked this evidence behind a TypeError) is fixed in #1218.

Reproduction

# Once #1218 merges; current main HEAD = 265aedb4 cannot reproduce without it
OUROBOROS_RUN_CANONICAL=1 uv run pytest tests/canonical/ -k cli-todo -v
# ~350s wall, ~$1 LLM cost. Expect phase=blocked terminal.

Local evidence preserved in .ooo-observability/R1-cli-todo-20260525-1739.log on the reporter's worktree (auto_session_id=auto_3f44b20d63b7, interview_id=interview_169770e8f45c48cf).

R1 evidence — interview ambiguity trajectory (13 rounds, never crossed the 0.2 gate)

round overall goal constraints success_criteria ready?
3 0.298 0.78 0.62 0.68 False
4 0.304 0.78 0.62 0.66 False
5 0.540 0.55 0.35 0.45 False
6 0.401 0.74 0.52 0.49 False
7 0.464 0.68 0.46 0.42 False
8 0.361 0.72 0.55 0.62 False
9 0.421 0.72 0.42 0.55 False
10 0.353 0.74 0.55 0.62 False
11 0.397 0.72 0.55 0.50 False
12 0.373 0.72 0.58 0.55 False

Score oscillates between 0.298 and 0.540 — never converges below the 0.2 readiness gate. After round 12 the auto driver enters the safe-default fallback at 08:45:41:

auto.interview.safe_default.entered
  ambiguity_score=0.373  backend_done=False  ledger_done=False
  open_gaps=['runtime_context']  max_rounds=12

R1 evidence — terminal state (recovered from log; JSON dump was not retained on disk)

  • phase=blocked
  • blocker: safe-default synthesis did not close the persisted interview: backend_done=False, ledger defaults rolled back
  • last_error_code=None
  • seed_origin=none
  • runtime_probe_evidence=[]

Suspect code

B1 — Logic: safe-default cannot close when backend silently no-ops synthesis

src/ouroboros/auto/interview_driver.py:440-454

state.interview_session_id = synthesis_turn.session_id
state.pending_question = synthesis_turn.question
if not (synthesis_turn.seed_ready or synthesis_turn.completed):
    _revert_safe_default_entries(ledger, finalization.defaulted_sections)
    blocker = (
        "safe-default synthesis did not close the persisted interview: "
        "backend_done=False, ledger defaults rolled back"
    )
    state.ledger = ledger.to_dict()
    state.mark_blocked(blocker, tool_name="interview.safe_default_synthesis")
    record_authoring_backend(state)
    self._save(state)
    return AutoInterviewResult(
        "blocked", state.interview_session_id, ledger, self.max_rounds, blocker
    )

When the Socratic backend accepts the synthesis answer but never flags seed_ready/completed on the resulting turn (the cli-todo runtime_context gap reproduces this reliably), #1167's policy rolls every default back and exits blocked. The backend appears to treat the driver-injected synthesis as just another user response, not a terminator.

B2 — Envelope: last_error_code never set for this blocker

src/ouroboros/auto/state.py:626-636

def mark_blocked(
    self,
    message: str,
    *,
    tool_name: str | None = None,
    error_code: str | None = None,
) -> None:
    self.last_tool_name = tool_name
    self.last_error_code = error_code
    self.transition(AutoPhase.BLOCKED, message, error=message)

Both safe-default failure sites in interview_driver.py (lines 434 and 449) call mark_blocked(blocker, tool_name="interview.safe_default_synthesis") without passing error_code=, so last_error_code defaults to None. The terminal envelope carries the rich blocker text but no canonical code — breaking the #1151 8-code mapping contract.

Sub-tasks

  • B1 (logic)src/ouroboros/auto/interview_driver.py:440-454. Decide closure policy when the backend ack is content-only: either extend the safe-default contract with a third closure mode (alongside mutual_agreement and ledger_only) that accepts "backend echoed, ledger satisfied" as a close, or fail forward into a deterministic ledger_only close instead of reverting defaults. Document the chosen policy on feat(auto): safe-default closure mode + partial-unsafe blocker code (PR-B2) #1167.
  • B2 (envelope) — Add INTERVIEW_SAFE_DEFAULT_SYNTHESIS_NONCLOSURE (or equivalent — must be drawn from the feat(auto): canonical stop_reason_code for interview-layer blockers #1151 alphabet) and pass it as error_code= at both mark_blocked call sites in interview_driver.py:434, 449. Add a regression test under tests/auto/ asserting that any safe-default blocker emits a non-None last_error_code from the documented alphabet.

Test-harness coordination

Prior art / related work

Cross-refs

Constraints (per evidence-driven minimal-substrate policy)

  • No second live R1 run until B1 lands — same evidence, $1 wasted.
  • No new substrates or abstractions; both sub-tasks are edits to existing modules.
  • No direct push to main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorbugReproducible defect or broken behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions