Skip to content

Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171

@shaun0927

Description

@shaun0927

Meta SSOT slice: L1 — DomainProfile Catalog (ledger-derived domain inference + default AC injection)

Terminology cleanup (2026-05-23 KST). Implementation has standardized on TaskClass / TaskClassProfile. Older DomainProfile Catalog wording in this issue is historical and should be read as superseded unless it explicitly refers to #849's prior recovery-hint concept.

This issue is the L1 design slice of #1157. It promotes #849's DomainProfile from a typed recovery hint to a first-class domain taxonomy that drives default AC and runtime-probe binding — without introducing a separate LLM classifier.

Self-audit note (2026-05-22)

An earlier draft of this issue proposed a separate LLM-based classifier (Sonnet) at the Big Bang phase, plus an eval set, accuracy floor, confidence threshold, and opt-in telemetry. That design was wrong for Ouroboros: it duplicated the work the Socratic interview already does (extracting structured spec into a ledger) and violated the SSOT's own promise to inherit Ouroboros's interview substrate as-is. This issue replaces that design with ledger-derived domain inference — a deterministic pattern-match against the entries the existing interview already populates. Zero new LLM calls, zero new external API surface, zero new substrate. See the Why ledger-derive and Ledger-derived domain inference sections below.

Why now

The current ooo auto pipeline implicitly treats every task as if it were a library with unit-test acceptance:

  • The Seed Architect emits identical AC templates regardless of domain.
  • The verifier evaluates the same way for ooo auto "habit tracker CLI" and ooo auto "2D kart racer".
  • The user has to manually say things like "and also include a game loop" / "and also exit code 0" — defeating the SSOT north-star ("one vague line, no follow-up needed").

L1 closes this by deriving the domain class from the structured ledger the Socratic interview already produces and injecting class-appropriate default AC + completion mode + runtime-probe binding before the Seed is sealed.

Why ledger-derive, not classifier

The interview already runs an LLM that converts the user's goal + interview answers into structured SeedDraftLedger entries: actors, inputs, outputs, runtime_context, constraints, non_goals, acceptance_criteria, verification_plan, failure_modes. Domain class is a derived property of those entries, not a separate inference:

  • outputs says "HTTP response" + inputs says "webhook payload" → webhook
  • outputs says "stdout, exit code 0" + runtime_context says "shell" → cli
  • outputs says "render frames" → game-2d
  • goal contains "refactor" + constraints says "preserve behavior" → refactor-in-place

These are pattern matches on the entries the interview already populated and standardized (the interview's other implicit job is to coerce raw user prose toward canonical vocabulary — "do you mean stdout, stderr, or both?"). A separate LLM classifier on top would:

  • Duplicate the inference the interview is already doing.
  • Introduce a parallel confidence signal that can disagree with the interview's own ambiguity_score.
  • Require an eval set, accuracy floor, telemetry pipeline, and model-cost decision — all unnecessary if we reuse what the interview already produced.

Ledger-derive keeps L1 inside Ouroboros's substrate. Pattern matching is deterministic, auditable, and reproducible; growing the catalog is a ~10 LoC PR per new class (pattern function + unit tests), not an eval-set re-curation.

Frozen class taxonomy for L1-a (7 classes)

After design review, L1-a ships seven classes — the smallest set that covers every canonical L0 scenario without over-claiming distinctions we cannot defend. Three classes (game-3d, desktop-app, notebook-analysis) are deferred to follow-up PRs after the L1-a enum lands; the schema is forward-extensible so additions do not require migration.

Class Ledger signals (illustrative) Default completion mode Probe binding (anticipated)
library outputs: API surface / importable symbols; no shell/HTTP runtime CODE_COMPLETE unit tests + API import smoke
cli outputs: stdout / exit code / printed text; runtime_context: shell / terminal PRODUCT_COMPLETE headless run + stdout/exit-code golden
web-service outputs: REST/HTTP endpoints, JSON body, multiple routes PRODUCT_COMPLETE API smoke (request → response shape)
webhook inputs: webhook payload / POST event; outputs: side effect (DB row / file / external call) PRODUCT_COMPLETE API smoke + side-effect probe
data-pipeline inputs: dataset / CSV / log file; outputs: aggregated / transformed / Parquet PRODUCT_COMPLETE input fixture → output fixture diff
game-2d outputs: render / frame / screen / canvas; (provisional: also browser-based interactive frontend) PRODUCT_COMPLETE headless N-tick sim + state-progression assertion
refactor-in-place goal: refactor / rewrite / restructure; constraints: preserve behavior / same tests CODE_COMPLETE before/after test-suite parity

Deferred (each becomes its own follow-up PR after L1-a lands):

  • game-3d — render-hash probe is meaningfully harder; defer until L3 ships the render-hash kind.
  • desktop-app — Electron / native / PWA tri-furcation is too broad to lock in one class. Browser-based interactive frontends are provisionally absorbed by game-2d.
  • notebook-analysis — outlier in completion semantics; defer until at least one real-world notebook scenario hits the test matrix.

Resolved taxonomy decisions

  1. webhook vs web-servicekeep separate. The runtime probe genuinely differs (side-effect vs request-shape). User goals are linguistically distinct.
  2. game-3d deferraldeferred (see above).
  3. desktop-app deferraldeferred (see above).
  4. refactor-in-place as a classkept. vertical-slice-refactor in the L0 canonical matrix needs this class.

Per-class schema (frozen by L1-a)

@dataclass(frozen=True, slots=True)
class DomainProfile:
    name: str                                       # canonical class name
    default_completion_mode: CompletionMode         # CODE_COMPLETE | PRODUCT_COMPLETE
    default_ac_template: tuple[str, ...]            # matches Seed.acceptance_criteria
                                                    # (plain strings, not AcceptanceCriterion objects)
    runtime_probe_kinds: tuple[ProbeKind, ...]      # bound by L3 when ready
    safe_defaults: Mapping[str, str]                # existing #849 field
    # Other existing typed-recovery fields preserved

Ledger-derived domain inference (L1-b)

A pure-Python function in src/ouroboros/auto/domain_inference.py:

@dataclass(frozen=True, slots=True)
class DomainInference:
    """Outcome of pattern-matching the ledger against the L1 catalog."""
    single: DomainProfile | None
    candidates: frozenset[DomainProfile]  # populated only when ambiguous
    reason: str                            # which pattern(s) matched, for audit


def derive_domain_from_ledger(ledger: SeedDraftLedger) -> DomainInference:
    """
    Pattern-match the ledger's structured entries against the L1 catalog.
    Returns one of:
      - single = <DomainProfile>           (exactly one class matched)
      - candidates = {A, B, ...}           (ambiguous; interview must disambiguate)
      - single = LIBRARY, reason = "unmatched"  (no pattern matched; safe default)

    Zero LLM calls. All matching is keyword/regex/substring against the
    *interview-standardized* ledger entries (which are already coerced toward
    canonical vocabulary by the Socratic loop).
    """

Pattern functions are registered per-class in the same module. Each pattern is one function that returns bool against a SeedDraftLedger:

def _matches_cli(ledger: SeedDraftLedger) -> bool:
    return (
        _ledger_entry_any(ledger, "outputs", ("stdout", "exit code", "printed text"))
        and _ledger_entry_any(ledger, "runtime_context", ("shell", "terminal", "subprocess"))
    )

Adding a new class = new pattern function + unit test. ~10 LoC PR.

Ambiguity handling

When derive_domain_from_ledger() returns multiple candidates, the interview driver gets a next-round question candidate (small hook in interview_driver.py):

if inference.candidates and len(inference.candidates) > 1:
    next_round_questions.append(
        f"This goal could be interpreted as: {', '.join(c.name for c in inference.candidates)}. "
        f"Which is the primary surface?"
    )

The interview's existing ambiguity-gate loop drives the next round. No new escalation system. The disambiguation question gets answered → ledger updates → derive runs again → one class wins (or remains ambiguous, in which case the loop continues until either max_rounds or convergence).

Unmatched handling

When no pattern matches, the inference returns single = LIBRARY with reason = "unmatched" and emits a domain_unmatched event into the EventStore. Reasoning:

  • LIBRARY has the narrowest completion gate (CODE_COMPLETE with unit tests + API import smoke). Mis-classification has the lowest blast radius — the system makes the least PRODUCT claim, so no irreversible mistakes follow.
  • The domain_unmatched event lets maintainers spot patterns of unmatched goals and add new catalog classes when justified.

Sub-PR breakdown

  1. L1-a — Catalog data only. Frozen 7-class enum + per-class fields (default_completion_mode, default_ac_template, runtime_probe_kinds, safe_defaults). ~100 LoC + 1 unit test per class. Unblocks L3-c probe binding. No inference logic; uses are explicit-only at this stage.
  2. L1-b — Pattern-matching derive_domain_from_ledger in src/ouroboros/auto/domain_inference.py + per-class _matches_* pattern functions + unit tests covering positive / negative / ambiguous / unmatched cases. ~150 LoC. No new LLM, no eval set, no accuracy floor — just deterministic pattern unit tests.
  3. L1-c — Interview-driver integration: between rounds (after each _step settles), call derive_domain_from_ledger; if ambiguous, append a disambiguation question to next-round candidates; if unmatched after max_rounds, emit domain_unmatched and proceed with LIBRARY default. ~50 LoC + integration tests.
  4. L1-d — Seed AC injection hook in seed_architect: when the Seed is assembled, look up the active DomainProfile.default_ac_template and prepend each entry to Seed.acceptance_criteria (user-supplied AC takes precedence on conflict, with domain_default_ac_overridden audit event). ~80 LoC + unit tests.
  5. L1-e — Envelope surface (AutoPipelineResult.active_domain_profile: str | None, already populated in state.active_domain_profile_name from feat(auto): DomainProfile and VerifiablePredicate contracts (#809 P3, PR 1/6) #849; just plumb it through _result()). ~10 LoC.

L1-a is the smallest first slice. L1-b/c/d/e each ≤ 1 PR.

Acceptance criteria

  • L1-a: 7-class enum + per-class fields + ≥ 1 unit test per class — 🟢 acceptance.
  • L1-b: pattern unit tests cover positive / negative / ambiguous (≥ 2 patterns matching) / unmatched cases. Adding a class requires ≤ 10 LoC + 1 test.
  • L1-c: an ambiguous derive_domain_from_ledger result appends exactly one disambiguation question to the next interview round; the question is resolved within max_interview_rounds.
  • L1-d: a canonical L0 scenario (e.g. cli-todo) Seed contains the CLI default AC template entries prepended to user AC.
  • L1-e: result.active_domain_profile populated on every ooo auto run.
  • An unmatched goal (e.g. constructed test fixture) emits domain_unmatched and falls through to LIBRARY completion mode without raising.

Out of scope

  • Adding new classes after L1-a freezes — each becomes its own ~10-LoC follow-up PR.
  • Runtime probes themselves (L3).
  • Watchdog / resilience (L2 / L5).
  • Multi-language ledger normalization (assumed: interview standardizes ledger entries to English-leaning canonical vocabulary; verified by reading existing tests/unit/auto/test_ledger_grading_answerer.py fixtures).

Decisions awaiting maintainer triage

None. The earlier draft listed L1-5 (classifier model) and L1-10 (opt-in telemetry) as BLOCK questions. The ledger-derive redesign retires both: there is no classifier model to choose, and no eval set / telemetry pipeline to set up.

Known residual risks (documented, not blockers)

  • R1 — sparse ledger after short interview. If the user converges the interview in 2 rounds (highly confident user), the ledger may be sparse and pattern matching may fail more often. Mitigation: pattern functions return unmatched rather than guessing; LIBRARY default keeps blast radius low.
  • R2 — non-English ledger entries. Existing test fixtures show ledger entries are model-generated in English-leaning canonical vocabulary. If a Korean-only interview thread emerges (the auto-answerer's from-auto synthesis is also English-tagged), a separate normalization pass would be needed. Add only if observed in the wild.
  • R3 — pattern catalog conflicts as the catalog grows. Two domains' patterns matching the same ledger configuration. Mitigation: DomainInference.candidates makes this explicit; no silent precedence. New patterns ship with a positive-set and a disambiguator-set test, so adding a pattern that causes a regression is caught by CI.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorenhancementNew feature or meaningful improvementneeds-designMulti-PR epic or architectural change, needs human planning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions