Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection)

# Meta SSOT slice: L1 — DomainProfile Catalog (ledger-derived domain inference + default AC injection)

> **Terminology cleanup (2026-05-23 KST).** Implementation has standardized on `TaskClass` / `TaskClassProfile`. Older `DomainProfile Catalog` wording in this issue is historical and should be read as superseded unless it explicitly refers to #849's prior recovery-hint concept.

This issue is the **L1 design slice** of #1157. It promotes #849's `DomainProfile` from a typed recovery hint to a first-class **domain taxonomy** that drives default AC and runtime-probe binding — *without* introducing a separate LLM classifier.

## Self-audit note (2026-05-22)

An earlier draft of this issue proposed a separate LLM-based classifier (Sonnet) at the Big Bang phase, plus an eval set, accuracy floor, confidence threshold, and opt-in telemetry. **That design was wrong** for Ouroboros: it duplicated the work the Socratic interview already does (extracting structured spec into a ledger) and violated the SSOT's own promise to *inherit Ouroboros's interview substrate as-is*. This issue replaces that design with **ledger-derived domain inference** — a deterministic pattern-match against the entries the existing interview already populates. Zero new LLM calls, zero new external API surface, zero new substrate. See the *Why ledger-derive* and *Ledger-derived domain inference* sections below.

## Why now

The current `ooo auto` pipeline implicitly treats every task as if it were a library with unit-test acceptance:

- The Seed Architect emits identical AC templates regardless of domain.
- The verifier evaluates the same way for `ooo auto "habit tracker CLI"` and `ooo auto "2D kart racer"`.
- The user has to manually say things like *"and also include a game loop"* / *"and also exit code 0"* — defeating the SSOT north-star (*"one vague line, no follow-up needed"*).

L1 closes this by **deriving the domain class from the structured ledger the Socratic interview already produces** and injecting class-appropriate default AC + completion mode + runtime-probe binding *before the Seed is sealed*.

## Why ledger-derive, not classifier

The interview already runs an LLM that converts the user's goal + interview answers into structured `SeedDraftLedger` entries: `actors`, `inputs`, `outputs`, `runtime_context`, `constraints`, `non_goals`, `acceptance_criteria`, `verification_plan`, `failure_modes`. **Domain class is a derived property of those entries, not a separate inference**:

- `outputs` says "HTTP response" + `inputs` says "webhook payload" → `webhook`
- `outputs` says "stdout, exit code 0" + `runtime_context` says "shell" → `cli`
- `outputs` says "render frames" → `game-2d`
- `goal` contains "refactor" + `constraints` says "preserve behavior" → `refactor-in-place`

These are pattern matches on the entries the interview *already populated and standardized* (the interview's other implicit job is to coerce raw user prose toward canonical vocabulary — *"do you mean stdout, stderr, or both?"*). A separate LLM classifier on top would:

- Duplicate the inference the interview is already doing.
- Introduce a parallel confidence signal that can disagree with the interview's own `ambiguity_score`.
- Require an eval set, accuracy floor, telemetry pipeline, and model-cost decision — all unnecessary if we reuse what the interview already produced.

Ledger-derive keeps L1 inside Ouroboros's substrate. Pattern matching is deterministic, auditable, and reproducible; growing the catalog is a `~10 LoC` PR per new class (pattern function + unit tests), not an eval-set re-curation.

## Frozen class taxonomy for L1-a (7 classes)

After design review, L1-a ships **seven** classes — the smallest set that covers every canonical L0 scenario without over-claiming distinctions we cannot defend. Three classes (`game-3d`, `desktop-app`, `notebook-analysis`) are deferred to follow-up PRs *after* the L1-a enum lands; the schema is forward-extensible so additions do not require migration.

| Class | Ledger signals (illustrative) | Default completion mode | Probe binding (anticipated) |
|---|---|---|---|
| `library` | `outputs`: API surface / importable symbols; no shell/HTTP runtime | `CODE_COMPLETE` | unit tests + API import smoke |
| `cli` | `outputs`: stdout / exit code / printed text; `runtime_context`: shell / terminal | `PRODUCT_COMPLETE` | headless run + stdout/exit-code golden |
| `web-service` | `outputs`: REST/HTTP endpoints, JSON body, multiple routes | `PRODUCT_COMPLETE` | API smoke (request → response shape) |
| `webhook` | `inputs`: webhook payload / POST event; `outputs`: side effect (DB row / file / external call) | `PRODUCT_COMPLETE` | API smoke + side-effect probe |
| `data-pipeline` | `inputs`: dataset / CSV / log file; `outputs`: aggregated / transformed / Parquet | `PRODUCT_COMPLETE` | input fixture → output fixture diff |
| `game-2d` | `outputs`: render / frame / screen / canvas; *(provisional: also browser-based interactive frontend)* | `PRODUCT_COMPLETE` | headless N-tick sim + state-progression assertion |
| `refactor-in-place` | `goal`: refactor / rewrite / restructure; `constraints`: preserve behavior / same tests | `CODE_COMPLETE` | before/after test-suite parity |

Deferred (each becomes its own follow-up PR after L1-a lands):

- `game-3d` — render-hash probe is meaningfully harder; defer until L3 ships the render-hash kind.
- `desktop-app` — Electron / native / PWA tri-furcation is too broad to lock in one class. Browser-based interactive frontends are provisionally absorbed by `game-2d`.
- `notebook-analysis` — outlier in completion semantics; defer until at least one real-world notebook scenario hits the test matrix.

## Resolved taxonomy decisions

1. **`webhook` vs `web-service`** → **keep separate**. The runtime probe genuinely differs (side-effect vs request-shape). User goals are linguistically distinct.
2. **`game-3d` deferral** → **deferred** (see above).
3. **`desktop-app` deferral** → **deferred** (see above).
4. **`refactor-in-place` as a class** → **kept**. `vertical-slice-refactor` in the L0 canonical matrix needs this class.

## Per-class schema (frozen by L1-a)

```python
@dataclass(frozen=True, slots=True)
class DomainProfile:
    name: str                                       # canonical class name
    default_completion_mode: CompletionMode         # CODE_COMPLETE | PRODUCT_COMPLETE
    default_ac_template: tuple[str, ...]            # matches Seed.acceptance_criteria
                                                    # (plain strings, not AcceptanceCriterion objects)
    runtime_probe_kinds: tuple[ProbeKind, ...]      # bound by L3 when ready
    safe_defaults: Mapping[str, str]                # existing #849 field
    # Other existing typed-recovery fields preserved
```

## Ledger-derived domain inference (L1-b)

A pure-Python function in `src/ouroboros/auto/domain_inference.py`:

```python
@dataclass(frozen=True, slots=True)
class DomainInference:
    """Outcome of pattern-matching the ledger against the L1 catalog."""
    single: DomainProfile | None
    candidates: frozenset[DomainProfile]  # populated only when ambiguous
    reason: str                            # which pattern(s) matched, for audit


def derive_domain_from_ledger(ledger: SeedDraftLedger) -> DomainInference:
    """
    Pattern-match the ledger's structured entries against the L1 catalog.
    Returns one of:
      - single = <DomainProfile>           (exactly one class matched)
      - candidates = {A, B, ...}           (ambiguous; interview must disambiguate)
      - single = LIBRARY, reason = "unmatched"  (no pattern matched; safe default)

    Zero LLM calls. All matching is keyword/regex/substring against the
    *interview-standardized* ledger entries (which are already coerced toward
    canonical vocabulary by the Socratic loop).
    """
```

Pattern functions are registered per-class in the same module. Each pattern is one function that returns `bool` against a `SeedDraftLedger`:

```python
def _matches_cli(ledger: SeedDraftLedger) -> bool:
    return (
        _ledger_entry_any(ledger, "outputs", ("stdout", "exit code", "printed text"))
        and _ledger_entry_any(ledger, "runtime_context", ("shell", "terminal", "subprocess"))
    )
```

Adding a new class = new pattern function + unit test. ~10 LoC PR.

### Ambiguity handling

When `derive_domain_from_ledger()` returns multiple candidates, the interview driver gets a **next-round question candidate** (small hook in `interview_driver.py`):

```python
if inference.candidates and len(inference.candidates) > 1:
    next_round_questions.append(
        f"This goal could be interpreted as: {', '.join(c.name for c in inference.candidates)}. "
        f"Which is the primary surface?"
    )
```

The interview's existing ambiguity-gate loop drives the next round. **No new escalation system.** The disambiguation question gets answered → ledger updates → derive runs again → one class wins (or remains ambiguous, in which case the loop continues until either max_rounds or convergence).

### Unmatched handling

When no pattern matches, the inference returns `single = LIBRARY` with `reason = "unmatched"` and emits a `domain_unmatched` event into the EventStore. Reasoning:

- `LIBRARY` has the *narrowest* completion gate (`CODE_COMPLETE` with unit tests + API import smoke). Mis-classification has the lowest blast radius — the system makes the least PRODUCT claim, so no irreversible mistakes follow.
- The `domain_unmatched` event lets maintainers spot patterns of unmatched goals and add new catalog classes when justified.

## Sub-PR breakdown

1. **L1-a** — Catalog data only. Frozen 7-class enum + per-class fields (`default_completion_mode`, `default_ac_template`, `runtime_probe_kinds`, `safe_defaults`). ~100 LoC + 1 unit test per class. Unblocks L3-c probe binding. *No inference logic; uses are explicit-only at this stage.*
2. **L1-b** — Pattern-matching `derive_domain_from_ledger` in `src/ouroboros/auto/domain_inference.py` + per-class `_matches_*` pattern functions + unit tests covering positive / negative / ambiguous / unmatched cases. ~150 LoC. **No new LLM, no eval set, no accuracy floor — just deterministic pattern unit tests.**
3. **L1-c** — Interview-driver integration: between rounds (after each `_step` settles), call `derive_domain_from_ledger`; if ambiguous, append a disambiguation question to next-round candidates; if unmatched after `max_rounds`, emit `domain_unmatched` and proceed with LIBRARY default. ~50 LoC + integration tests.
4. **L1-d** — Seed AC injection hook in `seed_architect`: when the Seed is assembled, look up the active `DomainProfile.default_ac_template` and prepend each entry to `Seed.acceptance_criteria` (user-supplied AC takes precedence on conflict, with `domain_default_ac_overridden` audit event). ~80 LoC + unit tests.
5. **L1-e** — Envelope surface (`AutoPipelineResult.active_domain_profile: str | None`, already populated in `state.active_domain_profile_name` from #849; just plumb it through `_result()`). ~10 LoC.

L1-a is the smallest first slice. L1-b/c/d/e each ≤ 1 PR.

## Acceptance criteria

- [ ] L1-a: 7-class enum + per-class fields + ≥ 1 unit test per class — 🟢 acceptance.
- [ ] L1-b: pattern unit tests cover positive / negative / ambiguous (≥ 2 patterns matching) / unmatched cases. Adding a class requires ≤ 10 LoC + 1 test.
- [ ] L1-c: an ambiguous `derive_domain_from_ledger` result appends exactly one disambiguation question to the next interview round; the question is resolved within `max_interview_rounds`.
- [ ] L1-d: a canonical L0 scenario (e.g. `cli-todo`) Seed contains the CLI default AC template entries prepended to user AC.
- [ ] L1-e: `result.active_domain_profile` populated on every `ooo auto` run.
- [ ] An unmatched goal (e.g. constructed test fixture) emits `domain_unmatched` and falls through to LIBRARY completion mode without raising.

## Out of scope

- Adding new classes after L1-a freezes — each becomes its own ~10-LoC follow-up PR.
- Runtime probes themselves (L3).
- Watchdog / resilience (L2 / L5).
- Multi-language ledger normalization (assumed: interview standardizes ledger entries to English-leaning canonical vocabulary; verified by reading existing `tests/unit/auto/test_ledger_grading_answerer.py` fixtures).

## Decisions awaiting maintainer triage

**None.** The earlier draft listed L1-5 (classifier model) and L1-10 (opt-in telemetry) as BLOCK questions. The ledger-derive redesign retires both: there is no classifier model to choose, and no eval set / telemetry pipeline to set up.

## Known residual risks (documented, not blockers)

- **R1 — sparse ledger after short interview.** If the user converges the interview in 2 rounds (highly confident user), the ledger may be sparse and pattern matching may fail more often. Mitigation: pattern functions return `unmatched` rather than guessing; LIBRARY default keeps blast radius low.
- **R2 — non-English ledger entries.** Existing test fixtures show ledger entries are model-generated in English-leaning canonical vocabulary. If a Korean-only interview thread emerges (the auto-answerer's `from-auto` synthesis is also English-tagged), a separate normalization pass would be needed. Add only if observed in the wild.
- **R3 — pattern catalog conflicts as the catalog grows.** Two domains' patterns matching the same ledger configuration. Mitigation: `DomainInference.candidates` makes this *explicit*; no silent precedence. New patterns ship with a *positive-set* and a *disambiguator-set* test, so adding a pattern that causes a regression is caught by CI.

## References

- #1157 — Meta SSOT for `ooo auto` (L1 lane body).
- #849 — DomainProfile contract (the data structure this lane promotes).
- #961 — AgentOS roadmap (L1 follow-ups route as Track B follow-ups outside Track C tier gates, mirroring the L4 PR pattern).


Class	Ledger signals (illustrative)	Default completion mode	Probe binding (anticipated)
`library`	`outputs`: API surface / importable symbols; no shell/HTTP runtime	`CODE_COMPLETE`	unit tests + API import smoke
`cli`	`outputs`: stdout / exit code / printed text; `runtime_context`: shell / terminal	`PRODUCT_COMPLETE`	headless run + stdout/exit-code golden
`web-service`	`outputs`: REST/HTTP endpoints, JSON body, multiple routes	`PRODUCT_COMPLETE`	API smoke (request → response shape)
`webhook`	`inputs`: webhook payload / POST event; `outputs`: side effect (DB row / file / external call)	`PRODUCT_COMPLETE`	API smoke + side-effect probe
`data-pipeline`	`inputs`: dataset / CSV / log file; `outputs`: aggregated / transformed / Parquet	`PRODUCT_COMPLETE`	input fixture → output fixture diff
`game-2d`	`outputs`: render / frame / screen / canvas; (provisional: also browser-based interactive frontend)	`PRODUCT_COMPLETE`	headless N-tick sim + state-progression assertion
`refactor-in-place`	`goal`: refactor / rewrite / restructure; `constraints`: preserve behavior / same tests	`CODE_COMPLETE`	before/after test-suite parity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171

Meta SSOT slice: L1 — DomainProfile Catalog (ledger-derived domain inference + default AC injection)

Self-audit note (2026-05-22)

Why now

Why ledger-derive, not classifier

Frozen class taxonomy for L1-a (7 classes)

Resolved taxonomy decisions

Per-class schema (frozen by L1-a)

Ledger-derived domain inference (L1-b)

Ambiguity handling

Unmatched handling

Sub-PR breakdown

Acceptance criteria

Out of scope

Decisions awaiting maintainer triage

Known residual risks (documented, not blockers)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171

Description

Meta SSOT slice: L1 — DomainProfile Catalog (ledger-derived domain inference + default AC injection)

Self-audit note (2026-05-22)

Why now

Why ledger-derive, not classifier

Frozen class taxonomy for L1-a (7 classes)

Resolved taxonomy decisions

Per-class schema (frozen by L1-a)

Ledger-derived domain inference (L1-b)

Ambiguity handling

Unmatched handling

Sub-PR breakdown

Acceptance criteria

Out of scope

Decisions awaiting maintainer triage

Known residual risks (documented, not blockers)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions