Meta-Cognition: Improve Self-Verification on the Agent Harness

## Background

Ouroboros has the skeleton of recursive self-improvement — observations, checkpoints, dream cycles, the reflect+crystallize pipeline, an evolution log, and a 5-tier permission model — but its self-verification is shallow:

- Tools are checked **syntactically** (call returned `ok`), not **semantically** (did the action accomplish the goal?).
- Reflection only fires *after* a task completes (post-task `runRSIPostTask` at `packages/cli/src/agent.ts:922`).
- The evolution log is audit-only — no agent reads it to make decisions.
- Skills are immortal once promoted; no usage tracking, no auto-demotion of failing ones.

The agent can confidently complete a task that didn't actually solve the user's problem, repeat the same mistake across sessions, and accumulate stale skills it never reviews.

## Research grounding

Three converging lines of work give us a clear roadmap:

1. **Liu & van der Schaar (ICML 2025) — "Truly Self-Improving Agents Require Intrinsic Metacognitive Learning"** ([arXiv 2506.05109](https://arxiv.org/abs/2506.05109)). Three-component framework: metacognitive **knowledge** (self-assessment of capabilities), **planning** (deciding what/how to learn), **evaluation** (reflecting on learning). Key distinction: *intrinsic* (agent decides) vs *extrinsic* (human-designed fixed loops). Ouroboros's RSI is currently extrinsic.
2. **Reflexion** (Shinn et al. 2023, [arXiv 2303.11366](https://arxiv.org/abs/2303.11366)) — Actor / Evaluator / Self-Reflection architecture using verbal RL stored in episodic memory. AlfWorld 73 → 89 %, HumanEval +11 %.
3. **Meta-Harness** (2025, [emergentmind summary](https://www.emergentmind.com/papers/2603.28052)) — outer-loop optimization where an agent reads its own source code, scores, and **raw execution traces** (not summaries) to propose harness improvements. +7.7 pts text classification, +4.7 pts on IMO-level math.

Important caveat from Renze & Guven ([arXiv 2405.06682](https://arxiv.org/abs/2405.06682)): self-reflection **hurts** performance when always-on (overcorrection, token bloat). All metacognitive passes must be conditional with explicit triggers and off-switches.

## Proposed delivery — four phases, each shippable on its own

### Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)

**1a. Semantic outcome verification.** After every write-class tool call (`file-edit`, `file-write`, mutating `bash`), a lightweight LLM evaluator pass returns `{ verdict: 'advanced' | 'neutral' | 'regressed' | 'wrong-target', reasoning }`. On `regressed` / `wrong-target`, a synthetic correction hint is injected into conversation history.

**1b. Mid-turn failure-triggered reflection.** Reflexion-style verbal reflection fires *during* the loop on structural failure signals — same tool retried with same args, N consecutive iterations without progress, approaching `maxSteps`, steer-message dissatisfaction. Injected via the same precedent as steer messages at `packages/cli/src/agent.ts:735`.

**New modules under `packages/cli/src/metacognition/`:**
- `types.ts` — `MetacognitiveState`, `OutcomeVerification`, `MidTurnReflection`, `ReflectionTrigger`
- `triggers.ts` — pure functions `shouldVerifyOutcome`, `shouldReflect` (zero token cost)
- `engine.ts` — `MetacognitiveEngine` class wrapping LLM sub-calls, full try/catch isolation

**Insertion point:** `packages/cli/src/agent.ts:892` — after the tool-results push, before `continue`. Only safe ReAct seam between tool execution and the next LLM call.

**Token budget:** ~450 tokens per outcome check, ~750 per reflection (dedup per trigger per run). 50-iter task with 20 write-ops ≈ 1 % overhead on a 200k-context model. Hard off-switches in config.

**Config knobs in `rsi` block of `packages/cli/src/config.ts`:** `verifyWriteOps`, `midTurnReflection`, `metacognitiveStallThreshold` (default 3), `metacognitiveErrorRetryThreshold` (default 2), `metacognitiveApproachingLimitRatio` (default 0.85).

### Phase 2 — Metacognitive-Knowledge Layer

A persistent self-assessment the agent reads at task start: *what do I know about my own capability profile?* Distinct from durable memory (which is about the user/project) — this is about the **agent**.

- New file `memory/metacognition.json` — `MetacognitiveSnapshot` with `capabilityProfiles`, `learningStrategiesEffective/Ineffective`, `openQuestions`. Hard 3000-token cap when rendered into the system prompt.
- New `packages/cli/src/rsi/metacognition.ts` (read/write/merge, atomic-write pattern from `evolution-log.ts`) and `metacognitive-synthesizer.ts` (LLM call consuming recent evolution entries + Phase 1 outcomes, returns snapshot delta).
- Wire into `RSIOrchestrator.onSessionEnd` after the dream cycle. Extend `BuildSystemPromptOptions` with `metacognitiveGuidance`. Add `metacognition-updated` and `skill-invocation-recorded` to `evolutionEntryTypeSchema`.
- Anti-hallucination: synthesizer must cite evolution-log entry IDs as evidence; uncited claims are not written.

### Phase 3 — Skill Lifecycle Management

Skills are currently immortal. Add invocation tracking, per-skill metrics, and human-gated demotion.

- Every skill invocation records an outcome (`success | failure | no-op`) attributed via Phase 1 verdict.
- `packages/cli/src/rsi/skill-metrics.ts` computes rolling success rates over a configurable window (default 20 invocations).
- Skills below `archiveSuccessRateThreshold` (default 0.2) for `>= minInvocationsForReview` (default 5) generate a proposal in `memory/skill-review-queue.json`.
- New tool `packages/cli/src/tools/skill-lifecycle.ts` (actions: `list-reviews`, `approve-archive`, `approve-refactor`, `reject-review`) classified at elevated risk — same tier as RSI tools, mandatory human approval.
- Calibration signal: persistent overconfidence (confidence > 0.7 but success < 0.4 over 10 invocations) flags a skill for review.
- Archiving moves `skills/active/{name}` → `skills/archive/{name}` (no data loss; re-promotion is a directory move).

### Phase 4 — Harness Self-Optimization (Meta-Harness inspired)

The agent proposes changes to its own source code in an isolated worktree, runs tests, presents the diff for **mandatory human approval**.

- New tool `packages/cli/src/tools/harness-proposal.ts` — actions `analyze` (read-only, consumes raw evolution-log entries + observation files per the Meta-Harness raw-traces > summaries finding), `draft` (creates worktree under `.ouroboros/harness-worktrees/{id}`, writes change, runs `bun run ts-check && bun test`), `submit` (writes pending-review record, emits desktop IPC notification), `list`.
- **Hard safety constraints (non-negotiable):**
  - Human approval required for `draft` and `submit` — existing permission lease flow checked, not bypassed.
  - All changes happen in a worktree; running tree untouched.
  - `testOutcome: 'fail'` blocks submission (enforced in tool, not by LLM).
  - File-scope allowlist: only `packages/cli/src/`. Reject changes to `agent-invocation-permissions.ts`, `permission-lease.ts`, `packages/desktop/`.
  - Rate-limit: one pending proposal per 48 h.
  - Changes under `packages/cli/src/rsi/` flagged in draft output as \"self-modification of the metric/feedback layer\" so reviewer sees the gaming risk.
- Desktop integration: new IPC notification `harness-proposal-pending` in `packages/desktop/src/shared/protocol.ts`; new persistent review panel (separate from mid-flight steering UI in commit `fd97a97` since proposals are not time-sensitive).

## Mapping: research → Ouroboros change

| Research finding | Phase | Ouroboros change |
|---|---|---|
| Outcome-based verification | 1a | `verifyOutcome` after every write-op |
| Reflexion verbal RL with conditional triggers | 1b | `reflectMidTurn` on stall/retry/limit signals |
| Renze & Guven: reflection HURTS when always-on | 1 | Explicit triggers, dedup per run, hard off-switch |
| Liu/van der Schaar: metacognitive **knowledge** | 2 | `MetacognitiveSnapshot` rendered into system prompt |
| Liu/van der Schaar: metacognitive **planning** | 3 | Agent decides which skills to demote via `skill-lifecycle` tool |
| Liu/van der Schaar: metacognitive **evaluation** | 1 + 2 | Outcome verification feeds the synthesizer |
| Meta-Harness: raw traces > compressed summaries | 4 | `harness-proposal analyze` reads raw evolution log + observation files |
| Meta-Harness: code-space search | 4 | Proposals are unified diffs; humans approve/reject |

## Verification

After each phase:

1. `bun run --filter @ouroboros/cli test` — required by [CLAUDE.md](CLAUDE.md) testing policy.
2. `bun run verify` from repo root (lint + ts-check + CLI tests + desktop E2E).
3. `bun run test:cli:live` — live-LLM smoke test, inspect `evolution.log.json` for new entry types.
4. **Phase 1 behavioral check:** misleading task (\"edit foo.ts to add X\" but X already present) — verify `'rsi-outcome-verified'` event with `wrong-target` verdict, correction hint in conversation history, agent does not falsely claim success.
5. **Phase 2 check:** two consecutive sessions on similar tasks — verify the second session's system prompt contains a non-empty `## Agent Self-Assessment` block referencing the first.
6. **Phase 3 check:** seed evolution log with 5 fake `skill-invocation-recorded` entries at 20 % success — verify `runSkillReview` emits an `archive` proposal and `skill-lifecycle approve-archive` requires the elevated-risk lease.
7. **Phase 4 check:** `harness-proposal analyze` against a seeded log; `draft` a no-op change; verify the worktree is created, tests run, submission requires human approval, rejection cleans up the worktree.

## Risks & mitigations

- **Reflection always-on hurts performance** (Renze & Guven) → Phase 1 conditional triggers + per-run dedup + config off-switches.
- **Stale metacognitive knowledge becoming dogma** → Phase 2 staleness flags (>30 sessions) + uncited-claim suppression in synthesizer.
- **Agent over-confidently archiving useful skills** → Phase 3 mandatory human approval at elevated tier; min-invocation guard against sparse data.
- **Agent gaming success metrics by self-modifying the metric layer** → Phase 4 file-scope allowlist + explicit \"self-modification of metric layer\" flag in draft output.
- **Token bloat in system prompt** → Phase 2 hard 3000-token cap; profile rendering omits low-sample entries.
- **RSI errors crashing the agent loop** → Every metacognitive seam wraps in try/catch and routes failures to `appendEntry`, mirroring the existing `RSIOrchestrator` isolation pattern.

## Open questions

1. Phase 1 → Phase 2 handoff shape — recommended `{ success: boolean; confidence: number; failureReasons?: string[]; perToolVerdicts?: OutcomeVerification[] }`.
2. Cross-session observation reads in Phase 2 — needs a `listRecentSessionIds(basePath, limit)` helper. Lives in `packages/cli/src/memory/paths.ts` or a new `memory/sessions.ts`.
3. Phase 3 attribution when multiple skills active in one task — agent annotates which skill was responsible, or per-skill outcomes inferred by checkpoint deltas.
4. Phase 1 same-model vs. smaller evaluator model — start same-model (prompt-cache benefit, no second API key); add `evaluatorModel` override only if latency becomes a concern.

## Related

- #93 — Apply SecondOrder AI meta-cognition patterns to Ouroboros (UX/IPC trust signals, planner/critic for tier-3+ RSI). Complementary surface — this issue is the agent-internal verification layer; #93 is the user-facing trust-signal layer.

## Plan source

Full plan with code path traces and integration-point analysis: `~/.claude/plans/conduct-research-on-meta-agile-russell.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta-Cognition: Improve Self-Verification on the Agent Harness #94

Background

Research grounding

Proposed delivery — four phases, each shippable on its own

Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)

Phase 2 — Metacognitive-Knowledge Layer

Phase 3 — Skill Lifecycle Management

Phase 4 — Harness Self-Optimization (Meta-Harness inspired)

Mapping: research → Ouroboros change

Verification

Risks & mitigations

Open questions

Related

Plan source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Research finding	Phase	Ouroboros change
Outcome-based verification	1a	`verifyOutcome` after every write-op
Reflexion verbal RL with conditional triggers	1b	`reflectMidTurn` on stall/retry/limit signals
Renze & Guven: reflection HURTS when always-on	1	Explicit triggers, dedup per run, hard off-switch
Liu/van der Schaar: metacognitive knowledge	2	`MetacognitiveSnapshot` rendered into system prompt
Liu/van der Schaar: metacognitive planning	3	Agent decides which skills to demote via `skill-lifecycle` tool
Liu/van der Schaar: metacognitive evaluation	1 + 2	Outcome verification feeds the synthesizer
Meta-Harness: raw traces > compressed summaries	4	`harness-proposal analyze` reads raw evolution log + observation files
Meta-Harness: code-space search	4	Proposals are unified diffs; humans approve/reject

Meta-Cognition: Improve Self-Verification on the Agent Harness #94

Description

Background

Research grounding

Proposed delivery — four phases, each shippable on its own

Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)

Phase 2 — Metacognitive-Knowledge Layer

Phase 3 — Skill Lifecycle Management

Phase 4 — Harness Self-Optimization (Meta-Harness inspired)

Mapping: research → Ouroboros change

Verification

Risks & mitigations

Open questions

Related

Plan source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions