Skip to content

Meta-Cognition: Improve Self-Verification on the Agent Harness #94

@kinwo

Description

@kinwo

Background

Ouroboros has the skeleton of recursive self-improvement — observations, checkpoints, dream cycles, the reflect+crystallize pipeline, an evolution log, and a 5-tier permission model — but its self-verification is shallow:

  • Tools are checked syntactically (call returned ok), not semantically (did the action accomplish the goal?).
  • Reflection only fires after a task completes (post-task runRSIPostTask at packages/cli/src/agent.ts:922).
  • The evolution log is audit-only — no agent reads it to make decisions.
  • Skills are immortal once promoted; no usage tracking, no auto-demotion of failing ones.

The agent can confidently complete a task that didn't actually solve the user's problem, repeat the same mistake across sessions, and accumulate stale skills it never reviews.

Research grounding

Three converging lines of work give us a clear roadmap:

  1. Liu & van der Schaar (ICML 2025) — "Truly Self-Improving Agents Require Intrinsic Metacognitive Learning" (arXiv 2506.05109). Three-component framework: metacognitive knowledge (self-assessment of capabilities), planning (deciding what/how to learn), evaluation (reflecting on learning). Key distinction: intrinsic (agent decides) vs extrinsic (human-designed fixed loops). Ouroboros's RSI is currently extrinsic.
  2. Reflexion (Shinn et al. 2023, arXiv 2303.11366) — Actor / Evaluator / Self-Reflection architecture using verbal RL stored in episodic memory. AlfWorld 73 → 89 %, HumanEval +11 %.
  3. Meta-Harness (2025, emergentmind summary) — outer-loop optimization where an agent reads its own source code, scores, and raw execution traces (not summaries) to propose harness improvements. +7.7 pts text classification, +4.7 pts on IMO-level math.

Important caveat from Renze & Guven (arXiv 2405.06682): self-reflection hurts performance when always-on (overcorrection, token bloat). All metacognitive passes must be conditional with explicit triggers and off-switches.

Proposed delivery — four phases, each shippable on its own

Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)

1a. Semantic outcome verification. After every write-class tool call (file-edit, file-write, mutating bash), a lightweight LLM evaluator pass returns { verdict: 'advanced' | 'neutral' | 'regressed' | 'wrong-target', reasoning }. On regressed / wrong-target, a synthetic correction hint is injected into conversation history.

1b. Mid-turn failure-triggered reflection. Reflexion-style verbal reflection fires during the loop on structural failure signals — same tool retried with same args, N consecutive iterations without progress, approaching maxSteps, steer-message dissatisfaction. Injected via the same precedent as steer messages at packages/cli/src/agent.ts:735.

New modules under packages/cli/src/metacognition/:

  • types.tsMetacognitiveState, OutcomeVerification, MidTurnReflection, ReflectionTrigger
  • triggers.ts — pure functions shouldVerifyOutcome, shouldReflect (zero token cost)
  • engine.tsMetacognitiveEngine class wrapping LLM sub-calls, full try/catch isolation

Insertion point: packages/cli/src/agent.ts:892 — after the tool-results push, before continue. Only safe ReAct seam between tool execution and the next LLM call.

Token budget: ~450 tokens per outcome check, ~750 per reflection (dedup per trigger per run). 50-iter task with 20 write-ops ≈ 1 % overhead on a 200k-context model. Hard off-switches in config.

Config knobs in rsi block of packages/cli/src/config.ts: verifyWriteOps, midTurnReflection, metacognitiveStallThreshold (default 3), metacognitiveErrorRetryThreshold (default 2), metacognitiveApproachingLimitRatio (default 0.85).

Phase 2 — Metacognitive-Knowledge Layer

A persistent self-assessment the agent reads at task start: what do I know about my own capability profile? Distinct from durable memory (which is about the user/project) — this is about the agent.

  • New file memory/metacognition.jsonMetacognitiveSnapshot with capabilityProfiles, learningStrategiesEffective/Ineffective, openQuestions. Hard 3000-token cap when rendered into the system prompt.
  • New packages/cli/src/rsi/metacognition.ts (read/write/merge, atomic-write pattern from evolution-log.ts) and metacognitive-synthesizer.ts (LLM call consuming recent evolution entries + Phase 1 outcomes, returns snapshot delta).
  • Wire into RSIOrchestrator.onSessionEnd after the dream cycle. Extend BuildSystemPromptOptions with metacognitiveGuidance. Add metacognition-updated and skill-invocation-recorded to evolutionEntryTypeSchema.
  • Anti-hallucination: synthesizer must cite evolution-log entry IDs as evidence; uncited claims are not written.

Phase 3 — Skill Lifecycle Management

Skills are currently immortal. Add invocation tracking, per-skill metrics, and human-gated demotion.

  • Every skill invocation records an outcome (success | failure | no-op) attributed via Phase 1 verdict.
  • packages/cli/src/rsi/skill-metrics.ts computes rolling success rates over a configurable window (default 20 invocations).
  • Skills below archiveSuccessRateThreshold (default 0.2) for >= minInvocationsForReview (default 5) generate a proposal in memory/skill-review-queue.json.
  • New tool packages/cli/src/tools/skill-lifecycle.ts (actions: list-reviews, approve-archive, approve-refactor, reject-review) classified at elevated risk — same tier as RSI tools, mandatory human approval.
  • Calibration signal: persistent overconfidence (confidence > 0.7 but success < 0.4 over 10 invocations) flags a skill for review.
  • Archiving moves skills/active/{name}skills/archive/{name} (no data loss; re-promotion is a directory move).

Phase 4 — Harness Self-Optimization (Meta-Harness inspired)

The agent proposes changes to its own source code in an isolated worktree, runs tests, presents the diff for mandatory human approval.

  • New tool packages/cli/src/tools/harness-proposal.ts — actions analyze (read-only, consumes raw evolution-log entries + observation files per the Meta-Harness raw-traces > summaries finding), draft (creates worktree under .ouroboros/harness-worktrees/{id}, writes change, runs bun run ts-check && bun test), submit (writes pending-review record, emits desktop IPC notification), list.
  • Hard safety constraints (non-negotiable):
    • Human approval required for draft and submit — existing permission lease flow checked, not bypassed.
    • All changes happen in a worktree; running tree untouched.
    • testOutcome: 'fail' blocks submission (enforced in tool, not by LLM).
    • File-scope allowlist: only packages/cli/src/. Reject changes to agent-invocation-permissions.ts, permission-lease.ts, packages/desktop/.
    • Rate-limit: one pending proposal per 48 h.
    • Changes under packages/cli/src/rsi/ flagged in draft output as "self-modification of the metric/feedback layer" so reviewer sees the gaming risk.
  • Desktop integration: new IPC notification harness-proposal-pending in packages/desktop/src/shared/protocol.ts; new persistent review panel (separate from mid-flight steering UI in commit fd97a97 since proposals are not time-sensitive).

Mapping: research → Ouroboros change

Research finding Phase Ouroboros change
Outcome-based verification 1a verifyOutcome after every write-op
Reflexion verbal RL with conditional triggers 1b reflectMidTurn on stall/retry/limit signals
Renze & Guven: reflection HURTS when always-on 1 Explicit triggers, dedup per run, hard off-switch
Liu/van der Schaar: metacognitive knowledge 2 MetacognitiveSnapshot rendered into system prompt
Liu/van der Schaar: metacognitive planning 3 Agent decides which skills to demote via skill-lifecycle tool
Liu/van der Schaar: metacognitive evaluation 1 + 2 Outcome verification feeds the synthesizer
Meta-Harness: raw traces > compressed summaries 4 harness-proposal analyze reads raw evolution log + observation files
Meta-Harness: code-space search 4 Proposals are unified diffs; humans approve/reject

Verification

After each phase:

  1. bun run --filter @ouroboros/cli test — required by CLAUDE.md testing policy.
  2. bun run verify from repo root (lint + ts-check + CLI tests + desktop E2E).
  3. bun run test:cli:live — live-LLM smoke test, inspect evolution.log.json for new entry types.
  4. Phase 1 behavioral check: misleading task ("edit foo.ts to add X" but X already present) — verify 'rsi-outcome-verified' event with wrong-target verdict, correction hint in conversation history, agent does not falsely claim success.
  5. Phase 2 check: two consecutive sessions on similar tasks — verify the second session's system prompt contains a non-empty ## Agent Self-Assessment block referencing the first.
  6. Phase 3 check: seed evolution log with 5 fake skill-invocation-recorded entries at 20 % success — verify runSkillReview emits an archive proposal and skill-lifecycle approve-archive requires the elevated-risk lease.
  7. Phase 4 check: harness-proposal analyze against a seeded log; draft a no-op change; verify the worktree is created, tests run, submission requires human approval, rejection cleans up the worktree.

Risks & mitigations

  • Reflection always-on hurts performance (Renze & Guven) → Phase 1 conditional triggers + per-run dedup + config off-switches.
  • Stale metacognitive knowledge becoming dogma → Phase 2 staleness flags (>30 sessions) + uncited-claim suppression in synthesizer.
  • Agent over-confidently archiving useful skills → Phase 3 mandatory human approval at elevated tier; min-invocation guard against sparse data.
  • Agent gaming success metrics by self-modifying the metric layer → Phase 4 file-scope allowlist + explicit "self-modification of metric layer" flag in draft output.
  • Token bloat in system prompt → Phase 2 hard 3000-token cap; profile rendering omits low-sample entries.
  • RSI errors crashing the agent loop → Every metacognitive seam wraps in try/catch and routes failures to appendEntry, mirroring the existing RSIOrchestrator isolation pattern.

Open questions

  1. Phase 1 → Phase 2 handoff shape — recommended { success: boolean; confidence: number; failureReasons?: string[]; perToolVerdicts?: OutcomeVerification[] }.
  2. Cross-session observation reads in Phase 2 — needs a listRecentSessionIds(basePath, limit) helper. Lives in packages/cli/src/memory/paths.ts or a new memory/sessions.ts.
  3. Phase 3 attribution when multiple skills active in one task — agent annotates which skill was responsible, or per-skill outcomes inferred by checkpoint deltas.
  4. Phase 1 same-model vs. smaller evaluator model — start same-model (prompt-cache benefit, no second API key); add evaluatorModel override only if latency becomes a concern.

Related

Plan source

Full plan with code path traces and integration-point analysis: ~/.claude/plans/conduct-research-on-meta-agile-russell.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions