Background
Ouroboros has the skeleton of recursive self-improvement — observations, checkpoints, dream cycles, the reflect+crystallize pipeline, an evolution log, and a 5-tier permission model — but its self-verification is shallow:
- Tools are checked syntactically (call returned
ok), not semantically (did the action accomplish the goal?).
- Reflection only fires after a task completes (post-task
runRSIPostTask at packages/cli/src/agent.ts:922).
- The evolution log is audit-only — no agent reads it to make decisions.
- Skills are immortal once promoted; no usage tracking, no auto-demotion of failing ones.
The agent can confidently complete a task that didn't actually solve the user's problem, repeat the same mistake across sessions, and accumulate stale skills it never reviews.
Research grounding
Three converging lines of work give us a clear roadmap:
- Liu & van der Schaar (ICML 2025) — "Truly Self-Improving Agents Require Intrinsic Metacognitive Learning" (arXiv 2506.05109). Three-component framework: metacognitive knowledge (self-assessment of capabilities), planning (deciding what/how to learn), evaluation (reflecting on learning). Key distinction: intrinsic (agent decides) vs extrinsic (human-designed fixed loops). Ouroboros's RSI is currently extrinsic.
- Reflexion (Shinn et al. 2023, arXiv 2303.11366) — Actor / Evaluator / Self-Reflection architecture using verbal RL stored in episodic memory. AlfWorld 73 → 89 %, HumanEval +11 %.
- Meta-Harness (2025, emergentmind summary) — outer-loop optimization where an agent reads its own source code, scores, and raw execution traces (not summaries) to propose harness improvements. +7.7 pts text classification, +4.7 pts on IMO-level math.
Important caveat from Renze & Guven (arXiv 2405.06682): self-reflection hurts performance when always-on (overcorrection, token bloat). All metacognitive passes must be conditional with explicit triggers and off-switches.
Proposed delivery — four phases, each shippable on its own
Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)
1a. Semantic outcome verification. After every write-class tool call (file-edit, file-write, mutating bash), a lightweight LLM evaluator pass returns { verdict: 'advanced' | 'neutral' | 'regressed' | 'wrong-target', reasoning }. On regressed / wrong-target, a synthetic correction hint is injected into conversation history.
1b. Mid-turn failure-triggered reflection. Reflexion-style verbal reflection fires during the loop on structural failure signals — same tool retried with same args, N consecutive iterations without progress, approaching maxSteps, steer-message dissatisfaction. Injected via the same precedent as steer messages at packages/cli/src/agent.ts:735.
New modules under packages/cli/src/metacognition/:
types.ts — MetacognitiveState, OutcomeVerification, MidTurnReflection, ReflectionTrigger
triggers.ts — pure functions shouldVerifyOutcome, shouldReflect (zero token cost)
engine.ts — MetacognitiveEngine class wrapping LLM sub-calls, full try/catch isolation
Insertion point: packages/cli/src/agent.ts:892 — after the tool-results push, before continue. Only safe ReAct seam between tool execution and the next LLM call.
Token budget: ~450 tokens per outcome check, ~750 per reflection (dedup per trigger per run). 50-iter task with 20 write-ops ≈ 1 % overhead on a 200k-context model. Hard off-switches in config.
Config knobs in rsi block of packages/cli/src/config.ts: verifyWriteOps, midTurnReflection, metacognitiveStallThreshold (default 3), metacognitiveErrorRetryThreshold (default 2), metacognitiveApproachingLimitRatio (default 0.85).
Phase 2 — Metacognitive-Knowledge Layer
A persistent self-assessment the agent reads at task start: what do I know about my own capability profile? Distinct from durable memory (which is about the user/project) — this is about the agent.
- New file
memory/metacognition.json — MetacognitiveSnapshot with capabilityProfiles, learningStrategiesEffective/Ineffective, openQuestions. Hard 3000-token cap when rendered into the system prompt.
- New
packages/cli/src/rsi/metacognition.ts (read/write/merge, atomic-write pattern from evolution-log.ts) and metacognitive-synthesizer.ts (LLM call consuming recent evolution entries + Phase 1 outcomes, returns snapshot delta).
- Wire into
RSIOrchestrator.onSessionEnd after the dream cycle. Extend BuildSystemPromptOptions with metacognitiveGuidance. Add metacognition-updated and skill-invocation-recorded to evolutionEntryTypeSchema.
- Anti-hallucination: synthesizer must cite evolution-log entry IDs as evidence; uncited claims are not written.
Phase 3 — Skill Lifecycle Management
Skills are currently immortal. Add invocation tracking, per-skill metrics, and human-gated demotion.
- Every skill invocation records an outcome (
success | failure | no-op) attributed via Phase 1 verdict.
packages/cli/src/rsi/skill-metrics.ts computes rolling success rates over a configurable window (default 20 invocations).
- Skills below
archiveSuccessRateThreshold (default 0.2) for >= minInvocationsForReview (default 5) generate a proposal in memory/skill-review-queue.json.
- New tool
packages/cli/src/tools/skill-lifecycle.ts (actions: list-reviews, approve-archive, approve-refactor, reject-review) classified at elevated risk — same tier as RSI tools, mandatory human approval.
- Calibration signal: persistent overconfidence (confidence > 0.7 but success < 0.4 over 10 invocations) flags a skill for review.
- Archiving moves
skills/active/{name} → skills/archive/{name} (no data loss; re-promotion is a directory move).
Phase 4 — Harness Self-Optimization (Meta-Harness inspired)
The agent proposes changes to its own source code in an isolated worktree, runs tests, presents the diff for mandatory human approval.
- New tool
packages/cli/src/tools/harness-proposal.ts — actions analyze (read-only, consumes raw evolution-log entries + observation files per the Meta-Harness raw-traces > summaries finding), draft (creates worktree under .ouroboros/harness-worktrees/{id}, writes change, runs bun run ts-check && bun test), submit (writes pending-review record, emits desktop IPC notification), list.
- Hard safety constraints (non-negotiable):
- Human approval required for
draft and submit — existing permission lease flow checked, not bypassed.
- All changes happen in a worktree; running tree untouched.
testOutcome: 'fail' blocks submission (enforced in tool, not by LLM).
- File-scope allowlist: only
packages/cli/src/. Reject changes to agent-invocation-permissions.ts, permission-lease.ts, packages/desktop/.
- Rate-limit: one pending proposal per 48 h.
- Changes under
packages/cli/src/rsi/ flagged in draft output as "self-modification of the metric/feedback layer" so reviewer sees the gaming risk.
- Desktop integration: new IPC notification
harness-proposal-pending in packages/desktop/src/shared/protocol.ts; new persistent review panel (separate from mid-flight steering UI in commit fd97a97 since proposals are not time-sensitive).
Mapping: research → Ouroboros change
| Research finding |
Phase |
Ouroboros change |
| Outcome-based verification |
1a |
verifyOutcome after every write-op |
| Reflexion verbal RL with conditional triggers |
1b |
reflectMidTurn on stall/retry/limit signals |
| Renze & Guven: reflection HURTS when always-on |
1 |
Explicit triggers, dedup per run, hard off-switch |
| Liu/van der Schaar: metacognitive knowledge |
2 |
MetacognitiveSnapshot rendered into system prompt |
| Liu/van der Schaar: metacognitive planning |
3 |
Agent decides which skills to demote via skill-lifecycle tool |
| Liu/van der Schaar: metacognitive evaluation |
1 + 2 |
Outcome verification feeds the synthesizer |
| Meta-Harness: raw traces > compressed summaries |
4 |
harness-proposal analyze reads raw evolution log + observation files |
| Meta-Harness: code-space search |
4 |
Proposals are unified diffs; humans approve/reject |
Verification
After each phase:
bun run --filter @ouroboros/cli test — required by CLAUDE.md testing policy.
bun run verify from repo root (lint + ts-check + CLI tests + desktop E2E).
bun run test:cli:live — live-LLM smoke test, inspect evolution.log.json for new entry types.
- Phase 1 behavioral check: misleading task ("edit foo.ts to add X" but X already present) — verify
'rsi-outcome-verified' event with wrong-target verdict, correction hint in conversation history, agent does not falsely claim success.
- Phase 2 check: two consecutive sessions on similar tasks — verify the second session's system prompt contains a non-empty
## Agent Self-Assessment block referencing the first.
- Phase 3 check: seed evolution log with 5 fake
skill-invocation-recorded entries at 20 % success — verify runSkillReview emits an archive proposal and skill-lifecycle approve-archive requires the elevated-risk lease.
- Phase 4 check:
harness-proposal analyze against a seeded log; draft a no-op change; verify the worktree is created, tests run, submission requires human approval, rejection cleans up the worktree.
Risks & mitigations
- Reflection always-on hurts performance (Renze & Guven) → Phase 1 conditional triggers + per-run dedup + config off-switches.
- Stale metacognitive knowledge becoming dogma → Phase 2 staleness flags (>30 sessions) + uncited-claim suppression in synthesizer.
- Agent over-confidently archiving useful skills → Phase 3 mandatory human approval at elevated tier; min-invocation guard against sparse data.
- Agent gaming success metrics by self-modifying the metric layer → Phase 4 file-scope allowlist + explicit "self-modification of metric layer" flag in draft output.
- Token bloat in system prompt → Phase 2 hard 3000-token cap; profile rendering omits low-sample entries.
- RSI errors crashing the agent loop → Every metacognitive seam wraps in try/catch and routes failures to
appendEntry, mirroring the existing RSIOrchestrator isolation pattern.
Open questions
- Phase 1 → Phase 2 handoff shape — recommended
{ success: boolean; confidence: number; failureReasons?: string[]; perToolVerdicts?: OutcomeVerification[] }.
- Cross-session observation reads in Phase 2 — needs a
listRecentSessionIds(basePath, limit) helper. Lives in packages/cli/src/memory/paths.ts or a new memory/sessions.ts.
- Phase 3 attribution when multiple skills active in one task — agent annotates which skill was responsible, or per-skill outcomes inferred by checkpoint deltas.
- Phase 1 same-model vs. smaller evaluator model — start same-model (prompt-cache benefit, no second API key); add
evaluatorModel override only if latency becomes a concern.
Related
Plan source
Full plan with code path traces and integration-point analysis: ~/.claude/plans/conduct-research-on-meta-agile-russell.md.
Background
Ouroboros has the skeleton of recursive self-improvement — observations, checkpoints, dream cycles, the reflect+crystallize pipeline, an evolution log, and a 5-tier permission model — but its self-verification is shallow:
ok), not semantically (did the action accomplish the goal?).runRSIPostTaskatpackages/cli/src/agent.ts:922).The agent can confidently complete a task that didn't actually solve the user's problem, repeat the same mistake across sessions, and accumulate stale skills it never reviews.
Research grounding
Three converging lines of work give us a clear roadmap:
Important caveat from Renze & Guven (arXiv 2405.06682): self-reflection hurts performance when always-on (overcorrection, token bloat). All metacognitive passes must be conditional with explicit triggers and off-switches.
Proposed delivery — four phases, each shippable on its own
Phase 1 — Semantic Outcome Verification + Mid-Turn Failure Reflection (high-impact, low-risk)
1a. Semantic outcome verification. After every write-class tool call (
file-edit,file-write, mutatingbash), a lightweight LLM evaluator pass returns{ verdict: 'advanced' | 'neutral' | 'regressed' | 'wrong-target', reasoning }. Onregressed/wrong-target, a synthetic correction hint is injected into conversation history.1b. Mid-turn failure-triggered reflection. Reflexion-style verbal reflection fires during the loop on structural failure signals — same tool retried with same args, N consecutive iterations without progress, approaching
maxSteps, steer-message dissatisfaction. Injected via the same precedent as steer messages atpackages/cli/src/agent.ts:735.New modules under
packages/cli/src/metacognition/:types.ts—MetacognitiveState,OutcomeVerification,MidTurnReflection,ReflectionTriggertriggers.ts— pure functionsshouldVerifyOutcome,shouldReflect(zero token cost)engine.ts—MetacognitiveEngineclass wrapping LLM sub-calls, full try/catch isolationInsertion point:
packages/cli/src/agent.ts:892— after the tool-results push, beforecontinue. Only safe ReAct seam between tool execution and the next LLM call.Token budget: ~450 tokens per outcome check, ~750 per reflection (dedup per trigger per run). 50-iter task with 20 write-ops ≈ 1 % overhead on a 200k-context model. Hard off-switches in config.
Config knobs in
rsiblock ofpackages/cli/src/config.ts:verifyWriteOps,midTurnReflection,metacognitiveStallThreshold(default 3),metacognitiveErrorRetryThreshold(default 2),metacognitiveApproachingLimitRatio(default 0.85).Phase 2 — Metacognitive-Knowledge Layer
A persistent self-assessment the agent reads at task start: what do I know about my own capability profile? Distinct from durable memory (which is about the user/project) — this is about the agent.
memory/metacognition.json—MetacognitiveSnapshotwithcapabilityProfiles,learningStrategiesEffective/Ineffective,openQuestions. Hard 3000-token cap when rendered into the system prompt.packages/cli/src/rsi/metacognition.ts(read/write/merge, atomic-write pattern fromevolution-log.ts) andmetacognitive-synthesizer.ts(LLM call consuming recent evolution entries + Phase 1 outcomes, returns snapshot delta).RSIOrchestrator.onSessionEndafter the dream cycle. ExtendBuildSystemPromptOptionswithmetacognitiveGuidance. Addmetacognition-updatedandskill-invocation-recordedtoevolutionEntryTypeSchema.Phase 3 — Skill Lifecycle Management
Skills are currently immortal. Add invocation tracking, per-skill metrics, and human-gated demotion.
success | failure | no-op) attributed via Phase 1 verdict.packages/cli/src/rsi/skill-metrics.tscomputes rolling success rates over a configurable window (default 20 invocations).archiveSuccessRateThreshold(default 0.2) for>= minInvocationsForReview(default 5) generate a proposal inmemory/skill-review-queue.json.packages/cli/src/tools/skill-lifecycle.ts(actions:list-reviews,approve-archive,approve-refactor,reject-review) classified at elevated risk — same tier as RSI tools, mandatory human approval.skills/active/{name}→skills/archive/{name}(no data loss; re-promotion is a directory move).Phase 4 — Harness Self-Optimization (Meta-Harness inspired)
The agent proposes changes to its own source code in an isolated worktree, runs tests, presents the diff for mandatory human approval.
packages/cli/src/tools/harness-proposal.ts— actionsanalyze(read-only, consumes raw evolution-log entries + observation files per the Meta-Harness raw-traces > summaries finding),draft(creates worktree under.ouroboros/harness-worktrees/{id}, writes change, runsbun run ts-check && bun test),submit(writes pending-review record, emits desktop IPC notification),list.draftandsubmit— existing permission lease flow checked, not bypassed.testOutcome: 'fail'blocks submission (enforced in tool, not by LLM).packages/cli/src/. Reject changes toagent-invocation-permissions.ts,permission-lease.ts,packages/desktop/.packages/cli/src/rsi/flagged in draft output as "self-modification of the metric/feedback layer" so reviewer sees the gaming risk.harness-proposal-pendinginpackages/desktop/src/shared/protocol.ts; new persistent review panel (separate from mid-flight steering UI in commitfd97a97since proposals are not time-sensitive).Mapping: research → Ouroboros change
verifyOutcomeafter every write-opreflectMidTurnon stall/retry/limit signalsMetacognitiveSnapshotrendered into system promptskill-lifecycletoolharness-proposal analyzereads raw evolution log + observation filesVerification
After each phase:
bun run --filter @ouroboros/cli test— required by CLAUDE.md testing policy.bun run verifyfrom repo root (lint + ts-check + CLI tests + desktop E2E).bun run test:cli:live— live-LLM smoke test, inspectevolution.log.jsonfor new entry types.'rsi-outcome-verified'event withwrong-targetverdict, correction hint in conversation history, agent does not falsely claim success.## Agent Self-Assessmentblock referencing the first.skill-invocation-recordedentries at 20 % success — verifyrunSkillReviewemits anarchiveproposal andskill-lifecycle approve-archiverequires the elevated-risk lease.harness-proposal analyzeagainst a seeded log;drafta no-op change; verify the worktree is created, tests run, submission requires human approval, rejection cleans up the worktree.Risks & mitigations
appendEntry, mirroring the existingRSIOrchestratorisolation pattern.Open questions
{ success: boolean; confidence: number; failureReasons?: string[]; perToolVerdicts?: OutcomeVerification[] }.listRecentSessionIds(basePath, limit)helper. Lives inpackages/cli/src/memory/paths.tsor a newmemory/sessions.ts.evaluatorModeloverride only if latency becomes a concern.Related
Plan source
Full plan with code path traces and integration-point analysis:
~/.claude/plans/conduct-research-on-meta-agile-russell.md.