Skip to content

fix(auto-recall): Tier 1 memory counter fix (ms-based suppression + lazy-heal)#712

Open
raw34 wants to merge 4 commits intoCortexReach:masterfrom
raw34:fix/tier1-memory-counters
Open

fix(auto-recall): Tier 1 memory counter fix (ms-based suppression + lazy-heal)#712
raw34 wants to merge 4 commits intoCortexReach:masterfrom
raw34:fix/tier1-memory-counters

Conversation

@raw34
Copy link
Copy Markdown
Contributor

@raw34 raw34 commented Apr 28, 2026

Summary

Rewrites the Tier 1 auto-recall lifecycle to fix counter bugs that blocked
decay-engine, access-tracker reinforcement, and future tier-manager promotion.
See Issue #633 / RFC #445.

Key changes

Schema (src/smart-metadata.ts)

  • Adds optional suppressed_until_ms?: number field with presence semantics
    (undefined = never touched by Tier 1; 0/number = touched, with optional active suppression)

Config (openclaw.plugin.json)

  • autoRecallBadRecallDecayMs — Option C decay window, default 24h
  • autoRecallSuppressionDurationMs — suppression duration, default 30min

Auto-recall inline patch (index.ts)

  • Always increment access_count and last_accessed_at on injection (the core fix)
  • Lazy-heal legacy pollution on first Tier 1 touch
  • Option C decay: reset bad_recall_count if gap > decay window; clock-skew safe
  • Write ms-based suppressed_until_ms, replacing turn-number suppression
  • Always zero suppressed_until_turn to prevent stale-number leaks

Governance filter (index.ts)

  • Replace suppressed_until_turn vs currentTurn with absolute-time check
    against suppressed_until_ms. Gateway restart no longer creates false
    "just expired" windows.

Out of scope

Test plan

  • New test/tier1-counters.test.mjs (22 tests) — asserts access_count
    0→1, suppressed_until_ms semantics, decay, lazy-heal idempotence
  • Registered in npm test, ci-test-manifest.mjs, and
    verify-ci-test-manifest.mjs baseline
  • test/smart-memory-lifecycle.mjs — Tier 1 access_count integration assertion
  • test/smart-metadata-v2.mjs — schema regression
  • test/clawteam-scope.test.mjs + per-agent-auto-recall + scope-access-undefined (65 tests) — scope regression

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review action: COMMENT

Thanks for the PR. GitHub currently reports this branch as not mergeable (mergeable=CONFLICTING, merge_state_status=DIRTY), so I am deferring deep review until the diff can be reviewed against the current base.

Please rebase onto the latest base branch, resolve the merge conflicts, and push the updated branch. Once it is cleanly mergeable again, I will re-run the full review on the updated diff.

…azy-heal)

Rewrite the Tier 1 auto-recall lifecycle to fix counter bugs that blocked
decay-engine, access-tracker reinforcement, and future tier-manager promotion.
See Issue CortexReach#633 / RFC CortexReach#445.

Schema:
- Add optional `suppressed_until_ms?: number` field to SmartMemoryMetadata.
  parseSmartMetadata preserves the `undefined` sentinel for "never touched
  by Tier 1" vs "touched, no active suppression". Includes guard comment
  + NaN test to prevent accidental clampCount refactors.

Config:
- Add `autoRecallBadRecallDecayMs` (default 24h, Option C decay window)
  and `autoRecallSuppressionDurationMs` (default 30min). Both intentionally
  omit a `maximum` bound (documented via $comment).

Auto-recall inline patch:
- Increment access_count and last_accessed_at on every injection — the
  core fix unblocking downstream lifecycle components.
- Lazy-heal legacy pollution: memories with no suppressed_until_ms AND
  non-zero bad_recall_count / suppressed_until_turn get reset on first
  Tier 1 touch.
- Option C decay: if gap since last_injected_at exceeds badRecallDecayMs,
  reset bad_recall_count to 0 before stale-injection increment. Negative
  gaps from clock skew fall through as "no decay" (conservative).
- Write ms-based suppressed_until_ms (replacing turn-number suppression);
  gateway restart no longer creates false "just expired" windows.
- Always zero legacy suppressed_until_turn to prevent stale-number leaks.

Governance filter:
- Replace `suppressed_until_turn` vs `currentTurn` check with absolute-time
  comparison against `suppressed_until_ms`. Suppression anchor is now
  wall-clock, not session-local.

Out of scope (intentionally unchanged):
- staleInjected judgment (Proposal A / PR CortexReach#597 territory)
- Manual memory_recall path
- Tier-manager activation

Tests:
- New test/tier1-counters.test.mjs asserts access_count increments 0->1
  via the shared parseSmartMetadata plumbing.
- Registered in npm test chain, ci-test-manifest, and
  verify-ci-test-manifest baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raw34 raw34 force-pushed the fix/tier1-memory-counters branch from 0212126 to 0841c4a Compare May 4, 2026 12:10
Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #712 Review: fix(auto-recall): Tier 1 memory counter fix (ms-based suppression + lazy-heal)

Verdict: REQUEST-CHANGES | 6 rounds completed | Value: 67% | Size: LARGE | Author: raw34

Value Assessment

Problem: Auto-recall memories can accumulate broken recall/suppression counters and fail to update access counters, which undermines recall availability and downstream reinforcement/decay behavior. This PR attempts to move suppression from session turn numbers to absolute millisecond timestamps, lazily heal legacy counter pollution, and increment access metadata on injection.

Dimension Assessment
Value Score 67%
Value Verdict review
Issue Linked true
Project Aligned true
Duplicate false
AI Slop Score 1/6
User Impact high
Urgency medium

Scope Drift: 1 flag(s)

  • test/smart-memory-lifecycle.mjs and test/tier1-counters.test.mjs duplicate the production patch logic in local helpers rather than exercising the actual index.ts auto-recall injection path

AI Slop Signals:

  • The new tests mirror production logic in local computeTier1Patch/isSuppressed helpers and include a comment saying index.ts wiring is verified by visual inspection, so part of the claimed fix is not directly tested.

Open Questions:

  • The provided linked issue data has no labels, assignment, or maintainer comments, so maintainer acknowledgement of #633 cannot be confirmed from this context.
  • PR #597 overlaps with recall governance and last_confirmed_use_at behavior; maintainers should confirm whether this Tier 1 ms-suppression approach should land independently or after Proposal A.
  • The full suite currently fails in test/smart-extractor-branches.mjs; that may be unrelated, but it must be resolved or classified before merge.
  • Maintainers should decide whether autoRecallBadRecallDecayMs and autoRecallSuppressionDurationMs need to be public config or could remain internal constants until behavior stabilizes.

Summary

Auto-recall memories can accumulate broken recall/suppression counters and fail to update access counters, which undermines recall availability and downstream reinforcement/decay behavior. This PR attempts to move suppression from session turn numbers to absolute millisecond timestamps, lazily heal legacy counter pollution, and increment access metadata on injection.

Evaluation Signals

Signal Value
Blockers 0
Warnings 0
PR Size LARGE
Verdict Floor request-changes
Risk Level high
Value Model codex
Primary Model codex
Adversarial Model claude

Must Fix

  • EF1: Full test suite is failing

Nice to Have

  • F2: Manual recall no longer clears active auto-recall suppression
  • F3: Debug output reports the retired suppression field
  • F4: New counter tests duplicate production logic instead of exercising it

Recommended Action

Good direction — problem is worth solving. Author should address must-fix findings, then this is ready to merge.


Reviewed at 2026-05-05T03:31:29Z | 6 rounds | Value: codex | Primary: codex | Adversarial: claude

Per rwmjhb's CHANGES_REQUESTED review (2026-05-05):

F2: manual `memory_recall` now clears `suppressed_until_ms` alongside
`suppressed_until_turn` in src/tools.ts:594-614. Without this, after the
governance check moved to ms-based suppression, a memory the user just
explicitly searched for would remain suppressed — a regression vs
pre-Tier-1 semantics where zeroing the turn field cleared the only
suppression mechanism. Same fix applied to `memory_promote` at line
~1939, which carries identical positive-signal semantics.

F3: `memory_explain_rank` debug output now reports `suppressedUntilMs`
instead of the retired `suppressedUntilTurn` (which is always 0 after
Tier 1 touches a memory). Cosmetic but eliminates a misleading metric.

F4: extracted Tier 1 patch logic to `src/auto-recall-tier1.ts` as pure
functions (`isSuppressed`, `computeTier1Patch`, plus default constants).
- index.ts auto-recall path now imports and calls these helpers,
  collapsing ~85 lines of inline logic into a single computeTier1Patch
  call site.
- test/tier1-counters.test.mjs and test/smart-memory-lifecycle.mjs now
  import the same helpers via jiti, replacing the local mirrored copies
  the reviewer flagged. Tests now drive the production code path
  directly, so any future drift surfaces immediately.

EF1 (test suite failure) is the pre-existing `smart-extractor-branches.mjs`
flake on master, verified against upstream/master CI runs. Out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raw34
Copy link
Copy Markdown
Contributor Author

raw34 commented May 5, 2026

Addressed in 97cffbe.

F2 (manual recall no longer clears active suppression) — fixed. memory_recall and memory_promote patches now zero suppressed_until_ms alongside suppressed_until_turn, restoring pre-Tier-1 semantics where a positive recall signal clears active suppression. (src/tools.ts:600-614, src/tools.ts:1933-1947)

F3 (debug output reports retired field) — fixed. memory_explain_rank now reports suppressedUntilMs=… instead of the retired suppressedUntilTurn (which is always 0 after Tier 1 touch). (src/tools.ts:2201)

F4 (tests duplicate production logic) — refactored. Pulled the Tier 1 logic out of index.ts into src/auto-recall-tier1.ts as two pure functions:

  • isSuppressed(meta, nowMs) — used by the governance filter
  • computeTier1Patch(meta, opts) — used by the auto-recall injection patch loop

test/tier1-counters.test.mjs and test/smart-memory-lifecycle.mjs now import these same helpers via jiti rather than maintaining local mirrors. The 22 unit tests + lifecycle integration assertion now drive the production code path directly, so any future drift surfaces immediately. The "verified by visual inspection" comment is gone.

EF1 (full test suite failing) — not caused by this PR. The failure is test/smart-extractor-branches.mjs (Smart extraction should trigger on turn 2 with cumulative count >= 2 + Failed to generate embedding from 127.0.0.1:XXXXX: Connection error), which is a pre-existing flake on master. Latest master CI run 25315749827 (commit 47b635d, the base this PR rebased onto) fails on the identical assertion, alongside update-consistency-lancedb (also flaky). The failure path is in smart-extractor / embedder code, which this PR does not touch. Out of scope here; would suggest a separate issue if the env-coupling needs hardening.

Open questions noted:

Verification: tier1-counters (22/22), smart-memory-lifecycle (incl. integration assertion), core-regression (22/22 excluding the master flake), packaging-and-workflow, storage-and-schema, cli-smoke all green locally.

@raw34
Copy link
Copy Markdown
Contributor Author

raw34 commented May 5, 2026

CI rerun shows two failures, both inherited from master (introduced by #716, merged ~3 hours ago):

  1. packaging-and-workflowError: unexpected manifest entry: test/issue606_sdk-migration.test.mjs. PR fix(sdk): migrate Bug 2 to api.runtime.agent (Issue #606) #716 added the file to scripts/ci-test-manifest.mjs (line 63) but did not add the matching entry to scripts/verify-ci-test-manifest.mjs EXPECTED_BASELINE, so the lockstep check fails.
  2. core-regressiontest/issue606_sdk-migration.test.mjs exited 1 with Cannot find module 'src/store.js' imported from index.ts and F1: circuit breaker opens… not-ok. The test itself is failing on master.

Verified the same failures on master at 09965e3: CI run 25359252529 has the identical two-line failure pattern. My PR is auto-merged with master before CI runs (default actions/checkout@v4 behavior on pull_request events), which is why the master-side breakage surfaces here.

No new code from this PR is implicated:

  • git merge --no-ff upstream/master against 97cffbe resolves cleanly with no conflicts.
  • All Tier 1 helpers and tests still pass locally — tier1-counters (22/22), smart-memory-lifecycle, core-regression minus the master-broken issue606_sdk-migration test, packaging-and-workflow minus the master EXPECTED_BASELINE mismatch.

Will be unblocked when master fixes #716. Happy to post the one-line EXPECTED_BASELINE patch as a separate PR if that helps.

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #712 Review: fix(auto-recall): Tier 1 memory counter fix (ms-based suppression + lazy-heal)

Verdict: REQUEST-CHANGES | 6 rounds completed | Value: 64% | Size: XL | Author: raw34

Value Assessment

Problem: Auto-recall can leave memory access and suppression counters in a broken state, causing recalled memories to stop appearing or fail to reinforce downstream decay/access tracking. The PR replaces session-turn suppression with millisecond-based suppression, lazily heals legacy counter pollution, and increments access metadata on Tier 1 injection.

Dimension Assessment
Value Score 64%
Value Verdict review
Issue Linked true
Project Aligned true
Duplicate false
AI Slop Score 0/6
User Impact high
Urgency medium

Scope Drift: 2 flag(s)

  • openclaw.plugin.json exposes new public configuration for decay and suppression duration; this may be justified operationally, but it creates long-term API surface beyond the narrow bug fix
  • test/smart-memory-lifecycle.mjs appends a helper-level assertion after the existing lifecycle pass message, which is slightly awkward test structure but still related to the fix

Open Questions:

  • Linked issue #633 has no labels, assignment, or maintainer comments in the provided context, so maintainer acknowledgement cannot be confirmed.
  • Maintainers should confirm whether PR #712 should land independently of PR #597 or wait for the broader Proposal A recall-governance direction.
  • Maintainers should decide whether autoRecallBadRecallDecayMs and autoRecallSuppressionDurationMs should remain public config or start as internal constants.
  • The full suite failure appears inherited from master according to the timeline, but CI should be green or the inherited failure should be clearly documented before merge.

Summary

Auto-recall can leave memory access and suppression counters in a broken state, causing recalled memories to stop appearing or fail to reinforce downstream decay/access tracking. The PR replaces session-turn suppression with millisecond-based suppression, lazily heals legacy counter pollution, and increments access metadata on Tier 1 injection.

Evaluation Signals

Signal Value
Blockers 0
Warnings 0
PR Size XL
Verdict Floor approve
Risk Level high
Value Model codex
Primary Model codex
Adversarial Model claude

Must Fix

  • F1: New Tier 1 config fields are dropped during config parsing

Nice to Have

  • EF1: Full test suite fails in smart extraction branch coverage
  • MR1: Read path silently retires legacy suppressed_until_turn — currently-suppressed memories surface immediately on deploy
  • MR2: buildSmartMetadata loses lazy-heal sentinel if any persistence layer maps undefined → null
  • MR3: TIER1_BAD_RECALL_SUPPRESSION_THRESHOLD remains hardcoded while companion knobs become public config

Recommended Action

Good direction — problem is worth solving. Author should address must-fix findings, then this is ready to merge.


Reviewed at 2026-05-05T10:30:44Z | 6 rounds | Value: codex | Primary: codex | Adversarial: claude

…+ MR2 + MR3)

F1 (must-fix): parsePluginConfig was dropping the two new Tier 1 config
fields. The interface declared them and index.ts read them, but the
return-object construction never copied them through, so user config
was always silently discarded and the runtime always saw undefined.

- Add a parseNonNegativeInt helper (parsePositiveInt rejects 0, but 0
  is a meaningful sentinel for both fields: disable decay / collapse
  suppression to a no-op).
- Wire autoRecallBadRecallDecayMs and autoRecallSuppressionDurationMs
  through parsePluginConfig.
- Add 5 regression tests in tier1-counters.test.mjs covering: field
  propagation, the 0 sentinel, undefined fallback, and negative-input
  rejection.

MR2 (nice-to-have): parseSmartMetadata used `!== undefined` to detect
the lazy-heal sentinel. If any persistence layer ever round-trips
undefined → null on the metadata JSON, the sentinel was lost (null
fell through to clampCount → 0, marking the memory as "Tier 1 touched"
incorrectly).

- Switch to `!= null` (covers both undefined and null) in both
  parseSmartMetadata and the patch path in buildSmartMetadata.
- Add a regression test for the null round-trip case.

MR3 (nice-to-have): document why TIER1_BAD_RECALL_SUPPRESSION_THRESHOLD
is a constant while companion knobs are public config — "3 strikes"
is a behavioral design choice; decay/duration are ops tuning. Note
the existing `minRepeated` opt seam is available if tuning is ever
needed.

EF1: pre-existing master flake, documented in earlier comment.
MR1 (legacy turn-based suppression at deploy): deferred — old turn
numbers were never deploy-stable in the pre-Tier-1 design (turn
counter resets per session), so "currently-suppressed memories" do
not have well-defined cross-deploy semantics. Lazy-heal is the
intended migration. Will document in the follow-up PR comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raw34
Copy link
Copy Markdown
Contributor Author

raw34 commented May 5, 2026

Second-round review addressed in 49693bc.

F1 (must-fix) — parsePluginConfig dropped new Tier 1 config fields: confirmed and fixed. The PluginConfig interface declared the fields and index.ts read them, but parsePluginConfig's return-object construction never copied them through, so user config was silently discarded and the runtime always saw undefined. Added a parseNonNegativeInt helper (the existing parsePositiveInt rejects 0, but 0 is a meaningful sentinel for both fields — disable decay / collapse suppression to a no-op) and wired both fields through. 5 new regression tests in tier1-counters.test.mjs cover: field propagation, the 0 sentinel, undefined fallback, and negative-input rejection.

MR2 (nice-to-have) — parseSmartMetadata loses lazy-heal sentinel on null: confirmed and fixed. parsed.suppressed_until_ms !== undefined failed to catch null, which falls through to clampCount(null, 0) === 0 and incorrectly marks the memory as "Tier 1 touched". Switched both the parse path and buildSmartMetadata's patch path to != null. Added a regression test for the null round-trip case.

MR3 (nice-to-have) — threshold hardcoded while knobs are public config: documented the rationale in src/auto-recall-tier1.ts. The "3 strikes" rule is a behavioral design choice that should hold across deployments; decay window and suppression duration are operational tuning parameters that ops may legitimately tune. The minRepeated opt on computeTier1Patch already provides the seam if real tuning need surfaces.

MR1 (nice-to-have) — read path silently retires suppressed_until_turn at deploy: I'd push back on this one. Pre-Tier-1 turn-based suppression was never deploy-stable — the turn counter is per-session and resets at every gateway restart, so a memory with suppressed_until_turn=999 from a previous session was already meaningless against a new session's currentTurn=1 (it would over-suppress for ~999 turns of the new session, an accidental behavior). Post-Tier-1 the legacy field is correctly ignored on read; lazy-heal is the migration path on next injection. The transition exposes some memories that were accidentally over-suppressed by the old logic, which is a better semantic, not a regression. Happy to add a short migration note in the PR description if maintainers want it documented for users.

EF1 — full test suite failure: still the pre-existing master flake (smart-extractor-branches.mjs + issue606_sdk-migration.test.mjs introduced by #716). Noted in earlier comment.

Verification: tier1-counters (28/28 — was 22, +6 new for F1/MR2), smart-metadata-v2, clawteam-scope + per-agent-auto-recall + scope-access-undefined (65/65), core-regression (22/22 minus master flake) all green locally.

Pre-fix the smart-memory-lifecycle test printed "OK: passed" before
the appended Tier 1 integration assertion ran, so a Tier 1 failure
would surface as "OK" followed by an AssertionError — confusing under
log skimming. Move the print to the end and reword it to reflect both
the legacy lifecycle and Tier 1 integration coverage.

Caught by self-audit before the next reviewer round; reviewer flagged
the same as a Scope Drift signal in the previous review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants