fix(memory): actually update on near-duplicate dedup (#34) by chiruno-9 · Pull Request #36 · cyzus/suzent

chiruno-9 · 2026-05-20T07:10:29Z

Summary

Fixes the data-loss bug described in #34. In MemoryManager._deduplicate_and_store_facts, when a newly extracted fact's cosine similarity to an existing memory exceeded DEDUPLICATION_SIMILARITY_THRESHOLD (0.85), the code appended the existing id to result.memories_updated but performed no update — the new content was silently dropped.

The most damaging case is factual updates: "I work at Google" → "I just joined Microsoft" embed close enough on short sentences that the newer fact gets collapsed into the older one and discarded. The system then retains stale information indefinitely.

Changes

src/suzent/memory/manager.py — the actual fix

On a near-duplicate hit, generate an embedding for the new fact and call store.update_memory(memory_id, content=..., embedding=..., metadata=..., importance=...). The store layer preserves the original row's timestamps; only content/vector/metadata/importance change.
If update_memory reports failure, fall back to inserting a fresh row so the fact is never silently lost.
Constant renamed DEDUPLICATION_SIMILARITY_THRESHOLD → DEFAULT_DEDUPLICATION_SIMILARITY_THRESHOLD to reflect that it's now a default, not a hard-coded global. Per-instance threshold lives on self.dedup_threshold.

Make the threshold tunable (issue's "reason 3": threshold is embedding-model-dependent)

MemoryManager.__init__ accepts dedup_threshold: Optional[float] (default = constant).
ConfigModel.memory_dedup_threshold: Optional[float] = None in src/suzent/config/__init__.py.
memory/lifecycle.py wires CONFIG.memory_dedup_threshold through on init.

Backwards compatible: if neither callers nor default.yaml set the new field, behavior is identical to before — except that near-duplicates are now actually updated instead of silently dropped.

Tests

tests/memory/test_dedup_update.py — 8 tests, all backend-only with mocked I/O:

test_near_duplicate_actually_updates_existing_memory — pins the bug: store.update_memory is called with the new content + freshly generated embedding + new importance.
test_below_threshold_inserts_new_memory — non-duplicate path unchanged.
test_no_similar_memories_inserts_new_memory — empty search result path unchanged.
test_threshold_is_configurable_per_instance — a tighter threshold (0.99) lets a 0.95 hit be treated as new.
test_update_failure_falls_back_to_insert — never lose a fact silently if the store rejects the update.
test_constructor_accepts_explicit_threshold[None|0.7|0.92] — public API honors the new arg.

Out of scope (deliberately deferred)

The issue's "possible future paths" — LLM-assisted dedup, adaptive thresholds, entity-graph merging — are all bigger directional changes. They're not needed to stop the data loss and would dilute review. Happy to take any of them on as follow-up PRs if you'd like.

Test plan

uv run pytest tests/memory/test_dedup_update.py — 8/8 pass.
uv run pytest tests/memory/ — 22/22 pass (no regression).
uv run ruff check on changed files — clean.
Manual smoke: with a real embedding model, store "I work at Google", then submit "I just joined Microsoft" — confirm the stored memory's content is updated rather than the new fact being dropped.

When ``_deduplicate_and_store_facts`` detected a fact whose cosine similarity to an existing memory exceeded the threshold, it appended the existing id to ``memories_updated`` but performed no update — the new content was silently dropped. For factual *updates* ("I work at Google" → "I just joined Microsoft"), the two short sentences embed close enough to be collapsed, so the system retained stale information indefinitely. Now, on a near-duplicate: * generate an embedding for the new fact, * call ``store.update_memory`` to overwrite content + vector + metadata + importance on the existing row (timestamps/ids preserved), * if the store update fails, fall back to inserting so the fact is never silently lost. Also expose the threshold as a tunable. The fixed 0.85 was embedding-model-dependent and would over-merge short paraphrases on some models. Add ``MemoryManager(dedup_threshold=...)`` and a ``CONFIG.memory_dedup_threshold`` knob so users can calibrate without patching code. Backwards compatible (default unchanged). The richer follow-ups noted in the issue (LLM-assisted dedup, adaptive thresholds, entity-graph merging) are intentionally out of scope here — this PR fixes the data-loss bug and unblocks per-model tuning. Refs cyzus#34

Copilot

Pull request overview

Fixes a data-loss bug in the backend memory deduplication flow by ensuring near-duplicate facts actually update the existing memory entry (instead of being silently dropped), and makes the dedup similarity threshold configurable via app config and MemoryManager initialization.

Changes:

Update near-duplicate handling to call store.update_memory(...), with fallback to insert if the update fails.
Make the dedup similarity threshold configurable (ConfigModel.memory_dedup_threshold → MemoryManager(dedup_threshold=...)).
Add focused tests covering update behavior, threshold configurability, and update-failure fallback.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`src/suzent/memory/manager.py`	Implements near-duplicate update behavior and adds an instance-level configurable dedup threshold.
`src/suzent/memory/lifecycle.py`	Wires the new config value into `MemoryManager` initialization.
`src/suzent/config/__init__.py`	Adds `memory_dedup_threshold` to the config model.
`tests/memory/test_dedup_update.py`	Adds regression + behavior tests for near-duplicate update/insert logic and threshold handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                existing_id = str(similar[0]["id"])
+                embedding = await self.embedding_gen.generate(fact.content)
+                updated = await self.store.update_memory(
+                    memory_id=existing_id,
+                    content=fact.content,


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

+                updated = await self.store.update_memory(
+                    memory_id=existing_id,
+                    content=fact.content,
+                    embedding=embedding,
+                    metadata=metadata,
+                    importance=fact.importance,
+                )


cyzus requested a review from Copilot May 20, 2026 12:19

Copilot started reviewing on behalf of cyzus May 20, 2026 12:20 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Potential fix for pull request finding

23974ac

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

cyzus requested a review from Copilot May 28, 2026 21:53

Copilot started reviewing on behalf of cyzus May 28, 2026 21:53 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread src/suzent/memory/manager.py

Comment on lines +531 to +537

updated = await self.store.update_memory(

memory_id=existing_id,

content=fact.content,

embedding=embedding,

metadata=metadata,

importance=fact.importance,

)

DeerGoat mentioned this pull request May 29, 2026

refactor(memory): append-only writes + retrieval-driven consolidation (fixes #34) #41

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): actually update on near-duplicate dedup (#34)#36

fix(memory): actually update on near-duplicate dedup (#34)#36
chiruno-9 wants to merge 2 commits into
cyzus:mainfrom
chiruno-9:fix/memory-dedup-actually-update

chiruno-9 commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chiruno-9 commented May 20, 2026

Summary

Changes

Tests

Out of scope (deliberately deferred)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants