Skip to content

[FEAT] Memory deduplication: fixed cosine threshold causes silent data loss #34

@DeerGoat

Description

@DeerGoat

Is your feature request related to a problem? Please describe.

The current memory deduplication in MemoryManager._deduplicate_and_store_facts() (src/suzent/memory/manager.py) uses a fixed cosine similarity threshold of 0.85 to decide whether a newly extracted fact is a duplicate of an existing memory. When similarity exceeds the threshold, the new fact is silently dropped.

This has several issues:

1. Contradictions and updates are treated as duplicates

If a user previously said "I work at Google" and later says "I just joined Microsoft", the two facts — both short sentences about employment at a tech company — can score above 0.85 cosine similarity. The system silently discards the newer fact, retaining stale information indefinitely.

More generally, any factual update to previously stored information (job changes, preference changes, project pivots) is likely to be dropped because the old and new facts are semantically close.

2. "Updated" memories are not actually updated

When a duplicate is detected, the code appends the existing memory's ID to memories_updated but performs no actual modification to the stored memory. The new content is lost.

# manager.py L510-515
if similar and similar[0].get("similarity", 0) > DEDUPLICATION_SIMILARITY_THRESHOLD:
    result.memories_updated.append(str(similar[0]["id"]))  # nothing is actually updated
else:
    memory_id = await self._add_memory_internal(...)

3. Threshold is embedding-model-dependent

A fixed 0.85 does not generalize across embedding models. Different models produce different similarity distributions for the same text pairs. Swapping embedding models (e.g. gemini-embedding-001text-embedding-3-small) changes what 0.85 means in practice, with no automatic adjustment.

4. Short single-sentence embeddings cluster tightly

Each fact is embedded as a single sentence. Short texts produce vectors dominated by broad topic signal, so two facts about the same topic but with meaningfully different content (e.g. "Uses React for frontend" vs "Migrating from React to Vue") can exceed 0.85 and be collapsed.

Possible future paths

  • LLM-assisted dedup: Use embedding similarity as a cheap pre-filter, then ask an LLM to classify near-matches as duplicate / update / distinct (similar to Mem0's approach).
  • Adaptive thresholds: Auto-calibrate per embedding model rather than using a fixed constant.
  • Entity-level merging: Extract entities and relations, dedup at the entity graph level (similar to LangMem).
  • Actually update on conflict: When a near-match is detected, update or replace the existing memory rather than silently dropping the new fact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions