Add content-fingerprint dedup on the write path#61
Merged
Conversation
Skip mem0's LLM fact-extraction when byte-identical content is re-submitted.
Before add, the raw input is normalized (lowercase + collapse whitespace) and
SHA-256'd; the fingerprint is stored in the `content_fp` Qdrant payload field
and looked up (via the vector store's filter) on the next add. An exact repeat
returns {"results": [], "deduplicated": true, "memory_id": ...} without calling
the LLM. Adapts the OB1 content-fingerprint-dedup recipe.
Verified against a live Qdrant: metadata lands top-level in the payload and
filtering on the (unindexed) content_fp field works, so no payload index is
required. The dedup check is fail-open — any lookup error just proceeds with a
normal add, so it can only ever save work, never block a write.
- app/memory.py: content_fingerprint(), _existing_fingerprint_id() (fail-open),
and add_memory(content, dedup=True, **kwargs) wrapper.
- app/rest.py: AddMemoryRequest.dedup flag (default true); route through wrapper.
- app/mcp_server.py: add_memory tool routes through the wrapper (always dedups).
- tests: fingerprint normalization, lookup found/empty/fail-open, wrapper
store/skip/dedup-false/metadata-merge, plus REST + MCP integration cases.
conftest defaults the dedup lookup to empty so existing add tests are unaffected.
- docs: USER_GUIDE (How memory works + add endpoint + import idempotency),
DEVELOPER_GUIDE memory.py description.
Closes #48.
https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue
There was a problem hiding this comment.
Pull request overview
Adds a lightweight, deterministic content-fingerprint lookup on the write path to short-circuit exact re-submissions before mem0’s LLM fact-extraction runs, making repeated imports/webhook retries cheaper and more idempotent.
Changes:
- Introduces
content_fingerprint()+_existing_fingerprint_id()and anadd_memory(..., dedup=True)wrapper inapp/memory.py. - Routes REST and MCP “add memory” calls through the new wrapper and adds a REST opt-out flag (
dedup: bool = True). - Expands tests and documentation to cover fingerprint storage, dedup short-circuiting, and the
dedup=falsebypass.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| app/memory.py | Adds fingerprint computation, best-effort lookup, and add wrapper that can short-circuit duplicates. |
| app/rest.py | Adds dedup request flag and routes adds through memory.add_memory(). |
| app/mcp_server.py | Routes MCP add_memory tool through the wrapper and documents dedup behavior. |
| tests/test_memory.py | Adds unit tests for fingerprinting, lookup behavior, and wrapper behavior. |
| tests/test_rest.py | Adds REST-level tests for fingerprint storage and dedup responses. |
| tests/test_mcp.py | Adds MCP-level test asserting dedup avoids calling .add(). |
| tests/conftest.py | Sets default mock vector_store list result to “empty” to avoid impacting existing tests. |
| docs/USER_GUIDE.md | Documents dedup behavior and the new dedup flag across user guide sections. |
| docs/DEVELOPER_GUIDE.md | Updates module description to mention dedup wrapper in memory.py. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…cal" wording - app/memory.py: the message-list fingerprint path previously json-dumped + lowercased, which did NOT collapse internal whitespace (newlines/tabs become escaped \n in JSON), contradicting the docstring. Normalize each message's role and text individually (lowercase + collapse whitespace) so equivalent transcripts dedupe. Extract a shared _normalize_text() helper. - Wording: dedup matches a normalized fingerprint (case-insensitive, whitespace-collapsed), not raw bytes. Replace "byte-identical" everywhere (app/rest.py dedup-field comment, USER_GUIDE x3) with accurate phrasing. - tests: assert message-transcript normalization (case/whitespace/newline equivalence, and that differing role/text fingerprint differently). https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds cheap content-fingerprint dedup on the write path so that re-submitting byte-identical content skips mem0's LLM fact-extraction (a Claude call per
add). Implements the OB1content-fingerprint-deduprecipe (backlog issue #48) — the flagship cost-saver.Flow: before
add, the raw input is normalized (lowercase + collapse whitespace) and SHA-256'd; the fingerprint is stored in acontent_fpQdrant payload field and looked up on the next add. An exact repeat returns{"results": [], "deduplicated": true, "memory_id": "…"}without calling the LLM. This makes import re-runs and webhook/n8n retries cheap and idempotent.Validated against a live Qdrant
Before building, I confirmed the two assumptions this depends on against a real deployment:
_create_memorymerges custom metadata into the payload top-level, socontent_fpis directly queryable (confirmed from a real point's payload).content_fpfield fine (a filter on the non-indexedhashfield returned HTTP 200) — so no payload index is required._create_filter({"content_fp": …})builds a valid QdrantFieldCondition/MatchValue(checked locally).Safety
None, so the add proceeds normally. The check can only ever save work, never block a write.AddMemoryRequestgainsdedup: bool = True; passfalseto force re-extraction.user_idonly (notagent_id) — consistent with the one-shared-pool model; the same fact from two agents dedupes to one.Files
app/memory.py—content_fingerprint(),_existing_fingerprint_id()(fail-open),add_memory(content, dedup=True, **kwargs).app/rest.py—dedupflag, route add through the wrapper.app/mcp_server.py—add_memorytool routes through the wrapper (always dedups; the tool docstring tells the model repeats are safe).Tests
tests/test_memory.py— fingerprint normalization (whitespace/case, message lists),_existing_fingerprint_idfound/empty/fail-open-on-error, andadd_memorystore-new/skip-duplicate/dedup=False/metadata-merge.tests/test_rest.py+tests/test_mcp.py— integration: fingerprint stored, exact repeat deduplicated (no.add),dedup=falsebypasses the check.conftest.pydefaults the dedup lookup to empty so existing add tests are unaffected.ruffclean.Docs
dedupflag + deduplicated response), and the import-toolkit idempotency note.memory.pydescription.Post-merge verification (optional, on your live deploy)
After this deploys, the same
contentposted twice should returndeduplicated: truethe second time with no new Qdrant point:Closes #48.
https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue
Generated by Claude Code