Skip to content

Add content-fingerprint dedup on the write path#61

Merged
imonroe merged 2 commits into
mainfrom
claude/ob1-content-dedup
Jun 7, 2026
Merged

Add content-fingerprint dedup on the write path#61
imonroe merged 2 commits into
mainfrom
claude/ob1-content-dedup

Conversation

@imonroe

@imonroe imonroe commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

Adds cheap content-fingerprint dedup on the write path so that re-submitting byte-identical content skips mem0's LLM fact-extraction (a Claude call per add). Implements the OB1 content-fingerprint-dedup recipe (backlog issue #48) — the flagship cost-saver.

Flow: before add, the raw input is normalized (lowercase + collapse whitespace) and SHA-256'd; the fingerprint is stored in a content_fp Qdrant payload field and looked up on the next add. An exact repeat returns {"results": [], "deduplicated": true, "memory_id": "…"} without calling the LLM. This makes import re-runs and webhook/n8n retries cheap and idempotent.

Validated against a live Qdrant

Before building, I confirmed the two assumptions this depends on against a real deployment:

  • mem0's _create_memory merges custom metadata into the payload top-level, so content_fp is directly queryable (confirmed from a real point's payload).
  • Qdrant executes a filter on the unindexed content_fp field fine (a filter on the non-indexed hash field returned HTTP 200) — so no payload index is required.
  • _create_filter({"content_fp": …}) builds a valid Qdrant FieldCondition/MatchValue (checked locally).

Safety

  • Fail-open: any error in the dedup lookup returns None, so the add proceeds normally. The check can only ever save work, never block a write.
  • Opt-out: REST AddMemoryRequest gains dedup: bool = True; pass false to force re-extraction.
  • Dedup is scoped by user_id only (not agent_id) — consistent with the one-shared-pool model; the same fact from two agents dedupes to one.
  • Distinct from mem0's semantic dedup, which still applies to similar-but-not-identical content that reaches the LLM.

Files

  • app/memory.pycontent_fingerprint(), _existing_fingerprint_id() (fail-open), add_memory(content, dedup=True, **kwargs).
  • app/rest.pydedup flag, route add through the wrapper.
  • app/mcp_server.pyadd_memory tool routes through the wrapper (always dedups; the tool docstring tells the model repeats are safe).

Tests

  • tests/test_memory.py — fingerprint normalization (whitespace/case, message lists), _existing_fingerprint_id found/empty/fail-open-on-error, and add_memory store-new/skip-duplicate/dedup=False/metadata-merge.
  • tests/test_rest.py + tests/test_mcp.py — integration: fingerprint stored, exact repeat deduplicated (no .add), dedup=false bypasses the check.
  • conftest.py defaults the dedup lookup to empty so existing add tests are unaffected.
  • Full suite: 149 passed, ruff clean.

Docs

  • User Guide: "How memory works" (exact-duplicate add is free), the add endpoint (dedup flag + deduplicated response), and the import-toolkit idempotency note.
  • Developer Guide: memory.py description.

Post-merge verification (optional, on your live deploy)

After this deploys, the same content posted twice should return deduplicated: true the second time with no new Qdrant point:

curl -s -X POST https://mem0.rage5.com/api/v1/memories \
  -H "Authorization: Bearer $MEM0_API_KEY" -H 'Content-Type: application/json' \
  -d '{"content":"dedup smoke test 12345"}'   # run twice; 2nd → {"deduplicated": true, ...}

Closes #48.

https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue


Generated by Claude Code

Skip mem0's LLM fact-extraction when byte-identical content is re-submitted.
Before add, the raw input is normalized (lowercase + collapse whitespace) and
SHA-256'd; the fingerprint is stored in the `content_fp` Qdrant payload field
and looked up (via the vector store's filter) on the next add. An exact repeat
returns {"results": [], "deduplicated": true, "memory_id": ...} without calling
the LLM. Adapts the OB1 content-fingerprint-dedup recipe.

Verified against a live Qdrant: metadata lands top-level in the payload and
filtering on the (unindexed) content_fp field works, so no payload index is
required. The dedup check is fail-open — any lookup error just proceeds with a
normal add, so it can only ever save work, never block a write.

- app/memory.py: content_fingerprint(), _existing_fingerprint_id() (fail-open),
  and add_memory(content, dedup=True, **kwargs) wrapper.
- app/rest.py: AddMemoryRequest.dedup flag (default true); route through wrapper.
- app/mcp_server.py: add_memory tool routes through the wrapper (always dedups).
- tests: fingerprint normalization, lookup found/empty/fail-open, wrapper
  store/skip/dedup-false/metadata-merge, plus REST + MCP integration cases.
  conftest defaults the dedup lookup to empty so existing add tests are unaffected.
- docs: USER_GUIDE (How memory works + add endpoint + import idempotency),
  DEVELOPER_GUIDE memory.py description.

Closes #48.

https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a lightweight, deterministic content-fingerprint lookup on the write path to short-circuit exact re-submissions before mem0’s LLM fact-extraction runs, making repeated imports/webhook retries cheaper and more idempotent.

Changes:

  • Introduces content_fingerprint() + _existing_fingerprint_id() and an add_memory(..., dedup=True) wrapper in app/memory.py.
  • Routes REST and MCP “add memory” calls through the new wrapper and adds a REST opt-out flag (dedup: bool = True).
  • Expands tests and documentation to cover fingerprint storage, dedup short-circuiting, and the dedup=false bypass.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
app/memory.py Adds fingerprint computation, best-effort lookup, and add wrapper that can short-circuit duplicates.
app/rest.py Adds dedup request flag and routes adds through memory.add_memory().
app/mcp_server.py Routes MCP add_memory tool through the wrapper and documents dedup behavior.
tests/test_memory.py Adds unit tests for fingerprinting, lookup behavior, and wrapper behavior.
tests/test_rest.py Adds REST-level tests for fingerprint storage and dedup responses.
tests/test_mcp.py Adds MCP-level test asserting dedup avoids calling .add().
tests/conftest.py Sets default mock vector_store list result to “empty” to avoid impacting existing tests.
docs/USER_GUIDE.md Documents dedup behavior and the new dedup flag across user guide sections.
docs/DEVELOPER_GUIDE.md Updates module description to mention dedup wrapper in memory.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/memory.py
Comment thread docs/USER_GUIDE.md Outdated
Comment thread docs/USER_GUIDE.md Outdated
Comment thread docs/USER_GUIDE.md Outdated
Comment thread app/rest.py Outdated
Comment thread tests/test_memory.py Outdated
…cal" wording

- app/memory.py: the message-list fingerprint path previously json-dumped +
  lowercased, which did NOT collapse internal whitespace (newlines/tabs become
  escaped \n in JSON), contradicting the docstring. Normalize each message's
  role and text individually (lowercase + collapse whitespace) so equivalent
  transcripts dedupe. Extract a shared _normalize_text() helper.
- Wording: dedup matches a normalized fingerprint (case-insensitive,
  whitespace-collapsed), not raw bytes. Replace "byte-identical" everywhere
  (app/rest.py dedup-field comment, USER_GUIDE x3) with accurate phrasing.
- tests: assert message-transcript normalization (case/whitespace/newline
  equivalence, and that differing role/text fingerprint differently).

https://claude.ai/code/session_017835DVrvURaYnbQiPQwzue
@imonroe imonroe merged commit 824651d into main Jun 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content-fingerprint dedup on the write path (cheap hash dedup before mem0 extraction)

3 participants