-
-
Notifications
You must be signed in to change notification settings - Fork 0
Make the inner Jaccard loop cheaper #43
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
- See TidyObsidian/find-duplicate-blocks.py
For each candidate, you recompute jaccard(tokens, cb["tokens"]) by doing & and | on Python sets.
Change:
-
Pre-store
len(tokens)alongsidetokensinsidecanonical_blocks, so you compute union size aslen_a + len_b - interand avoid building a full union set. -
If you keep tokens in a sorted list instead of a set, you can do an intersection with a two-pointer walk, which is often faster and more cache-friendly at this scale.
Both reduce per-comparison overhead without changing behavior.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
No status