-
-
Notifications
You must be signed in to change notification settings - Fork 0
Use a hash-based pick #41
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
- See TidyObsidian/find-duplicate-blocks.py
Use a hash-based pick: sort tokens by a stable hash (e.g., hash(token) or sha1(token)), then take the first N. This approximates MinHash and tends to distribute blocks more evenly across the token space.
This change shrinks candidate_indices per block, so the inner Jaccard loop runs far fewer times.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
No status