-
-
Notifications
You must be signed in to change notification settings - Fork 0
Parallelize block comparison itself #42
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
- See TidyObsidian/find-duplicate-blocks.py
This script takes 30-40 minutes to process 22K files. Currently, only reading and block extraction are parallel; step 5 runs in a single process.
A simple structural improvement:
-
Split
all_blocksinto chunks and run the “for each block, find candidate indices, check Jaccard, append or merge” logic in worker processes. -
Have each worker build its own local
canonical_blocks/token_index, then merge the results at the end (e.g., by re-running a cheaper global dedup on the worker outputs).
This is more work to implement, but if 30–40 minutes is mostly CPU, using all cores in step 5 can cut that substantially.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Todo