The current implementation hashes every orphan and candidate file: digest, err := getDigest(path)
This can become very expensive when processing large archives with many large files.
However, many datasets contain files that are already uniquely identifiable using: filename + size
Example: DSC_1023.JPG (5.2 MB)
In most real-world archives, this combination uniquely identifies the file.
Proposed Optimization
Add a fast pre-filter index before hashing.
If multiple matches exist, fall back to the current digest-based approach.
Benefits
Typical archive restructuring scenario:
- 100k files
- 90% unique filename+size
Result:
- significant reduction in hashing
- faster scanning
- improved scalability for large media archives
Safety
The optimization only skips hashing when:
- exactly one filename+size candidate exists
- If ambiguity exists, hashing still runs as before.
Solution Code Snippet
Fast index structure
type FastKey struct {
Name string
Size int64
}
fastIndex := map[FastKey][]string{}
for path, meta := range destinationFiles {
key := FastKey{
Name: filepath.Base(path),
Size: meta.Size,
}
fastIndex[key] = append(fastIndex[key], path)
}
Fast lookup before hashing
key := FastKey{
Name: filepath.Base(orphanAtSource),
Size: sourceFiles[orphanAtSource].Size,
}
candidates := fastIndex[key]
if len(candidates) == 1 {
candidateAtDestination := candidates[0]
actions = append(actions, action.MoveFileAction{
BasePath: destinationDirPath,
RelativeFromPath: candidateAtDestination,
RelativeToPath: orphanAtSource,
})
continue
}
The current implementation hashes every orphan and candidate file: digest, err := getDigest(path)
This can become very expensive when processing large archives with many large files.
However, many datasets contain files that are already uniquely identifiable using: filename + size
Example: DSC_1023.JPG (5.2 MB)
In most real-world archives, this combination uniquely identifies the file.
Proposed Optimization
Add a fast pre-filter index before hashing.
If multiple matches exist, fall back to the current digest-based approach.
Benefits
Typical archive restructuring scenario:
Result:
Safety
The optimization only skips hashing when:
Solution Code Snippet
Fast index structure
type FastKey struct {
Name string
Size int64
}
fastIndex := map[FastKey][]string{}
for path, meta := range destinationFiles {
key := FastKey{
Name: filepath.Base(path),
Size: meta.Size,
}
fastIndex[key] = append(fastIndex[key], path)
}
Fast lookup before hashing
key := FastKey{
Name: filepath.Base(orphanAtSource),
Size: sourceFiles[orphanAtSource].Size,
}
candidates := fastIndex[key]
if len(candidates) == 1 {
candidateAtDestination := candidates[0]
actions = append(actions, action.MoveFileAction{
BasePath: destinationDirPath,
RelativeFromPath: candidateAtDestination,
RelativeToPath: orphanAtSource,
})
continue
}