Skip to content

Performance improvement: avoid unnecessary hashing by pre-indexing files using filename + size #27

@mervenator

Description

@mervenator

The current implementation hashes every orphan and candidate file: digest, err := getDigest(path)
This can become very expensive when processing large archives with many large files.
However, many datasets contain files that are already uniquely identifiable using: filename + size

Example: DSC_1023.JPG (5.2 MB)
In most real-world archives, this combination uniquely identifies the file.

Proposed Optimization

Add a fast pre-filter index before hashing.
If multiple matches exist, fall back to the current digest-based approach.

Benefits
Typical archive restructuring scenario:

  • 100k files
  • 90% unique filename+size
    Result:
  • significant reduction in hashing
  • faster scanning
  • improved scalability for large media archives

Safety
The optimization only skips hashing when:

  • exactly one filename+size candidate exists
  • If ambiguity exists, hashing still runs as before.

Solution Code Snippet

Fast index structure
type FastKey struct {
Name string
Size int64
}
fastIndex := map[FastKey][]string{}
for path, meta := range destinationFiles {
key := FastKey{
Name: filepath.Base(path),
Size: meta.Size,
}
fastIndex[key] = append(fastIndex[key], path)
}

Fast lookup before hashing
key := FastKey{
Name: filepath.Base(orphanAtSource),
Size: sourceFiles[orphanAtSource].Size,
}
candidates := fastIndex[key]
if len(candidates) == 1 {
candidateAtDestination := candidates[0]
actions = append(actions, action.MoveFileAction{
BasePath: destinationDirPath,
RelativeFromPath: candidateAtDestination,
RelativeToPath: orphanAtSource,
})
continue
}

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions