GitHub - izag8216/md-dedupe: Find and handle duplicate markdown files in any directory tree

Markdown-based knowledge bases accumulate duplicates over time -- exported notes, clipped articles, bookmark files, research notes. md-dedupe detects both exact and near-duplicate markdown files, locally, with zero API dependencies.

English | 日本語

Installation

pip install md-dedupe

Quick Start

# Scan for duplicates
md-dedupe scan /path/to/markdown/files

# Scan with URL and frontmatter detection
md-dedupe scan /path/to/files --check-urls --check-frontmatter

# Adjust similarity threshold
md-dedupe scan /path/to/files --threshold 0.7

# Generate a report
md-dedupe report /path/to/files --format markdown -o report.md

# Interactive merge (dry-run by default)
md-dedupe merge /path/to/files --interactive

# Execute merges
md-dedupe merge /path/to/files --interactive --apply

Features

Exact dedup -- SHA-256 body hashing (frontmatter excluded)
Near-duplicate detection -- n-gram Jaccard similarity with size pre-filter
URL-based dedup -- group files sharing common URLs
Frontmatter comparison -- match on title, date, source fields
Union-find clustering -- merge results from all detection methods
Multiple report formats -- terminal (Rich), JSON, Markdown
Interactive merge -- TUI for reviewing and resolving duplicates
Safe defaults -- dry-run mode, automatic backups before merge
Zero API dependencies -- all computation is local

Commands

md-dedupe scan <path>                          # Find all duplicates
md-dedupe scan <path> --threshold 0.7          # Lower similarity threshold
md-dedupe scan <path> --check-urls             # Include URL-based dedup
md-dedupe scan <path> --check-frontmatter      # Include frontmatter comparison
md-dedupe scan <path> --min-size 100           # Skip files under 100 bytes
md-dedupe scan <path> --format json            # JSON output
md-dedupe report <path> --format markdown      # Generate report file
md-dedupe merge <path> --interactive           # Interactive merge review
md-dedupe merge <path> --apply                 # Execute pending merges

Configuration

Create .md-dedupe.toml in your project root:

threshold = 0.8
ngram_size = 3
check_urls = true
check_frontmatter = true
min_size = 50
url_overlap = 0.8
frontmatter_fields = ["title", "date", "source"]
exclude = [".git", ".obsidian", "node_modules", "__pycache__", ".venv"]

Performance

Strategy	Complexity	Description
Exact dedup	O(n)	Single SHA-256 pass
Size pre-filter	O(n log n)	Only compare pairs within 20% size range
URL dedup	O(n)	Build URL-to-file index, then intersect
Near-dedup	O(k)	k = candidate pairs after size filter

Expected: 3000+ files processed in under 30 seconds.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
src/md_dedupe		src/md_dedupe
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Quick Start

Features

Commands

Configuration

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Start

Features

Commands

Configuration

Performance

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages