This tool converts email files (.eml) to Markdown format (.md) while preserving the email thread structure, extracting attachments, and intelligently removing duplicate emails in threads.
- Converts
.emlfiles to Markdown format - Preserves email thread structure using multiple methods:
- Standard RFC822 embedded message detection
- Pattern matching to detect quoted emails in the body text
- SimHash-based deduplication to remove redundant email content
- Extracts and saves attachments
- Organizes content in chronological order (oldest-to-newest or newest-to-oldest)
- Moves processed files to a separate directory
- Creates a clean directory structure for output files
project/
├── input/ # Place .eml files here
├── output/ # Converted files will be saved here (one directory per email)
│ └── email1/ # Example output directory
│ ├── email1.md
│ └── [attachments]
├── done/ # Processed .eml files are moved here
├── .gitignore # Configured to ignore email content
└── create_gitkeep.py # Utility script to set up directory structure
- Make sure you have Python 3.7+ installed
- Clone this repository or download the source code
- Install the required dependencies:
pip install -r requirements.txt- Set up the required directory structure:
python create_gitkeep.py- Place your
.emlfiles in theinputdirectory - Run the script:
# Default: oldest to newest email order, default deduplication
python eml2md.py
# Optional: newest to oldest email order
python eml2md.py --newest-first
# Adjust deduplication sensitivity (higher = more aggressive)
python eml2md.py --dedup-threshold 12- Find your converted files in the
outputdirectory, organized in subdirectories named after the original email files
| Option | Description |
|---|---|
--newest-first |
Sort emails from newest to oldest in the markdown file (default is oldest to newest) |
--dedup-threshold VALUE |
Set the similarity threshold for deduplication (default is 8, higher values mean more aggressive deduplication) |
The tool uses three complementary methods to handle email threads:
-
Embedded Messages: Detects properly formatted
message/rfc822parts in multipart emails. -
Pattern Matching: Analyzes the email body text to find patterns indicating quoted emails such as:
- Outlook format: "From: ... Sent: ... To: ... Subject: ..."
- Reply format: "On [date], [person] wrote:"
- Gmail format: "On [date] at [time], [person] wrote:"
-
SimHash Deduplication: Uses content-based hashing to identify and remove duplicate emails, even when formatting differs:
- Creates a fingerprint for each email that preserves similarity
- Compares emails using Hamming distance between fingerprints
- Removes duplicates while prioritizing newer emails
The SimHash algorithm:
- Extracts features from the email (sender, subject, key content lines)
- Creates a 64-bit fingerprint that preserves content similarity
- Compares fingerprints using Hamming distance (bit differences)
- Groups similar emails and keeps only the most representative one
The default threshold of 8 bits (out of 64) works well for most emails, but you can adjust it with the --dedup-threshold parameter. Higher values will be more aggressive in identifying duplicates.
The generated Markdown file includes:
- Email thread organized chronologically
- Metadata for each email (date, from, to, cc, subject)
- Email content
- Links to extracted attachments
# Email Thread
## Email 1
**Date**: 2023-01-15 14:32:45
**From**: sender@example.com
**To**: recipient@example.com
**Subject**: Example Subject
### Content
This is the content of the email.
### Attachments
- [document.pdf](document.pdf)
---
## Email 2
...- Python 3.7+
- email-validator
- python-dateutil
eml2md.py- The main script that handles the conversion process
create_gitkeep.py- Sets up the required directory structure and creates .gitkeep files to ensure the directories are tracked by Git while ignoring their contents. Run this script once when setting up the project.
.gitignore- Configured to track the directory structure but ignore email files and contentrequirements.txt- Lists the Python package dependencies
- The tool is designed for English-language emails
- Complex HTML formatting may be simplified in the conversion process
- Pattern matching depends on email client formatting and may not detect all thread styles
- Deduplication thresholds may need adjustment for your specific emails
- Very short emails might be incorrectly identified as duplicates (adjust threshold if needed)
This project is licensed under the MIT License - see the LICENSE file for details