Extract images from EPUB, MOBI, AZW, and AZW3 with reading-order support, filtering, manifest output, and comic archive export.
- Reading-order-aware extraction for EPUB and MOBI
- Image role classification (
cover,page,thumbnail,decoration) - Filtering by size, width, height, and aspect ratio
- Incremental deduplication with persistent hash cache
- Optional
manifest.jsonanddebug-orderoutput - Export formats:
CBZ,CBR, andPDF(PDF requires Pillow) - Parallel extraction by file
- Subcommand-based CLI:
scan,extract,inspect,verify - Optional JSON logs
pip install -r requirements.txt
pip install -e .ebook-extract scan /path/to/books -r --format autoebook-extract extract /path/to/books \
--format auto \
--recursive \
--manifest \
--debug-order \
--archive-format cbz \
--hash-cache .cache/image_hashes.json \
--parallelism 4ebook-extract inspect /path/to/book.mobi --debug-order --verboseebook-extract verify /path/to/books -r --format auto--min-size <bytes>--min-width <px>--min-height <px>--max-aspect-ratio <float>--no-dedup--add-ignore-hash <sha256>(repeatable)--all-images(EPUB only)--manifest--debug-order--archive-format cbz|cbr|pdf--hash-cache <path>--parallelism <n>--json-logs--dry-run
from src import EPUBImageExtractor, MobiImageExtractor
epub = EPUBImageExtractor(
min_image_size=2048,
min_width=300,
min_height=300,
max_aspect_ratio=3.5,
write_manifest=True,
write_debug_order=True,
archive_format="cbz",
hash_cache_path=".cache/hashes.json",
parallelism=2,
)
epub.extract_from_directory("books", recursive=True)
mobi = MobiImageExtractor(write_manifest=True, archive_format="cbr")
mobi.extract_from_directory("books", recursive=True)For each book, the extractor creates a folder with files named like:
0000_cover.jpg
0001_page.jpg
0002_page.jpg
When enabled:
manifest.jsonis created in the output folderdebug_orderpayload is embedded in manifest- archive is generated next to the folder (
.cbz,.cbr, or.pdf)
- DRM-protected files are not supported.
PDFexport requires Pillow:
pip install PillowMIT