ocr_md_book.py turns a folder of scanned page images into clean Markdown, merges every page into a single book.md, and finally packages the result as an EPUB (plus optional AZW3/MOBI if Calibre is available). The OCR step relies on Alibaba DashScope (Tongyi) multimodal models and includes light post-processing to remove headers, footers, page numbers, and unwanted hard wraps. The tool is resumable and designed for macOS but remains cross-platform friendly.
- Python 3.10 or newer.
- Python dependencies (install inside a virtual environment if possible):
python3 -m pip install httpx pillow tqdm pyyaml
- External tools:
pandoc(required) - e.g.brew install pandocon macOS.ebook-convertfrom Calibre (optional) if you want AZW3/MOBI output.
- DashScope API Key - export before running:
export DASHSCOPE_API_KEY="sk-your-key"
If your source is a PDF, convert it to page images first. Install Poppler (provides pdftoppm), e.g. brew install poppler, then run:
pdftoppm -png -r 300 "input.pdf" "output-prefix"This will create files such as output-prefix-01.png, output-prefix-02.png, ... that you can place in the images directory for the OCR step.
- Place all page images (PNG/JPG) in one directory; natural sorting is handled automatically, but numeric suffixes are recommended.
- Provide a cover image if you want EPUB metadata to include it; the path must exist or pandoc will fail.
From the project root, run:
python3 ocr_md_book.py \
--images-dir ./book_images \
--title "The Wealth Handbook" \
--author "Unknown" \
--lang zh-CN \
--max-width 1800 \
--concurrency 4 \
--model qwen3-omni-flash \
--cover ./book_images/output-001.png \
--out-name book \
--skip-ocr-existing \
--to-azw3 \
--to-mobiResults land in book_images/_out/:
pages/page-0001.md, ... individual Markdown filesbook.md- merged documentbook.epub- main deliverable (andbook.azw3/book.mobiwhen Calibre is detected and flags set)
--images-dir(required): folder containing images.--title,--author,--lang: EPUB metadata.--max-width: downscale width before upload (never upscale).--concurrency: async OCR concurrency; start between 1-4.--model: DashScope model name (e.g.qwen3-omni-flash).--cover: cover image path for EPUB metadata (must exist).--out-name: output file prefix (defaultbook).--skip-ocr-existing: skip pages with existing Markdown (resume support).--from-list: newline-separated file list to control ordering.--pages: subset pages like1-50,120,121-130.--dry-run: list pages to process without running OCR.--to-azw3,--to-mobi: build Kindle formats ifebook-convertis available.--verbose: show detailed logs (default output is concise).
- Gather images (or read from
--from-list) and sort naturally. - Auto-rotate with EXIF data, optionally downscale, and forward to DashScope using several payload variants for compatibility.
- Clean the resulting Markdown and write to
_out/pages/page-XXXX.md. - Merge pages into
_out/book.md. - Build the EPUB via pandoc, and optionally call Calibre to produce AZW3/MOBI.
- Combine
--skip-ocr-existingwith the default output structure to resume after interruptions. - Failed pages are logged by index; re-run the command (optionally with
--pages) to fill the gaps.
- HTTP 400 "url error": ensure the chosen model supports base64 payloads. If it requires public URLs, upload images to accessible HTTPS locations and reference them via
--from-list. - Cover file missing: confirm the path passed to
--coverexists or omit the flag. - Calibre not found: the script logs a warning and skips AZW3/MOBI when
ebook-convertis absent.
No specific license is provided. Use internally or personally as needed, and comply with the licenses of DashScope, Calibre, pandoc, and other dependencies.