A robust command-line tool to crawl, scrape, and convert modern documentation websites (React, Vue, Docusaurus, Sphinx, Redocly) into clean, offline PDFs.
Unlike standard HTML-to-PDF converters, this tool uses a headless browser (Playwright) to render client-side JavaScript, ensuring that Single Page Applications (SPAs) and dynamic sidebars are captured correctly.
- Dynamic Spider: Auto-detects sidebars, TOCs, and navigation menus even if rendered via JS. Includes fallback strategies for difficult sites.
- Smart Sanitizer: Strips "web" artifacts (Copy buttons, navbars, footers, breadcrumbs) to create a book-like reading experience.
- Professional PDF Output: Generates a custom Cover Page, Table of Contents with real page numbers, and rewires web links to point to internal PDF chapters.
- Interactive Mode: If the spider fails to find links automatically, a visible browser launches to allow manual selector identification.
- Universal Support: Pre-configured for Flask, React, Playwright, ReadTheDocs, and Redocly, with generic support for others.
- Python 3.8+
- Windows Users: GTK3 Runtime (Required for WeasyPrint PDF generation).
-
Clone the repository:
git clone https://github.com/YOUR_USERNAME/universal-doc-downloader.git cd universal-doc-downloader -
Create a virtual environment (recommended):
python -m venv venv # Windows: .\venv\Scripts\activate # Mac/Linux: source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Install browser binaries: This tool requires Chromium to render pages.
playwright install chromium
Pass the URL of the documentation homepage. The tool attempts to auto-detect the sidebar selector.
python doc_dl.py https://flask.palletsprojects.com/en/stable/python doc_dl.py https://playwright.dev/python/docs/intro \
--title "Playwright Python Manual" \
--output playwright.pdfDownload only the first 5 pages to test the layout before scraping the whole site.
python doc_dl.py https://docs.bria.ai/ --limit 5If the auto-detection fails, inspect the website and provide the CSS selector for the sidebar navigation.
python doc_dl.py https://example.com/docs --selector "div.my-custom-menu"Run with the browser visible to watch the scraping process.
python doc_dl.py https://example.com/docs --visible --verbose| Flag | Short | Description | Default |
|---|---|---|---|
url |
The target URL (Required) | N/A | |
--output |
-o |
Filename for the generated PDF | manual.pdf |
--title |
-t |
Title displayed on the cover page | "Documentation" |
--selector |
-s |
CSS selector for the sidebar | Auto-detect |
--limit |
-l |
Stop after N pages (0 = download all) | 0 |
--visible |
Run browser in headful mode (visible) | False |
|
--verbose |
-v |
Enable detailed debug logging | False |
Distributed under the MIT License. See LICENSE for more information.