Skip to content

varunbhandarii/universal-doc-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal Doc Downloader

License: MIT Python 3.8+

A robust command-line tool to crawl, scrape, and convert modern documentation websites (React, Vue, Docusaurus, Sphinx, Redocly) into clean, offline PDFs.

Unlike standard HTML-to-PDF converters, this tool uses a headless browser (Playwright) to render client-side JavaScript, ensuring that Single Page Applications (SPAs) and dynamic sidebars are captured correctly.

Features

  • Dynamic Spider: Auto-detects sidebars, TOCs, and navigation menus even if rendered via JS. Includes fallback strategies for difficult sites.
  • Smart Sanitizer: Strips "web" artifacts (Copy buttons, navbars, footers, breadcrumbs) to create a book-like reading experience.
  • Professional PDF Output: Generates a custom Cover Page, Table of Contents with real page numbers, and rewires web links to point to internal PDF chapters.
  • Interactive Mode: If the spider fails to find links automatically, a visible browser launches to allow manual selector identification.
  • Universal Support: Pre-configured for Flask, React, Playwright, ReadTheDocs, and Redocly, with generic support for others.

Prerequisites

  • Python 3.8+
  • Windows Users: GTK3 Runtime (Required for WeasyPrint PDF generation).

Installation

  1. Clone the repository:

    git clone https://github.com/YOUR_USERNAME/universal-doc-downloader.git
    cd universal-doc-downloader
  2. Create a virtual environment (recommended):

    python -m venv venv
    # Windows:
    .\venv\Scripts\activate
    # Mac/Linux:
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Install browser binaries: This tool requires Chromium to render pages.

    playwright install chromium

Usage

Basic Usage

Pass the URL of the documentation homepage. The tool attempts to auto-detect the sidebar selector.

python doc_dl.py https://flask.palletsprojects.com/en/stable/

Custom Output & Metadata

python doc_dl.py https://playwright.dev/python/docs/intro \
  --title "Playwright Python Manual" \
  --output playwright.pdf

Testing (Limit Pages)

Download only the first 5 pages to test the layout before scraping the whole site.

python doc_dl.py https://docs.bria.ai/ --limit 5

Manual Selector

If the auto-detection fails, inspect the website and provide the CSS selector for the sidebar navigation.

python doc_dl.py https://example.com/docs --selector "div.my-custom-menu"

Debugging

Run with the browser visible to watch the scraping process.

python doc_dl.py https://example.com/docs --visible --verbose

Command Line Options

Flag Short Description Default
url The target URL (Required) N/A
--output -o Filename for the generated PDF manual.pdf
--title -t Title displayed on the cover page "Documentation"
--selector -s CSS selector for the sidebar Auto-detect
--limit -l Stop after N pages (0 = download all) 0
--visible Run browser in headful mode (visible) False
--verbose -v Enable detailed debug logging False

License

Distributed under the MIT License. See LICENSE for more information.

About

A CLI tool that crawls documentation sites and converts them into a single, clean PDF. Handles dynamic sidebars, SPA navigation, and CSS sanitization for offline reading.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages