This project is a versatile web scraper designed to extract and compile documentation from any given URL into a single Markdown file. It's particularly useful for creating offline copies of online documentation or for aggregating information from multiple pages into a single document.
- Recursively scrapes documentation from a given base URL
- Extracts main content from each page
- Identifies and preserves code snippets, attempting to detect the programming language
- Compiles all content into a single, well-structured Markdown file
- Handles pagination and relative links
- Configurable maximum page limit and request timeout
- Python 3.7+
- aiohttp
- beautifulsoup4
- html2text
-
Clone this repository:
git clone https://github.com/CryptoMaN-Rahul/Universal-Documentation-Scraper cd Universal-Documentation-Scraper -
Install the required packages:
pip install -r requirements.txt
Run the script with a target documentation URL:
python main.py --url "https://docs.example.com"
You can customize the behavior with additional command-line arguments:
python main.py --url "https://docs.example.com" --max-pages 200 --timeout 60
Arguments:
--url: The base URL of the documentation (required)--max-pages: Maximum number of pages to scrape (default: 100)--timeout: Timeout for requests in seconds (default: 30)
The script will create a Markdown file named documentation.md in the documentation_output directory. This file will contain the compiled documentation with preserved structure and code snippets.
- The scraper starts from the provided base URL and extracts all links on the page.
- It then visits each link that is within the same domain and path as the base URL.
- For each page, it extracts the main content, preserving the structure.
- Code blocks are identified, and the script attempts to determine the programming language.
- All content is converted to Markdown format.
- The process continues recursively until all discovered pages are scraped or the maximum page limit is reached.
- Finally, all collected content is compiled into a single Markdown file.
- The script may not capture dynamically loaded content that requires JavaScript execution.
- Language detection for code snippets is based on simple heuristics and may not always be accurate.
- The effectiveness of content extraction may vary depending on the structure of the target website.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Now, here's the content for the requirements.txt file:
aiohttp==3.8.4 beautifulsoup4==4.10.0 html2text==2020.1.16