An intelligent web scraper that crawls entire websites, processes images using OCR, and creates comprehensive Markdown documentation in a single text file. Perfect for creating large knowledge bases for GPT training or documentation purposes.
- Complete Site Crawling: Recursively crawls all pages starting from a root URL
- Image OCR Processing: Extracts text from images using Tesseract OCR
- Intelligent Content Extraction: Focuses on main content, avoiding navigation and ads
- Markdown Output: Creates clean, structured markdown in a single .txt file
- Respectful Crawling: Includes rate limiting and robots.txt compliance
- Error Handling: Robust error handling with retry mechanisms
- Install Python dependencies:
pip install -r requirements.txt- Install Tesseract OCR:
- Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
from src.scraper import WebScraper
scraper = WebScraper()
scraper.crawl("https://support.atlassian.com/jira-service-management-cloud")python main.py --url "https://support.atlassian.com/jira-service-management-cloud" --output "atlassian_docs.txt"Edit config.yaml to customize:
- Maximum pages to crawl
- Rate limiting settings
- Content filters
- OCR settings
The scraper creates a comprehensive markdown file containing:
- Page titles and URLs
- Main content from each page
- Text extracted from images
- Structured navigation and hierarchy
# Website Documentation: Jira Service Management Cloud
## Page: Getting Started
URL: https://support.atlassian.com/jira-service-management-cloud/getting-started
Content goes here...
### Images Found:
- Image 1: [OCR extracted text]
- Image 2: [OCR extracted text]
---
## Page: Configuration
URL: https://support.atlassian.com/jira-service-management-cloud/configuration
Content goes here...
ai-web-scraper/
├── src/
│ ├── __init__.py
│ ├── scraper.py # Main scraper class
│ ├── crawler.py # URL crawling logic
│ ├── content_extractor.py # Content extraction
│ ├── ocr_processor.py # Image OCR processing
│ └── markdown_generator.py # Output formatting
├── config/
│ └── config.yaml # Configuration settings
├── output/ # Generated documentation
├── main.py # CLI entry point
├── requirements.txt # Python dependencies
└── README.md # This file