Skip to content

chintanparekh2510/ai-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Web Scraper

An intelligent web scraper that crawls entire websites, processes images using OCR, and creates comprehensive Markdown documentation in a single text file. Perfect for creating large knowledge bases for GPT training or documentation purposes.

Features

  • Complete Site Crawling: Recursively crawls all pages starting from a root URL
  • Image OCR Processing: Extracts text from images using Tesseract OCR
  • Intelligent Content Extraction: Focuses on main content, avoiding navigation and ads
  • Markdown Output: Creates clean, structured markdown in a single .txt file
  • Respectful Crawling: Includes rate limiting and robots.txt compliance
  • Error Handling: Robust error handling with retry mechanisms

Installation

  1. Install Python dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:

Usage

Basic Usage

from src.scraper import WebScraper

scraper = WebScraper()
scraper.crawl("https://support.atlassian.com/jira-service-management-cloud")

Command Line

python main.py --url "https://support.atlassian.com/jira-service-management-cloud" --output "atlassian_docs.txt"

Configuration

Edit config.yaml to customize:

  • Maximum pages to crawl
  • Rate limiting settings
  • Content filters
  • OCR settings

Output

The scraper creates a comprehensive markdown file containing:

  • Page titles and URLs
  • Main content from each page
  • Text extracted from images
  • Structured navigation and hierarchy

Example Output Structure

# Website Documentation: Jira Service Management Cloud

## Page: Getting Started
URL: https://support.atlassian.com/jira-service-management-cloud/getting-started

Content goes here...

### Images Found:
- Image 1: [OCR extracted text]
- Image 2: [OCR extracted text]

---

## Page: Configuration
URL: https://support.atlassian.com/jira-service-management-cloud/configuration

Content goes here...

Project Structure

ai-web-scraper/
├── src/
│   ├── __init__.py
│   ├── scraper.py          # Main scraper class
│   ├── crawler.py          # URL crawling logic
│   ├── content_extractor.py # Content extraction
│   ├── ocr_processor.py    # Image OCR processing
│   └── markdown_generator.py # Output formatting
├── config/
│   └── config.yaml         # Configuration settings
├── output/                 # Generated documentation
├── main.py                # CLI entry point
├── requirements.txt       # Python dependencies
└── README.md             # This file

About

AI-powered web scraper with OCR support for creating comprehensive documentation and knowledge bases from websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages