Crawlio

Extensible web crawling and data extraction framework.

A technical foundation for building scalable data pipelines using Scrapy and Playwright.

🌐 Architecture Overview

graph TD
    subgraph Client Layer
        Browser[Web Browser]
        Admin[Admin Dashboard]
    end

    subgraph "Application Layer (FastAPI)"
        API[FastAPI Backend]
        Auth[Auth Service]
        Routers[API Routers]
        Templates[Jinja2 Templates]
    end

    subgraph "Data Layer"
        DB[(PostgreSQL/SQLite)]
        Redis[(Redis Cache)]
    end

    subgraph "Crawling Layer"
        Core[Lazy Core]
        Scrapy[Scrapy Engine]
        Playwright[Playwright Renderer]
        Ollama[Ollama AI Service]
    end

    Browser -->|HTTP/HTTPS| API
    Admin -->|Manage| API

    API --> Routers
    Routers --> Auth
    Routers --> Templates
    Routers --> Core

    Core --> Scrapy
    Core --> Playwright
    Core --> Ollama

    API --> DB
    Core --> DB
    Core --> Redis

Core Stack
Code Quality
Documentation

Crawlio is an extensible web crawling framework designed for both developers and organizations that need robust data extraction pipelines. It combines the speed of Scrapy with the dynamic rendering capabilities of Playwright to handle modern websites that use heavy JavaScript.

What is Crawlio?

If you need to collect data from websites—whether it's product prices, news articles, or social media updates—Crawlio handles the hard parts for you:

Automatic Scrolling & Clicking: It can "browse" like a human to see content that only appears when you scroll or click.
Multiple Save Locations: Send your data directly to Excel (CSV), Google Sheets, or professional databases (PostgreSQL/MongoDB).
Security & Reliability: Built-in protection against being blocked, including smart rate limiting and proxy support.
Easy Dashboard: A simple web interface to see how your data collection is going in real-time.

Features

Automated Workflows: Fast setup for new data collection tasks ("spiders").
Modern Web Support: Built-in Playwright integration for sites like Amazon, Twitter, or React apps.
Google Sheets Integration: Push data directly to your spreadsheets for easy sharing.
Smart Rate Limiting (Enhanced): Protects the application and target websites from abuse by ensuring fair usage, now with advanced IP and user identification that cannot be easily bypassed.
Integrated Proxy Manager: Built-in system for automatic proxy rotation and health checks, compatible with both Scrapy and Playwright.
Developer First: Clean, modular code that is easy to extend.
Production Ready: Full Docker support for stable, long-running deployments.

Quick Start

1. Installation

This project uses uv for dependency management.

# Install from PyPI
pip install crawlio

# OR install locally with uv
uv add crawlio

For development:

# Initialize and sync dependencies
uv sync

Note

Install Playwright browser binaries after the initial setup: playwright install

2. Static Site Crawler

Create my_agent.py:

import scrapy
from lazy_crawler.crawler.spiders.base_crawler import LazyBaseCrawler
from scrapy.crawler import CrawlerProcess

class MyAgent(LazyBaseCrawler):
    name = "my_agent"

    def start_requests(self):
        yield scrapy.Request("https://example.com", self.parse)

    def parse(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url
        }

process = CrawlerProcess()
process.crawl(MyAgent)
process.start()

3. Dynamic Content (JavaScript)

Leverage Playwright for sites that require browser rendering:

class DynamicAgent(LazyBaseCrawler):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com",
            meta={"playwright": True},
            callback=self.parse
        )

    def parse(self, response):
        data = response.css(".rendered-content::text").get()
        yield {"content": data}

Data Management

1. MongoDB Integration

Configuration (.env):

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lazy_crawler_db

Settings:

ITEM_PIPELINES = {
    "lazy_crawler.crawler.pipelines.MongoPipeline": 400,
}

2. Google Sheets Export

Configuration (.env):

GOOGLE_SHEETS_CREDS_FILE=creds.json
GOOGLE_SHEETS_SPREADSHEET_NAME=CrawlData
GOOGLE_SHEETS_WORKSHEET_NAME=Results

3. JSON & CSV Export

Enable the built-in pipelines to save to local files:

custom_settings = {
    "ITEM_PIPELINES": {
        # Export to scraped_data.json
        "lazy_crawler.crawler.pipelines.JsonWriterPipeline": 300,

        # Export to scraped_data_{timestamp}.csv
        "lazy_crawler.crawler.pipelines.CSVPipeline": 301,
    }
}

4. Excel Export

Enable the Excel pipeline to save data as .xlsx:

custom_settings = {
    "ITEM_PIPELINES": {
        "lazy_crawler.crawler.pipelines.ExcelWriterPipeline": 302,
    }
}

Dashboard & API

The project includes a dashboard for monitoring crawl progress and exploring extracted data.

Start the service:

uv run python -m lazy_crawler.app.main

Dashboard: http://localhost:8000/
API Documentation: http://localhost:8000/docs

Docker Deployment (Production)

Deploy using the provided orchestration files:

# Manual startup
docker compose up --build -d

Dashboard: http://localhost/
API Docs: http://localhost/docs
Health: http://localhost/health

Customization

The framework is designed to be modified. You can extend LazyBaseCrawler or implement custom pipelines to handle specific data requirements.

Commercial Support

Need help building complex spiders? We offer expert integration services.

Custom Spider Development: We build the scraper for you.
Enterprise SLA: Guaranteed support and maintenance.
Hire an Expert

Contributing

Technical contributions and bug reports are welcome.

License

Crawlio is licensed under the MIT License.

Created by Pradip P.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github		.github
app		app
config		config
example		example
lazy_crawler		lazy_crawler
nginx/conf.d		nginx/conf.d
.env.example		.env.example
.env.production		.env.production
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yml		.readthedocs.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
fix_remote_db.sh		fix_remote_db.sh
manage.py		manage.py
pull-and-serve.sh		pull-and-serve.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawlio

🌐 Architecture Overview

What is Crawlio?

Features

Quick Start

1. Installation

2. Static Site Crawler

3. Dynamic Content (JavaScript)

Data Management

1. MongoDB Integration

2. Google Sheets Export

3. JSON & CSV Export

4. Excel Export

Dashboard & API

Docker Deployment (Production)

Customization

Commercial Support

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawlio

🌐 Architecture Overview

What is Crawlio?

Features

Quick Start

1. Installation

2. Static Site Crawler

3. Dynamic Content (JavaScript)

Data Management

1. MongoDB Integration

2. Google Sheets Export

3. JSON & CSV Export

4. Excel Export

Dashboard & API

Docker Deployment (Production)

Customization

Commercial Support

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages