Skip to content

Sahil-coder-30/YourCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ YourCrawl: The Intelligent Enterprise Web Crawler

YourCrawl is a high-performance, enterprise-grade web crawler built with Next.js and TypeScript. It’s designed to intelligently crawl, extract, and structure data from websites while respecting robots.txt, managing crawl policies, and providing actionable insights.

✨ Key Features

  • πŸš€ Blazing Fast Performance: Optimized for high-speed data extraction with intelligent concurrency management
  • πŸ€– AI-Powered Data Structuring: Uses Gemini AI to automatically structure messy HTML content into clean JSON
  • πŸ›‘οΈ Enterprise Compliance: Respects robots.txt, implements crawl delays, and handles crawl politeness
  • 🎨 Modern Dashboard: Interactive frontend with real-time stats, charts, and result visualization
  • πŸ“Š Real-Time Analytics: Visualizes crawl metrics including successful crawls, errors, average duration, and data volume
  • πŸ” Secure & Efficient: Type-safe TypeScript, efficient data pipelining, and streamlined architecture

πŸ› οΈ Tech Stack

  • Framework: Next.js 14 (App Router)
  • Language: TypeScript
  • Styling: Tailwind CSS
  • AI: Google Gemini
  • Data Processing: Robust crawl policies and structured output processing
  • Architecture: Server Actions, Server Components, and intelligent data pipelining

πŸ“‚ Project Structure

YourCrawl/
β”œβ”€β”€ app/                   # Next.js App Router: Pages, Layouts, Routes
β”œβ”€β”€ components/            # Reusable React components and UI elements
β”œβ”€β”€ lib/                   # Core logic, utilities, AI integration, crawl policies
β”‚   β”œβ”€β”€ crawl-policy.ts    # robots.txt parsing and crawl policy enforcement
β”‚   β”œβ”€β”€ ai-parser.ts       # Gemini AI integration for data structuring
β”‚   └── utils.ts           # Utility functions and helpers
β”œβ”€β”€ public/                # Static assets
β”œβ”€β”€ styles/                # Global styles and CSS
β”œβ”€β”€ server/                # API endpoints and server-side logic
β”œβ”€β”€ types/                 # TypeScript type definitions
└── server.ts              # Application entry point

πŸš€ Getting Started

Prerequisites

  • Node.js: >= 20.x
  • npm/yarn/pnpm: Package manager
  • Google Gemini API Key: For AI-powered data structuring

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd YourCrawl
  2. Install dependencies:

    npm install
    # or
    yarn install
    # or
    pnpm install

Configuration

  1. Create a .env.local file in the project root:

    cp .env.example .env.local
  2. Add your Google Gemini API key to the .env.local file:

    GOOGLE_API_KEY="[GCP_API_KEY]"

Running the Application

  1. Start the development server:

    npm run dev
    # or
    yarn dev
    # or
    pnpm dev
  2. Open the application in your browser: http://localhost:3000

Build and Run (Production)

  1. Build the application for production:

    npm run build
    # or
    yarn build
    # or
    pnpm build
  2. Start the production server:

    npm start
    # or
    yarn start
    # or
    pnpm start

πŸ› οΈ Development Commands

Command Description
npm run dev Start development server
npm run build Build for production
npm run start Start production server
npm run lint Run ESLint and TypeScript checks
npm run format Format code with Prettier

πŸ—οΈ Architecture Overview

πŸ”Œ Server Components & Actions

// app/page.tsx
'use server'

import CrawlDashboard from '@/components/CrawlDashboard'
import { performCrawl } from '@/lib/crawl-engine'

export default async function HomePage() {
  return <CrawlDashboard performCrawl={performCrawl} />
}

βš™οΈ AI-Powered Data Structuring

// lib/ai-parser.ts
import { GoogleGenerativeAI } from '@google/generative-ai'

export async function structureDataWithAI(htmlContent: string): Promise<any> {
  const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
  const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' })

  const prompt = `
    Extract and structure the following HTML content into clean JSON:
    HTML Content: ${htmlContent}
    
    Return only the JSON object, no explanations.
  `

  const result = await model.generateContent(prompt)
  return JSON.parse(result.response.text())
}

πŸ“‹ Robots.txt Compliance

// lib/crawl-policy.ts
import { RobotsTxtFile } from '@robotstxt/robotstxt'

export async function getCrawlPolicy(url: string): Promise<RobotsTxtFile> {
  const robotsTxtPath = new URL('/robots.txt', url).toString()
  const response = await fetch(robotsTxtPath)
  
  if (response.ok) {
    const text = await response.text()
    return new RobotsTxtFile(text, url)
  }
  
  return new RobotsTxtFile('', url)
}

export function isAllowed(policy: RobotsTxtFile, url: string, userAgent: string): boolean {
  return policy.isAllowed(userAgent, url)
}

🎨 Frontend Dashboard

The dashboard provides:

  • Real-time Stats: Track successful crawls, errors, and average duration
  • Analytics: Visualize crawl performance with charts
  • Crawl History: View past crawl results and performance
  • Live Results: Monitor and interact with ongoing crawls

Dashboard Preview


πŸ” Advanced Features

Intelligent Crawl Strategy

The crawler implements:

  1. Robots.txt Verification: Automatically fetches and parses robots.txt
  2. User-Agent Rotation: Allows setting custom user-agents for different crawl policies
  3. Crawl Delays: Respects Crawl-delay directives to avoid overwhelming servers
  4. Sitemaps: (Optional) Supports sitemap discovery and parsing for comprehensive crawling

Data Pipelining

The system uses an efficient data pipeline:

HTML Fetch β†’ Robots.txt Check β†’ Content Extraction β†’ AI Structuring β†’ Result Storage

Each step can be independently monitored and optimized, making the system highly maintainable.


πŸ§ͺ Testing

Run the development server:

npm run dev

πŸ” Security & Compliance

  • Respect robots.txt: Built-in compliance with crawl directives
  • Rate Limiting: Implemented through crawl delay policies
  • User-Agent Handling: Proper user-agent identification for policy enforcement
  • Error Handling: Graceful error handling and retry mechanisms

🀝 Contributing

Contributions are welcome! Please feel free to submit a pull request.


πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ“§ Contact

About

πŸš€ YourCrawl: The Intelligent Enterprise Web Crawler YourCrawl is a high-performance, enterprise-grade web crawler built with Next.js and TypeScript. It’s designed to intelligently crawl, extract, and structure data from websites while respecting robots.txt, managing crawl policies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors