🚀 YourCrawl: The Intelligent Enterprise Web Crawler

YourCrawl is a high-performance, enterprise-grade web crawler built with Next.js and TypeScript. It’s designed to intelligently crawl, extract, and structure data from websites while respecting robots.txt, managing crawl policies, and providing actionable insights.

✨ Key Features

🚀 Blazing Fast Performance: Optimized for high-speed data extraction with intelligent concurrency management
🤖 AI-Powered Data Structuring: Uses Gemini AI to automatically structure messy HTML content into clean JSON
🛡️ Enterprise Compliance: Respects robots.txt, implements crawl delays, and handles crawl politeness
🎨 Modern Dashboard: Interactive frontend with real-time stats, charts, and result visualization
📊 Real-Time Analytics: Visualizes crawl metrics including successful crawls, errors, average duration, and data volume
🔐 Secure & Efficient: Type-safe TypeScript, efficient data pipelining, and streamlined architecture

🛠️ Tech Stack

Framework: Next.js 14 (App Router)
Language: TypeScript
Styling: Tailwind CSS
AI: Google Gemini
Data Processing: Robust crawl policies and structured output processing
Architecture: Server Actions, Server Components, and intelligent data pipelining

📂 Project Structure

YourCrawl/
├── app/                   # Next.js App Router: Pages, Layouts, Routes
├── components/            # Reusable React components and UI elements
├── lib/                   # Core logic, utilities, AI integration, crawl policies
│   ├── crawl-policy.ts    # robots.txt parsing and crawl policy enforcement
│   ├── ai-parser.ts       # Gemini AI integration for data structuring
│   └── utils.ts           # Utility functions and helpers
├── public/                # Static assets
├── styles/                # Global styles and CSS
├── server/                # API endpoints and server-side logic
├── types/                 # TypeScript type definitions
└── server.ts              # Application entry point

🚀 Getting Started

Prerequisites

Node.js: >= 20.x
npm/yarn/pnpm: Package manager
Google Gemini API Key: For AI-powered data structuring

Installation

Clone the repository:
```
git clone <repository-url>
cd YourCrawl
```

Install dependencies:

npm install
# or
yarn install
# or
pnpm install

Configuration

Create a .env.local file in the project root:
```
cp .env.example .env.local
```
Add your Google Gemini API key to the .env.local file:
```
GOOGLE_API_KEY="[GCP_API_KEY]"
```

Running the Application

Start the development server:
```
npm run dev
# or
yarn dev
# or
pnpm dev
```
Open the application in your browser: http://localhost:3000

Build and Run (Production)

Build the application for production:

npm run build
# or
yarn build
# or
pnpm build

Start the production server:

npm start
# or
yarn start
# or
pnpm start

🛠️ Development Commands

Command	Description
`npm run dev`	Start development server
`npm run build`	Build for production
`npm run start`	Start production server
`npm run lint`	Run ESLint and TypeScript checks
`npm run format`	Format code with Prettier

🏗️ Architecture Overview

🔌 Server Components & Actions

// app/page.tsx
'use server'

import CrawlDashboard from '@/components/CrawlDashboard'
import { performCrawl } from '@/lib/crawl-engine'

export default async function HomePage() {
  return <CrawlDashboard performCrawl={performCrawl} />
}

⚙️ AI-Powered Data Structuring

// lib/ai-parser.ts
import { GoogleGenerativeAI } from '@google/generative-ai'

export async function structureDataWithAI(htmlContent: string): Promise<any> {
  const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
  const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' })

  const prompt = `
    Extract and structure the following HTML content into clean JSON:
    HTML Content: ${htmlContent}
    
    Return only the JSON object, no explanations.
  `

  const result = await model.generateContent(prompt)
  return JSON.parse(result.response.text())
}

📋 Robots.txt Compliance

// lib/crawl-policy.ts
import { RobotsTxtFile } from '@robotstxt/robotstxt'

export async function getCrawlPolicy(url: string): Promise<RobotsTxtFile> {
  const robotsTxtPath = new URL('/robots.txt', url).toString()
  const response = await fetch(robotsTxtPath)
  
  if (response.ok) {
    const text = await response.text()
    return new RobotsTxtFile(text, url)
  }
  
  return new RobotsTxtFile('', url)
}

export function isAllowed(policy: RobotsTxtFile, url: string, userAgent: string): boolean {
  return policy.isAllowed(userAgent, url)
}

🎨 Frontend Dashboard

The dashboard provides:

Real-time Stats: Track successful crawls, errors, and average duration
Analytics: Visualize crawl performance with charts
Crawl History: View past crawl results and performance
Live Results: Monitor and interact with ongoing crawls

🔍 Advanced Features

Intelligent Crawl Strategy

The crawler implements:

Robots.txt Verification: Automatically fetches and parses robots.txt
User-Agent Rotation: Allows setting custom user-agents for different crawl policies
Crawl Delays: Respects Crawl-delay directives to avoid overwhelming servers
Sitemaps: (Optional) Supports sitemap discovery and parsing for comprehensive crawling

Data Pipelining

The system uses an efficient data pipeline:

HTML Fetch → Robots.txt Check → Content Extraction → AI Structuring → Result Storage

Each step can be independently monitored and optimized, making the system highly maintainable.

🧪 Testing

Run the development server:

npm run dev

🔐 Security & Compliance

Respect robots.txt: Built-in compliance with crawl directives
Rate Limiting: Implemented through crawl delay policies
User-Agent Handling: Proper user-agent identification for policy enforcement
Error Handling: Graceful error handling and retry mechanisms

🤝 Contributing

Contributions are welcome! Please feel free to submit a pull request.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

📧 Contact

Project: YourCrawl

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Backend		Backend
Frontend		Frontend
ML		ML
models		models
rag		rag
.DS_Store		.DS_Store
README.md		README.md
cookies.txt		cookies.txt
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 YourCrawl: The Intelligent Enterprise Web Crawler

✨ Key Features

🛠️ Tech Stack

📂 Project Structure

🚀 Getting Started

Prerequisites

Installation

Configuration

Running the Application

Build and Run (Production)

🛠️ Development Commands

🏗️ Architecture Overview

🔌 Server Components & Actions

⚙️ AI-Powered Data Structuring

📋 Robots.txt Compliance

🎨 Frontend Dashboard

🔍 Advanced Features

Intelligent Crawl Strategy

Data Pipelining

🧪 Testing

🔐 Security & Compliance

🤝 Contributing

📝 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 YourCrawl: The Intelligent Enterprise Web Crawler

✨ Key Features

🛠️ Tech Stack

📂 Project Structure

🚀 Getting Started

Prerequisites

Installation

Configuration

Running the Application

Build and Run (Production)

🛠️ Development Commands

🏗️ Architecture Overview

🔌 Server Components & Actions

⚙️ AI-Powered Data Structuring

📋 Robots.txt Compliance

🎨 Frontend Dashboard

🔍 Advanced Features

Intelligent Crawl Strategy

Data Pipelining

🧪 Testing

🔐 Security & Compliance

🤝 Contributing

📝 License

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages