YourCrawl is a high-performance, enterprise-grade web crawler built with Next.js and TypeScript. Itβs designed to intelligently crawl, extract, and structure data from websites while respecting robots.txt, managing crawl policies, and providing actionable insights.
- π Blazing Fast Performance: Optimized for high-speed data extraction with intelligent concurrency management
- π€ AI-Powered Data Structuring: Uses Gemini AI to automatically structure messy HTML content into clean JSON
- π‘οΈ Enterprise Compliance: Respects
robots.txt, implements crawl delays, and handles crawl politeness - π¨ Modern Dashboard: Interactive frontend with real-time stats, charts, and result visualization
- π Real-Time Analytics: Visualizes crawl metrics including successful crawls, errors, average duration, and data volume
- π Secure & Efficient: Type-safe TypeScript, efficient data pipelining, and streamlined architecture
- Framework: Next.js 14 (App Router)
- Language: TypeScript
- Styling: Tailwind CSS
- AI: Google Gemini
- Data Processing: Robust crawl policies and structured output processing
- Architecture: Server Actions, Server Components, and intelligent data pipelining
YourCrawl/
βββ app/ # Next.js App Router: Pages, Layouts, Routes
βββ components/ # Reusable React components and UI elements
βββ lib/ # Core logic, utilities, AI integration, crawl policies
β βββ crawl-policy.ts # robots.txt parsing and crawl policy enforcement
β βββ ai-parser.ts # Gemini AI integration for data structuring
β βββ utils.ts # Utility functions and helpers
βββ public/ # Static assets
βββ styles/ # Global styles and CSS
βββ server/ # API endpoints and server-side logic
βββ types/ # TypeScript type definitions
βββ server.ts # Application entry point
- Node.js: >= 20.x
- npm/yarn/pnpm: Package manager
- Google Gemini API Key: For AI-powered data structuring
-
Clone the repository:
git clone <repository-url> cd YourCrawl
-
Install dependencies:
npm install # or yarn install # or pnpm install
-
Create a
.env.localfile in the project root:cp .env.example .env.local
-
Add your Google Gemini API key to the
.env.localfile:GOOGLE_API_KEY="[GCP_API_KEY]"
-
Start the development server:
npm run dev # or yarn dev # or pnpm dev
-
Open the application in your browser: http://localhost:3000
-
Build the application for production:
npm run build # or yarn build # or pnpm build
-
Start the production server:
npm start # or yarn start # or pnpm start
| Command | Description |
|---|---|
npm run dev |
Start development server |
npm run build |
Build for production |
npm run start |
Start production server |
npm run lint |
Run ESLint and TypeScript checks |
npm run format |
Format code with Prettier |
// app/page.tsx
'use server'
import CrawlDashboard from '@/components/CrawlDashboard'
import { performCrawl } from '@/lib/crawl-engine'
export default async function HomePage() {
return <CrawlDashboard performCrawl={performCrawl} />
}// lib/ai-parser.ts
import { GoogleGenerativeAI } from '@google/generative-ai'
export async function structureDataWithAI(htmlContent: string): Promise<any> {
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!)
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' })
const prompt = `
Extract and structure the following HTML content into clean JSON:
HTML Content: ${htmlContent}
Return only the JSON object, no explanations.
`
const result = await model.generateContent(prompt)
return JSON.parse(result.response.text())
}// lib/crawl-policy.ts
import { RobotsTxtFile } from '@robotstxt/robotstxt'
export async function getCrawlPolicy(url: string): Promise<RobotsTxtFile> {
const robotsTxtPath = new URL('/robots.txt', url).toString()
const response = await fetch(robotsTxtPath)
if (response.ok) {
const text = await response.text()
return new RobotsTxtFile(text, url)
}
return new RobotsTxtFile('', url)
}
export function isAllowed(policy: RobotsTxtFile, url: string, userAgent: string): boolean {
return policy.isAllowed(userAgent, url)
}The dashboard provides:
- Real-time Stats: Track successful crawls, errors, and average duration
- Analytics: Visualize crawl performance with charts
- Crawl History: View past crawl results and performance
- Live Results: Monitor and interact with ongoing crawls
The crawler implements:
- Robots.txt Verification: Automatically fetches and parses
robots.txt - User-Agent Rotation: Allows setting custom user-agents for different crawl policies
- Crawl Delays: Respects
Crawl-delaydirectives to avoid overwhelming servers - Sitemaps: (Optional) Supports sitemap discovery and parsing for comprehensive crawling
The system uses an efficient data pipeline:
HTML Fetch β Robots.txt Check β Content Extraction β AI Structuring β Result Storage
Each step can be independently monitored and optimized, making the system highly maintainable.
Run the development server:
npm run dev- Respect robots.txt: Built-in compliance with crawl directives
- Rate Limiting: Implemented through crawl delay policies
- User-Agent Handling: Proper user-agent identification for policy enforcement
- Error Handling: Graceful error handling and retry mechanisms
Contributions are welcome! Please feel free to submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Project: YourCrawl
