π Transform any web interface SPAs, dynamic dashboards, or complex content layers into semantically structured, LLM-optimized Markdown with human-level intelligence. Outperforms FireCrawl, Jina Reader, and other paid solutions while running entirely on your local machine.
π Modern Bun-powered workspace with comprehensive scraper, Hono API server, and MCP integration π
| Feature | SniffHunt | FireCrawl (/extract) | Jina Reader | Others |
|---|---|---|---|---|
| Cost | π Free & Open Source | π° $99-799/month | π° usage-based | π° $50-1000/month |
| Privacy | π 100% Local | βοΈ Cloud-based | βοΈ Cloud-based | βοΈ Cloud-based |
| AI Intelligence | π§ Cognitive DOM modeling | β‘ Basic extraction | π Text-only | π Limited |
| Interactive Content | β Full UI interaction | β Static only | β Static only | β Limited |
| LLM Optimization | π― Purpose-built | π Generic output | π Generic output | π Basic |
While FireCrawl, Jina & Others uses basic text extraction, SniffHunt employs cognitive modeling to understand context and semantics.
Handles complex SPAs and dynamic interfaces that cause other tools to fail completely.
Navigates tabs, modals, and dropdowns like a human user, not just scraping static HTML.
Generates markdown specifically formatted for optimal LLM consumption and context understanding.
Runs entirely in your local environment, unlike cloud-based tools that process your data externally.
What You'll Learn: How to install and configure SniffHunt, start the API server and web interface, set up MCP integration for AI tools, and see basic usage examples for each component.
Before we begin, make sure you have:
- Bun >= 1.2.15 (Install Bun)
- Google Gemini API Key (Get free key from Google AI Studio)
β οΈ API Key Required: You'll need a Google Gemini API key for the AI-powered content extraction. The free tier is generous and perfect for getting started.
git clone https://github.com/mpmeetpatel/sniffhunt-scraper.git
cd sniffhunt-scraperbun installThis installs all dependencies for the entire workspace including all apps.
cp .env.example .envEdit the .env file and add your Gemini API key:
# Required
GOOGLE_GEMINI_KEY=your_actual_api_key_here
# Optional (You can provide multiple keys here to avoid rate limiting & load balancing)
GOOGLE_GEMINI_KEY1=your_alternative_key_1
GOOGLE_GEMINI_KEY2=your_alternative_key_2
GOOGLE_GEMINI_KEY3=your_alternative_key_3
# Optional (defaults shown)
PORT=8080
MAX_RETRY_COUNT=2
RETRY_DELAY=1000
PAGE_TIMEOUT=10000
CORS_ORIGIN=*Choose your preferred way to use SniffHunt:
Perfect for interactive use and web application integration.
bun run dev:serverThe server will start on http://localhost:8080
# In a new terminal
bun run dev:webOpen http://localhost:6001 in your browser for the beautiful web interface.
# Test the API
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "mode": "normal"}'Integrate SniffHunt directly with Claude Desktop, Cursor, or other MCP-compatible AI tools.
bun run setup:mcpThis builds the MCP server and makes it globally available.
Add this to your MCP client configuration (e.g., Cursor, Windsurf, VSCode, Claude Desktop):
{
"mcpServers": {
"sniffhunt-scraper": {
"command": "npx",
"args": ["-y", "sniffhunt-scraper-mcp-server"],
"env": {
"GOOGLE_GEMINI_KEY": "your-api-key-here"
}
}
}
}Restart your AI client and try asking:
Scrape https://anu-vue.netlify.app/guide/components/alert.html & grab the 'Outlined Alert Code snippets'
The AI will automatically use SniffHunt to extract the content!
React-based web interface with modern UI and real-time scraping capabilities.
- Complete Setup: Follow the Quick Start Guide for initial setup
- API Server Running: The web interface requires the API server
- Environment Configured: Ensure
.envfile is properly set up
# Step 1: Start the API server (Terminal 1)
bun run dev:server
# Step 2: Start the web interface (Terminal 2)
bun run dev:webAccess Points:
- π Web Interface:
http://localhost:6001 - π API Server:
http://localhost:8080
Perfect for automation, scripting, and one-off extractions.
# Scrape any website
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
# Output saved as:
# scraped.raw.md or scraped.md (based on mode and query automatically generated name)
# scraped.html# Use normal mode for static sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode normal
# Use beast mode for complex sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets" --mode beast
# Add semantic query for focused extraction & Custom output filename
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets" --output my-contentModel Context Protocol server for AI integrations.
Model Context Protocol (MCP) is a standardized way for AI applications to access external tools and data sources. SniffHunt's MCP server allows AI models to scrape and extract web content as part of their reasoning process.
π‘ MCP Benefits:
- Direct AI Integration: Use scraping within AI conversations
- Tool Calling: AI models can scrape websites automatically
- Context Enrichment: Provide real-time web data to AI models
- Standardized Interface: Works with any MCP-compatible AI client
Before setting up MCP integration, ensure you have:
- SniffHunt Installed: Complete the Quick Start Guide first
- API Key Configured: Google Gemini API key in your
.envfile - MCP Client: Claude Desktop, Cursor, or another MCP-compatible AI tool
# Build and setup MCP server from root directory
bun run setup:mcpThis command:
- Builds the MCP server with all scraping capabilities
- Publishes globally via
npxfor any AI client to use (Locally only, will not publish to npm or any other package registry) - Creates the binary that MCP clients can execute
What happens internally:
- Compiles the MCP server from
apps/mcp/src/ - Builds dependencies and scraper functionality
- Makes
sniffhunt-scraper-mcp-serveravailable globally
Add this configuration to your MCP client:
{
"mcpServers": {
"sniffhunt-scraper": {
"command": "npx",
"args": ["-y", "sniffhunt-scraper-mcp-server"],
"env": {
"GOOGLE_GEMINI_KEY": "your_actual_api_key_here"
}
}
}
}Important Notes:
- Replace
your_actual_api_key_herewith your real Google Gemini API key - Environment variables are passed directly to the MCP server process
After adding the configuration:
- Close your AI client completely
- Restart the application
- Verify the MCP server is loaded (look for SniffHunt tools in your AI client)
Your AI client should now have access to SniffHunt scraping capabilities. Test by asking:
π‘ Try These Examples:
- Can you scrape https://news.ycombinator.com and summarize the top stories?
- Can you scrape https://anu-vue.netlify.app/guide/components/alert.html and grab code snippets for outlined alerts?
The AI will automatically use SniffHunt to fetch and process the content!
Scrape and extract content from any website.
Parameters:
url(required): Target URL to scrapemode(optional):normalorbeast(default: beast)userQuery(optional): Natural language description of desired content
Example Usage in AI Chat:
User: "Can you scrape https://news.ycombinator.com and get the top 5 stories?"
AI: I'll scrape Hacker News for you and extract the top stories.
[Uses scrape_website tool with url="https://news.ycombinator.com" and userQuery="top 5 stories"]
The MCP tool returns data in the standard MCP format. The actual response structure:
{
"content": [
{
"type": "text",
"text": {
"success": true,
"url": "https://example.com",
"mode": "beast",
"processingTime": 2.34,
"markdownLength": 12450,
"htmlLength": 45230,
"hasEnhancedError": false,
"enhancedErrorMessage": null,
"markdown": "# Page Title\\n\\nExtracted content in markdown format...",
"html": "<html>Raw HTML content...</html>"
}
}
]
}Response Fields:
success: Boolean indicating if scraping was successfulurl: The scraped URLmode: Scraping mode used (normalorbeast)processingTime: Time taken for scraping in secondsmarkdownLength: Length of extracted markdown contenthtmlLength: Length of raw HTML contenthasEnhancedError: Boolean indicating if enhanced error info is availableenhancedErrorMessage: Human-readable error message (if any)markdown: Cleaned, structured content in markdown formathtml: Raw HTML content from the page
Hono-based API server with streaming and sync endpoints.
Before starting the server, ensure you have:
- Environment Setup: A
.envfile in the root directory with your Google Gemini API key - Dependencies Installed: Run
bun installfrom the root directory
# Start from root directory (automatically loads .env)
bun run dev:serverBenefits:
- Automatically loads environment variables from root
.env - Consistent with other workspace commands
- No need to navigate to subdirectories
# Alternative: Start from server directory
cd apps/server
bun devThe server will start on http://localhost:8080 by default.
# Health check
curl http://localhost:8080/health
# Expected response:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0",
"timestamp": "xxxxx"
}Returns API health status and configuration validation.
Response:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0",
"timestamp": "xxxxx"
}Real-time streaming extraction with progress updates.
Request Body:
{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal" | "beast",
"query": "natural language content description"
}Example:
curl -N http://localhost:8080/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://anu-vue.netlify.app/guide/components/alert.html", "mode": "beast"}'Response: Server-Sent Events (SSE) stream with real-time updates.
Standard synchronous extraction for simple integrations.
Request Body:
{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal" | "beast",
"query": "natural language content description"
}Parameters:
url(required): Target URL for content extractionmode(optional): Extraction strategynormal: Standard content extraction (default)beast: Interactive interface handling with AI intelligence
query(optional): Natural language description for semantic filtering
Response Format:
{
"success": true,
"content": "# Extracted Content\n\nMarkdown-formatted content here...",
"metadata": {
"title": "Page Title",
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "beast",
"extractionTime": 3.2,
"contentLength": 15420
}
}- Best for: Static content, blogs, documentation
- Performance: Fast extraction
- Capabilities: Basic content extraction (still better than paid services even in normal mode)
- Best for: SPAs, dynamic dashboards, interactive interfaces
- Performance: Intelligent extraction with AI processing
- Capabilities:
- UI interaction (clicks, scrolls, navigation)
- Modal and popup handling
- Dynamic content loading
- Semantic content understanding
π‘ Mode Selection: Use
normalmode for standard websites andbeastmode for complex web applications that require interaction or have dynamic content runtime content.
Use the query parameter to extract specific content like this:
# Extract Avatar Code snippets
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/avatar.html",
"mode": "beast",
"query": "Grab the Avatar Code snippets"
}'
# Extract API reference and code examples
curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal",
"query": "Grab API reference and code examples"
}'Command-line interface for direct web scraping operations.
Before starting the CLI scraper, ensure you have:
- Environment Setup: A
.envfile in the root directory with your Google Gemini API key - Dependencies Installed: Run
bun installfrom the root directory
# Recommended: Run from root directory
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html# Alternative: Run from scraper directory
cd apps/scraper
bun cli.js https://anu-vue.netlify.app/guide/components/alert.html# Scrape a basic website
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html
# Output will be saved to a markdown file
# Output: example-com-20240115-143022.md# Scrape with beast mode for complex sites
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beast
# Scrape with custom query for semantic filtering
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"
# Combine options
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beast --query "Grab the Outlined Alert Code snippets" --output custom-nameThe target URL to scrape. Must be the first argument.
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.htmlChoose the scraping strategy:
normal(default): Fast extraction for static contentbeast: AI-powered extraction for interactive content
# Normal mode (default)
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode normal
# Beast mode for SPAs and dynamic content
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --mode beastNatural language description of desired content for semantic filtering.
# Extract specific content
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"Specify custom output filename.
# Custom filename
bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --output my-contentDisplay help information and available options.
bun run cli:scraper --help# Full syntax
bun run cli:scraper <URL> [OPTIONS]
# Options:
# -m, --mode <mode> Scraping mode: normal|beast (default: normal)
# -q, --query <query> Natural language content filter
# -o, --output <file> Output filename (default: auto-generated)
# -h, --help Show help informationReact-based web interface with modern UI and real-time scraping.
Before starting the web interface:
- Complete Setup: Follow the Quick Start Guide for initial setup
- API Server Running: The web interface requires the API server
- Environment Configured: Ensure
.envfile is properly set up
# Step 1: Start the API server (Terminal 1)
bun run dev:server
# Step 2: Start the web interface (Terminal 2)
bun run dev:webAccess Points:
- π Web Interface:
http://localhost:6001 - π API Server:
http://localhost:8080
Benefits:
- Consistent workspace environment
- Automatic environment variable loading
- Coordinated development setup
# Alternative: Start from web directory
cd apps/web
bun devNote: This method also loads the root .env file automatically.
- API Server Health: Visit
http://localhost:8080/health - Web Interface: Open
http://localhost:6001in your browser - Test Scraping: Try scraping
https://example.comwith normal mode
Type or paste the website URL you want to scrape (for example):
https://anu-vue.netlify.app/guide/components/alert.html
Normal Mode - For standard websites
Beast Mode - For complex applications
Use natural language to specify what content you want (for example):
- "Grab code snippets & API Reference"
Click the "Extract Content" button and watch real-time progress:
- π Connecting - Establishing connection to target site
- β³ Loading - Page loading and rendering
- π§ Analyzing - AI-powered content understanding (Beast mode)
- β‘ Extracting - Converting to markdown format
- β Complete - Content ready for use
# Check if API server is running
curl http://localhost:8080/health
# Should return:
{
"status": "healthy",
"service": "SniffHunt Scraper API",
"version": "1.0.0"
}curl -X POST http://localhost:8080/scrape-sync \
-H "Content-Type: application/json" \
-d '{
"url": "https://anu-vue.netlify.app/guide/components/alert.html",
"mode": "normal",
"query": "Grab the Outlined Alert Code snippets"
}'bun run cli:scraper https://anu-vue.netlify.app/guide/components/alert.html --query "Grab the Outlined Alert Code snippets"- Open
http://localhost:6001 - Enter URL:
https://anu-vue.netlify.app/guide/components/alert.html - Select mode: "Normal"
- Add query: "Grab the Outlined Alert Code snippets"
- Click "Extract Content"
Error: "API key not configured"
Solution: Ensure GOOGLE_GEMINI_KEY is set in your .env file with a valid Gemini API key.
Error: "Port 8080 already in use"
Solution: Close any other process that is using port 8080 or change the port in your .env file:
PORT=6001Error: "Browser not found"
Solution: Install Playwright browser dependencies (run this command in the root directory of the project):
cd apps/scraper && bunx playwright-core install --with-deps --only-shell chromium- Content Research: Extract structured data from any website
- AI Workflows: Provide real-time web content to LLM applications
- Data Mining: Automated content extraction for analysis
- Documentation: Convert web content to markdown for documentation
- API Integration: RESTful endpoints for programmatic access
- π Bug Reports: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Direct Support: For enterprise integrations and custom requirements
-
β Star the Repository: GitHub
-
πΌ Upvote on peerlist:
-
β Support Development: Buy Me Coffee
- Personal Use: Free
- Commercial Use: β (contact for licensing)
- Code redistribution/reselling: β
- License - see LICENSE file for details.
Privacy & Compliance: SniffHunt is a true privacy-first solution that runs entirely on your infrastructure, ensuring your data never leaves your control.