Lead Finder — a lightweight web scraper for discovering potential leads (researchers, professionals, and organizations). Fast, configurable, and designed for streaming results to a web frontend.
How it works — at a glance:
- 🔍 Querying search engines (DuckDuckGo via the
ddgslibrary) to discover candidate pages - 🕷️ Crawling internal links to extract lead details and scraping via Scrapy and Playwright
- 🌐 Serving a frontend built with React + TypeScript and a backend powered by FastAPI
Lead Finder is a lead-finding application that searches for potential leads (researchers, professionals, and organizations) across academic and professional platforms. It has two main components: a backend service for scraping and processing, and a frontend UI for interacting with results.
- Purpose: Handles web scraping, crawling, and lead processing
- Key Features:
- Searches DuckDuckGo for initial results
- Optional Scrapy crawling for one-off page extraction
- Optional deep Playwright crawling for in-depth site exploration (e.g., LinkedIn, ResearchGate)
- Lead enrichment and processing
- Server-Sent Events (SSE) for real-time progress streaming
- Google Sheets export integration
- OAuth authentication for Google services
- Purpose: User interface for searching, displaying results, and managing leads
- Key Features:
- Real-time progress updates via SSE
- Results table with filtering
- CSV and Google Sheets export
- Settings for search domains and result limits
- Responsive design
- 👤 User enters a search query in the frontend
- 🔍 Backend performs DuckDuckGo search to get initial URLs
- 🕷️ Optional crawling phases: Scrapy for quick extraction, Playwright for deep crawling
- ⚙️ Leads are processed and enriched with contact information
- 📊 Results are streamed back to frontend via SSE
- 📤 Users can export results to CSV or Google Sheets
- Backend: Python, FastAPI, Scrapy, Playwright, DuckDuckGo
- Frontend: React, TypeScript, CSS
- Deployment: Render (backend); Vercel / Netlify (frontend)
main.py: FastAPI app with SSE/scrape/stream,/scrape, and/processendpointssrc/handlers/handle.py: Core orchestration for scraping and processingsrc/utils/duck.py: DuckDuckGo search integrationsrc/utils/scrapy_ok.py: Scrapy spider for page extractionsrc/utils/playwright_deep.py: Deep site crawling with Playwrightsrc/utils/profile.py: Heuristics for detecting profile-like URLssrc/handlers/auth_google.py&google_export.py: Google OAuth and Sheets export
src/LeadFinder.tsx: Main component with SSE client and UI statesrc/components/ResultsTable.tsx: Results display and processing triggerssrc/components/SearchBar.tsx: Search input and action buttonssrc/components/ProgressBar.tsx: Progress visualization
cd backend
pip install -r requirements.txt
python -m main # Runs on port 8000cd frontend
npm install # or bun install
npm run dev # Runs on port 3000, expects backend on 8000pytest # From repo root, runs backend and frontend testsUSE_PLAYWRIGHT_DEEP: Enable/disable deep crawlingALLOW_LINKEDIN_DEEP: Allow LinkedIn deep crawlingCRAWL_TIMEOUT: Timeout for crawling operationsDEEP_TIMEOUT_S: Timeout for deep crawlingDEEP_MAX_PAGES: Maximum pages to crawl deeply
- 🧾
pubmed— Academic publications - 💼
linkedin— Professional profiles - ➕ Custom domains can be added via settings
- Playwright is lazy-loaded to avoid import-time failures and reduce startup friction
- Defensive error handling is used for all external dependencies and network operations
- Server-side heuristics prioritize profile-like URLs to improve result quality
- Google Sheets export uses OAuth for secure access
- The crawler avoids sites that explicitly block automation by default
- The backend is deployed to Render:
https://leads-59wq.onrender.com - The frontend can be deployed to Vercel, Netlify, or similar platforms
- Ensure CORS and environment variables are configured appropriately for production