Lead Finder

Lead Finder — a lightweight web scraper for discovering potential leads (researchers, professionals, and organizations). Fast, configurable, and designed for streaming results to a web frontend.

How It Works

How it works — at a glance:

🔍 Querying search engines (DuckDuckGo via the ddgs library) to discover candidate pages
🕷️ Crawling internal links to extract lead details and scraping via Scrapy and Playwright
🌐 Serving a frontend built with React + TypeScript and a backend powered by FastAPI

Project Overview

Lead Finder is a lead-finding application that searches for potential leads (researchers, professionals, and organizations) across academic and professional platforms. It has two main components: a backend service for scraping and processing, and a frontend UI for interacting with results.

Backend — FastAPI (Server) 🔧

Purpose: Handles web scraping, crawling, and lead processing
Key Features:
- Searches DuckDuckGo for initial results
- Optional Scrapy crawling for one-off page extraction
- Optional deep Playwright crawling for in-depth site exploration (e.g., LinkedIn, ResearchGate)
- Lead enrichment and processing
- Server-Sent Events (SSE) for real-time progress streaming
- Google Sheets export integration
- OAuth authentication for Google services

Frontend — React + TypeScript (Client) 🎯

Purpose: User interface for searching, displaying results, and managing leads
Key Features:
- Real-time progress updates via SSE
- Results table with filtering
- CSV and Google Sheets export
- Settings for search domains and result limits
- Responsive design

Architecture

Data Flow

👤 User enters a search query in the frontend
🔍 Backend performs DuckDuckGo search to get initial URLs
🕷️ Optional crawling phases: Scrapy for quick extraction, Playwright for deep crawling
⚙️ Leads are processed and enriched with contact information
📊 Results are streamed back to frontend via SSE
📤 Users can export results to CSV or Google Sheets

Key Technologies

Backend: Python, FastAPI, Scrapy, Playwright, DuckDuckGo
Frontend: React, TypeScript, CSS
Deployment: Render (backend); Vercel / Netlify (frontend)

Key Files & Responsibilities

Backend

main.py: FastAPI app with SSE /scrape/stream, /scrape, and /process endpoints
src/handlers/handle.py: Core orchestration for scraping and processing
src/utils/duck.py: DuckDuckGo search integration
src/utils/scrapy_ok.py: Scrapy spider for page extraction
src/utils/playwright_deep.py: Deep site crawling with Playwright
src/utils/profile.py: Heuristics for detecting profile-like URLs
src/handlers/auth_google.py & google_export.py: Google OAuth and Sheets export

Frontend

src/LeadFinder.tsx: Main component with SSE client and UI state
src/components/ResultsTable.tsx: Results display and processing triggers
src/components/SearchBar.tsx: Search input and action buttons
src/components/ProgressBar.tsx: Progress visualization

Development Setup

Backend

cd backend
pip install -r requirements.txt
python -m main  # Runs on port 8000

Frontend

cd frontend
npm install  # or bun install
npm run dev  # Runs on port 3000, expects backend on 8000

Testing

pytest  # From repo root, runs backend and frontend tests

Configuration

Environment Variables

USE_PLAYWRIGHT_DEEP: Enable/disable deep crawling
ALLOW_LINKEDIN_DEEP: Allow LinkedIn deep crawling
CRAWL_TIMEOUT: Timeout for crawling operations
DEEP_TIMEOUT_S: Timeout for deep crawling
DEEP_MAX_PAGES: Maximum pages to crawl deeply

Search Domains

🧾 pubmed — Academic publications
💼 linkedin — Professional profiles
➕ Custom domains can be added via settings

Security & Best Practices

Playwright is lazy-loaded to avoid import-time failures and reduce startup friction
Defensive error handling is used for all external dependencies and network operations
Server-side heuristics prioritize profile-like URLs to improve result quality
Google Sheets export uses OAuth for secure access
The crawler avoids sites that explicitly block automation by default

Deployment

The backend is deployed to Render: https://leads-59wq.onrender.com
The frontend can be deployed to Vercel, Netlify, or similar platforms
Ensure CORS and environment variables are configured appropriately for production

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.devcontainer		.devcontainer
.github		.github
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
backend-test.py		backend-test.py
e.ps1		e.ps1
test_duck.py		test_duck.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lead Finder

How It Works

Project Overview

Backend — FastAPI (Server) 🔧

Frontend — React + TypeScript (Client) 🎯

Architecture

Data Flow

Key Technologies

Key Files & Responsibilities

Backend

Frontend

Development Setup

Backend

Frontend

Testing

Configuration

Environment Variables

Search Domains

Security & Best Practices

Deployment

About

Uh oh!

Releases

Packages

Languages

License

chintu4/leads

Folders and files

Latest commit

History

Repository files navigation

Lead Finder

How It Works

Project Overview

Backend — FastAPI (Server) 🔧

Frontend — React + TypeScript (Client) 🎯

Architecture

Data Flow

Key Technologies

Key Files & Responsibilities

Backend

Frontend

Development Setup

Backend

Frontend

Testing

Configuration

Environment Variables

Search Domains

Security & Best Practices

Deployment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages