A self-hosted microservice application for tracking automated job discovery, contact enrichment, and cold email generation.
Python 3.11 FastAPI Scrapy Docker
I built this because managing engineering applications at product-based startups manually was slow and unscalable. As a third-year student approaching placement season, I needed a way to find relevant roles across multiple scattered platforms, uncover who the actual hiring managers were, and draft personalized outreach emails without spending hours on repetitive clicking. This system automates the top of the job-hunt funnel so I can focus on interviewing.
https://arachnode.vercel.app/
(https://youtu.be/GiibnmC7kiY)
The dashboard provides a unified view of the entire pipeline. It displays active job listings filtered by status and tech stack on the main screen, a tabular view of all discovered contacts with their verification status, and an email generation pane where draft outreach templates are queued. Four key metrics track total jobs, applied jobs, discovered contacts, and drafted emails at the top.
- Crawls tech-focused directories (Wellfound, YC Jobs, Remotive) using Scrapy.
- Scrapes structured job platforms (Naukri, LinkedIn, Internshala) via Playwright browser automation.
- Normalizes job events and aggregates them into a central PostgreSQL database.
- Deduplicates identical job postings intelligently based on company and role.
- Queries the GitHub API and uses OSINT techniques to find technical stakeholders and recruiters for specific companies.
- Validates discovered email addresses via SMTP checks.
- Generates context-aware cold emails using locally hosted LLM models (Ollama/Mistral) and Jinja2 templates.
- Enqueues draft outbound emails for manual review and Gmail sending.
- Proxies all backend services through a unified API gateway.
- Visualizes job funnel metrics via an interactive, single-page HTML dashboard.
- Maintains atomic state tracking for each job application (new, applied, ignored).
- Runs the entire discovery-to-draft pipeline in the background using APScheduler cron jobs.
The system uses an event-driven microservice architecture with Python across the stack. I chose Redis Streams over a heavier broker like Kafka because it provides sufficient consumer group semantics with minimal operational overhead for a self-hosted personal tool. The loosely coupled services allow background crawling and contact discovery to scale and fail independently without blocking the API gateway or the web dashboard.
┌────────────────────────┐
┌─────────────────────┐ Jobs via REST POST │ │
│ ├──────────────────────────►│ Gateway Service │
│ Scheduler Service │ │ (:8080) │
│ (APScheduler) │ Trigger operations │ │
└──────────┬──────────┘ └──────┬────────┬────────┘
│ │ │
│ Trigger spiders via POST REST │ │ REST
│ │ │
┌──────────▼──────────┐ ┌──────▼────────▼────────┐
│ │ Jobs via POST │ │
│ Platform Scraper ├──────────────────────────►│ Aggregator Service │
│ (:8001) │ │ (:8000) │
└─────────────────────┘ └──────┬─────────────────┘
│ ▲
┌─────────────────────┐ Jobs via Stream │ │ Store
│ │ ┌──────▼────────▼────────┐
│ Crawler Service ├───────[ Redis ]──────────►│ PostgreSQL │
│ (Scrapy spider) │ Stream │ Database DB │
└─────────────────────┘ └──────┬────────┬────────┘
│ │
┌─────────────────────┐ ┌──────▼────────▼────────┐
│ │ Trigger via REST POST │ │
│ Email Service ◄───────────────────────────┤ Contact Discovery │
│ (:8003) │ │ (:8002) │
└─────────────────────┘ └────────────────────────┘
| Service | Language/Framework | Port | Responsibility |
|---|---|---|---|
| Crawler | Python / Scrapy | None | Navigates startup directories, extracts job listings, and publishes items to a Redis stream. |
| Platform Scraper | Python / FastAPI / Playwright | 8001 | Handles on-demand JavaScript-heavy browser scraping for platforms like LinkedIn. |
| Aggregator | Python / FastAPI / asyncpg | 8000 | Consumes the Redis stream, deduplicates entries, saves to Postgres, and serves job queries. |
| Contact Discovery | Python / FastAPI / httpx | 8002 | Uses OSINT and GitHub APIs to uncover potential recruiter/manager emails per company. |
| Email Generator | Python / FastAPI / Ollama | 8003 | Renders Jinja2 templates and interfaces with Ollama to draft contextual cold emails. |
| Gateway | Python / FastAPI / Vanilla JS | 8080 | Proxies frontend requests, composes complex workflows, and serves the static dashboard. |
| Scheduler | Python / APScheduler | None | Periodically triggers crawlers, contact discovery, and automated email drafting. |
| Technology | Purpose |
|---|---|
| FastAPI | Handles all internal REST APIs and the unified Gateway proxy logic. |
| Scrapy | Crawls structured and unstructured startup directories with high concurrency. |
| Playwright | Renders JavaScript-heavy career pages (LinkedIn, Wellfound, Naukri). |
| asyncpg | Provides direct async connections to PostgreSQL for high-performance writes. |
| Redis Streams | Acts as a lightweight event broker decoupling crawling from persistence. |
| PostgreSQL | Stores unified job listings, contact profiles, and email drafts. |
| APScheduler | Manages chronologically scheduled automation cycles in a background process. |
| Jinja2 | Populates string templates for standard cold email structures. |
| Ollama | Drafts highly customized outreach emails locally using open-weights reasoning LLMs. |
| httpx | Performs asynchronous outbound API calls to GitHub and other OSINT sources. |
| BeautifulSoup4 | Parses static HTML response blobs extracting email patterns and target links. |
| smtplib | Validates discovered email domains and sends approved outreach via Gmail SMTP. |
| Docker | Containerizes services for isolated execution environments. |
| docker-compose | Orchestrates the multi-container stack, network routing, and volumes. |
| Chart.js | Renders job metrics and funnel conversion graphs on the dashboard. |
| pytest | Executes unit, integration, and end-to-end tests across the monorepo. |
| testcontainers | Spins up ephemeral Postgres and Redis instances automatically for integration testing. |
- Python 3.11+
- Docker 24+ and docker-compose v2
- Redis 7 (or run via Docker)
- Ollama (optional, for AI-enhanced emails)
- A Gmail account with an App Password (for sending emails)
Note: If you do not install Ollama, the email generator gracefully degrades to standard parameterized Jinja2 templates. Focus is maintained on contact discovery and job aggregation.
- Clone the repository and navigate into the root directory.
git clone https://github.com/vaibhav-sharma/jobCrawler.git
cd jobCrawler- Copy the example environment file and fill in the required variables.
cp .env.example .env- Install the Playwright chromium browser locally if you plan to aggressively run or test the crawler outside docker.
playwright install chromium- Bring up the full infrastructure stack via docker-compose.
docker compose up --build -dVisit http://localhost:8080 to open the dashboard.
| Variable | Required | Default | Description |
|---|---|---|---|
| Core | |||
POSTGRES_USER |
Yes | jobuser |
Target username for the PostgreSQL database container. |
POSTGRES_PASSWORD |
Yes | jobpass |
Target password for the PostgreSQL database container. |
POSTGRES_DB |
Yes | jobsdb |
Target database schema name inside PostgreSQL. |
GATEWAY_PORT |
No | 8080 |
External routing port exposed on localhost for the UI and Gateway. |
| Crawler | |||
JOBSEEKER_ROLE |
No | Backend Engineer |
The primary job role targeted during scraped searches. |
JOBSEEKER_STACK |
No | Python,FastAPI |
Target developer tools used to filter matching jobs. |
GMAIL_ADDRESS |
No | Senders email used for outbound applications. | |
GMAIL_APP_PASSWORD |
No | Secure remote app password to authorize smtplib. |
|
YOUR_NAME |
No | Applicant |
Outbound sender name attached to drafted cold emails. |
YOUR_GITHUB_URL |
No | Included in footer context for recruiter outreach prompts. | |
| Ollama | |||
OLLAMA_BASE_URL |
No | http://host.docker.internal:11434 |
Gateway path to resolve localhost Ollama APIs inside container bridges. |
Trigger an intense, immediate scrape against configured structured target platforms via the gateway proxy.
curl -X POST http://localhost:8080/api/scrape \
-H "Content-Type: application/json" \
-d '{"role": "Backend Engineer", "platforms": ["internshala"]}'This request kicks off a Playwright-based crawling procedure in the scraper service. Jobs retrieved from this flow bypass Redis and directly hit the aggregator via POST callbacks for synchronous feedback loops.
Execute the entire discovery-to-email funnel for an individual existing job ID in your database.
curl -X POST http://localhost:8080/api/workflow/apply \
-H "Content-Type: application/json" \
-d '{
"job_id": "aa1f4bc0-5c08-4531-9c8a-721fb1afe033",
"template": "cold_outreach",
"roles": ["Engineering Manager", "Technical Recruiter"]
}'Example JSON response:
{
"job": {
"id": "aa1f4bc0-5c08-4531-9c8a-721fb1afe033",
"company": "Supabase",
"role": "Postgres Engineer",
"status": "new"
},
"contacts": [
{
"id": "b22f4bc0...",
"name": "Jane Doe",
"email": "jane@supabase.com",
"verified": "verified"
}
],
"draft_email": {
"id": "c33f4bc0...",
"subject": "Backend Engineer application — Supabase",
"body": "Hi Jane,\n\nI was browsing open roles...",
"status": "draft"
}
}The APScheduler mechanism triggers unattended sweeps continually without manual user execution. Scrapes execute every 8 hours, loading raw lists into the database. Every 24 hours (offset by 4 hours to avoid deadlock throttling), the contact discovery worker fetches all unprocessed generic companies and scours GitHub/OSINT APIs for recruiter points of contact. Finally, a draft execution process constructs customized outgoing email strings based on the recently located verified addresses, leaving them prepared in the UI dashboard.
├── aggregator-service/ # Persists scraped data; serves jobs REST API
│ ├── main.py # FastAPI route declarations
│ ├── db.py # asyncpg query interfaces
│ ├── consumer.py # Background Redis pub-sub listener
│ └── Dockerfile
├── contact-discovery-service/ # Performs OSINT queries discovering employees
│ ├── main.py # Traces company names to Github API endpoints
│ ├── storage.py # Local sync Postgres queries for caching contacts
│ ├── verifier.py # Connects to domains tracking SMTP responses
│ └── Dockerfile
├── crawler-service/ # Emits background job entities into Redis stream
│ ├── crawler/spiders/ # Target pipeline targets (yc.py, remotive.py)
│ ├── scrapy.cfg # Base application scrapy configuration definition
│ └── Dockerfile
├── email-generator-service/ # Local LLM drafting integrations
│ ├── main.py
│ ├── ollama_client.py # Interfaces locally configured Mistral
│ ├── templates/ # Contains structured Jinja2 text templates
│ └── Dockerfile
├── gateway/ # Application proxy endpoints and user interface
│ ├── main.py # API fanout routers linking isolated tasks
│ ├── proxy.py # httpx abstractions executing proxy mappings
│ ├── dashboard.html # Single-page vanilla JS UI tracking state
│ └── Dockerfile
├── scheduler/ # Centralized task execution interval timers
│ ├── main.py
│ ├── tasks.py # Wraps Gateway APIs in timed interval routines
│ ├── logger.py # Intercepts log formats translating properties to JSON
│ └── Dockerfile
├── docker-compose.yml # Core network mapping execution environments
└── README.md
If modifying specific logical rules on a proxy or contact crawler, you can run isolated services independently using virtual environments.
python3.11 -m venv .venv
source .venv/bin/activate
cd crawler-service
pip install -r requirements.txt
export REDIS_HOST=localhost
export JOBSEEKER_ROLE="Backend Engineer"
scrapy crawl remotiveExecute distinct testing levels using pytest.
pytest tests/unitExecutes independent business logic checks against mock objects predicting localized function output expectations.
pytest tests/integrationValidates local inter-service message bus streams utilizing testcontainers initializing raw Postgres and Redis nodes specifically for testing workflows.
pytest tests/contractsAsserts schema integrity among disjoint services expecting uniform HTTP data body formats.
pytest tests/e2eFires browser-based integration actions traversing full API endpoints across unified proxy configurations simulating typical dashboard flows.
- Navigate to
crawler-service/crawler/spiders/. - Produce a class extending
scrapy.Spiderproviding specific domain configurations. - Configure target parse endpoints locating structured text data blobs in HTML.
- Yield Python dictionary structures fitting the common normalization pipeline.
import scrapy
class NewStartupSpider(scrapy.Spider):
name = 'new_startup'
start_urls = ['https://newstartupdomain.com/jobs']
def parse(self, response):
for job in response.css('.job-posting'):
yield {
'company': job.css('.co-name::text').get(),
'role': job.css('.title::text').get(),
'url': response.urljoin(job.css('a::attr(href)').get()),
}- Within
scraper-service/scripts, attach a new class utilizing the abstract BaseScraper. - Structure internal
Playwrightnavigation commands targeting complex JavaScript forms. - Append mapping identifiers onto the primary
/api/scrapeincoming interface.
Endpoints presented represent Gateway proxy targets. Internal traffic targets utilize disparate routing identifiers unavailable publicly.
Returns tracked job listings from aggregator database instances.
| Param | Type | Default | Description |
|---|---|---|---|
role |
str | None |
Case-insensitive substring match checking indexed job role titles. |
stack |
str | None |
Comma-separated array enforcing matching stack requirements on fetched rows. |
status |
str | None |
Filter responses identifying specific targets: new, applied, ignored. |
sort |
str | latest |
Returns values respecting chronological database ingestion timing. |
limit |
int | 50 |
Row cap filtering results. Max 500. |
curl -X GET "http://localhost:8080/api/jobs?status=new&limit=2"[
{
"id": "c1f7a0...",
"company": "Example Startup",
"role": "Backend Engineer",
"status": "new",
"posted_at": "2026-04-03T20:21:00"
},
...
]Returns generic system metadata metric summaries assessing target health operations.
| Param | Type | Default | Description |
|---|---|---|---|
| N/A | N/A | N/A | Receives no query formatting overrides. |
curl -X GET "http://localhost:8080/api/stats"{
"total_jobs": 124,
"sources": {"ycombinator": 80, "remotive": 44},
"statuses": {"new": 100, "applied": 24, "ignored": 0}
}Forces scraper pipeline executing synchronous retrieval bypassing Redis structures.
| Param | Type | Default | Description |
|---|---|---|---|
| (Body) | JSON | Required |
Defines query platforms, role, and string based stack elements driving results. |
curl -X POST "http://localhost:8080/api/scrape" \
-H "Content-Type: application/json" \
-d '{"role": "Backend Engineer", "platforms": ["naukri"]}'{
"status": "success",
"scraped": 15,
"inserted": 5,
"duplicates": 10
}Retrieves discovered employee contacts from target corporate database structures.
| Param | Type | Default | Description |
|---|---|---|---|
company |
str | None |
Enforces specific string matching searching target corporation names. |
curl -X GET "http://localhost:8080/api/contacts?company=Supabase"[
{
"id": "e331bc...",
"name": "Jane Doe",
"company": "Supabase",
"role": "Recruiter",
"email": "jane@supabase.com",
"verified": "verified"
},
...
]Synchronously initiates a Github / OSINT tracking request discovering potential contacts related to a corporate target.
| Param | Type | Default | Description |
|---|---|---|---|
| (Body) | JSON | Required |
Supplies target company title and optional internal role identification filtering arrays. |
curl -X POST "http://localhost:8080/api/discover" \
-H "Content-Type: application/json" \
-d '{"company": "Supabase"}'{
"status": "success",
"found": 2,
"details": "Triggered asynchronous ingestion process assessing 2 generic profile identifiers."
}Extracts formatted text outbound applications produced by locally hosted LLMs.
| Param | Type | Default | Description |
|---|---|---|---|
job_id |
str | None |
Returns emails specific to exact database application roles. |
curl -X GET "http://localhost:8080/api/emails?job_id=aa1f4bc0-5c08-4531-9c8a-721fb1afe033"[
{
"id": "a90bb...",
"subject": "Backend Engineer role — Supabase",
"status": "draft",
"generated_at": "2026-04-03T21:14:00"
},
...
]Processes target templates into rendered string blobs utilizing configured LLM hosts or Jinja2 defaults.
| Param | Type | Default | Description |
|---|---|---|---|
| (Body) | JSON | Required |
Targets job_id, contact_id, and template properties configuring response rendering inputs. |
curl -X POST "http://localhost:8080/api/generate" \
-H "Content-Type: application/json" \
-d '{"job_id": "aa1f4bc0...", "template": "followup"}'{
"email_id": "b1aa2...",
"subject": "Checking in on my backend application",
"body": "Hi Jane,\n\nFollowing up on my previous message...",
"status": "draft"
}Comprehensively triggers cross-domain discovery and text drafting across proxy channels producing singular composite outputs.
| Param | Type | Default | Description |
|---|---|---|---|
| (Body) | JSON | Required |
Supples exact job_id properties ensuring precise target workflow execution properties. |
curl -X POST "http://localhost:8080/api/workflow/apply" \
-H "Content-Type: application/json" \
-d '{"job_id": "aa1f4bc0..."}'{
"job": {
"id": "aa1f4bc0...",
"company": "Supabase"
},
"contacts": [...],
"draft_email": {...}
}Built:
- Web crawler (Wellfound, YC, Remotive)
- Platform scraper (Naukri, LinkedIn, Internshala)
- Job aggregation with deduplication
- Contact discovery via OSINT
- Cold email generation (Jinja2 + Ollama)
- API gateway with dashboard
- Automated scheduling
Planned:
- Add resume parsing module adapting outbound email drafting according to distinct required keywords mapped off CV text profiles.
- Incorporate asynchronous outgoing SMTP workflows validating automated delivery logic bypassing local Gmail browser access steps.
- Construct generic containerized application form completion agents accessing target URLs executing headless submission steps via LLM mapping properties.
- Extend dashboard layout capturing historical metric trend data visualizations scaling historical outreach efforts correctly.
- Refactor internal Redis Streams implementation allowing scalable distributed cluster operation instances scaling processing concurrency properly.
This tool operates exclusively on publicly accessible HTML structures parsing standardized data points mimicking common browser indexing mechanisms. I strictly adhere to site-wide robots.txt properties and rate limit execution timing enforcing respect for operational bandwidth limitations on target directories. Additionally, discovering organizational points of contact serves purely targeted professional outreach efforts mitigating automated bulk spam logic loops by mandating manual user verification checks traversing final outbound gateway executions.
While I designed this application managing my personal placement operations efficiently, contributions proposing intelligent scaling structures remain absolutely welcome. Incorporating supplemental crawler targets, parsing extensions spanning newly structured job platforms, or generating flexible Jinja2 text templates all provide significant collaborative value.
File specific bug tickets or propose standardized PR adjustments traversing the typical GitHub operational flow referencing unified repository structures clearly.
MIT © Vaibhav Sharma 2026
Note: The MIT license covers the tool's source code architecture completely. Users remain independently responsible guiding localized deployment targets ensuring ethical compliance adhering to scraped platform data handling regulations safely.