Skip to content

sneharao/AI_Job_Outreach

Repository files navigation

AI Job Retrieval & Outreach Engine

A production-ready job search automation system. It analyzes your resume, finds matching jobs across multiple boards, researches the company's product, and drafts personalized outreach messages—all in a single, real-time pipeline.


🧠 System Architecture

The project consists of a FastAPI backend running a LangGraph workflow and a React frontend that displays real-time updates via Server-Sent Events (SSE).

1. The Core Workflow (LangGraph)

The "brain" of the project is a directed graph where each node handles a specific step. The system iterates through jobs found in a loop.

Node Name Purpose Key Inputs Key Outputs LLM Used?
resume_analyzer Extract skills & impact from PDF. resume_text resume_profile ✅ (Sonnet)
job_scraper Find jobs via Tavily/XING & extract JD. job_title, location raw_jobs (List) ✅ (Haiku)
match_scorer Score match (0-100) based on resume. profile, job[idx] match_result ✅ (Sonnet)
product_analyzer Research company & find product gaps. company, JD product_analysis ✅ (Sonnet)
hr_finder Find recruiters/hiring managers on LinkedIn. company, JD hr_contact ✅ (GPT-4o-mini)
outreach_gen Draft Email & LinkedIn messages. All previous data outreach_draft ✅ (Sonnet)
results_persister Save analyzed job to SQLite database. All job data completed_results ❌ (No)
advance_job Move to next job or end search. current_index current_index + 1 ❌ (No)

💾 Data Flow & Persistence

The system is designed to be efficient and "resume-safe." It stores data at three key points:

  1. Resume Profile: Saved to user_profile table immediately after analysis.
  2. Job Results: Saved to run_results table at the end of every loop iteration. If the search is interrupted, you still have the results for jobs processed so far.
  3. Research Caches: product_cache and hr_cache store company-specific research. If two different jobs are from the same company, we don't pay for research twice.

🗄️ Database Schema (SQLite)

The project uses a local SQLite database (path configured via SQLITE_DB_PATH) for persistent storage and caching.

1. user_profile

Stores the analyzed resume and user metadata. This is a single-row table (ID=1).

Column Type Description
id INTEGER Primary Key (always 1).
email TEXT User's email address.
resume_text TEXT Raw extracted text from the PDF.
profile_json TEXT Structured JSON of skills, experience, and goals.
role_category TEXT The detected professional category (e.g., "Software Engineer").
updated_at TEXT ISO timestamp of the last update.

2. run_results

Stores every job that passed the match threshold in the current session.

Column Type Description
id INTEGER Primary Key (Autoincrement).
result_json TEXT Complete job object (JD, analysis, outreach drafts).
overall_score REAL The match score (0-100).
saved_at TEXT Timestamp when the job was persisted.

3. product_cache

Caches company research to avoid redundant LLM and Tavily API calls.

Column Type Description
company TEXT Primary Key (Normalized lowercase company name).
product_json TEXT Analysis of product name, description, and gaps.
created_at TEXT Timestamp of the initial research.

4. hr_cache

Caches recruiter contact information for companies.

Column Type Description
company TEXT Primary Key (Normalized lowercase company name).
hr_json TEXT Found LinkedIn profiles, names, and titles of HR staff.
created_at TEXT Timestamp of the initial research.

🌐 Frontend & Real-Time Updates

The frontend uses Zustand for state management and @microsoft/fetch-event-source for SSE.

  • Real-time Streaming: The backend pushes events (node_complete, scrape_progress, result_saved) to the frontend.
  • No Polling: The UI updates automatically when a job is finished—it doesn't need to ask the database.
  • Persisted UI: Your resume data is saved in your browser's localStorage, but the job results are ephemeral (cleared on refresh).

⚙️ Configuration & Limits

To keep costs low and results high-quality, the following limits are in place:

  • Job Limit: Maximum 8 jobs processed per search.
  • Recency: Only jobs posted in the last 30 days are accepted.
  • Language: Only English job descriptions are processed (detected before LLM calls).
  • Threshold: Only jobs with a match score >= 60 trigger the research & outreach phase.

🛠️ Developer Guide

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • API Keys: OpenRouter, Resend, Tavily, Proxycurl

Backend Setup

cd backend
pip install -r requirements.txt
cp .env.example .env # Add your keys here
python main.py

Frontend Setup

cd frontend
npm install
npm run dev

Testing

We have dedicated tests for core logic and integration:

  • pytest tests/test_graph_nodes.py: Node logic tests.
  • pytest tests/test_duplicates.py: Verifies that duplicate jobs are correctly filtered by URL and Ground Truth (Company/Title).

🚀 Key Optimizations

  1. LLM Routing: Heavy analysis uses Claude 3.5 Sonnet, while fast data extraction uses Claude 3.5 Haiku. Simple decisions use GPT-4o-mini.
  2. Multi-Pass Deduplication:
    • Pass 1 (Search Pass): Immediately filters out duplicate URLs and identical company/title strings from raw search results to save on LLM extraction costs.
    • Pass 2 (Ground Truth Pass): After the LLM extracts the actual company and title from the job page, the system performs a second check. This catches the same job posted on different platforms (e.g., Greenhouse vs LinkedIn) with different URLs.
  3. Stable Job IDs: Job IDs are generated using MD5 hashes of normalized URLs. This ensures that IDs remain identical across server restarts, preventing duplicate results in the frontend if the user refreshes.
  4. Parallelism: Product research and HR searching happen simultaneously using LangGraph's fan-out capability.

About

AI-powered job outreach automation with job scraping, AI-generated personalized emails, and workflow orchestration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors