AI Job Retrieval & Outreach Engine

A production-ready job search automation system. It analyzes your resume, finds matching jobs across multiple boards, researches the company's product, and drafts personalized outreach messages—all in a single, real-time pipeline.

🧠 System Architecture

The project consists of a FastAPI backend running a LangGraph workflow and a React frontend that displays real-time updates via Server-Sent Events (SSE).

1. The Core Workflow (LangGraph)

The "brain" of the project is a directed graph where each node handles a specific step. The system iterates through jobs found in a loop.

Node Name	Purpose	Key Inputs	Key Outputs	LLM Used?
resume_analyzer	Extract skills & impact from PDF.	`resume_text`	`resume_profile`	✅ (Sonnet)
job_scraper	Find jobs via Tavily/XING & extract JD.	`job_title`, `location`	`raw_jobs` (List)	✅ (Haiku)
match_scorer	Score match (0-100) based on resume.	`profile`, `job[idx]`	`match_result`	✅ (Sonnet)
product_analyzer	Research company & find product gaps.	`company`, `JD`	`product_analysis`	✅ (Sonnet)
hr_finder	Find recruiters/hiring managers on LinkedIn.	`company`, `JD`	`hr_contact`	✅ (GPT-4o-mini)
outreach_gen	Draft Email & LinkedIn messages.	All previous data	`outreach_draft`	✅ (Sonnet)
results_persister	Save analyzed job to SQLite database.	All job data	`completed_results`	❌ (No)
advance_job	Move to next job or end search.	`current_index`	`current_index + 1`	❌ (No)

💾 Data Flow & Persistence

The system is designed to be efficient and "resume-safe." It stores data at three key points:

Resume Profile: Saved to user_profile table immediately after analysis.
Job Results: Saved to run_results table at the end of every loop iteration. If the search is interrupted, you still have the results for jobs processed so far.
Research Caches: product_cache and hr_cache store company-specific research. If two different jobs are from the same company, we don't pay for research twice.

🗄️ Database Schema (SQLite)

The project uses a local SQLite database (path configured via SQLITE_DB_PATH) for persistent storage and caching.

1. `user_profile`

Stores the analyzed resume and user metadata. This is a single-row table (ID=1).

Column	Type	Description
`id`	INTEGER	Primary Key (always 1).
`email`	TEXT	User's email address.
`resume_text`	TEXT	Raw extracted text from the PDF.
`profile_json`	TEXT	Structured JSON of skills, experience, and goals.
`role_category`	TEXT	The detected professional category (e.g., "Software Engineer").
`updated_at`	TEXT	ISO timestamp of the last update.

2. `run_results`

Stores every job that passed the match threshold in the current session.

Column	Type	Description
`id`	INTEGER	Primary Key (Autoincrement).
`result_json`	TEXT	Complete job object (JD, analysis, outreach drafts).
`overall_score`	REAL	The match score (0-100).
`saved_at`	TEXT	Timestamp when the job was persisted.

3. `product_cache`

Caches company research to avoid redundant LLM and Tavily API calls.

Column	Type	Description
`company`	TEXT	Primary Key (Normalized lowercase company name).
`product_json`	TEXT	Analysis of product name, description, and gaps.
`created_at`	TEXT	Timestamp of the initial research.

4. `hr_cache`

Caches recruiter contact information for companies.

Column	Type	Description
`company`	TEXT	Primary Key (Normalized lowercase company name).
`hr_json`	TEXT	Found LinkedIn profiles, names, and titles of HR staff.
`created_at`	TEXT	Timestamp of the initial research.

🌐 Frontend & Real-Time Updates

The frontend uses Zustand for state management and @microsoft/fetch-event-source for SSE.

Real-time Streaming: The backend pushes events (node_complete, scrape_progress, result_saved) to the frontend.
No Polling: The UI updates automatically when a job is finished—it doesn't need to ask the database.
Persisted UI: Your resume data is saved in your browser's localStorage, but the job results are ephemeral (cleared on refresh).

⚙️ Configuration & Limits

To keep costs low and results high-quality, the following limits are in place:

Job Limit: Maximum 8 jobs processed per search.
Recency: Only jobs posted in the last 30 days are accepted.
Language: Only English job descriptions are processed (detected before LLM calls).
Threshold: Only jobs with a match score >= 60 trigger the research & outreach phase.

🛠️ Developer Guide

Prerequisites

Python 3.11+
Node.js 18+
API Keys: OpenRouter, Resend, Tavily, Proxycurl

Backend Setup

cd backend
pip install -r requirements.txt
cp .env.example .env # Add your keys here
python main.py

Frontend Setup

cd frontend
npm install
npm run dev

Testing

We have dedicated tests for core logic and integration:

pytest tests/test_graph_nodes.py: Node logic tests.
pytest tests/test_duplicates.py: Verifies that duplicate jobs are correctly filtered by URL and Ground Truth (Company/Title).

🚀 Key Optimizations

LLM Routing: Heavy analysis uses Claude 3.5 Sonnet, while fast data extraction uses Claude 3.5 Haiku. Simple decisions use GPT-4o-mini.
Multi-Pass Deduplication:
- Pass 1 (Search Pass): Immediately filters out duplicate URLs and identical company/title strings from raw search results to save on LLM extraction costs.
- Pass 2 (Ground Truth Pass): After the LLM extracts the actual company and title from the job page, the system performs a second check. This catches the same job posted on different platforms (e.g., Greenhouse vs LinkedIn) with different URLs.
Stable Job IDs: Job IDs are generated using MD5 hashes of normalized URLs. This ensures that IDs remain identical across server restarts, preventing duplicate results in the frontend if the user refreshes.
Parallelism: Product research and HR searching happen simultaneously using LangGraph's fan-out capability.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
IMPLEMENTATION_GUIDE.md		IMPLEMENTATION_GUIDE.md
INSTALLATION_TROUBLESHOOTING.md		INSTALLATION_TROUBLESHOOTING.md
README.md		README.md
RESPONSIVE_FIXES.md		RESPONSIVE_FIXES.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Job Retrieval & Outreach Engine

🧠 System Architecture

1. The Core Workflow (LangGraph)

💾 Data Flow & Persistence

🗄️ Database Schema (SQLite)

1. `user_profile`

2. `run_results`

3. `product_cache`

4. `hr_cache`

🌐 Frontend & Real-Time Updates

⚙️ Configuration & Limits

🛠️ Developer Guide

Prerequisites

Backend Setup

Frontend Setup

Testing

🚀 Key Optimizations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Job Retrieval & Outreach Engine

🧠 System Architecture

1. The Core Workflow (LangGraph)

💾 Data Flow & Persistence

🗄️ Database Schema (SQLite)

1. user_profile

2. run_results

3. product_cache

4. hr_cache

🌐 Frontend & Real-Time Updates

⚙️ Configuration & Limits

🛠️ Developer Guide

Prerequisites

Backend Setup

Frontend Setup

Testing

🚀 Key Optimizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `user_profile`

2. `run_results`

3. `product_cache`

4. `hr_cache`

Packages