LeadForge — AI-Powered Lead Scoring, Enrichment & Outreach

An end-to-end, reproducible pipeline that sources, normalises, enriches, scores, and generates personalised outreach for 291 qualified enterprise AI/ML leads — from raw data export to ready-to-send email + LinkedIn DM sequences.

Built for: P95.AI — an AI inference optimization platform
Target buyer: CTOs, VPs Engineering, Heads of AI at companies running LLMs in production
GitHub: github.com/aayush2724/LeadForge
Website: leadforge-876cf6-11q5x.thinkroot.app

Results at a Glance

Metric	Value
Total active leads	291
Hot tier (score 65+)	135
Warm tier (score 40–64)	57
Cold tier (score <40)	99
Personalised emails generated	50
LinkedIn DMs generated	50
A/B variants designed	40 (top 20 leads × 2 variants)
Pipeline stages	11 (11/11 passing)
Sources	Apollo · LinkedIn Sales Nav · Crunchbase · GitHub · BuiltWith · Seed lists

Project Overview

LeadForge is an intelligent lead qualification and personalised outreach system built for P95.AI. It combines multi-platform lead sourcing, Clay-powered enrichment, a 9-signal ICP scoring rubric, and GPT-4o outreach generation into a single reproducible Python pipeline — fully automated via n8n.

The system produces a fully enriched, scored, and outreach-ready lead database that can be re-run monthly with fresh data using one command:

python pipeline.py

Or triggered end-to-end in a single click using the n8n workflow.

Problem Statement

P95.AI has a powerful product — an AI inference optimization layer that cuts LLM serving costs 30–45% with zero model changes. The challenge is finding and reaching the right enterprise buyers efficiently.

Three core problems:

No signal, no context — Raw lead lists lack tech stack data, hiring signals, and buying intent. Without enrichment, every outreach message is a guess.
Manual qualification doesn't scale — Human lead scoring is slow, inconsistent, and expensive. SDRs spend 70%+ of their time researching instead of selling.
Generic outreach fails — Non-personalised cold emails get ignored, flagged as spam, and permanently damage sender reputation. A CTO running Ray for distributed ML won't respond to a generic pitch.

LeadForge solves all three with a data-driven, signal-aware pipeline.

Pipeline Architecture

Raw Sources
    │
    ├── Apollo.io export        → scripts/normalize_apollo.py
    ├── LinkedIn Sales Nav      → scripts/normalize_linkedin.py
    ├── Seed lists              → scripts/normalize_seeds.py
    └── GitHub / Crunchbase /   → scripts/normalize_engineer_sources.py
        BuiltWith
                │
                ▼
    scripts/compile_leads.py
    (merge + dedup by domain)
                │
                ▼
    scripts/prefilter.py
    (hard disqualifiers + competitor detection)
                │
                ▼
    scripts/quota_check.py
    (vertical + geo distribution validation)
                │
                ▼
    Clay enrichment (manual import)
    → data/enriched_leads.csv
                │
                ▼
    scripts/enrich_3b.py
    (hiring signals + funding patch)
                │
                ▼
    scripts/scoring_engine.py
    (115-point ICP score per lead)
                │
                ▼
    data/scored_leads.csv
    (291 active, Hot/Warm/Cold)
                │
        ┌───────┴───────┐
        ▼               ▼
Phase 5 Outreach    Phase 6 A/B Test
phase5_outreach.csv  phase6_ab_variants.csv
(50 leads)           (top 20 × 2 variants)

Stage Summary

Stage	Script	Output
1. Normalize Apollo	`normalize_apollo.py`	`data/raw/apollo_normalized.csv`
2. Normalize LinkedIn	`normalize_linkedin.py`	`data/raw/linkedin_normalized.csv`
3. Normalize Seeds	`normalize_seeds.py`	`data/raw/seeds_normalized.csv`
4. Normalize Engineer Sources	`normalize_engineer_sources.py`	`data/raw/engineer_normalized.csv`
5. Merge + Dedupe	`compile_leads.py`	`data/raw_leads.csv`
6. Pre-filter	`prefilter.py`	`data/raw_leads_rejected.csv`
7. Quota Check	`quota_check.py`	`data/sourcing_qa_report.md`
8. Phase 3A API Enrichment	`enrich_pipeline.py`	`data/enriched_leads.csv`
9. Phase 3B Intent Enrichment	`enrich_3b.py`	`data/enriched_leads.csv`
10. Lead Scoring	`scoring_engine.py`	`data/scored_leads.csv`
11. Outreach Generation	`generate_linkedin_dms.py`	`data/phase5_outreach.csv`

ICP Framework

Full definition in icp_framework.md.

Target Persona:

Title: CTO, VP Engineering, Head of AI/ML, Director of Engineering
Company size: 200–5,000 employees
Funding: Series B through D, or bootstrapped with >$20M ARR
Must-haves: LLMs in production + cloud infrastructure
Verticals: SaaS (primary), FinTech, HealthTech, Cybersec, Logistics
Geo: US (primary), EU/UK (secondary), India seed-only

Hard Disqualifiers:

Competitor platforms: Baseten, Modal, Anyscale, Fireworks, Together, Replicate, RunPod, HuggingFace
Under 50 employees
Government or defense
No discernible LLM workload

Tier Thresholds:

Hot: 65+ points
Warm: 40–64 points
Cold: under 40 points

Lead Scoring Model

Each lead is scored 0–115 across 9 signals:

Signal	Max Points	Logic
Contact title	25	CTO=25, VP Eng=20, Head/Dir=18
Uses LLM in production	20	TRUE=20
Funding stage	20	Series C=20, Series B=15, Series D=12
Employee count	15	501–2000=15, 201–500=12, 2001–5000=10
Active ML hiring	10	Hiring ML engineers detected
Kubernetes in stack	8	TRUE=8
Geo tier	8	US=8, EU/UK=6, India=4
Ray / WandB in stack	6	TRUE=6
GitHub AI repos	3	Active public AI/infra repos

Top scoring lead: Vaibhav Nivargi — Moveworks CTO — 115/115

Outreach Generation

Personalised outreach for the top 50 Hot leads using per-lead signal context.

Cold email structure:

Hook based on detected tech stack signal (Ray → GPU efficiency, Kubernetes → infra scale)
Hiring signal adds urgency line
150–200 word target, soft CTA for 15-minute demo

LinkedIn DM structure:

Capped at 300 characters
Same stack-personalised hook, condensed to one punch line + ask

Sample email — Vaibhav Nivargi, Moveworks:

Subject: GPU efficiency at Moveworks scale

Hi Vaibhav,

Noticed Moveworks runs Ray for distributed ML — GPU efficiency at scale
is a real challenge. You're also hiring ML engineers, which tells me
inference demand is growing.

I'm building P95.AI, an inference optimization layer that cuts LLM
serving costs 30–45% with zero model changes.

Worth a 15-minute call to see if it fits your stack?

A/B Testing Strategy

Top 20 leads (score 88+) receive two message variants each.

	Variant A — Pain-Led	Variant B — Social Proof
Hook	Specific GPU/cost pain tied to their stack	Peer company quantified result
Subject example	"Your Ray cluster is probably leaving 30% GPU capacity on the table"	"How a Series C AI company cut inference costs 40% in 3 weeks"
Target reply rate	12%	10%
Target open rate	45%	50%
Hypothesis	High-urgency, problem-aware buyers respond better	Curious/research-mode buyers engage with social proof

Full hypothesis documentation: ab_test_hypotheses.md

Winner declared at: +5pp reply rate over 2-week window, minimum 10 sends per variant.

Tech Stack

Tool	Purpose
Python 3.13	Pipeline scripting, data processing
pandas	Data normalization, dedup, scoring
Clay.com	Lead enrichment backbone — tech stack, hiring, funding, LinkedIn signals
Apollo.io	Primary lead sourcing (80 leads)
LinkedIn Sales Navigator	High-recency contact sourcing (60 leads)
Crunchbase	Funding and firmographic data
GitHub API	Engineering signal detection — AI/infra repos
BuiltWith	Tech stack detection — Kubernetes, Snowflake, Ray, WandB
GPT-4o (OpenAI)	Personalised outreach generation
n8n	Full pipeline automation and orchestration
python-dotenv	API key management
tqdm + rich	Progress tracking and CLI output

Quick Start

# 1. Clone the repo
git clone https://github.com/aayush2724/LeadForge
cd LeadForge

# 2. Create virtual environment
python -m venv .venv
.venv\Scripts\activate        # Windows
source .venv/bin/activate     # Mac/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure API keys
cp .env.template .env
# Edit .env and add your keys

# 5. Run the full pipeline
python pipeline.py

Note: All output files are pre-built and committed. You can skip directly to reviewing data/scored_leads.csv without running any scripts.

Required API Keys

Key	Required for	Where to get
`APOLLO_API_KEY`	Apollo sourcing	app.apollo.io → Settings → API
`OPENAI_API_KEY`	GPT-4o outreach generation	platform.openai.com
`GITHUB_TOKEN`	GitHub AI repo enrichment	github.com → Settings → Tokens
`CLAY_API_KEY`	Clay enrichment	clay.com → Account → API

Automated Pipeline (n8n)

The full pipeline is automated via n8n, allowing one-click end-to-end execution without touching the command line.

Setup

Step 1 — Install n8n

npm install -g n8n

Step 2 — Set environment variable

On Windows (run PowerShell as Administrator):

[System.Environment]::SetEnvironmentVariable("NODE_FUNCTION_ALLOW_BUILTIN", "child_process", "User")
[System.Environment]::SetEnvironmentVariable("NODE_FUNCTION_ALLOW_EXTERNAL", "child_process", "User")

On Mac/Linux:

export NODE_FUNCTION_ALLOW_BUILTIN=child_process
export NODE_FUNCTION_ALLOW_EXTERNAL=child_process

Step 3 — Start n8n from the project root

cd LeadForge
n8n start

Step 4 — Import the workflow

Open http://localhost:5678
Go to Workflows → Import
Select workflows/leadforge_pipeline.json

Step 5 — Execute

Click Execute Workflow
All 12 pipeline nodes run in sequence automatically

Workflow Nodes

Manual Trigger
      ↓
Compile Leads           ← compile_leads.py
      ↓
Normalize Apollo        ← normalize_apollo.py
      ↓
Normalize LinkedIn      ← normalize_linkedin.py
      ↓
Normalize Seeds         ← normalize_seeds.py
      ↓
Normalize Gaps          ← normalize_gaps.py
      ↓
Enrich Pipeline         ← enrich_pipeline.py
      ↓
Prefilter Competitors   ← prefilter.py
      ↓
Score Leads             ← scoring_engine.py
      ↓
Validate Emails         ← validate_emails.py
      ↓
Generate LinkedIn DMs   ← generate_linkedin_dms.py
      ↓
Sanity Check            ← validate_row.py

Each node outputs { success: true, output: "..." } — visible in the n8n execution log for full auditability.

The node chain above reflects the workflow shown in the n8n editor screenshot.

Running the Workflow

n8n Editor View:

The workflow executes all 12 stages in sequence. Below is a sample execution from the n8n web UI at http://localhost:5678:

Total execution time for full pipeline: ~8–10 minutes (varies based on API rate limits and enrichment latency).

Tip: To monitor live execution, open the "Executions" tab in the n8n web UI. All logs are visible in real-time as nodes complete.

Data Sourcing Guide

Lead sourcing from Apollo.io and LinkedIn Sales Navigator requires manual CSV export due to platform API restrictions on free/standard plans. All exported files are included in data/raw/ so the full pipeline can be re-run without re-sourcing.

To re-source fresh leads:

Apollo.io (target: 80 leads)

Go to apollo.io → Search → People
Job titles: CTO, VP Engineering, Head of AI, Director of Engineering
Headcount: 200–5,000
Industries: Computer Software, Financial Services, Healthcare
Keywords: machine learning, LLM, AI inference, GPU, generative AI
Exclude: baseten.co, modal.com, anyscale.com, fireworks.ai, together.ai, replicate.com
Export CSV → save to data/raw/apollo_pass1_*.csv

LinkedIn Sales Navigator (target: 60 leads)

Go to Sales Navigator → Lead Filters
Same job titles as above
Seniority: VP, CXO, Director
Headcount: 200–5,000
Activity: Posted on LinkedIn
Geography: US, UK, Germany, France, Netherlands, India
Save list → export → save to data/raw/linkedin_pass*.csv

After re-sourcing:

python pipeline.py

Key Output Files

File	Rows	Description
`data/raw_leads.csv`	297	All normalised leads pre-filter
`data/raw_leads_rejected.csv`	6	Disqualified leads with reasons
`data/enriched_leads.csv`	297	Post-Clay enrichment master list
`data/scored_leads.csv`	291	Scored + tiered, all signals
`data/scoring_report.md`	—	Score distribution + top leads
`data/sourcing_qa_report.md`	—	Vertical/geo quota validation
`data/phase5_outreach.csv`	50	Email subject, body, LinkedIn DM
`data/phase6_ab_variants.csv`	40	A/B variant messages for top 20
`ab_test_hypotheses.md`	—	A/B test design + success metrics
`icp_framework.md`	—	Full ICP definition + scoring rubric

Directory Structure

LeadForge/
├── README.md
├── icp_framework.md
├── ab_test_hypotheses.md
├── pipeline.py
├── fix_violations.py
├── requirements.txt
├── .env.template
│
├── workflows/
│   └── leadforge_pipeline.json     ← n8n workflow (import to re-run)
│
├── data/
│   ├── SCHEMA.md
│   ├── raw_leads.csv
│   ├── raw_leads_rejected.csv
│   ├── enriched_leads.csv
│   ├── enriched_leads_3a_backup.csv
│   ├── scored_leads.csv
│   ├── scored_leads_validated.csv
│   ├── phase5_outreach.csv
│   ├── phase6_ab_variants.csv
│   ├── scoring_report.md
│   ├── sourcing_qa_report.md
│   ├── enrichment_run_log.md
│   ├── enrichment_3b_log.md
│   └── raw/
│       ├── apollo_normalized.csv
│       ├── apollo_pass1_us_saas_fintech.csv
│       ├── apollo_pass2_eu_uk.csv
│       ├── apollo_pass3_healthtech_cybersec.csv
│       ├── apollo_pass4_flex.csv
│       ├── linkedin_normalized.csv
│       ├── linkedin_pass1_us.csv
│       ├── linkedin_pass2_eu.csv
│       ├── linkedin_pass3_india.csv
│       ├── seeds_normalized.csv
│       ├── seeds_raw.csv
│       ├── engineer_normalized.csv
│       ├── aayush_normalized.csv
│       ├── builtwith_raw.csv
│       ├── crunchbase_raw.csv
│       ├── github_raw.csv
│       ├── cyber_normalized_gap.csv
│       ├── cyber_people_gap.csv
│       ├── ecommerce_normalized_gap.csv
│       ├── ecommerce_people_gap.csv
│       ├── healthtech_normalized_gap.csv
│       ├── healthtech_people_gap.csv
│       ├── logistics_normalized_gap.csv
│       └── logistics_people_gap.csv
│
├── scripts/
│   ├── compile_leads.py
│   ├── normalize_apollo.py
│   ├── normalize_linkedin.py
│   ├── normalize_seeds.py
│   ├── normalize_engineer_sources.py
│   ├── normalize_gaps.py
│   ├── prefilter.py
│   ├── quota_check.py
│   ├── enrich_pipeline.py
│   ├── enrich_3b.py
│   ├── scoring_engine.py
│   ├── generate_linkedin_dms.py
│   ├── validate_emails.py
│   ├── validate_row.py
│   └── enrichers/
│       ├── __init__.py
│       ├── apollo_enricher.py
│       ├── crunchbase_enricher.py
│       ├── github_enricher.py
│       └── jobs_enricher.py
│
├── docs/
│   ├── team_roles.md
│   ├── clay_setup.md
│   ├── n8n_workflow_execution.png
│   └── clay_screenshots/
│
└── logs/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LeadForge — AI-Powered Lead Scoring, Enrichment & Outreach

Results at a Glance

Table of Contents

Project Overview

Problem Statement

Pipeline Architecture

Stage Summary

ICP Framework

Lead Scoring Model

Outreach Generation

A/B Testing Strategy

Tech Stack

Quick Start

Required API Keys

Automated Pipeline (n8n)

Setup

Workflow Nodes

Running the Workflow

Data Sourcing Guide

To re-source fresh leads:

Key Output Files

Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
data		data
docs		docs
logs		logs
scripts		scripts
workflows		workflows
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
ab_test_hypotheses.md		ab_test_hypotheses.md
fix_violations.py		fix_violations.py
icp_framework.md		icp_framework.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LeadForge — AI-Powered Lead Scoring, Enrichment & Outreach

Results at a Glance

Table of Contents

Project Overview

Problem Statement

Pipeline Architecture

Stage Summary

ICP Framework

Lead Scoring Model

Outreach Generation

A/B Testing Strategy

Tech Stack

Quick Start

Required API Keys

Automated Pipeline (n8n)

Setup

Workflow Nodes

Running the Workflow

Data Sourcing Guide

To re-source fresh leads:

Key Output Files

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages