Beyond keywords. Beyond filters. The AI brain for modern hiring.
Traditional keyword filters miss the hidden gems—candidates whose true potential is buried in behavioral signals and career trajectories. This system builds a multi-signal predictive ranking engine that finds them.
JD (raw text)
│
▼
[LLM JD Parser]─────────────────────────────────────────┐
(Gemini Flash) │
│ │
▼ ▼
[Embedding Engine] [Structured Requirements]
(all-mpnet-base-v2) required_skills, seniority,
│ min/max experience, domain
▼
[Vector DB ANN Retrieval] ← Candidate Pool (pre-indexed embeddings)
(ChromaDB + HNSW)
│
▼ Top-200 candidates
[Multi-Signal Ranking Engine]
├── Semantic Score (40%) — embedding cosine similarity
├── Skill Match Score (25%) — required + nice-to-have skill overlap
├── Behavioral Score (20%) — recency decay, intent signals, engagement
└── Career Score (15%) — velocity, trajectory, hidden gem bonus
│
▼ Top-20
[LLM Re-Ranker + Explainer] ← Gemini Flash generates match explanation
(optional, adds explainability)
│
▼
[Ranked Output]
ranked_output.csv | REST API | Streamlit Demo UI
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Add your Gemini API key to .env
GEMINI_API_KEY=your_key_herestreamlit run frontend/streamlit_app.pyuvicorn backend.api.main:app --reload
# Swagger UI: http://localhost:8000/docsdocker-compose upUses sentence-transformers/all-mpnet-base-v2 to encode both JD and candidate profiles into 768-dimensional vectors. Cosine similarity captures semantic meaning beyond keyword overlap — "built NLP pipelines" matches "natural language processing experience" even with zero shared keywords.
Fuzzy skill matching with alias normalization (tf → tensorflow, k8s → kubernetes). Separates required skills (weighted 3x) from nice-to-haves.
Converts activity data into intent scores using exponential decay:
- Recency:
score = exp(-0.023 × days_since_active)— active today = 1.0, active 30 days ago ≈ 0.5 - Engagement: Applications, profile views, job clicks
- Intent: Open-to-work status, resume recency, active application behavior
Detects growth velocity and hidden potential:
- Velocity: Reached current seniority in fewer years than expected → fast-tracker bonus
- Progression: Promotions count, distinct title changes
- Hidden Gem Bonus: Open-source contributions (+6%), side projects (+4%), publications (+5%), recent career switcher (+8%)
Top 20 candidates are sent to Gemini Flash for contextual re-scoring and human-readable match explanations. This adds explainability — each ranked candidate comes with a one-sentence reason.
data/ranked_output.csv columns:
| Column | Description |
|---|---|
| rank | Final rank (1 = best fit) |
| candidate_id | Unique identifier |
| name | Candidate name |
| current_title | Current job title |
| composite_score | Final weighted score (0-1) |
| semantic_score | JD-profile semantic similarity |
| skill_match_score | Skill overlap score |
| behavioral_score | Activity/intent signal score |
| career_score | Career trajectory score |
| confidence | HIGH / MEDIUM / LOW |
| match_explanation | LLM-generated reason for ranking |
| matched_skills | Skills that matched the JD |
| Feature | Keyword Filter | Cosine Similarity Only | This System |
|---|---|---|---|
| Semantic understanding | ❌ | ✅ | ✅ |
| Behavioral signals | ❌ | ❌ | ✅ |
| Career trajectory | ❌ | ❌ | ✅ |
| Hidden gem detection | ❌ | ❌ | ✅ |
| Explainable ranking | ❌ | ❌ | ✅ |
| Sub-second retrieval | ✅ | ✅ (HNSW ANN) | |
| Tuneable weights | ❌ | ❌ | ✅ |
| Component | Technology |
|---|---|
| LLM / Parsing | Gemini 1.5 Flash |
| Embeddings | sentence-transformers/all-mpnet-base-v2 |
| Vector DB | ChromaDB (HNSW-based ANN) |
| Backend API | FastAPI + Uvicorn |
| Demo UI | Streamlit |
| Containerization | Docker + docker-compose |
intelligent-candidate-discovery/
├── README.md
├── CLAUDE.md ← AI-assistant context for the repo
├── requirements.txt
├── Makefile ← make install · test · run-api · demo · index · rank
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── intelligent_candidate_discovery_architecture.svg ← MVP pipeline
├── intelligent_candidate_discovery_production_architecture.svg ← Production target
├── data/
│ └── ranked_output.csv ← Submission file (output)
├── data_tier/
│ ├── vector_store.py ← ChromaDB HNSW ANN wrapper
│ └── fixtures/
│ ├── sample_candidates.json
│ └── sample_jd.txt
├── backend/
│ ├── api/main.py ← FastAPI: POST /rank · /index · GET /health
│ ├── core/
│ │ ├── jd_parser.py ← Gemini Flash · keyword fallback
│ │ ├── candidate_parser.py ← Raw record → internal schema
│ │ └── ranking_engine.py ← 4-signal weighted fusion
│ └── services/
│ └── embedder.py ← sentence-transformers wrapper
├── frontend/
│ └── streamlit_app.py ← Demo UI
├── shared/
│ └── config.py ← All tuneable parameters
├── docs/
│ ├── architecture.md ← MVP layers + module map
│ └── methodology.md ← Per-signal scoring formulas
└── tests/
├── test_jd_parser.py
├── test_ranking_engine.py
├── test_behavioral_scorer.py
└── test_skill_matcher.py
Two architecture diagrams ship in the repo:
| Diagram | Purpose |
|---|---|
intelligent_candidate_discovery_architecture.svg |
The MVP pipeline running in this repo |
intelligent_candidate_discovery_production_architecture.svg |
The production target: 18 MVP components running in a 10-service docker-compose stack · 8 production-shaped stubs · 6 honest deferrals (ATS webhooks, scrapers, HRIS CDC, BigQuery, Feast, drift pipelines) |
See docs/architecture.md for the layer-by-layer walkthrough.
Track 01: The Data & AI Challenge — Intelligent Candidate Discovery
Redrob AI × Hack2Skill | 42-Day Challenge | ₹10 Lakh Prize Pool
Built by Omkar — solving India's talent discovery problem with AI