SalaryLens is an intelligent salary prediction & exploration platform that uses machine learning to deliver data-driven compensation insights for software engineers worldwide. Trained on the Stack Overflow Developer Survey 2023 with 30,000+ real-world data points.
Every developer eventually asks: "Am I being paid fairly?"
|
The global tech job market is a maze of conflicting salary data. Developers routinely face:
|
flowchart LR
A["๐งโ๐ป YOU"] --> B["๐ Country\n๐ Education\nโณ Experience"]
B --> C["๐ค SalaryLens\nML Engine"]
C --> D["๐ฐ Predicted\nSalary"]
|
SalaryLens turns that ??? into a precise, data-backed answer โ instantly.
mindmap
root((SalaryLens))
๐ฎ Predict
14 Countries
4 Education Levels
0โ50 Years Experience
Instant USD Estimate
๐ Explore
Country Distribution Pie Chart
Mean Salary by Country Bar Chart
Salary vs Experience Line Chart
๐ค ML Engine
Decision Tree Regressor
GridSearchCV Tuned
Label Encoded Features
๐งน Data Pipeline
65K Raw Responses
30K Cleaned Records
Outlier Capping
Category Consolidation
|
|
๐ฅ๏ธ Predict Page โ Get Your Estimate
How it works: Select your country, education level, and years of experience โ click Calculate Salary โ get your predicted compensation instantly.
Interface highlights:
- ๐ Dropdown with 14 supported countries
- ๐ Education level from "Less than a Bachelors" to "Post grad"
- ๐ Smooth slider for experience (0โ50 years)
- โน๏ธ Collapsible explainer: "Knowing the market rate for your skills can help you negotiate better salaries"
๐ Explore โ Where Are Developers?
Distribution highlights from the Stack Overflow 2023 Survey:
๐บ๐ธ United States โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 36.9%
๐ฌ๐ง United Kingdom โโโโโโโโโโโ 10.8%
๐ฎ๐ณ India โโโโโโโโโโ 9.7%
๐ฉ๐ช Germany โโโโโโโ 6.5%
๐ซ๐ท France โโโโโโ 6.0%
๐จ๐ฆ Canada โโโโโ 5.0%
๐ง๐ท Brazil โโโโ 4.1%
๐ณ๐ฑ Netherlands โโโ 3.4%
๐ช๐ธ Spain โโโ 3.3%
๐ต๐ฑ Poland โโโ 3.2%
๐ฆ๐บ Australia โโโ 2.9%
๐ฎ๐น Italy โโโ 2.8%
๐ธ๐ช Sweden โโ 2.6%
๐ท๐บ Russia โโ 2.6%
๐ก Insight: The US alone accounts for more than a third of all respondents, which gives the model very strong signal for American salary predictions.
๐ฐ Explore โ Salary by Country
| Tier | Countries | Avg. Salary Range |
|---|---|---|
| ๐ฅ Top | United States | $120,000+ |
| ๐ฅ High | Australia, Canada, Germany | $65K โ $77K |
| ๐ฅ Mid | UK, Netherlands, Sweden, France | $50K โ $70K |
| ๐ Emerging | India, Brazil, Poland, Russia | $28K โ $40K |
๐ก Insight: A developer in the US earns ~4x what the same profile earns in India or Brazil โ geography remains the single biggest salary factor.
๐ Explore โ Salary by Experience
Career stages decoded:
| Stage | Years | Typical Salary | Growth Rate |
|---|---|---|---|
| ๐ฑ Entry | 0โ3 | $55K โ $70K | ๐ Fastest growth |
| ๐ Growth | 4โ10 | $75K โ $95K | โก Steep upward curve |
| ๐๏ธ Plateau | 11โ35 | $95K โ $120K | ๐ Steady, gradual |
| ๐ Senior | 35โ50 | $120K โ $180K | ๐ Peaks with volatility |
๐ก Insight: The steepest salary jump happens in the first 10 years. After that, specialization, leadership, and geography matter more than raw experience.
flowchart LR
subgraph DATA ["๐ฅ Data Layer"]
A[Stack Overflow\nSurvey 2023\n~65K responses]
end
subgraph PROCESS ["๐งน Processing"]
B[Filter Full-Time\nEmployees]
C[Handle Nulls\n& Outliers]
D[Encode Categories\nLabelEncoder]
end
subgraph ML ["๐ค ML Layer"]
E[Train 3 Models\nLinReg / DTree / RF]
F[GridSearchCV\nHyperparameter Tuning]
G[Best Model\nDecision Tree]
end
subgraph APP ["๐ App Layer"]
H[Streamlit\nWeb Interface]
I[๐ฐ Prediction]
J[๐ Exploration]
end
A --> B --> C --> D --> E --> F --> G --> H
H --> I
H --> J
| # | Stage | What Happens | Key Details |
|---|---|---|---|
| 1๏ธโฃ | Ingest | Load raw CSV from Stack Overflow | 65,000 survey responses |
| 2๏ธโฃ | Filter | Keep only full-time employed devs | Drops part-time, freelance, unemployed |
| 3๏ธโฃ | Clean | Remove nulls, cap salary outliers | Range: $10K โ $250K |
| 4๏ธโฃ | Consolidate | Group rare countries into "Other" | Threshold: 400+ responses to keep |
| 5๏ธโฃ | Engineer | Standardize education into 4 tiers | Map 15+ raw categories โ 4 |
| 6๏ธโฃ | Encode | LabelEncoder for country & education | Numeric representation for ML |
| 7๏ธโฃ | Train | Fit 3 regression algorithms | Compare RMSE across models |
| 8๏ธโฃ | Tune | GridSearchCV on Decision Tree | Optimize max_depth parameter |
| 9๏ธโฃ | Serialize | Pickle model + encoders | saved_steps.pkl for production |
| ๐ | Deploy | Serve via Streamlit | Two-page app: Predict + Explore |
SalaryLens/
โ
โโโ ๐ app.py โ Entry point โ sidebar navigation
โ โโโ ๐ฎ predict_page.py โ Prediction UI + model inference
โ โโโ ๐ explore_page.py โ Data viz: pie, bar, line charts
โ
โโโ ๐ SalaryPrediction.ipynb โ Full ML experimentation notebook
โโโ ๐ฆ saved_steps.pkl โ Serialized model + label encoders
โโโ ๐ survey_results_public.csv โ Stack Overflow raw dataset
graph TD
A["๐ app.py"] -->|Sidebar: Predict| B["๐ฎ predict_page.py"]
A -->|Sidebar: Explore| C["๐ explore_page.py"]
B --> D["๐ฆ saved_steps.pkl"]
D --> E["DecisionTreeRegressor"]
D --> F["LabelEncoder ร 2"]
E --> G["๐ฐ Predicted Salary"]
C --> H["๐ survey_results_public.csv"]
H --> I["๐ฅง Pie: Country Distribution"]
H --> J["๐ Bar: Salary by Country"]
H --> K["๐ Line: Salary by Experience"]
style A fill:#58a6ff,stroke:#1f6feb,color:#0d1117
style G fill:#3fb950,stroke:#238636,color:#0d1117
style I fill:#d2a8ff,stroke:#8b5cf6,color:#0d1117
style J fill:#d2a8ff,stroke:#8b5cf6,color:#0d1117
style K fill:#d2a8ff,stroke:#8b5cf6,color:#0d1117
Three algorithms went head-to-head on the same data:
| Model | RMSE (USD) | Verdict |
|---|---|---|
| ๐ Linear Regression | ~$30,500 | โช Solid baseline, underfits complex patterns |
| ๐ณ Decision Tree | $30,428 | ๐ข Winner โ best bias-variance tradeoff after tuning |
| ๐ฒ Random Forest | $29,487 | ๐ก Lowest training error, but overfitting risk |
๐ Decision Tree Regressor was selected โ tuned via GridSearchCV with
max_depth โ {None, 2, 4, 6, 8, 10, 12}usingneg_mean_squared_errorscoring.
๐ง Why not Random Forest?
While Random Forest achieved a lower RMSE on the training set ($29,487 vs $30,428), the Decision Tree with optimized depth showed better generalization to unseen data. The small RMSE gap (~$1K) didn't justify the added complexity and overfitting risk of the ensemble approach for this feature set of only 3 input variables.
| ๐บ๐ธ United States | ๐ฎ๐ณ India | ๐ฌ๐ง United Kingdom | ๐ฉ๐ช Germany | ๐จ๐ฆ Canada |
| ๐ง๐ท Brazil | ๐ซ๐ท France | ๐ช๐ธ Spain | ๐ฆ๐บ Australia | ๐ณ๐ฑ Netherlands |
| ๐ต๐ฑ Poland | ๐ฎ๐น Italy | ๐ท๐บ Russia | ๐ธ๐ช Sweden |
Education tiers supported:
| Code | Level | Includes |
|---|---|---|
L1 |
๐ Less than a Bachelors | High school, associate degree, bootcamp, self-taught |
L2 |
๐ Bachelor's Degree | Any undergraduate degree |
L3 |
๐ Master's Degree | Graduate-level education |
L4 |
๐ Post Grad | Professional or doctoral degree (PhD, MD, JD) |
๐ Click to expand โ Full data stats & cleaning steps
| Metric | Value |
|---|---|
| ๐ฅ Raw responses | ~65,000 |
| ๐งน After cleaning | ~30,000 |
| ๐ฐ Salary range | $10,000 โ $250,000 (USD) |
| ๐ Countries | 14 (400+ response threshold) |
| ๐ Education tiers | 4 (consolidated from 15+) |
| โณ Experience range | 0 โ 50 years |
| ๐ข Employment filter | Full-time only |
| ๐ Survey year | 2023 |
| ๐ฆ Source | Stack Overflow Annual Developer Survey |
Cleaning pipeline:
- โ Selected 5 key columns: Country, EdLevel, YearsCodePro, Employment, Salary
- โ Dropped rows with null salary โ kept 34,756 rows
- โ Removed remaining nulls โ 34,000+ clean rows
- โ Filtered for full-time employment only โ 30,019 rows
- โ Grouped countries with < 400 responses into "Other", then removed "Other"
- โ Capped salaries to $10Kโ$250K to remove extreme outliers
- โ Mapped experience strings ("Less than 1 year", "More than 50 years") to floats
- โ Consolidated 15+ education categories into 4 clean tiers
โ ๏ธ Note: This is a showcase repository โ the full source code is private. Interested in collaborating? Reach out โ
# Clone (requires private access)
git clone https://github.com/shanskarBansal/SalaryLens.git
cd SalaryLens
# Install dependencies
pip install streamlit pandas scikit-learn matplotlib numpy
# Place the Stack Overflow Developer Survey 2023 CSV in project root
# Download from: https://survey.stackoverflow.co/
# Launch the app
streamlit run app.pyRequirements:
python >= 3.8
streamlit >= 1.28
pandas >= 1.5
scikit-learn >= 1.2
matplotlib >= 3.6
numpy >= 1.23
| Status | Feature | Description |
|---|---|---|
| ๐ข | Salary Prediction | Core ML prediction engine |
| ๐ข | Data Exploration | Interactive charts & visualizations |
| ๐ข | Multi-country Support | 14 countries covered |
| ๐ก | Job Role Filtering | Predict by role: Frontend, Backend, DevOps, etc. |
| ๐ก | Company Size Factor | Startup vs Enterprise salary adjustments |
| ๐ด | REST API | Expose predictions as an API endpoint |
| ๐ด | XGBoost / LightGBM | Upgrade to gradient-boosted models |
| ๐ด | Auto-updating Data | Live integration with latest SO surveys |
| ๐ด | Mobile-First UI | Responsive design for all devices |
๐ข Done ย ย ๐ก Planned ย ย ๐ด Future
| Developer | |
|---|---|
| ๐งโ๐ป | Harsh Bir |
| ๐งโ๐ป | Priyanshu Dayal |
| ๐งโ๐ป | Shanskar Bansal |
| ๐งโ๐ป | Saloni Thakur |
The complete source code for SalaryLens is maintained in a private repository. This showcase repo demonstrates the platform's capabilities, architecture, methodology, and results.
๐ฌ Want access or interested in collaborating? Reach out via GitHub โ @shanskarBansal
| ๐ | Stack Overflow โ Developer Survey 2023 dataset |
| ๐ | Streamlit โ Python web framework |
| ๐ค | Scikit-Learn โ ML algorithms & tools |
| ๐ | Python โ Language & ecosystem |




