Skip to content

mrunmayee3108/Data-Science-Job-Market-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Data Science Job Market Analysis: Skills, Salaries & Trends

πŸ“Œ Project Overview

This project performs an end-to-end exploratory data analysis (EDA) on a global Data Science job market dataset to uncover insights related to:

  • πŸ’° Salary trends
  • 🧠 Experience-level impact on compensation
  • 🌍 Location-based hiring patterns
  • 🏒 Company size and work setting influence
  • πŸ“ˆ Market evolution over time

🎯 Objectives

  • Clean and standardize raw job market data using Pandas
  • Perform detailed exploratory analysis using Pandas
  • Visualize trends using Matplotlib & Seaborn
  • Answer business-style analytical questions using MySQL
  • Identify high-paying roles, in-demand experience levels, and geographic trends

🧰 Tech Stack

Category Tools
Data Cleaning Pandas
Data Analysis Pandas, SQL
Database MySQL
Visualization Matplotlib, Seaborn
Environment Jupyter Notebook

πŸ“‚ Dataset Description

The dataset contains global job postings related to Data Science and AI roles with features such as:

  • work_year
  • job_role
  • job_category
  • experience_level
  • employment_type
  • work_setting
  • company_location
  • company_size
  • salary_in_usd

πŸ’‘ Only USD-normalized salaries were retained to ensure consistent salary analysis.

🧹 Data Cleaning (Pandas)

Data cleaning was performed entirely using Pandas, following real-world data quality practices:

βœ” Cleaning Steps

  • Removed redundant salary columns (salary, salary_currency)
  • Dropped invalid salary values (negative or zero)
  • Handled missing values in categorical features
  • Removed duplicate job postings
  • Standardized categorical values (experience level, employment type)
  • Converted messy job titles into structured job roles
  • Ensured consistent country naming
  • Reordered columns for analytical clarity

βœ” Result

A clean, analysis-ready dataset with consistent structure and realistic values.

πŸ“Š Exploratory Data Analysis & Visualization (Pandas)

Comprehensive EDA was conducted using Pandas, supported by Matplotlib and Seaborn for visualization.

πŸ” Key Analysis Areas

  • Overall salary distribution and skewness
  • Experience level vs salary comparison
  • Job role-based salary analysis
  • Remote vs hybrid vs in-person work trends
  • Company size impact on compensation
  • Geographic job demand and salary variations
  • Market segmentation using pivot tables and heatmaps
  • Year-over-year salary and demand trends

Visualisations were also used for better understanding and analysis.

These visualizations helped translate raw numbers into clear, interpretable insights.

πŸ—„οΈ SQL Analysis (MySQL)

The cleaned dataset was loaded into MySQL to answer business-oriented analytical questions using SQL.

πŸ“Œ SQL Insights Covered

  • Average & median salary by experience level
  • Salary comparison across job roles
  • Country-wise job distribution
  • Highest paying roles per country
  • Remote vs in-person salary comparison
  • Role + experience level salary aggregation
  • Identification of high-paying, high-demand roles

This step demonstrates the ability to:

Translate analytical questions into efficient SQL queries.

πŸš€ Future Enhancements

  • Interactive Streamlit dashboard
  • Skill extraction using NLP
  • Regional salary normalization
  • Predictive salary modeling

βš™οΈ How to Run This Project

1️⃣ Clone the Repository

git clone https://github.com/mrunmayee3108/Data-Science-Job-Market-Analysis.git
cd Data-Science-Job-Market-Analysis

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run Data Cleaning & EDA (Pandas)

Open Jupyter Notebook and run all cells:

jupyter notebook

The notebook includes:

  • Data cleaning using Pandas
  • Exploratory Data Analysis (EDA)
  • Visualizations using Matplotlib & Seaborn

πŸ—„οΈ Load Cleaned Data into MySQL (Recommended CLI Method)

⚠️ Note: MySQL Workbench may fail for large CSV files. The MySQL Command Line Client is faster and more reliable.

Step 1 β€” Navigate to MySQL bin folder (Windows)

cd "C:\Program Files\MySQL\MySQL Server 8.0\bin"

Step 2 β€” Enable Local File Import

mysql -u your_username -p -D your_database_name -e "SET GLOBAL local_infile = 1;"

Enter your MySQL password when prompted.

Step 3 β€” Log in with local-infile enabled

mysql -u your_username -p your_database_name --local-infile=1

This opens the MySQL shell (mysql>).

Step 4 β€” Load the Cleaned CSV File

πŸ“Œ Important: Use forward slashes in file paths on Windows.

LOAD DATA LOCAL INFILE 'C:/path/to/cleaned_job_market_data.csv'
INTO TABLE job_market
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Explanation:

  • LOAD DATA LOCAL INFILE β†’ bulk import command
  • IGNORE 1 ROWS β†’ skips the header row
  • ENCLOSED BY '"' β†’ handles quoted values

πŸš€ This method is 10–20Γ— faster than MySQL Workbench.

Step 5 β€” Verify Import

SELECT COUNT(*) FROM job_market;

πŸ‘₯ Contributing

Pull requests are welcome.

πŸ“„ License

MIT License.

πŸ™ Acknowledgments

  • Kaggle (Dataset)

⭐ Support

If you like this project, consider giving the repository a ⭐ star on GitHub!

Author: Mrunmayee Sachin Potdar

About

EDA on Salaries & Trends in Data Science job market

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published