This project performs an end-to-end exploratory data analysis (EDA) on a global Data Science job market dataset to uncover insights related to:
- π° Salary trends
- π§ Experience-level impact on compensation
- π Location-based hiring patterns
- π’ Company size and work setting influence
- π Market evolution over time
- Clean and standardize raw job market data using Pandas
- Perform detailed exploratory analysis using Pandas
- Visualize trends using Matplotlib & Seaborn
- Answer business-style analytical questions using MySQL
- Identify high-paying roles, in-demand experience levels, and geographic trends
| Category | Tools |
|---|---|
| Data Cleaning | Pandas |
| Data Analysis | Pandas, SQL |
| Database | MySQL |
| Visualization | Matplotlib, Seaborn |
| Environment | Jupyter Notebook |
The dataset contains global job postings related to Data Science and AI roles with features such as:
work_yearjob_rolejob_categoryexperience_levelemployment_typework_settingcompany_locationcompany_sizesalary_in_usd
π‘ Only USD-normalized salaries were retained to ensure consistent salary analysis.
Data cleaning was performed entirely using Pandas, following real-world data quality practices:
- Removed redundant salary columns (
salary,salary_currency) - Dropped invalid salary values (negative or zero)
- Handled missing values in categorical features
- Removed duplicate job postings
- Standardized categorical values (experience level, employment type)
- Converted messy job titles into structured job roles
- Ensured consistent country naming
- Reordered columns for analytical clarity
A clean, analysis-ready dataset with consistent structure and realistic values.
Comprehensive EDA was conducted using Pandas, supported by Matplotlib and Seaborn for visualization.
- Overall salary distribution and skewness
- Experience level vs salary comparison
- Job role-based salary analysis
- Remote vs hybrid vs in-person work trends
- Company size impact on compensation
- Geographic job demand and salary variations
- Market segmentation using pivot tables and heatmaps
- Year-over-year salary and demand trends
Visualisations were also used for better understanding and analysis.
These visualizations helped translate raw numbers into clear, interpretable insights.
The cleaned dataset was loaded into MySQL to answer business-oriented analytical questions using SQL.
- Average & median salary by experience level
- Salary comparison across job roles
- Country-wise job distribution
- Highest paying roles per country
- Remote vs in-person salary comparison
- Role + experience level salary aggregation
- Identification of high-paying, high-demand roles
This step demonstrates the ability to:
Translate analytical questions into efficient SQL queries.
- Interactive Streamlit dashboard
- Skill extraction using NLP
- Regional salary normalization
- Predictive salary modeling
git clone https://github.com/mrunmayee3108/Data-Science-Job-Market-Analysis.git
cd Data-Science-Job-Market-Analysispip install -r requirements.txtOpen Jupyter Notebook and run all cells:
jupyter notebookThe notebook includes:
- Data cleaning using Pandas
- Exploratory Data Analysis (EDA)
- Visualizations using Matplotlib & Seaborn
β οΈ Note: MySQL Workbench may fail for large CSV files. The MySQL Command Line Client is faster and more reliable.
cd "C:\Program Files\MySQL\MySQL Server 8.0\bin"mysql -u your_username -p -D your_database_name -e "SET GLOBAL local_infile = 1;"Enter your MySQL password when prompted.
mysql -u your_username -p your_database_name --local-infile=1This opens the MySQL shell (mysql>).
π Important: Use forward slashes in file paths on Windows.
LOAD DATA LOCAL INFILE 'C:/path/to/cleaned_job_market_data.csv'
INTO TABLE job_market
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;Explanation:
LOAD DATA LOCAL INFILEβ bulk import commandIGNORE 1 ROWSβ skips the header rowENCLOSED BY '"'β handles quoted values
π This method is 10β20Γ faster than MySQL Workbench.
SELECT COUNT(*) FROM job_market;Pull requests are welcome.
MIT License.
- Kaggle (Dataset)
If you like this project, consider giving the repository a β star on GitHub!
Author: Mrunmayee Sachin Potdar