📖 **Click to expand Table of Contents**
- Project Overview
- End Goals
- System Workflow Overview
- U.S.-Centric Dataset Bias
- Job Title Standardization Challenges
- Skill Fragmentation & Scope Limitation
- Sprint-Based Development Approach
- Sprint Overview
- Data Sources & Licensing
- Data Cleaning & Processing
- Job Title Normalization & Mapping
- ETL Workflow
- Methodology
- Challenges & Solutions
- Database Choice & Schema
- API Functionality
- Frontend Implementation
- Job Categorization Index Page
- Job-Specific Dashboard
- Data Source & Licensing
- Bias & Limitations
- Attribution & Design Influence
- Ethical Data Practices
- Expanding Job Categorization & Global Listings
- Live Data Updates & Web Scraping
- Custom Search Filters & Advanced User Features
- Improved Skill Extraction & NLP Integration
- User Authentication & Profile-Based Recommendations
- Pre-Requisites
- Installation Steps
- Database Setup
- Running the Project
- Platform-Specific Considerations
- Key Takeaways
- Future Enhancements
The Data Employment Skills Dashboard is designed to help job seekers, career changers, and data professionals stay competitive by identifying the most in-demand skills in Data Science job postings. Job titles and required skills are often inconsistent across postings, making it difficult to understand what truly matters. Our tool analyzes, ranks, and visualizes these skills, providing clear, data-driven insights to support smarter career decisions.
Our interactive dashboard allows users to:
- 🔍 Identify high-demand skills based on real job postings.
- 📊 Explore job titles and skill rankings in an intuitive interface.
- 📈 Compare their skillset with industry trends to guide career growth.
This project extracts, categorizes, and ranks job skills, making it easier for users to identify skill gaps and target relevant professional development opportunities.
- ✅ Provide job seekers with a clear understanding of in-demand skills.
- ✅ Enable interactive exploration of job titles and skill rankings.
- ✅ Offer real-world job market insights using real job postings as a data source.
The following process map outlines how the Python scripts, MongoDB interactions, and system operations work together to transform raw job postings into actionable insights on the dashboard.
Figure 1: Program Process Map
A high-level overview of system operations, covering data extraction, storage, API integration, and visualization.

This diagram illustrates the full process, from data collection to interactive visualizations, ensuring that job seekers can easily access and interpret job market trends.
- Figure 2: Job Postings by Location

- The dataset is heavily skewed toward U.S.-based job postings, making insights less applicable to international job markets.
- The visualization above illustrates the geographic bias in the dataset.
- Figure 3: Job Title Frequency Distribution

- One of the biggest challenges in job market data analysis is job title standardization.
- ~80% of job titles in the dataset appear only once, making standardization and categorization difficult.
- The line graph above represents the distribution of job title occurrences, highlighting the long-tail nature of job postings.
- Figure 4: Skill Frequency Distribution

- Job postings frequently list skills in inconsistent formats, leading to skill fragmentation.
- Considering dataset limitations, time constraints, and our current skill levels, we decided to narrow the scope to 14 representative job titles and focus our skill extraction on the top 20 skills per job category.
- Our approach ensures that the data remains relatable, actionable, and as clean as possible, reducing misinterpretation.
- Unlike more advanced projects, we did not use NLP, machine learning, or other complex AI-driven techniques, as those are beyond our current capabilities. Instead, we relied on Regex-based skill grouping to maintain accuracy and clarity.
Data_Employment_Skills_Dashboard/
│
├── backend/
│ ├── api_endpoints.py
│
├── data/
│ ├── cleaned_job_postings.csv
│ ├── cleaned_job_skills.csv
│ ├── job_postings.csv
│ ├── job_postings.json
│ ├── job_skills.csv
│ ├── job_summary.csv
│ ├── keyword_classifications.json
│ ├── title_classifications.json
│
├── executable/
│ ├── api_endpoints_executable.py
│ ├── data_cleaning_executable.py
│ ├── database_creation_executable.py
│ ├── run_scripts.py
│ ├── requirements.txt
│ ├── skill_extraction_executable.py
│
├── frontend/
│ ├── Js/
│ │ ├── api-import.js
│ │ ├── chart-rendering.js
│ │ ├── dashboard-script.js
│ │ ├── skillList.js
│ ├── dashboard.html
│ ├── dashboard-style.css
│ ├── ethics.html
│ ├── index.html
│ ├── script.js
│ ├── style.css
│
├── members/
│ ├── Jose_Traboulsi/.. --> Individual EDA and Subsequent Code
│ ├── Karla_Lopez/.. --> Individual EDA and Subsequent Code
│ ├── Sean_Schallberger/.. --> Individual EDA and Subsequent Code
│
├── notebooks/
│ ├── data_cleaning.ipynb
│ ├── database_creation.ipynb
│ ├── skill_extraction.ipynb
│
├── resources/
│ ├── Documentation/
│ │ ├── Data_Skills_Dashboard_Project_Outline.pdf
│ │ ├── Database_Diagram.png
│ │ ├── Database_Diagram_Extended.png
│ │ ├── Project 3 Roadmap.pdf
│ │ ├── Project_3_Process_Map.jpg
│ │ ├── Project_3_Process_Map.png
│ ├── eda-related images...
│
└── README.md
This project follows a sprint-based workflow, where tasks are distributed among team members for efficiency. Each sprint consists of multiple tickets, allowing parallel work on different tasks. To maintain efficiency, team members worked on separate files, manually merging updates at the end of each sprint.
Collaboration & Version Control:
- GitHub was used for version control to track changes.
- Team members documented challenges and findings to facilitate troubleshooting.
- Exploratory Data Analysis (EDA) was divided into multiple parts, ensuring thorough dataset investigation.
| Sprint | Focus Area | Key Tasks |
|---|---|---|
| Sprint 1 | Data Collection & Cleaning | Load dataset, assess quality, standardize job titles, remove duplicates |
| Sprint 2 | Skill Extraction & Processing | Extract & rank top 20 skills per job title, map skills to job roles |
| Sprint 3 | Database & API Development | Store cleaned data in MongoDB, develop Flask-based API endpoints |
| Sprint 4 | Dashboard Development | Build interactive visualizations for job skill insights |
- Ticket 1.1: Load and explore dataset for job market representation.
- Ticket 1.2: Assess data consistency, missing values, and duplicates.
- Ticket 1.3: Identify dataset biases & ethical concerns.
- Ticket 2: Standardize job titles, remove duplicate postings, map experience levels.
- Ticket 3: Extract skills from job descriptions.
- Ticket 4: Rank and categorize skills based on demand & relevance.
- Scope Consideration:
- Limited extraction to top 20 skills per job category to keep insights actionable.
- No use of NLP, ML, or complex AI due to current skill constraints.
Figure 1: Skill Frequency Distribution
A bar chart showing the most frequently mentioned skills across job postings.

- Ticket 5: Upload cleaned job & skill data to MongoDB.
- Ticket 6: Develop Flask API endpoints for skill aggregation & recommendations.
Figure 2: Industry Keyword Distribution
A bar chart categorizing job postings based on industry-specific keywords.

- Ticket 7: Build interactive skill ranking dashboard using JavaScript.
- Visualization Features:
- Job Market Overview → Top skills across job roles & industries.
- Personalized Skill Assessment → Compare user skills with job market demands.
- Interactive Filters → Adjust by experience level, job location, or role.
- Charts & Visuals → Bar charts, heatmaps, word clouds for insights.
If time and resources permit, additional AWS integrations may be implemented for scalability:
- MongoDB Atlas → Cloud database storage.
- AWS Lambda + API Gateway → Serverless backend deployment.
- AWS S3 + CloudFront → Hosting interactive dashboard for global access.
✅ Final Deliverables:
- MongoDB database storing job postings & skill data.
- Flask API for job searches & skill recommendations.
- Interactive dashboard for skill ranking & filtering.
- GitHub repository with documented code & README.
This project utilizes publicly available datasets from Kaggle, ensuring proper attribution and compliance with data usage policies. The dataset is licensed under the Open Data Commons Attribution License (ODC-By) v1.0, allowing us to freely use, modify, and share the data as long as proper credit is given.
- Data Science Job Postings & Skills (2024)
Dataset source: Kaggle
License: Open Data Commons Attribution License (ODC-By) v1.0
job_postings.csv→ Contains job titles, descriptions, locations, etc.job_skills.csv→ Contains extracted skills from job postings.
One of the biggest challenges in job market data analysis is ensuring clean and structured data. Before analysis, we:
- Removed incomplete entries to improve dataset reliability.
- Identified and merged duplicate postings while consolidating relevant job data.
Figure 1: Job Titles Distribution
A bar chart showing the distribution of job postings by title.

Standardizing job titles was essential due to variations in how they appear across postings. To ensure consistency:
- Mapped variations to a standard format (e.g., "Sr. Data Analyst" → "Senior Data Analyst").
- Classified job levels into Junior, Mid, and Senior categories based on regex-based rules.
- Filtered irrelevant or duplicated job postings by removing listings that lacked proper skill descriptions or contained redundant data.
To ensure structured and usable data, we implemented an Extract, Transform, Load (ETL) process:
- Extract relevant data from
job_postings.csvandjob_skills.csv. - Transform job titles and skill lists using standardization techniques.
- Load cleaned data into MongoDB (details in Section 5).
Our skill extraction process was designed to:
- Extract relevant skills from job descriptions.
- Categorize skills based on relevance and industry standards.
- Rank skills by demand to highlight key industry trends.
- Many job postings use different terminology for the same skill (e.g., "SQL Programming" vs. "SQL").
- We applied regex-based text normalization to standardize skill names.
- No NLP or machine learning techniques were used, as they were beyond our current capabilities.
Figure 2: Skill Frequency Distribution
A bar chart visualizing the most frequently mentioned skills in job postings.

Given time constraints and dataset limitations, we:
- Focused on 14 representative job titles.
- Extracted only the top 20 most frequently mentioned skills per job category.
- Prioritized accuracy and clarity by ensuring minimal misinterpretation of skill labels.
Figure 3: Bottom 10 Least Common Skills
A table listing the least common skills, indicating potential niche job requirements.

We chose MongoDB for its flexibility in handling nested data structures and unstructured job postings. Given the variety of skills and job titles, a relational database would require excessive normalization.
Additionally, a NoSQL approach allows us to efficiently store, query, and scale our data as job market insights evolve.
The MVP database schema was designed to meet core data visualization needs, ensuring clean and efficient storage of job postings, job titles, and skill rankings.
Figure 2: MVP Database Schema
A diagram illustrating our minimum viable database structure, optimized for visualization queries.

For future scalability, we designed an extended schema to support multiple data pipelines, broader job classifications, and deeper skill groupings.
Figure 3: Extended Database Schema
A more detailed entity-relationship diagram, outlining potential improvements for handling additional job categories, locations, and structured skill hierarchies.

To make job market insights accessible, our API enables:
- Job Title Searches → Retrieve skill rankings for specific job titles.
- Filtered Results → Search based on experience level, location, or industry.
- JSON Responses → Structured outputs for seamless integration with front-end applications.
This setup allows users to explore data-driven career insights dynamically while keeping the database structure optimized for fast queries.
The dashboard is built using JavaScript (D3.js, Plotly.js) and is designed to be interactive and user-friendly. The user journey consists of three main pages:
Instead of a dropdown menu, the homepage features a clean index page listing all 14 job categorization titles, each with a brief description.
Each job title is visually differentiated by colored bars, which indicate the generally accepted earliest seniority-based entry points for that role. Clicking on a job title opens a dedicated dashboard page with data insights specific to that role.
Figure 1: Job Categorization Index Page
This page allows users to select a data job category, each visually distinguished by seniority level indicators.

Once a job category is selected, users are directed to a job-specific dashboard containing:
✅ Job Market Insights:
- A metric box displaying the total number of job postings collected for the selected job title.
✅ Education Requirements Overview:
- Three pie charts displaying the percentage of job postings that mention:
- A bachelor's degree
- A master’s degree
- A PhD
✅ Skill Ranking & Customization:
- Top 10 Most Commonly Listed Skills for the selected job title, displayed in a checklist format.
- Dynamic Skill Gap Bar Chart:
- As users check off skills they already possess, the bar chart dynamically updates, showing the remaining top 10 skills they lack for that role.
Figure 2: Job-Specific Dashboard
An interactive view of job listings, education requirements, skill rankings, and dynamic skill gaps.

To ensure transparency in the dataset and methodology, we documented data sources, limitations, and potential areas for future improvements in a dedicated Project Ethics & Considerations section.
This page outlines:
- Data Source & Licensing → Kaggle dataset under ODC-By 1.0 License
- Scope & Limitations → Focused on 14 job categories with 50-100 postings per category
- Technical Considerations → Built using MongoDB, Flask API, and JavaScript
- Future Enhancements → Exploring real-time job tracking & custom search filters
Figure 3: Ethics & Considerations Page
This page ensures transparency in data licensing, scope, and future improvements.

The dashboard integrates multiple interactive visualizations to provide a clear and actionable overview of job market expectations:
- Metric Box displaying the total number of job postings collected for the selected job title.
- Education Requirement Pie Charts breaking down the percentage of postings requiring different degree levels.
- Top 10 Skill Checklist for each role, allowing users to track skills they already have.
- Dynamic Bar Chart showing the top 10 skills missing from the user’s checked-off list.
This structured dashboard provides data-driven career insights, helping users navigate industry trends, skill requirements, and educational pathways with clarity.
Ensuring transparency and responsible data usage is a critical part of this project. Below, we outline key ethical considerations, dataset limitations, and attribution to external sources that influenced our work.
- This project uses publicly available job postings data from Kaggle, licensed under the Open Data Commons Attribution License (ODC-By) v1.0.
- The dataset may contain biases, as it is heavily U.S.-centric and may not fully represent global job markets.
- We do not claim ownership over the dataset and encourage users to validate findings before making career decisions.
- Geographical Bias: Most job postings originate from the U.S., limiting insights for international users.
- Job Title Standardization: Since job titles vary significantly across companies, standardization methods may not capture every nuance.
- Skill Fragmentation: Many job postings list skills in inconsistent formats, leading to challenges in accurate extraction and ranking.
- No AI/ML Used: Skill extraction was performed using Regex-based text normalization, avoiding potential ethical concerns around biased AI models.
- The structure of our hub/index design was influenced by a mini project from Masynctech, which explains how to create an interactive hub layout for project navigation.
- Video reference: Masynctech - Interactive Hub Mini Project.
- Our project builds upon this design and sit layout, customizing it for a simplified job skills dashboard while maintaining an intuitive, user-friendly experience.
- 📌 No Personally Identifiable Information (PII): Our dataset does not include sensitive user data.
- 🔄 Transparency: We document our methodology, data sources, and limitations openly.
- 🤝 Community Collaboration: Future iterations may incorporate user feedback to enhance fairness and usability.
By keeping these ethical considerations in mind, we aim to provide valuable, data-driven insights while respecting responsible data use and transparency. 🚀
As a Minimum Viable Product (MVP), the current Data Employment Skills Dashboard provides a structured and interactive way for job seekers to explore in-demand skills and job trends. However, several areas can be enhanced for greater functionality, accuracy, and usability.
- Current Scope: Focused on 14 U.S.-based job categories.
- Next Step: Expand to include more specialized roles and international job postings, ensuring broader market coverage.
- Challenge: Requires additional datasets and standardization for consistency.
- Current Scope: Static dataset from Kaggle’s 2024 LinkedIn job postings.
- Next Step:
- Integrate real-time job tracking via web scraping.
- Add Live API connections to pull fresh job postings.
- Implement automated data refresh cycles for up-to-date insights.
- Challenge:
- Handling data inconsistencies and job posting duplication.
- Ensuring ethical web scraping practices and data licensing compliance.
- Current Scope: Users can explore preset job categories but cannot apply detailed filters.
- Next Step:
- Add custom filters for:
✅ Experience Level → Junior, Mid-Level, Senior.
✅ Industry Specialization → E.g., AI, Cloud, Data Engineering.
✅ Location Preferences → U.S. vs. International. - Implement personalized dashboards where users can save job preferences and compare their skill match against multiple roles.
- Add custom filters for:
- Current Scope: Uses Regex-based skill extraction for ranking.
- Next Step:
- Introduce Natural Language Processing (NLP) to:
✅ Identify alternative skill names & synonyms.
✅ Cluster similar skill groups (e.g., "SQL" vs. "SQL Programming").
✅ Improve accuracy in skill categorization. - Potentially use Machine Learning (ML) for trend prediction and personalized recommendations.
- Introduce Natural Language Processing (NLP) to:
- Challenge:
- Requires higher computational power and additional dataset training.
- Current Scope: Open-access dashboard with no user login or customization.
- Next Step:
- Implement user profiles for saved searches, tracked skills, and career suggestions.
- Add LinkedIn or GitHub API integration for auto-recommendations based on user profiles.
- Develop a "Skill Roadmap" feature that suggests learning paths based on job market trends.
- Challenge:
- Requires database expansion and secure authentication protocols.
Beyond incremental improvements, the goal is to evolve this dashboard into a full-scale career intelligence tool, offering:
✅ Real-time skill tracking based on job market fluctuations.
✅ AI-driven career path recommendations personalized to users.
✅ Seamless integration with e-learning platforms (e.g., Coursera, Udemy, LinkedIn Learning) for upskilling suggestions.
These enhancements will ensure the dashboard remains relevant, data-driven, and user-centric, providing meaningful career insights for job seekers and professionals alike.
Before running the project, ensure you have:
- Python 3.8+ installed.
- MongoDB installed manually (must be configured separately).
- Git installed for repository cloning.
- Necessary Python packages (installed automatically by the executable or manually from
requirements.txt).
1️⃣ Clone the repository:
git clone https://github.com/Oceanpro00/Data_Employment_Skills_Dashboard.git
cd Data_Employment_Skills_Dashboard 2️⃣ Create and activate a virtual environment:
conda create --name dev python=3.8
conda activate dev 3️⃣ Install dependencies (Optional Step):
- If you plan to run the project using
run_scripts.py, the script will automatically install dependencies fromrequirements.txt, so you do not need to do this manually. - If you plan to run individual notebooks yourself, you must install dependencies manually using:
pip install -r requirements.txt
Since this project uses MongoDB as the backend database, ensure you have it installed and running locally.
💪 Start MongoDB (Default Port 27017):
mongod --dbpath ./data/mongodb 💪 Load the cleaned dataset into MongoDB:
python database_creation_executable.py To ensure proper execution, open your terminal or command prompt inside the executables folder before running the script. If not, the script may not function correctly.
Once dependencies are installed and MongoDB is running, the entire data pipeline and API setup can be executed with one command:
python run_scripts.py This script will:
✅ Install dependencies (if not already installed).
✅ Run all preprocessing scripts in the correct order.
✅ Start the Flask API for skill data retrieval.
✅ Launch the frontend dashboard in a browser.
💻 Mac/Linux Users:
- Ensure MongoDB is installed (
brew install mongodb-community@7.0). - Use Terminal for executing commands.
🖥 Windows Users:
- Manually install MongoDB and start
mongodbefore running scripts. - Use PowerShell or Command Prompt to execute scripts.
The Data Employment Skills Dashboard provides job seekers with valuable insights into the most in-demand skills within Data Science roles. By leveraging real-world job postings, this tool enables users to explore job market trends, assess their skillset against industry standards, and make informed career decisions.
Key takeaways from this project include:
- Standardizing job titles and skill extraction to reduce inconsistencies in job descriptions.
- Providing an interactive dashboard that allows users to visualize and explore job market trends dynamically.
- Utilizing MongoDB, Flask, and JavaScript to create a scalable and efficient system for querying job skills data.
While this project has successfully demonstrated how data science techniques can be used to analyze job market trends, there is always room for improvement. Future enhancements may include:
- Expanding the dataset to include international job postings.
- Integrating machine learning techniques to predict future skill demands.
- Automating data collection through web scraping and real-time job API integration.
This project serves as a proof of concept for data-driven job market analysis, with potential real-world applications in career development and workforce planning. 🚀