This project cleans and prepares a layoffs dataset using SQL by removing duplicates, standardizing text, handling null/blank values, and dropping unnecessary columns. The result is a consistent, high-quality dataset ready for reporting, analysis, and visualization. README Data Cleaning.md
This project demonstrates data cleaning techniques using SQL on a layoffs dataset. The goal is to transform raw, messy data into a clean and usable form for reporting and analysis.
- Remove duplicate records.
- Standardize text data (e.g., fixing inconsistent company/industry names).
- Handle null and blank values.
- Convert data types (especially dates).
- Drop unnecessary columns.
By the end of this process, we have a cleaned table (layoffs_staging2) ready for further analysis and reporting.
- Data Cleaning project.sql → The SQL script containing all steps of the data cleaning process.
- Microsoft SQL Server (T-SQL)
- Database:
layoffs
- Used CTE (Common Table Expression) with
ROW_NUMBER()to identify duplicate rows based on company, industry, location, date, etc. - Deleted rows where
row_num > 1.
- Trimmed extra spaces in company names.
- Corrected inconsistent industry names (e.g.,
"crypto currency"→"crypto"). - Fixed country names with trailing spaces or punctuation (e.g.,
"United States."→"United States").
- Removed rows with invalid or blank values (e.g., when
industrycontained dates instead of industry names). - Ensured fields like
dateare in valid formats.
- Used
TRY_CONVERT(DATE, [date], 101)to change date strings into properDATEformat.
- Removed helper columns like
row_numafter data cleaning was complete.
- How to use CTEs and window functions for duplicate removal.
- Practical text standardization techniques with
TRIM()andUPDATE. - Best practices for handling missing or invalid data.
- Converting and validating date fields in SQL.
- Structuring a data cleaning workflow in SQL.