This repository contains my submission for Task 1: Data Cleaning and Preprocessing completed as part of my Data Analyst Internship.
The objective of this task is to clean and prepare a raw dataset by handling missing values, checking duplicates, standardizing column names, converting date formats, and verifying data types.
- Dataset Name: Netflix Movies and TV Shows
- Source: Kaggle
- File:
netflix_titles.csv
Task-1-Data-Cleaning-and-Preprocessing/
│
├── dataset/
│ ├── netflix_titles.csv
│ └── cleaned_netflix_titles.csv
│
├── code/
│ └── data_cleaning.py
│
├── notebook/
│ └── data_cleaning.ipynb
│
├── report/
│ └── data_cleaning_summary.txt
│
└── README.md
-
Handled missing values:
director→Unknowncast→Unknowncountry→Unknownrating→Not Ratedduration→Unknown
-
Checked for duplicate records:
- Found 0 duplicate rows
-
Standardized column names:
- Converted all column headers to lowercase
- Replaced spaces with underscores
-
Converted
date_addedcolumn:- Changed to datetime format using Pandas
-
Verified data types:
- Ensured all columns had appropriate data types after cleaning
| Metric | Value |
|---|---|
| Original rows | 8807 |
| Final rows | 8807 |
| Duplicate rows removed | 0 |
- Python 3
- Pandas
- Jupyter Notebook
- Clone this repository:
git clone https://github.com/Kushankumarag/Task-1-Data-Cleaning-and-Preprocessing.git
- Install dependencies:
pip install pandas
- Run the script:
python code/data_cleaning.py
- Identified and treated missing values across multiple columns
- Detected and handled duplicate records
- Standardized column names for consistency
- Converted date columns to proper datetime format
- Improved overall dataset quality for downstream analysis
Kushan Kumar
Data Analyst Internship | Task 1 Submission