Skip to content

Latest commit

 

History

History
75 lines (45 loc) · 2.82 KB

File metadata and controls

75 lines (45 loc) · 2.82 KB

Machine-Learning

Data Preprocessing in Machine Learning

Introduction to Data Preprocessing

Data preprocessing is a crucial step in machine learning pipelines. It involves transforming raw data into a format suitable for machine learning algorithms.

Importing Libraries

First, let's import the necessary libraries for data preprocessing:

  • NumPy: For numerical computations.
  • Matplotlib: For data visualization.
  • Pandas: For data manipulation and analysis.

Importing the Dataset

We start by importing the dataset using Pandas. Here's a glimpse of our dataset:

print(X)  print(y)

This code reads a CSV file ('Data.csv') into a DataFrame and separates features (X) from the target variable (y).

Handling Missing Data

Missing data is a common issue in datasets. We use the SimpleImputer class from scikit-learn to replace missing values. The strategy used here is to replace missing values with the mean of the respective column.

print(X)

Now, missing values are imputed with the mean of the respective columns.

Encoding Categorical Data

Machine learning models require numerical inputs. Hence, categorical variables need to be encoded. We use one-hot encoding for categorical features and label encoding for the target variable.

print(X)  print(y)

Now, categorical data is encoded into numerical format suitable for machine learning algorithms.

Splitting the Dataset

To evaluate the performance of machine learning models, we split the dataset into training and testing sets. The data is split into 80% training and 20% testing sets.

print(X_train)  print(X_test)  print(y_train)  print(y_test)

Now, we have separate training and testing sets for both features and target variables.

Feature Scaling

Feature scaling is essential for many machine learning algorithms. It ensures that all features have the same scale, preventing some features from dominating others.

print(X_train)  print(X_test)

Feature scaling is applied to standardize the features, making them comparable across different scales.

Conclusion

Data preprocessing is a crucial step in building machine learning models. It involves handling missing data, encoding categorical variables, splitting the dataset, and scaling features. These techniques ensure that the data is well-prepared for training machine learning algorithms, leading to more accurate models.

References: