-
Notifications
You must be signed in to change notification settings - Fork 7
Data Preprocessing
Kevin G Menear edited this page May 10, 2023
·
3 revisions
This documentation contains a collection of data preprocessing functions that can be used for converting and encoding columns in a pandas DataFrame.
- Convert Timedelta Columns
- Convert Datetime Columns
- Convert String to Integer Columns
- Convert Hours to Seconds Columns
- Convert Requested Memory String to Float
- Label Encode Columns
- One-hot Encode Columns
- One-hot Encode with 'Other'
def convert_timedelta_columns(df, columns):
"""
Convert a Timedelta column to integer number of seconds.
Parameters:
- df (pandas.DataFrame): The DataFrame where the Timedelta columns will be converted.
- columns (List[str]): A list of the columns to be converted.
"""def convert_datetime_columns(df, columns):
"""
Convert a Datetime column to integer number of seconds since UNIX epoch time.
Parameters:
- df (pandas.DataFrame): The DataFrame where the Datetime columns will be converted.
- columns (List[str]): A list of the columns to be converted.
"""def convert_string_to_int_columns(df, columns):
"""
Convert strings to integers in pandas DataFrame columns.
Parameters:
- df (pandas.DataFrame): The DataFrame where the columns will be converted.
- columns (List[str]): A list of the columns to be converted.
"""def convert_hours_to_seconds_columns(df, columns):
"""
Convert a column of hours (Float) to integer # of seconds.
Parameters:
- df (pandas.DataFrame): The DataFrame where the columns will be converted.
- columns (List[str]): A list of the columns to be converted.
"""def convert_req_mem_string_to_float(df):
"""
Normalize the requested memory column to number of Megabytes.
Parameters:
- df (pandas.DataFrame): The DataFrame where the column will be converted.
"""def label_encode_columns(df, columns):
"""
Label encode a categorical column in a DataFrame and save the encoding as a list in a new column.
Parameters:
- df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
- columns (List[str]): A list of the columns to be encoded.
"""def one_hot_encode_columns(df, columns):
"""
One-hot encode a categorical column in a DataFrame and save the encoding as a list in a new column.
Parameters:
- df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
- columns (List[str]): A list of the columns to be encoded.
"""def onehot_with_other(df, columns, n_values):
"""
One-hot encode a categorical column in a DataFrame and save the encoding as a list in a new column.
This function differs from the one_hot_encode_columns function in that not all instances of a
feature are given a separate feature. Rather, only the n most prevalent instance in each column are
given a feature, and the rest are gruped together as 'other'.
Parameters:
- df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
- columns (List[str]): A list of the columns to be encoded.
- n_values (List[int]): A list of the n values for each column (see description above)
Returns:
- encoded_columns (dict): A dictionary with keys as the original column names and values as lists of
encoded column names generated after one-hot encoding.
"""