Skip to content

Data Preprocessing

Kevin G Menear edited this page May 10, 2023 · 3 revisions

Data Preprocessing Functions

This documentation contains a collection of data preprocessing functions that can be used for converting and encoding columns in a pandas DataFrame.

Table of Contents

  1. Convert Timedelta Columns
  2. Convert Datetime Columns
  3. Convert String to Integer Columns
  4. Convert Hours to Seconds Columns
  5. Convert Requested Memory String to Float
  6. Label Encode Columns
  7. One-hot Encode Columns
  8. One-hot Encode with 'Other'

Convert Timedelta Columns

def convert_timedelta_columns(df, columns):
    """
    Convert a Timedelta column to integer number of seconds.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the Timedelta columns will be converted.
    - columns (List[str]): A list of the columns to be converted.
    """

Convert Datetime Columns

def convert_datetime_columns(df, columns):
    """
    Convert a Datetime column to integer number of seconds since UNIX epoch time.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the Datetime columns will be converted.
    - columns (List[str]): A list of the columns to be converted.
    """

Convert String to Integer Columns

def convert_string_to_int_columns(df, columns):
    """
    Convert strings to integers in pandas DataFrame columns.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the columns will be converted.
    - columns (List[str]): A list of the columns to be converted.
    """

Convert Hours to Seconds Columns

def convert_hours_to_seconds_columns(df, columns):
    """
    Convert a column of hours (Float) to integer # of seconds.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the columns will be converted.
    - columns (List[str]): A list of the columns to be converted.
    """

Convert Requested Memory String to Float

def convert_req_mem_string_to_float(df):
    """
    Normalize the requested memory column to number of Megabytes.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the column will be converted.
    """

Label Encode Columns

def label_encode_columns(df, columns):
    """
    Label encode a categorical column in a DataFrame and save the encoding as a list in a new column.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
    - columns (List[str]): A list of the columns to be encoded.
    """

One-hot Encode Columns

def one_hot_encode_columns(df, columns):
    """
    One-hot encode a categorical column in a DataFrame and save the encoding as a list in a new column.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
    - columns (List[str]): A list of the columns to be encoded.
    """

One-hot Encode with 'Other'

def onehot_with_other(df, columns, n_values):
    """
    One-hot encode a categorical column in a DataFrame and save the encoding as a list in a new column.
    This function differs from the one_hot_encode_columns function in that not all instances of a 
    feature are given a separate feature. Rather, only the n most prevalent instance in each column are
    given a feature, and the rest are gruped together as 'other'.
    
    Parameters:
    - df (pandas.DataFrame): The DataFrame where the categorical columns will be encoded.
    - columns (List[str]): A list of the columns to be encoded.
    - n_values (List[int]): A list of the n values for each column (see description above)
    
    Returns:
    - encoded_columns (dict): A dictionary with keys as the original column names and values as lists of
    encoded column names generated after one-hot encoding.
    """

Back to top