Preprocessing with Scikit-learn Pipelines (2/3)

Elevate Your Preprocessing with Custom Transformers

Posted by Fabian Hruby on June 12, 2024

Introduction

Building on our understanding of data preprocessing with the Scikit-learn Pipeline from the first part of this series, let's explore transforming categorical data. This part goes beyond basic techniques by introducing custom transformers for tailored data cleaning. Our introductory example was straightforward, focusing on missing numerical values. However, handling categorical variables can be more complex, especially when replacing missing values and cleaning messy data.

In this part of the series, we will look at transforming categorical data using scikit-learn's FunctionTransformer and explore the use of custom transformers to define specific preprocessing functions tailored to your data.

Setting the Goals for Our Next Transformations

Speaking of data, let's take a look at our data after applying the transformations from part 1:



Our columns age, experience_in_years and salary_in_dollar look well prepared after we imputed missing values and scaled them with scikit-learn's MinMaxScaler.
However, the highlighted values in the columns job_title and city need attention.

For example, models such as neural networks only work with numerical data, so we need to convert our categorical data, such as job_title and city, into numbers. Although postal_codes are already integers, they represent categories, so we will bin these values into groups before applying one-hot encoding. This approach will reduce the dimensionality of the feature space while retaining meaningful information.

However, before converting the strings to numbers and binning our postal codes, we must address the special characters in the job_title column. Ignoring these characters would result in treating "Data A*nalyst" and "Data Analyst" as different strings, which is undesirable. Therefore, we need to clean up the data in the job_title column.

So here are our preprocessing goals:

  • Remove special characters from the job_title column
  • Create bins and assign each postal code to a bin
  • One-hot encode the job_title, city, and postal_code_bin columns (created in the previous step)

Applying FunctionTransformer and Custom Transformer

To clean the job_title column and bin the values in the postal_codes column, we will leverage scikit-learn's FunctionTransformer and build our own custom transformer, respectively. Incorporating these transformations within our preprocessing pipeline allows us to streamline the data preprocessing steps. As the blog title says, we'll be using some tools from the scikit-learn library, although this is not the most robust solution as we won't be using the ColumnTransformer. Stay tuned for part three of this series where we'll introduce the ColumnTransformer for a more elegant approach.

Using Scikit-learn's FunctionTransformer

Here is the complete script, some of which you may already know from the first part of this series:
"""Script for preprocessing data"""

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler

from preprocessing import (
    PostalCodeBinTransformer,
    one_hot_encode_columns,
    remove_special_chars,
)

# Path to our raw data
DATA_PATH = "data/data.xlsx"

# Reading in data and converting the postal_code column as an integer
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

# Selecting the columns in which missing values should be imputed
cols_to_impute = ["age", "experience_in_years", "salary_in_dollar"]

# Creating our Pipeline object with a SimpleImputer and MinMaxScaler from sklearn
pipeline_num = Pipeline(
    [
        ("MeanImputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
        ("MinMaxScaler", MinMaxScaler()),
    ]
)

# Creating our special_char_remover with scikit-learn's FunctionTransformer
special_char_remover = FunctionTransformer(
    remove_special_chars,
    kw_args={"column_name": "job_title"}
).set_output(transform="pandas")

# Name the columns which should be one hot encoded
columns_to_encode = ["job_title", "city", "postal_code_bin"]

# Creating our one hot encoding with scikit-learn's FunctionTransformer
one_hot_encoder = FunctionTransformer(
    one_hot_encode_columns,
    kw_args={"columns_to_encode": columns_to_encode},
)

# Creating our second Pipeline object with our
# custom preprocessing function and our CustomTransformers
pipeline_cat = Pipeline(
    [
        ("SpecialCharRemover", special_char_remover),
        ("PostalCodeBinner", PostalCodeBinTransformer()),
        ("OneHotEncoder", one_hot_encoder),
    ]
)

# Coping our raw_data and imputing missing values with the column's mean
imputed_data = raw_data.copy()
imputed_data[cols_to_impute] = pipeline_num.fit_transform(raw_data[cols_to_impute])

# Remove the special characters from the job_title column
# Add column postal_code_bin based on the postal_code column
# One-not encode the columns in the list columns_to_encode
cleaned_data = pipeline_cat.fit_transform(imputed_data)

And here is the code from our imported functions and our custom transfomer from the preprocessing.py file:

    
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin


def remove_special_chars(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    """
    Removes special characters from a column.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame containing the column to be cleaned.
    col_name : str
        The name of the column which values should be cleaned from special characters.

    Returns
    -------
    pd.DataFrame
        The DataFrame where the specified column values are cleaned from special characters.
    """
    # Define pattern for special characters
    pattern = r"[\*\/!_\-.,;?$%&$\^°]"
    df[column_name] = df[column_name].str.replace(pattern, "", regex=True)
    return df


def one_hot_encode_columns(df: pd.DataFrame, columns_to_encode: list[str])
-> pd.DataFrame:
    """
    One-hot encodes specified categorical columns in a DataFrame.

    Parameters
    ----------
    df: pd.DataFrame or np.ndarray
        The input DataFrame or array containing categorical columns to be
        one-hot encoded.

    columns_to_encode : list[str]
        A list of column names in the DataFrame to be one-hot encoded.

    Returns
    -------
    pd.DataFrame or np.ndarray
        DataFrame or array with one-hot encoded columns.
    """
    if isinstance(df, pd.DataFrame):
        encoded_df = pd.get_dummies(df, columns=columns_to_encode)
        return encoded_df
    elif isinstance(df, np.ndarray):
        raise ValueError("Input must be a DataFrame for one-hot encoding.")
    else:
        raise TypeError("Unsupported input type. Expected DataFrame or numpy array.")


class PostalCodeBinTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
    self.bins = [
        0, 10000, 20000, 30000, 40000,
        50000, 60000, 70000, 80000, 90000, 100000,
    ]
    self.labels = [
        "0-10000", "10000-20000", "20000-30000", "30000-40000",
        "40000-50000", "50000-60000", "60000-70000", "70000-80000",
        "80000-90000", "90000-100000",
    ]

    def fit(self, X: pd.DataFrame):
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        postal_codes = X["postal_code"].astype(int)
        X["postal_code_bin"] = pd.cut(postal_codes,
                                      bins=self.bins,
                                      labels=self.labels
                                )
        return X

The following step-by-step explanation will focus on how we use scikit-learn's FunctionTransformer to integrate our remove_special_chars and one_hot_encode_columns functions into the pipeline. We will also use our custom PostalCodeBinTransformer to bin the values of the postal_code column into groups:

Imputing and Scaling Numerical Data

"""Script for preprocessing data"""

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler

from preprocessing import remove_special_chars

# Path to our raw data
DATA_PATH = "data/data.xlsx"

# Reading in data and converting the postal_code column as an integer
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

# Selecting the columns in which missing values should be imputed
cols_to_impute = ["age", "experience_in_years", "salary_in_dollar"]

# Creating our Pipeline object with a SimpleImputer and MinMaxScaler from sklearn
pipeline_num = Pipeline(
    [
        ("MeanImputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
        ("MinMaxScaler", MinMaxScaler()),
    ]
)

The first part of the script preprocesses the numeric data. It imputes missing values in the age, experience_in_years and salary_in_dollar columns with the mean value of each column and then scales the data between 0 and 1 using a MinMaxScaler. Refer to the first part of this series for more details.

Transforming our Functions with Scikit-learn's FunctionTransformer

remove_special_chars

# Selecting the column from which special characters should be removed
special_char_remover = FunctionTransformer(
    remove_special_chars,
    kw_args={"column_name": "job_title"}
).set_output(transform="pandas")
  • FunctionTransformer allows us to apply custom functions to data within a pipeline. In this case, we are transforming our imported remove_special_chars function.
  • remove_special_cars refers to the earlier imported function that removes special characters from a DataFrame column.
  • The argument kw_args={"column_name": "job_title"} allows us to specify keyword arguments – in this case we want to set the column_name argument to "job_title" – for our custom remove_special_chars function to ensure the right column is cleaned.
  • The .set_output(transform="pandas") method is used to specify the output type of the transformer. By default, Scikit-learn transformers return NumPy arrays. However, when working with DataFrames, it's often more convenient to maintain the DataFrame format to preserve column names and other metadata. This is especially useful when you need to apply further transformations or inspect the transformed data.
  • In summary, this code creates a resuable transformer based on a custom function that can easily be included in a scikit-learn pipeline to clean special characters from a specific column.

    one_hot_encode_columns

    # Name the columns which should be one hot encoded
    columns_to_encode = ["job_title", "city", "postal_code_bin"]
    
    # Creating our one hot encoding with scikit-learn's FunctionTransformer
    one_hot_encoder = FunctionTransformer(
        one_hot_encode_columns,
        kw_args={"columns_to_encode": columns_to_encode},
    )
    
  • Similarly, we use FunctionTransformer to convert our one_hot_encode_columns function into a transformer for one-hot encoding of specified categorical columns.
  • Building the Categorical Data Preprocessing Pipeline

    Let's now move on to creating the second pipeline that handles the categorical data. This pipeline contains our custom preprocessing functions and transformers to ensure that our categorical data is effectively cleaned, binned and one-hot encoded.
    Here is the code to create this pipeline:

    # Creating our second Pipeline object with our custom preprocessing function
    # and our CustomTransformers
    pipeline_cat = Pipeline(
        [
            ("SpecialCharRemover", special_char_remover),
            ("PostalCodeBinner", PostalCodeBinTransformer()),
            ("OneHotEncoder", one_hot_encoder),
        ]
    )
    

    1. Removing Special Characters from the job_title column

    ("SpecialCharRemover", special_char_remover)
    

    This step uses our transformer special_char_remover, which cleans the job_title column by removing unwanted special characters. This ensures that variations in job titles caused by special characters do not lead to redundant categories.

    2. Creating Bins for Postal Codes

    ("PostalCodeBinner", PostalCodeBinTransformer())
    

    This custom transformer, PostalCodeBinTransformer, bins postal codes into predefined groups. This reduces the high-dimensional feature space that would result from one-hot encoding individual postal codes. Let's look at how this custom transformer is implemented:

    class PostalCodeBinTransformer(BaseEstimator, TransformerMixin):
        def __init__(self):
            self.bins = [
                0, 10000, 20000, 30000, 40000,
                50000, 60000, 70000, 80000, 90000, 100000,
            ]
            self.labels = [
                "0-10000", "10000-20000", "20000-30000", "30000-40000",
                "40000-50000", "50000-60000", "60000-70000", "70000-80000",
                "80000-90000", "90000-100000",
            ]
    
        def fit(self, X: pd.DataFrame):
            return self
    
        def transform(self, X: pd.DataFrame) -> pd.DataFrame:
            postal_codes = X["postal_code"].astype(int)
            X["postal_code_bin"] = pd.cut(postal_codes,
                                          bins=self.bins,
                                          labels=self.labels
                                   )
            return X
    

    In the context of creating custom transformers in scikit-learn, having both a fit and a transform function are necessary to maintain to the interface expected by the scikit-learn pipeline. Even if the transformer doesn't require fitting, the presence of these methods ensures compatibility with the pipeline's operation.

    Initialization (__init__ method)
    The __init__ method is used to initialize the transformer. In the PostalCodeBinTransformer, it defines the bins and labels that will be used to categorize postal codes.

    Fitting (fit method)
    The fit method is intended to prepare the transformer based on the training data. However, for the PostalCodeBinTransformer, no preparation based on the data is needed, so the method simply returns self.
    Even though the fit method doesn't do anything in this transformer, it is required because scikit-learn expects every transformer to have this method. This ensures the transformer can be integrated seamlessly into a pipeline, where fit will be called before transform.

    Transformation (transform method)
    The transform method is where the actual data transformation happens. For the PostalCodeBinTransformer, this involves binning the postal codes into predefined groups.
    In the transform method, the following happens:

    • Conversion to integer: The postal codes are converted to integers to ensure they can be properly binned.
    • Binning: The pd.cut function is used to bin the postal codes into the specified ranges (bins).
    • Adding a new column: A new column, postal_code_bin, is added to the DataFrame to store the bin labels.

    By defining both the fit and the transform method, the PostalCodeBinTransformer becomes a well-formed scikit-learn transformer, ready to be used in a preprocessing pipeline. This adherence to the expected interface allows for seamless integration and consistent behavior within the scikit-learn framework.

    3. One-Hot Encoding Categorical Columns

    ("OneHotEncoder", one_hot_encoder)

    This step uses our custom transformer one_hot_encoder, which one-hot encodes the specified categorical columns (job_title, city and postal_code_bin). One-hot encoding converts categorical variables into a form that can be provided to machine learning algorithms to do a better job in prediction.

    Applying the Pipeline

    After defining the pipeline, we can apply it to our data to preprocess it:

    # Copying our raw_data and imputing missing values with the column's mean
    imputed_data = raw_data.copy()
    imputed_data[cols_to_impute] = pipeline_num.fit_transform(raw_data[cols_to_impute])
    
    # Remove the special characters from the job_title column
    # Add column postal_code_bin based on the postal_code column
    # One-not encode the columns in the list columns_to_encode
    cleaned_data = pipeline_cat.fit_transform(imputed_data)
    

    We first create a copy of the raw data and then apply the numerical data preprocessing pipeline (pipeline_num) to impute missing values and scale the specified numerical columns.

    Next, we apply the categorical data preprocessing pipeline (pipeline_cat) to the imputed data. This step removes special characters from the job_title column, bins the postal codes, and one-hot encodes the categorical columns.

    To get a better understanding of what happens before one-hot encoding, here is an intermediate state of the data showing the state before one-hot encoding:


    1. the column job_title no longer contains any special characters
    2. an additional column postal_code_bin has been added, which we can use for our one-hot encoding in the next step

    After running our code and looking at our transformed data, everything looks fine:


    As the one-hot-encoded columns for the job titles and statuses are not visible in the screenshot above, here are two more screenshots for the sake of completeness:

    One-hot-encoded values of the job_title column.

    One-hot-encoded values of the city column.

    By integrating these preprocessing steps into a single pipeline, we streamline the data cleaning and transformation process, making it more efficient and less prone to error. This setup also ensures that our data is consistently processed in the same way every time, which is critical for reproducible and reliable machine learning workflows.

    Conclusion

    In this part of the series, we have extended our preprocessing pipeline to handle categorical data by removing special characters, binning postal codes, and applying one-hot encoding. These steps are crucial for preparing data for machine learning models, such as neural networks, which require numerical input. Stay tuned for the third and final part of this series, where we will introduce scikit-learn's ColumnTransformer for a more elegant and robust approach to preprocessing mixed data types.