Preprocessing with Scikit-learn Pipelines (3/3)

Introduction

Welcome back to our blog series on Scikit-Learn Pipelines! After exploring the fundamentals of pipelines in part 1 and discussing the necessity of robust preprocessing steps for various data types in part 2, today we're diving deep into a tool that elevates data preprocessing in Scikit-Learn to a new level: the ColumnTransformer.

In the previous article, we saw the importance of applying different preprocessing steps to different columns in our dataset. We sometimes resorted to manually selecting columns and creating separate pipelines for numerical and categorical features. While this works, it can quickly become cluttered and error-prone, especially as our dataset grows and becomes more complex. This is precisely where the ColumnTransformer comes in to offer us an elegant and robust solution.

The ColumnTransformer: A Conductor for Your Data

The ColumnTransformer is a class in Scikit-Learn that allows us to apply different transformers to different groups of columns in our DataFrame.

ColumnTransformer

Organize and Improve Readability: It centralizes all preprocessing logic, making our code much cleaner and easier to understand.
Enhance Robustness: It enables us to apply specific transformations to precisely the columns they are intended for, thereby avoiding potential errors caused by accidentally applying a transformation to the wrong data type.
Improve Maintainability: Changes to the preprocessing of specific columns can be made in isolation without affecting the rest of the pipeline.

Now, let's take a look at a concrete example to see the power of the ColumnTransformer in action.

Important Note

The code shown is intentionally kept simple to illustrate the functionality of the ColumnTransformer. In the real world, it would be advisable to include additional checks to ensure, for example, that the data types of the columns meet expectations. More comprehensive error handling would also further enhance the robustness of the code. Furthermore, in the interest of clarity and keeping the focus squarely on the ColumnTransformer, some best practices might have been intentionally overlooked to maintain simplicity. The primary goal of this article was to highlight the strength and elegance of the ColumnTransformer without getting bogged down in overly complex code.

Application Example: Comprehensive Data Preprocessing with ColumnTransformer

Consider the following code, which demonstrates how we can use the ColumnTransformer to apply various preprocessing steps to a sample dataset:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    MinMaxScaler,
    OneHotEncoder,
)

from preprocessing import PostalCodeBinTransformer, remove_special_chars

# Path to the raw data
DATA_PATH = "data/data.xlsx"

# Read data and convert postal_code to integer
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

# Define numerical columns
COLS_TO_ENCODE = ["age", "experience_in_years", "salary_in_dollar"]

# Pipeline for numerical data: imputation and scaling
numerical_pipeline = Pipeline(
    [("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())]
)

# Transformer for special characters in job titles
special_char_remover = FunctionTransformer(
    remove_special_chars,
    validate=False,
    feature_names_out="one-to-one",
).set_output(transform="pandas")


# ColumnTransformer for all preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", numerical_pipeline, COLS_TO_ENCODE),
        ("job_title", special_char_remover, ["job_title"]),
        ("postal_code", PostalCodeBinTransformer(), ["postal_code"]),
        ("city", "passthrough", ["city"]),
    ],
    remainder="drop",  # Remove other columns (e.g., first_name)
    verbose_feature_names_out=False,
)

# One-Hot Encoding for categorical features
ohe_transformer = ColumnTransformer(
    transformers=[
        (
            "ohe",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            ["job_title", "city", "postal_code_bin"],
        )
    ],
    remainder="passthrough",  # Keep numerical data
    verbose_feature_names_out=False,
).set_output(transform="pandas")


preprocessed_data = preprocessor.fit_transform(raw_data)
preprocessed_df = pd.DataFrame(
    preprocessed_data, columns=preprocessor.get_feature_names_out()
)

cleaned_data = ohe_transformer.fit_transform(preprocessed_df)

# Display results
pd.set_option("display.max_columns", 8)
pd.set_option("display.expand_frame_repr", False)
print(cleaned_data.head())

Let's break down this code step by step:

1. Importing Libraries:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    MinMaxScaler,
    OneHotEncoder,
)

from preprocessing import PostalCodeBinTransformer, remove_special_chars

We import necessary libraries from pandas for data manipulation and various modules from sklearn for preprocessing, including ColumnTransformer,SimpleImputer, Pipeline, FunctionTransformer, MinMaxScaler, and OneHotEncoder. We also import the custom transformer PostalCodeBinTransformer and the function remove_special_chars from the preprocessing.py file.

2. Defining Data Path and Reading Data:

# Path to the raw data
DATA_PATH = "data/data.xlsx"

# Read data and convert postal_code to integer
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

We define the path to our raw data file (data.xlsx) and read the data into a pandas DataFrame. The converters argument in pd.read_excel ensures that the postal_code column is read as an integer.

3. Defining Numerical Columns:

# Define numerical columns
    COLS_TO_ENCODE = ["age", "experience_in_years", "salary_in_dollar"]

We create a list of column names that we consider numerical features.

4. Creating a Pipeline for Numerical Data:

# Pipeline for numerical data: imputation and scaling
numerical_pipeline = Pipeline(
    [("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())]
)

Here, we define a Scikit-Learn Pipeline to handle the preprocessing of our numerical features. This pipeline consists of two steps:

SimpleImputer(strategy="mean"): This step will replace any missing values in the numerical columns with the mean of that column.
MinMaxScaler(): This step will scale the numerical features to a range between 0 and 1.

5. Creating a Transformer for Special Characters in Job Titles:

# Transformer for special characters in job titles
special_char_remover = FunctionTransformer(
    remove_special_chars,
    validate=False,
    feature_names_out="one-to-one",
).set_output(transform="pandas")

We use FunctionTransformer to apply the remove_special_chars function (defined in preprocessing.py) to the job_title column.

remove_special_chars: This function takes a pandas Series as input and removes any special characters based on a defined regular expression.
validate=False We set this to False as we are confident in the function's input.
feature_names_out="one-to-one" This ensures that the output column name is the same as the input column name.
.set_output(transform="pandas") This ensures that the output of the transformer is a pandas DataFrame.

Explanation of preprocessing.py:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin


def remove_special_chars(series: pd.Series) -> pd.Series:
    """
    Remove special characters from a pandas Series.

    This function removes special characters from a given pandas Series using 
    a predefined regular expression pattern. The targeted special characters 
    include symbols such as `*`, `/`, `!`, `_`, `-`, `.`, `,`, `;`, `?`, `$`, 
    `%`, `&`, `^`, and others.

    Parameters
    ----------
    series : pd.Series
        A pandas Series containing string data from which special characters 
        will be removed.

    Returns
    -------
    pd.Series
        A pandas Series with all special characters removed.
    """
    pattern = "[*/!_\\-.,;?$%&$^\xc2\xb0]"
    cleaned_series = series.replace(pattern, "", regex=True)
    return cleaned_series


class PostalCodeBinTransformer(BaseEstimator, TransformerMixin):
    """
    A custom transformer that bins postal codes into predefined ranges.

    This transformer takes a DataFrame containing a postal code column and 
    assigns each postal code to a predefined range (bin). The ranges are 
    intervals of 10,000, and each bin is labeled accordingly.

    Attributes
    ----------
    bins : list of int
        A list defining the upper limits of each bin. These bins represent 
        ranges such as 0-10000, 10000-20000, etc.
    labels : list of str
        A list of labels corresponding to the ranges defined by `bins`. Each 
        label represents a range in string format, e.g., "0-10000".

    Methods
    -------
    fit(X, y=None)
        Fit the transformer (no operation needed).
    transform(X)
        Transform the input DataFrame by binning postal codes.
    """

    def __init__(self):
        """
        Initialize the PostalCodeBinTransformer with predefined bins and labels.

        The transformer divides postal codes into intervals of 10,000 and assigns 
        a label to each interval. For example, postal codes in the range 0-10000 
        are labeled as "0-10000".
        """
        self.bins = [
            0,
            10000,
            20000,
            30000,
            40000,
            50000,
            60000,
            70000,
            80000,
            90000,
            100000,
        ]
        self.labels = [
            "0-10000",
            "10000-20000",
            "20000-30000",
            "30000-40000",
            "40000-50000",
            "50000-60000",
            "60000-70000",
            "70000-80000",
            "80000-90000",
            "90000-100000",
        ]

    def fit(self, X: pd.DataFrame, y=None):
        """
        Fit the transformer to the data (no operation needed).

        This method is provided for compatibility with the scikit-learn 
        transformer interface. No actual fitting is required for this transformer.

        Parameters
        ----------
        X : pd.DataFrame
            The input DataFrame.
        y : None
            Ignored. Exists for compatibility with scikit-learn.

        Returns
        -------
        PostalCodeBinTransformer
            The transformer instance itself.
        """
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transform the input DataFrame by binning postal codes.

        The postal codes in the 'postal_code' column of the input DataFrame are 
        assigned to predefined bins based on their values. A new column 
        'postal_code_bin' is added to the DataFrame, containing the corresponding 
        bin labels.

        Parameters
        ----------
        X : pd.DataFrame
            The input DataFrame containing a 'postal_code' column to be binned.

        Returns
        -------
        pd.DataFrame
            The transformed DataFrame with an additional 'postal_code_bin' column 
            representing the binned postal code ranges.
        """
        postal_codes = X["postal_code"].astype(int)
        X["postal_code_bin"] = pd.cut(
            postal_codes, bins=self.bins, labels=self.labels
        )
        return X.drop(columns=["postal_code"])

    def get_feature_names_out(self, input_features=None):
        """
        Get the names of the output features.

        This method is part of the scikit-learn interface and returns the names 
        of the features generated by the transformer.

        Parameters
        ----------
        input_features : None
            Ignored. Exists for compatibility with scikit-learn.

        Returns
        -------
        list of str
            A list containing the name of the output feature, 
            i.e., ['postal_code_bin'].
        """
        return ["postal_code_bin"]

remove_special_chars(series): This function uses a regular expression to remove a predefined set of special characters from each string in the input pandas Series.
PostalCodeBinTransformer: This is a custom transformer that inherits from BaseEstimator and TransformerMixin

__init__(self): Initializes the transformer with predefined bins (upper limits of postal code ranges) and labels for these bins.
fit(self, X, y=None): This method is required by the Scikit-Learn API but doesn't perform any operation in this transformer. It simply returns the transformer object itself.
transform(self, X): This method takes a DataFrame X as input, extracts the postal_code column, converts it to integer type, and then uses pd.cut to bin the postal codes into the predefined ranges. It adds a new column postal_code_bin to the DataFrame containing the bin labels and then drops the original postal_code column.
get_feature_names_out(self, input_features=None): This method is used to define the names of the output features generated by this transformer. In this case, it outputs a single column named postal_code_bin.

6. Creating the First ColumnTransformer for Initial Preprocessing:

# ColumnTransformer for all preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", numerical_pipeline, COLS_TO_ENCODE),
        ("job_title", special_char_remover, ["job_title"]),
        ("postal_code", PostalCodeBinTransformer(), ["postal_code"]),
        ("city", "passthrough", ["city"]),
    ],
    remainder="drop",  # Remove other columns (e.g., first_name)
    verbose_feature_names_out=False,
)

Here, we instantiate the ColumnTransformer named preprocessor. The transformers argument takes a list of tuples, where each tuple defines a transformation to be applied to a specific set of columns:

("numerical", numerical_pipeline, COLS_TO_ENCODE): Applies the numerical_pipeline (imputation and scaling) to the columns listed in COLS_TO_ENCODE. The string "numerical" is the name given to this transformer.
("job_title", special_char_remover, ["job_title"]): Applies the special_char_remover (our FunctionTransformer) to the job_title column.
("postal_code", PostalCodeBinTransformer(), ["postal_code"]): Applies our custom PostalCodeBinTransformer to the postal_code column.
("city", "passthrough", ["city"]): Applies the "passthrough" strategy to the city column, meaning this column will be kept without any transformation.
remainder="drop": This specifies what to do with any columns in the input DataFrame that are not explicitly mentioned in the transformers list. In this case, we are dropping them (e.g., a potential first_name column).
verbose_feature_names_out=False: This option controls whether the output feature names should be verbose (including the transformer name). We set it to False for cleaner output.

7. Creating the Second ColumnTransformer for One-Hot Encoding:

# One-Hot Encoding for categorical features
ohe_transformer = ColumnTransformer(
    transformers=[
        (
            "ohe",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            ["job_title", "city", "postal_code_bin"],
        )
    ],
    remainder="passthrough",  # Keep numerical data
    verbose_feature_names_out=False,
).set_output(transform="pandas")

We create another ColumnTransformer named ohe_transformer to perform One-Hot Encoding on the categorical features.

("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["job_title", "city", "postal_code_bin"]): Applies the OneHotEncoder to the job_title, city, and postal_code_bin columns.
handle_unknown="ignore" This tells the encoder to ignore unknown categories encountered during the transform phase.
sparse_output=False This ensures that the output of the One-Hot Encoder is a dense NumPy array rather than a sparse matrix.
remainder="passthrough" This time, we set remainder to "passthrough", meaning that any columns not specified in the transformers list (which will be our processed numerical columns from the first ColumnTransformer) will be passed through without any transformation.
.set_output(transform="pandas"): This ensures the output is a pandas DataFrame.

8. Fitting and Transforming the Data:

preprocessed_data = preprocessor.fit_transform(raw_data)
preprocessed_df = pd.DataFrame(
    preprocessed_data, columns=preprocessor.get_feature_names_out()
)

cleaned_data = ohe_transformer.fit_transform(preprocessed_df)

First, we fit the preprocessor to our raw_data and then transform it to get the preprocessed data. The get_feature_names_out() method of the preprocessor is used to get the names of the output columns, which we then use to create a pandas DataFrame preprocessed_df.

Next, we fit the ohe_transformer to the preprocessed_df and transform it to get our final cleaned_data.

9. Displaying Results:

# Display results
pd.set_option("display.max_columns", 8)
pd.set_option("display.expand_frame_repr", False)
print(cleaned_data.head())

Finally, we set some pandas display options to make the output more readable and print the head of the cleaned_data DataFrame.

The final result stored in the `cleaned_data` dataframe

Elegance and Robustness

As you can see, the ColumnTransformer allows us to clearly define and bundle the preprocessing steps for different column types within a single structure. This is not only more organized than manually selecting and transforming columns, as we might have hinted at in part 2, but also more robust. We ensure that each transformation is applied only to the intended columns, minimizing the risk of errors due to incompatible data types.

The use of Pipelines within the ColumnTransformer for the numerical columns further demonstrates how we can create complex preprocessing chains for individual column groups, highlighting the flexibility and power of this tool.

Conclusion

The ColumnTransformer is an indispensable tool in the world of Scikit-Learn Pipelines. It offers an elegant and robust way to apply different preprocessing steps to different columns in your dataset. By centrally organizing the preprocessing logic, it improves the readability, maintainability, and reliability of your machine learning workflow.

Preprocessing with Scikit-learn Pipelines 3/3

The Elegance and Robustness of ColumnTransformer