Preprocessing with Scikit-learn Pipelines (1/3)

Introduction

You have probably come across the phrase "garbage in, garbage out" when reading articles or books on machine learning. This phrase relates to the fact that data with a high quality is the key if you want to successfully develop a model which really adds value. That's where data preprocessing comes in, as it's a fundamental step in the machine learning lifecycle with a significant impact on model performance and efficiency.

In this three-part series, we'll take a look at the world of data preprocessing using scikit-learn pipelines. Part 1 explores the "why" and "how" of preprocessing, focusing on standard transformers and their limitations. Part 2 will explore more advanced techniques for handling categorical data and building custom transformers, while part 3 will introduce scikit-learn's ColumnTransformer, which is suitable for applying transformations to heterogeneous datasets (datasets with different data types).

Why do we even want to preprocess our data?

Clean and properly scaled data is critical to successfully training machine learning models - and here's why:

Missing or incorrect Values
Data collection isn't always perfect and you may come across missing or incorrect values. These missing and incorrect values can confuse your model and lead to inaccurate predictions. Preprocessing techniques such as imputation (filling in missing values) or deletion ensure that your data is valid and ready for model building.

Different Scales
Imagine that one feature in your data represents income in dollars (from 0 to millions), and another represents age in years (0 to 100).. If you feed this directly into a model, the income feature will overwhelm the age feature due to its larger scale. This can cause the model to prioritise income over age, even though age may be a more important factor for your prediction. Preprocessing techniques such as scaling or normalization address this by putting all features on a similar scale.

By handling missing values and ensuring consistent scales, preprocessing helps your machine learning model focus on the actual relationships within your data. This leads to cleaner, more accurate and more interpretable results.

In addition to these, there are other reasons to preprocess data, such as:

Encoding Categorical Features
Many models can't understand text labels directly. Preprocessing techniques such as one-hot encoding transform categorical features (such as "red", "blue", "green") into numerical representations that the model can work with.

Handling Outliers
Extreme outliers can skew your model's predictions. Preprocessing techniques such as capping address these outliers without discarding valuable data points.

Enter the Pipeline: A Streamlined Approach with Scikit-learn

Scikit-learn describes a pipeline as follows:

A sequence of data transformers with an optional final predictor.

Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

– Scikit-learn Pipeline

In order to understand how a pipeline works and how you can utilize it, we will use a very simple script. I'll explain the different parts step by step to give you a good understanding of what's going on.

But first, let's have a quick look at the data we will be working with in this preprocessing series:

In this post, we will focus on the erroneous data in the yellow rectangles. The data points highlighted in red are the ones we will be looking at in part two.

As you can see, we are only focusing on the numerical columns. The problem with these data points is obvious: we have missing data, but we will find a way to fix this – with a simple preprocessing pipeline from scikit-learn's Pipeline class.

Here is the code that we will discuss step by step to gain insight into a standard preprocessing pipeline:

"""Script for preprocessing data"""

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Path to our raw data
DATA_PATH = "data/data.xlsx"

# Reading in data and converting the postal_code column as an integer
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

# Selecting the columns in which missing values should be imputed
cols_to_impute = ["age", "experience_in_years", "salary_in_dollar"]

# Creating our Pipeline object with a SimpleImputer and MinMaxScaler from sklearn
pipeline = Pipeline([
    ("MeanImputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
    ("MinMaxScaler", MinMaxScaler()),
])

# Coping our raw_data and imputing missing values with the column's mean
imputed_data = raw_data.copy()
imputed_data[cols_to_impute] = pipeline.fit_transform(raw_data[cols_to_impute])

Here's a step-by-step breakdown of the above (simple) pipeline for imputing missing values and scaling numeric features:

1. Import Necessary Libraries

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

2. Load and Prepare Data

DATA_PATH = "data/data.xlsx"
raw_data = pd.read_excel(DATA_PATH, converters={"postal_code": int})

Replace data/data.xlsx with the path to your actual data file.
The converters parameter in pd.read_excel ensures the postal_code column is read as integers.

3. Select Columns to Impute and Scale

cols_to_impute = ["age", "experience_in_years", "salary_in_dollar"]

4. Simple Imputation for Missing Values

pipeline = Pipeline([
("MeanImputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
("MinMaxScaler", MinMaxScaler()),
])

Here it is: the construction of our Pipeline object named pipeline:

Important Note: While selecting columns within the pipeline might seem convenient for this example, it is generally not recommended. This approach becomes inflexible when you have datasets with varying column types. In the third part of this series we'll explore scikit-learn's ColumnTransformer, a more versatile approach for handling heterogeneous datasets.

As you can see, the pipeline object is created by passing a list to scikit-learn's Pipeline. The first element in the tuple is the name of the step, which you can choose freely. The second element is the transformation you want to apply to your data. Note that all these steps must be "transforms", which means, that they must implement a fit and a transform method. We will have a look at this when we will talk about custom transformers in the second part of this series.

In our case, we are using two transformations:

SimpleImputer
The simple imputer does what its name says: it imputes (missing) values based on a (simple) chosen strategy ("mean", "median", "most_frequent", "constant"). In our case, the strategy is "mean". This means that any missing values in the transformed column will be replaced by the mean of the corresponding column in which the missing value appears. I will not go into any further details and other imputers here as this would go beyond the scope of this article, but you can dig deeper in the scikit-learn documentation on imputing.

The missing_values parameter,allows you to specify the format of the missing values, if you want to. Although mean and np.nan are the default values for strategy and missing_values, I wanted to include them to show you that you can tweak the SimpleImputer to suit your needs.

MinMaxScaler
Min-max scaling is a widely used scaling technique for scaling values in a fixed range between 0 and 1. The mathematical formula is as follows: $$x_{scaled} = \frac{x - X_{min}}{X_{max} - X_{min}}$$ x_scaled: The scaled value ranging between 0 and 1
x: The value which will be scaled
X_min: The minimum value of all X values in the column
X_max: The maximum value of all X values in the column

MinMaxScaler is only one of many scaling techniques available in scikit-learn. The choice of scaler depends on your specific data and the desired outcome.

5. Impute and Scale (on a copy)

imputed_data = raw_data.copy()
imputed_data[cols_to_impute] = pipeline.fit_transform(raw_data[cols_to_impute])

We create a copy of the raw data named imputed_data to avoid modifying the original data. Then, we use the pipeline to transform the columns containing missing values cols_to_impute.

There you have it: your missing values are imputed

This screenshot has been added for demonstration purposes. Normally, you wouldn't see this intermediate step.

... and then scaled

As you can see, your transformations worked!

We have no missing values

Our columns are scaled so that the smallest unscaled value is 0 and the largest unscaled value is now 1. All other values are between 0 and 1.

Details about fit, transform and fit_transform

Even if I have said that I won't go into any details regarding fit and transform in this post, I would still like to add a brief note.
To finally apply our pipeline to our data, we used the fit_transform method. Therefore, I would like to explain very briefly what the fit_transform method is all about. fit_transform is a method that combines the two methods fit and transform and is simply for convenience. Internally, the fit method is executed first, which learns the properties of the data that are then used in the transform method. In our case, this was the "learning" of the minimum and maximum values of our columns, as these are needed for our MinMaxScaler (see formula above). In other transformations (such as e.g. StandardScaler does), other properties such as the mean and standard deviation are learned.

Conclusion

Preprocessing pipelines in scikit-learn provide a structured and efficient way to prepare your data for machine learning. By chaining preprocessing steps together, you can ensure consistency and clarity in your data transformations. This first part lays the foundation for understanding pipelines. In the next part of this series, we'll take a look at more advanced techniques and explore how to use function transformers and build custom transformers to solve more complex data preprocessing challenges. In the third part, we'll also explore scikit-learn's ColumnTransformer, a more versatile approach to handling heterogeneous datasets.

Introduction

Why do we even want to preprocess our data?

Missing or incorrect Values

Different Scales

Encoding Categorical Features

Handling Outliers