Clean & preprocess data (opt. feature engineering)
Design network architecture
Train and validate model
Hyperparameter tuning
Test and evaluate performance
Deploy to production
Monitor and iterate
(We won’t cover the italicised items in this course, though they are very important.)
Overview
Every dataset needs to be transformed in various ways before it can be fed into a machine learning model.
Preprocessing refers to the transformations of the raw data before input to the network
Some steps are specific to neural networks (e.g. standardising continuous variables), others are necessary for every model (e.g. encoding categorical variables, handling missing data).
Note that this isn’t foolproof. If you have ‘postcode’ as a feature, it may be read in as a number when it should be stored as a string and treated as a categorical variable.
Note that we really cannot allow there to be duplicate rows in the training, validation, or test sets. If there are, we will have data leakage between the sets, which will lead to overfitting and poor generalisation.
Checking for leaks
We can look for rows that appear in multiple of the datasets.
train_val_overlap = pd.merge(X_train_raw, X_val_raw, how='inner')train_test_overlap = pd.merge(X_train_raw, X_test_raw, how='inner')val_test_overlap = pd.merge(X_val_raw, X_test_raw, how='inner')print(f"Number of duplicated rows between train and val: {len(train_val_overlap)}")print(f"Number of duplicated rows between train and test: {len(train_test_overlap)}")print(f"Number of duplicated rows between val and test: {len(val_test_overlap)}")
Number of duplicated rows between train and val: 0
Number of duplicated rows between train and test: 0
Number of duplicated rows between val and test: 0
An earlier check is to check that there are no duplicate rows in the dataset.
That is because the neural network expects continuous inputs to be normalised to help with the training procedure.
Basically, the initial starting point for the network are random weights and biases that assume a (roughly) standard normal kind of input, to keep the inner neuron values from overflowing or underflowing.
Note
Notice that the output of X_train here looks a bit different to X_train_raw earlier? We’ll touch on this in a couple of slides.
Scikit-learn preprocessing methods
fit: learn the parameters of the transformation
transform: apply the transformation
fit_transform: learn the parameters and apply the transformation
Tells sklearn that any transformed data should stay as pandas dataframes.
Defines the SimpleImputer. This function helps in dealing with missing values. Default is set to mean, meaning that, missing values in each column will be replaced with the column mean.
Finds the means of training set columns, and replaces train set missing values by those means.
Replace missing values in the validation set by the means of the training set columns.
Replace missing values in the test set by the means of the training set columns.
From SimpleImputer’s docs:
X_test
MedInc
HouseAge
AveRooms
AveBedrms
Population
AveOccup
Latitude
Longitude
4712
3.2500
39.0
4.503205
1.073718
1109.0
1.777244
34.06
-118.36
2151
1.9784
37.0
4.988584
1.038813
1143.0
2.609589
36.78
-119.78
...
...
...
...
...
...
...
...
...
6823
4.8750
42.0
5.347985
1.058608
829.0
3.036630
34.09
-118.10
11878
2.7054
52.0
5.741214
1.060703
905.0
2.891374
33.99
-117.38
4128 rows × 8 columns
Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value… default=‘mean’
Example 2: Continuous and Categorical Variables
French motor dataset
# Download the dataset if we don't have it already. ifnot Path("french-motor.csv").exists(): freq = sklearn.datasets.fetch_openml(data_id=41214, as_frame=True).frame freq.to_csv("french-motor.csv", index=False)else: freq = pd.read_csv("french-motor.csv")freq.info()
Checks if the dataset does not already exist within the current directory.
Download the dataset from sklearn. The fetch_openml function allows the user to bring in the datasets available in the OpenML platform. Every dataset has a unique ID, hence, can be fetched by providing the ID. data_id of the French motor dataset is 41214.
Save it as a CSV file, so we don’t need to redownload it (slow!) next time.
If it already exists, then read the dataset from the CSV file.
The data
freq
IDpol
ClaimNb
Exposure
Area
VehPower
VehAge
DrivAge
BonusMalus
VehBrand
VehGas
Density
Region
0
1.0
1.0
0.10000
D
5.0
0.0
55.0
50.0
B12
Regular
1217.0
R82
1
3.0
1.0
0.77000
D
5.0
0.0
55.0
50.0
B12
Regular
1217.0
R82
2
5.0
1.0
0.75000
B
6.0
2.0
52.0
50.0
B12
Diesel
54.0
R22
...
...
...
...
...
...
...
...
...
...
...
...
...
678010
6114328.0
0.0
0.00274
D
6.0
2.0
45.0
50.0
B12
Diesel
1323.0
R82
678011
6114329.0
0.0
0.00274
B
4.0
0.0
60.0
50.0
B12
Regular
95.0
R26
678012
6114330.0
0.0
0.00274
B
7.0
6.0
29.0
54.0
B12
Diesel
65.0
R72
678013 rows × 12 columns
Data dictionary
IDpol: policy number (unique identifier)
ClaimNb: number of claims on the given policy
Exposure: total exposure in yearly units
Area: area code (categorical, ordinal)
VehPower: power of the car (categorical, ordinal)
VehAge: age of the car in years
DrivAge: age of the (most common) driver in years
BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
Density: of inhabitants per km2 in the city of the living place of the driver
Region: regions in France (prior to 2016)
We have \{ (\mathbf{x}_i, y_i) \}_{i=1, \dots, n} for \mathbf{x}_i \in \mathbb{R}^{p} and y_i \in \mathbb{N}_0. Assume the distribution
Y_i \sim \mathsf{Poisson}(\lambda(\mathbf{x}_i))
We have \mathbb{E} Y_i = \lambda(\mathbf{x}_i). The NN takes \mathbf{x}_i & predicts \mathbb{E} Y_i.
Splits the dataset into train and test sets. By setting the random_state to a specific number, we ensure the consistency in the train-test split. freq.drop("ClaimNb", axis=1) removes the “ClaimNb” column.
Resets the index of train set, and drops the previous index column. Since the index column will get shuffled during the train-test split, we may want to reset the index to start from 0 again.
data["column_name"].value_counts() function provides counts of each category for a categorical variable. In this dataset, variables Area and VehGas are assumed to have natural orderings whereas VehBrand and Region are not considered to have such natural orderings. Therefore, the two sets of categorical variables will have to be treated differently.
Nominal categorical variables
These are categorical variables which you cannot order in a ‘default’ sensible way.
from sklearn.preprocessing import OneHotEncodergas = X_train_raw[["VehGas"]]
Collinearity is not a big deal for neural networks. However if you want to reuse the same X_train data for a (generalised) linear regression, then you probably want to dummy-encode here to avoid needing two different preprocessing regimes for the two competing models.
Ordinal categorical variables
These are categorical variables which have a natural order to them.
from sklearn.preprocessing import OrdinalEncoderenc = OrdinalEncoder()enc.fit(X_train_raw[["Area"]])enc.categories_
OrdinalEncoder can assign numerical values to each category of the ordinal variable. The nice thing about OrdinalEncoder is that it can preserve the information about ordinal relationships in the data. To actually convert the values in the ordinal columns, we must also apply the enc.transform method. The following lines of code show how we consistently apply the transform function to both train and test sets. To avoid inconsistencies in encoding, we use enc.fit function only to the train set.
Imports the OrdinalEncoder from sklearn.preprocessing library
Create an OrdinalEncoder object named enc
Selects the two columns with ordinal variables from X_train and fits the ordinal encoder
Gives out the number of unique categories in each ordinal variable
for i, area inenumerate(enc.categories_[0]):print(f"The Area value {area} gets turned into {i}.")
The Area value A gets turned into 0.
The Area value B gets turned into 1.
The Area value C gets turned into 2.
The Area value D gets turned into 3.
The Area value E gets turned into 4.
The Area value F gets turned into 5.
In other words,
print(" < ".join(enc.categories_[0]) +", and ")print(" = ".join(f"{r} - {l}"for l, r inzip(enc.categories_[0][:-1], enc.categories_[0][1:])))
A < B < C < D < E < F, and
B - A = C - B = D - C = E - D = F - E
Ordinal encoded values
Note that fitting an ordinal encoder (enc.fit) only establishes the mapping between numerical values and ordinal variable levels. To actually convert the values in the ordinal columns, we must also apply the enc.transform function. Following lines of code shows how we consistently apply the transform function to both train and test sets. To avoid inconsistencies in encoding, we use enc.fit function only to the train set.
We could train on this X_train, or on the previous X_train, but each only contains one feature from the dataset. How do we add the continuous variables back in? Use a sklearn column transformer for that.
Imports the make_column_transformer class that can carry out data preparation selectively
Starts defining the column transformer object
Apply dummy encoding to the nominal categorical columns
Apply ordinal encoding to the ordinal categorical column
Standardise the remaining columns (which are all numerical)
Fits and transforms the train set using this new column transformer
X_train_raw
Exposure
Area
VehPower
VehAge
DrivAge
BonusMalus
VehBrand
VehGas
Density
Region
0
0.01
A
4.0
0.0
70.0
50.0
B12
Regular
37.0
R72
...
...
...
...
...
...
...
...
...
...
...
508508
0.42
D
6.0
1.0
37.0
54.0
B2
Diesel
1284.0
R25
508509 rows × 10 columns
X_train
onehotencoder__VehGas_Regular
onehotencoder__VehBrand_B10
onehotencoder__VehBrand_B11
onehotencoder__VehBrand_B12
onehotencoder__VehBrand_B13
onehotencoder__VehBrand_B14
onehotencoder__VehBrand_B2
onehotencoder__VehBrand_B3
onehotencoder__VehBrand_B4
onehotencoder__VehBrand_B5
...
onehotencoder__Region_R91
onehotencoder__Region_R93
onehotencoder__Region_R94
ordinalencoder__Area
remainder__Exposure
remainder__VehPower
remainder__VehAge
remainder__DrivAge
remainder__BonusMalus
remainder__Density
0
1.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
-1.424715
-1.196660
-1.245210
1.732277
-0.623438
-0.442987
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
508508
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
...
0.0
0.0
0.0
3.0
-0.299951
-0.221191
-1.068474
-0.602450
-0.367372
-0.127782
508509 rows × 39 columns
The default column names of X_train that the column transformer adds are a bit obscure/verbose. We can add the verbose_feature_names_out=False flag to have a cleaner X_train dataframe.
An important thing to notice here is that, the order of columns have changed. They are rearranged according to the order in which we specify the transformations inside the column transformer.
for i, area inenumerate(enc_wrong.categories_[0]):print(f"The Area value {area} gets turned into {i}.")
The Area value Extremely high gets turned into 0.
The Area value High gets turned into 1.
The Area value Low gets turned into 2.
The Area value Medium gets turned into 3.
The Area value Very high gets turned into 4.
The Area value Very low gets turned into 5.
In other words,
print(" < ".join(enc_wrong.categories_[0]))
Extremely high < High < Low < Medium < Very high < Very low
for i, area inenumerate(enc_right.categories_[0]):print(f"The Area value {area} gets turned into {i}.")
The Area value Very low gets turned into 0.
The Area value Low gets turned into 1.
The Area value Medium gets turned into 2.
The Area value High gets turned into 3.
The Area value Very high gets turned into 4.
The Area value Extremely high gets turned into 5.
In other words,
print(" < ".join(enc_right.categories_[0]))
Very low < Low < Medium < High < Very high < Extremely high
Always take a look at the missing data in, say, Excel. Sometimes pandas reads in a category like “None” and interprets that as a missing value, when it’s really a valid category. This happened to me just the other day.
Look for sparse categories. One common trick is to lump together all the small categories (e.g. < 5% or 10% each) into a new combined ‘OTHER’ or ‘RARE’ category.
Make a plan for preprocessing each column
Take categorical columns \hookrightarrow one-hot vectors
binary columns \hookrightarrow do nothing (already 0 & 1)
from sklearn.pipeline import make_pipelinenom_vars = ["gender", "ever_married", "Residence_type","work_type", "smoking_status"] ct = make_column_transformer( (OneHotEncoder(sparse_output=False, handle_unknown="ignore"), nom_vars), ("passthrough", ["hypertension", "heart_disease"]), remainder=make_pipeline(SimpleImputer(), StandardScaler()), verbose_feature_names_out=False)X_train = ct.fit_transform(X_train_raw)X_val = ct.transform(X_val_raw)X_test = ct.transform(X_test_raw)for name, X inzip(("train", "val", "test"), (X_train, X_val, X_test)): num_na = pd.DataFrame(X).isna().sum().sum()print(f"The {name} set has shape {X.shape} & with {num_na} NAs.")
The train set has shape (3066, 20) & with 0 NAs.
The val set has shape (1022, 20) & with 0 NAs.
The test set has shape (1022, 20) & with 0 NAs.
Imports make_pipeline class from sklearn.pipeline library. make_pipeline is used to streamline the data pre processing. In the above example, make_pipeline is used to first treat for missing values and then scale numerical values
Stores categorical variables in nom_vars
Specifies the one-hot encoding for all categorical variables. We set the sparse_output=False, to return a dense array rather than a sparse matrix. handle_unknown specifies how the neural network should handle unseen categories. By setting handle_unknown="ignore", we instruct the neural network to ignore categories that were not seen during training. If we did not do this, it will interrupt the model’s operation after deployment
Passes through hypertension and heart_disease without any pre processing
Makes a pipeline that first applies SimpleImputer() to replace missing values with the mean and then applies StandardScaler() to scale the numerical values
Prints out the missing values to ensure the SimpleImputer() has worked
Handling unseen categories
X_train_raw["gender"].value_counts()
gender
Female 1802
Male 1264
Name: count, dtype: int64
X_val_raw["gender"].value_counts()
gender
Female 615
Male 406
Other 1
Name: count, dtype: int64
Because the way train and test was split, one-hot encoder could not pick up on the third category. This could interrupt the model performance. To avoid such confusions, we could either give instructions manually on how to tackle unseen categories. An example is given below.
ind = np.argmax(X_val_raw["gender"] =="Other")X_val_raw.iloc[ind-1:ind+3][["gender"]]
However, to give such instructions on handling unseen categories, we would first have to know what those possible categories could be. We should also have specific knowledge on what value to assign in case they come up during model performance. One easy way to tackle it would be to use handle_unknown="ignore" during encoding, as mentioned before.
Handling unseen categories II
This was caused by fitting the encoder on the X_train which, by bad luck, didn’t have the “Other” category. Why not just read the entire dataset before splitting to collect all the values the categorical variables could take?
It’s a very mild form of data leakage, though it may be justifiable in some cases. If we made the datasets, we’d have a better idea of whether it makes sense on a case-by-case basis. We struggle here as we (in this course) are just analysing datasets which we didn’t collect.