Categorical Variables

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Author

Patrick Laub

Show the package imports

import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

Preprocessing

Preprocessing data is essential in creating a successful neural network. Proper preprocessing ensures the data is in a format conducive to learning.

Keras model methods

compile: specify the loss function and optimiser
fit: learn the parameters of the model
predict: apply the model
evaluate: apply the model and calculate a metric

random.seed(12)
model = Sequential()
model.add(Dense(1, activation="relu"))
model.compile("adam", "poisson")
model.fit(X_train, y_train, verbose=0)
y_pred = model.predict(X_val, verbose=0)
print(model.evaluate(X_val, y_val, verbose=0))

4.944334506988525

Scikit-learn model methods

fit: learn the parameters of the model
predict: apply the model
score: apply the model and calculate a metric

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(model.score(X_val, y_val))

-0.6668505979514436

Scikit-learn preprocessing methods

fit: learn the parameters of the transformation
transform: apply the transformation
fit_transform: learn the parameters and apply the transformation

fit
fit_transform

scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

print(X_train_sc.mean(axis=0))
print(X_train_sc.std(axis=0))
print(X_val_sc.mean(axis=0))
print(X_val_sc.std(axis=0))

[-5.95e-18  7.93e-18  3.77e-17 -7.14e-17]
[1. 1. 1. 1.]
[-0.34  0.07 -0.27 -0.82]
[1.01 0.66 1.26 0.89]

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

print(X_train_sc.mean(axis=0))
print(X_train_sc.std(axis=0))
print(X_val_sc.mean(axis=0))
print(X_val_sc.std(axis=0))

[-5.95e-18  7.93e-18  3.77e-17 -7.14e-17]
[1. 1. 1. 1.]
[-0.34  0.07 -0.27 -0.82]
[1.01 0.66 1.26 0.89]

It is important to make sure that the scaler is fitted using only the data from the train set.

Summary of the splitting

Dataframes & arrays

X_test.head(3)

	x1	x2	x3	x4
83	0.075805	-0.677162	0.975120	-0.147057
53	0.954002	0.651391	-0.315269	0.758969
70	0.113517	0.662131	1.586017	-1.237815

X_test_sc

array([[ 0.13, -0.64,  0.89, -0.4 ],
       [ 1.15,  0.67, -0.44,  0.62],
       [ 0.18,  0.68,  1.52, -1.62],
       [ 0.77, -0.82, -1.22,  0.31],
       [ 0.06,  1.46, -0.39,  2.83],
       [ 2.21,  0.49, -1.34,  0.51],
       [-0.57,  0.53, -0.02,  0.86],
       [ 0.16,  0.61, -0.96,  2.12],
       [ 0.9 ,  0.2 , -0.23, -0.57],
       [ 0.62, -0.11,  0.55,  1.48],
       [ 0.  ,  1.57, -2.81,  0.69],
       [ 0.96, -0.87,  1.33, -1.81],
       [-0.64,  0.87,  0.25, -1.01],
       [-1.19,  0.49, -1.06,  1.51],
       [ 0.65,  1.54, -0.23,  0.22],
       [-1.13,  0.34, -1.05, -1.82],
       [ 0.02,  0.14,  1.2 , -0.9 ],
       [ 0.68, -0.17, -0.34,  1.  ],
       [ 0.44, -1.72,  0.22, -0.66],
       [ 0.73,  2.19, -1.13, -0.87],
       [ 2.73, -1.82,  0.59, -2.04],
       [ 1.04, -0.13, -0.13, -1.36],
       [-0.14,  0.43,  1.82, -0.04],
       [-0.24, -0.72, -1.03, -1.15],
       [ 0.28, -0.57, -0.04, -0.66]])

Note

By default, when you pass sklearn a DataFrame it returns a numpy array.

Keep as a DataFrame

From scikit-learn 1.2:

from sklearn import set_config
set_config(transform_output="pandas")

imp = SimpleImputer()
imp.fit(X_train)
X_train_imp = imp.fit_transform(X_train)
X_val_imp = imp.transform(X_val)
X_test_imp = imp.transform(X_test)

Imports set_config function from sklearn.
Sets the configuration to transofrm the output back to pandas.
Defines the SimpleImputer. This function helps in dealing with missing values. Default is set to mean, meaning that, missing values in each column will be replaced with the column mean.
Applies SimpleImputer on the train set before applying the scaler.
Fits and transforms the train set
Transforms the validation set
Transforms the test set

X_test_imp

	x1	x2	x3	x4
83	0.075805	-0.677162	0.975120	-0.147057
53	0.954002	0.651391	-0.315269	0.758969
...	...	...	...	...
42	-0.245388	-0.753736	-0.889514	-0.815810
69	0.199060	-0.600217	0.069802	-0.385314

25 rows × 4 columns

French Motor Claims Dataset

French motor dataset

Download the dataset if we don’t have it already.

from pathlib import Path
from sklearn.datasets import fetch_openml

if not Path("french-motor.csv").exists():
    freq = fetch_openml(data_id=41214, as_frame=True).frame
    freq.to_csv("french-motor.csv", index=False)
else:
    freq = pd.read_csv("french-motor.csv")

freq

	IDpol	ClaimNb	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.0	1.0	0.10000	D	5.0	0.0	55.0	50.0	B12	Regular	1217.0	R82
1	3.0	1.0	0.77000	D	5.0	0.0	55.0	50.0	B12	Regular	1217.0	R82
2	5.0	1.0	0.75000	B	6.0	2.0	52.0	50.0	B12	Diesel	54.0	R22
...	...	...	...	...	...	...	...	...	...	...	...	...
678010	6114328.0	0.0	0.00274	D	6.0	2.0	45.0	50.0	B12	Diesel	1323.0	R82
678011	6114329.0	0.0	0.00274	B	4.0	0.0	60.0	50.0	B12	Regular	95.0	R26
678012	6114330.0	0.0	0.00274	B	7.0	6.0	29.0	54.0	B12	Diesel	65.0	R72

678013 rows × 12 columns

Imports Path class from the pathlib.
Imports the fetch_openml function from the sklearn.datasets module. fetch_openml allows the user to bring in the datasets available in the OpenML platform. Every dataset has a unique ID, hence, can be fetched by providing the ID. data_id of the French motor dataset is 41214.
Checks if the dataset does not already exist with in the Jupyter Notebook directory.
Fetches the dataset from OpenML
Convers the dataset into .csv format
If it already exists, then read the dataset as a .csv file

Data dictionary

IDpol: policy number (unique identifier)
ClaimNb: number of claims on the given policy
Exposure: total exposure in yearly units
Area: area code (categorical, ordinal)
VehPower: power of the car (categorical, ordinal)
VehAge: age of the car in years
DrivAge: age of the (most common) driver in years

BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
Density: density of inhabitants per km² in the city of the living place of the driver
Region: regions in France (prior to 2016)

The model

Have \{ (\mathbf{x}_i, y_i) \}_{i=1, \dots, n} for \mathbf{x}_i \in \mathbb{R}^{47} and y_i \in \mathbb{N}_0.

Assume the distribution Y_i \sim \mathsf{Poisson}(\lambda(\mathbf{x}_i))

We have \mathbb{E} Y_i = \lambda(\mathbf{x}_i). The NN takes \mathbf{x}_i & predicts \mathbb{E} Y_i.

Ordinal Variables

Subsample and split

freq = freq.drop("IDpol", axis=1).head(25_000)

X_train, X_test, y_train, y_test = train_test_split(
  freq.drop("ClaimNb", axis=1), freq["ClaimNb"], random_state=2023)

# Reset each index to start at 0 again.
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

Drops the "IDpol" column and selects only the top 25_000 rows of the dataset
Splits the dataset in to train and test sets. By setting the random_state to a specific number, we ensure the consistency in the train-test split. freq.drop("ClaimNb", axis=1) removes the “ClaimNb” column.
Resets the index of train set, and drops the previous index column. Since the index column will get shuffled during the train-test split, we may want to reset the index to start from 0 again.

What values do we see in the data?

X_train["Area"].value_counts()
X_train["VehBrand"].value_counts()
X_train["VehGas"].value_counts()
X_train["Region"].value_counts()

Area
C    5507
D    4113
A    3527
E    2769
B    2359
F     475
Name: count, dtype: int64

VehBrand
B1     5069
B2     4838
B12    3708
       ... 
B13     336
B11     284
B14     136
Name: count, Length: 11, dtype: int64

VehGas
Regular    10773
Diesel      7977
Name: count, dtype: int64

Region
R24    6498
R82    2119
R11    1909
       ... 
R21      90
R42      55
R43      26
Name: count, Length: 22, dtype: int64

data["column_name"].value_counts() function provides counts of each category for a categorical variable. In this dataset, variables Area and VehGas are assumed to have natural orderings whereas VehBrand and Region are not considered to have such natural orderings. Therefore, the two sets of categorical variables will have to be treated differently.

Ordinal & binary categories are easy

from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(X_train[["Area", "VehGas"]])
oe.categories_

[array(['A', 'B', 'C', 'D', 'E', 'F'], dtype=object),
 array(['Diesel', 'Regular'], dtype=object)]

OrdinalEncoder can assign numerical values to each category of the ordinal variable. The nice thing about OrdinalEncoder is that it can preserve the information about ordinal relationships in the data. Furthermore, this encoding is more efficient in terms of memory usage. 1. Imports the OrdinalEncoder from sklearn.preprocessing library 2. Defines the OrdinalEncoder object as oe 3. Selects the two columns with ordinal variables from X_train and fits the ordinal encoder 4. Gives out the number of unique categories in each ordinal variable

for i, area in enumerate(oe.categories_[0]):
    print(f"The Area value {area} gets turned into {i}.")

The Area value A gets turned into 0.
The Area value B gets turned into 1.
The Area value C gets turned into 2.
The Area value D gets turned into 3.
The Area value E gets turned into 4.
The Area value F gets turned into 5.

for i, gas in enumerate(oe.categories_[1]):
    print(f"The VehGas value {gas} gets turned into {i}.")

The VehGas value Diesel gets turned into 0.
The VehGas value Regular gets turned into 1.

Ordinal encoded values

Note that fitting an ordinal encoder (oe.fit) only establishes the mapping between numerical values and ordinal variable levels. To actually convert the values in the ordinal columns, we must also apply the oe.transform function. Following lines of code shows how we consistently apply the transform function to both train and test sets. To avoid inconsistencies in encoding, we use oe.fit function only to the train set.

X_train_ord = oe.transform(X_train[["Area", "VehGas"]])
X_test_ord = oe.transform(X_test[["Area", "VehGas"]])

X_train[["Area", "VehGas"]].head()

	Area	VehGas
0	C	Diesel
1	C	Regular
2	E	Regular
3	D	Diesel
4	A	Regular

X_train_ord.head()

	Area	VehGas
0	2.0	0.0
1	2.0	1.0
2	4.0	1.0
3	3.0	0.0
4	0.0	1.0

Train on ordinal encoded values

If we would like to see whether we can train a neural network only on the ordinal variables, we can try the following code.

random.seed(12)
model = Sequential([
  Dense(1, activation="exponential")
])

model.compile(optimizer="adam", loss="poisson")

es = EarlyStopping(verbose=True)
hist = model.fit(X_train_ord, y_train, epochs=100, verbose=0,
    validation_split=0.2, callbacks=[es])
hist.history["val_loss"][-1]

Epoch 22: early stopping

0.7821308374404907

Sets the random state for reproducibility
Constructs a neural network with 1 Dense layer, 1 neuron and an exponential activation function
Compiles the model by defining the optimizer and loss function
Defines the early stopping object (Note that the early stopping object only works if we have a validation set. If we do not define a validation set, there will be no validation loss, hence, no metric to compare the training loss with.)
Fits the model only with the encoded columns as input data. The command validation_split=0.2 tells the neural network to treat the last 20% of input data as the validation set. This is an alternative way of defining the validation set.
Returns the validation loss at the final epoch of training

What about adding the continuous variables back in? Use a sklearn column transformer for that.

Preprocess ordinal & continuous

from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (OrdinalEncoder(), ["Area", "VehGas"]),
  ("drop", ["VehBrand", "Region"]),
  remainder=StandardScaler()
)

X_train_ct = ct.fit_transform(X_train)

Imports the make_column_transformer class that can carry out data preparation selectively
Starts defining the column transformer object
Selects the ordinal columns and apply ordinal encoding
Drops the nominal columns
Applies StandardScaler transformation to the remaining numerical columns
Fits and transforms the train set using the defined column transformer object

X_train.head(3)

	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.00	C	6.0	2.0	66.0	50.0	B2	Diesel	124.0	R24
1	0.36	C	4.0	10.0	22.0	100.0	B1	Regular	377.0	R93
2	0.02	E	12.0	8.0	44.0	60.0	B3	Regular	5628.0	R11

X_train_ct.head(3)

	ordinalencoder__Area	ordinalencoder__VehGas	remainder__Exposure	remainder__VehPower	remainder__VehAge	remainder__DrivAge	remainder__BonusMalus	remainder__Density
0	2.0	0.0	1.126979	-0.165005	-0.844589	1.451036	-0.637179	-0.366980
1	2.0	1.0	-0.590896	-1.228181	0.586255	-1.548692	2.303010	-0.302700
2	4.0	1.0	-1.503517	3.024524	0.228544	-0.048828	-0.049141	1.031432

X_train_ct.head(3) returns a dataset with column names replaced according to a strange setting. To avoid that, we can use the verbose_feature_names_out=False command. Following code shows how the command results in a better looking X_train_ct data set.

Preprocess ordinal & continuous II

from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (OrdinalEncoder(), ["Area", "VehGas"]),
  ("drop", ["VehBrand", "Region"]),
  remainder=StandardScaler(),
  verbose_feature_names_out=False
)
X_train_ct = ct.fit_transform(X_train)

X_train.head(3)

	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.00	C	6.0	2.0	66.0	50.0	B2	Diesel	124.0	R24
1	0.36	C	4.0	10.0	22.0	100.0	B1	Regular	377.0	R93
2	0.02	E	12.0	8.0	44.0	60.0	B3	Regular	5628.0	R11

X_train_ct.head(3)

	Area	VehGas	Exposure	VehPower	VehAge	DrivAge	BonusMalus	Density
0	2.0	0.0	1.126979	-0.165005	-0.844589	1.451036	-0.637179	-0.366980
1	2.0	1.0	-0.590896	-1.228181	0.586255	-1.548692	2.303010	-0.302700
2	4.0	1.0	-1.503517	3.024524	0.228544	-0.048828	-0.049141	1.031432

An important thing to notice here is that, the order of columns have changed. They are rearranged according to the order in which we specify the transformations inside the column transformer.

Glossary

column transformer
nominal variables
ordinal variables