Categorical Variables

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Patrick Laub

Preprocessing

Lecture Outline

Preprocessing
French Motor Claims Dataset
Ordinal Variables

Keras model methods

compile: specify the loss function and optimiser
fit: learn the parameters of the model
predict: apply the model
evaluate: apply the model and calculate a metric

random.seed(12)
model = Sequential()
model.add(Dense(1, activation="relu"))
model.compile("adam", "poisson")
model.fit(X_train, y_train, verbose=0)
y_pred = model.predict(X_val, verbose=0)
print(model.evaluate(X_val, y_val, verbose=0))

4.944334506988525

Scikit-learn model methods

fit: learn the parameters of the model
predict: apply the model
score: apply the model and calculate a metric

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(model.score(X_val, y_val))

-0.6668505979514436

Scikit-learn preprocessing methods

fit: learn the parameters of the transformation
transform: apply the transformation
fit_transform: learn the parameters and apply the transformation

fit
fit_transform

scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

print(X_train_sc.mean(axis=0))
print(X_train_sc.std(axis=0))
print(X_val_sc.mean(axis=0))
print(X_val_sc.std(axis=0))

[-5.95e-18  7.93e-18  3.77e-17 -7.14e-17]
[1. 1. 1. 1.]
[-0.34  0.07 -0.27 -0.82]
[1.01 0.66 1.26 0.89]

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

print(X_train_sc.mean(axis=0))
print(X_train_sc.std(axis=0))
print(X_val_sc.mean(axis=0))
print(X_val_sc.std(axis=0))

[-5.95e-18  7.93e-18  3.77e-17 -7.14e-17]
[1. 1. 1. 1.]
[-0.34  0.07 -0.27 -0.82]
[1.01 0.66 1.26 0.89]

Summary of the splitting

Dataframes & arrays

X_test.head(3)

	x1	x2	x3	x4
83	0.075805	-0.677162	0.975120	-0.147057
53	0.954002	0.651391	-0.315269	0.758969
70	0.113517	0.662131	1.586017	-1.237815

X_test_sc

array([[ 0.13, -0.64,  0.89, -0.4 ],
       [ 1.15,  0.67, -0.44,  0.62],
       [ 0.18,  0.68,  1.52, -1.62],
       [ 0.77, -0.82, -1.22,  0.31],
       [ 0.06,  1.46, -0.39,  2.83],
       [ 2.21,  0.49, -1.34,  0.51],
       [-0.57,  0.53, -0.02,  0.86],
       [ 0.16,  0.61, -0.96,  2.12],
       [ 0.9 ,  0.2 , -0.23, -0.57],
       [ 0.62, -0.11,  0.55,  1.48],
       [ 0.  ,  1.57, -2.81,  0.69],
       [ 0.96, -0.87,  1.33, -1.81],
       [-0.64,  0.87,  0.25, -1.01],
       [-1.19,  0.49, -1.06,  1.51],
       [ 0.65,  1.54, -0.23,  0.22],
       [-1.13,  0.34, -1.05, -1.82],
       [ 0.02,  0.14,  1.2 , -0.9 ],
       [ 0.68, -0.17, -0.34,  1.  ],
       [ 0.44, -1.72,  0.22, -0.66],
       [ 0.73,  2.19, -1.13, -0.87],
       [ 2.73, -1.82,  0.59, -2.04],
       [ 1.04, -0.13, -0.13, -1.36],
       [-0.14,  0.43,  1.82, -0.04],
       [-0.24, -0.72, -1.03, -1.15],
       [ 0.28, -0.57, -0.04, -0.66]])

Note

By default, when you pass sklearn a DataFrame it returns a numpy array.

Keep as a DataFrame

From scikit-learn 1.2:

from sklearn import set_config
set_config(transform_output="pandas")

imp = SimpleImputer()
imp.fit(X_train)
X_train_imp = imp.fit_transform(X_train)
X_val_imp = imp.transform(X_val)
X_test_imp = imp.transform(X_test)

X_test_imp

	x1	x2	x3	x4
83	0.075805	-0.677162	0.975120	-0.147057
53	0.954002	0.651391	-0.315269	0.758969
...	...	...	...	...
42	-0.245388	-0.753736	-0.889514	-0.815810
69	0.199060	-0.600217	0.069802	-0.385314

25 rows × 4 columns

French Motor Claims Dataset

Lecture Outline

Preprocessing
French Motor Claims Dataset
Ordinal Variables

French motor dataset

Download the dataset if we don’t have it already.

from pathlib import Path
from sklearn.datasets import fetch_openml

if not Path("french-motor.csv").exists():
    freq = fetch_openml(data_id=41214, as_frame=True).frame
    freq.to_csv("french-motor.csv", index=False)
else:
    freq = pd.read_csv("french-motor.csv")

freq

French motor dataset

	IDpol	ClaimNb	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.0	1.0	0.10000	D	5.0	0.0	55.0	50.0	B12	Regular	1217.0	R82
1	3.0	1.0	0.77000	D	5.0	0.0	55.0	50.0	B12	Regular	1217.0	R82
2	5.0	1.0	0.75000	B	6.0	2.0	52.0	50.0	B12	Diesel	54.0	R22
...	...	...	...	...	...	...	...	...	...	...	...	...
678010	6114328.0	0.0	0.00274	D	6.0	2.0	45.0	50.0	B12	Diesel	1323.0	R82
678011	6114329.0	0.0	0.00274	B	4.0	0.0	60.0	50.0	B12	Regular	95.0	R26
678012	6114330.0	0.0	0.00274	B	7.0	6.0	29.0	54.0	B12	Diesel	65.0	R72

678013 rows × 12 columns

Data dictionary

IDpol: policy number (unique identifier)
ClaimNb: number of claims on the given policy
Exposure: total exposure in yearly units
Area: area code (categorical, ordinal)
VehPower: power of the car (categorical, ordinal)
VehAge: age of the car in years
DrivAge: age of the (most common) driver in years

BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
Density: density of inhabitants per km² in the city of the living place of the driver
Region: regions in France (prior to 2016)

The model

Have \{ (\mathbf{x}_i, y_i) \}_{i=1, \dots, n} for \mathbf{x}_i \in \mathbb{R}^{47} and y_i \in \mathbb{N}_0.

Assume the distribution Y_i \sim \mathsf{Poisson}(\lambda(\mathbf{x}_i))

We have \mathbb{E} Y_i = \lambda(\mathbf{x}_i). The NN takes \mathbf{x}_i & predicts \mathbb{E} Y_i.

Ordinal Variables

Lecture Outline

Preprocessing
French Motor Claims Dataset
Ordinal Variables

Subsample and split

freq = freq.drop("IDpol", axis=1).head(25_000)

X_train, X_test, y_train, y_test = train_test_split(
  freq.drop("ClaimNb", axis=1), freq["ClaimNb"], random_state=2023)

# Reset each index to start at 0 again.
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

What values do we see in the data?

X_train["Area"].value_counts()
X_train["VehBrand"].value_counts()
X_train["VehGas"].value_counts()
X_train["Region"].value_counts()

Area
C    5507
D    4113
A    3527
E    2769
B    2359
F     475
Name: count, dtype: int64

VehBrand
B1     5069
B2     4838
B12    3708
       ... 
B13     336
B11     284
B14     136
Name: count, Length: 11, dtype: int64

VehGas
Regular    10773
Diesel      7977
Name: count, dtype: int64

Region
R24    6498
R82    2119
R11    1909
       ... 
R21      90
R42      55
R43      26
Name: count, Length: 22, dtype: int64

Ordinal & binary categories are easy

from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(X_train[["Area", "VehGas"]])
oe.categories_

[array(['A', 'B', 'C', 'D', 'E', 'F'], dtype=object),
 array(['Diesel', 'Regular'], dtype=object)]

for i, area in enumerate(oe.categories_[0]):
    print(f"The Area value {area} gets turned into {i}.")

The Area value A gets turned into 0.
The Area value B gets turned into 1.
The Area value C gets turned into 2.
The Area value D gets turned into 3.
The Area value E gets turned into 4.
The Area value F gets turned into 5.

for i, gas in enumerate(oe.categories_[1]):
    print(f"The VehGas value {gas} gets turned into {i}.")

The VehGas value Diesel gets turned into 0.
The VehGas value Regular gets turned into 1.

Ordinal encoded values

X_train_ord = oe.transform(X_train[["Area", "VehGas"]])
X_test_ord = oe.transform(X_test[["Area", "VehGas"]])

X_train[["Area", "VehGas"]].head()

	Area	VehGas
0	C	Diesel
1	C	Regular
2	E	Regular
3	D	Diesel
4	A	Regular

X_train_ord.head()

	Area	VehGas
0	2.0	0.0
1	2.0	1.0
2	4.0	1.0
3	3.0	0.0
4	0.0	1.0

Train on ordinal encoded values

random.seed(12)
model = Sequential([
  Dense(1, activation="exponential")
])

model.compile(optimizer="adam", loss="poisson")

es = EarlyStopping(verbose=True)
hist = model.fit(X_train_ord, y_train, epochs=100, verbose=0,
    validation_split=0.2, callbacks=[es])
hist.history["val_loss"][-1]

Epoch 22: early stopping

0.7821308374404907

What about adding the continuous variables back in? Use a sklearn column transformer for that.

Preprocess ordinal & continuous

from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (OrdinalEncoder(), ["Area", "VehGas"]),
  ("drop", ["VehBrand", "Region"]),
  remainder=StandardScaler()
)

X_train_ct = ct.fit_transform(X_train)

X_train.head(3)

	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.00	C	6.0	2.0	66.0	50.0	B2	Diesel	124.0	R24
1	0.36	C	4.0	10.0	22.0	100.0	B1	Regular	377.0	R93
2	0.02	E	12.0	8.0	44.0	60.0	B3	Regular	5628.0	R11

X_train_ct.head(3)

	ordinalencoder__Area	ordinalencoder__VehGas	remainder__Exposure	remainder__VehPower	remainder__VehAge	remainder__DrivAge	remainder__BonusMalus	remainder__Density
0	2.0	0.0	1.126979	-0.165005	-0.844589	1.451036	-0.637179	-0.366980
1	2.0	1.0	-0.590896	-1.228181	0.586255	-1.548692	2.303010	-0.302700
2	4.0	1.0	-1.503517	3.024524	0.228544	-0.048828	-0.049141	1.031432

Preprocess ordinal & continuous II

from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (OrdinalEncoder(), ["Area", "VehGas"]),
  ("drop", ["VehBrand", "Region"]),
  remainder=StandardScaler(),
  verbose_feature_names_out=False
)
X_train_ct = ct.fit_transform(X_train)

X_train.head(3)

	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region
0	1.00	C	6.0	2.0	66.0	50.0	B2	Diesel	124.0	R24
1	0.36	C	4.0	10.0	22.0	100.0	B1	Regular	377.0	R93
2	0.02	E	12.0	8.0	44.0	60.0	B3	Regular	5628.0	R11

X_train_ct.head(3)

	Area	VehGas	Exposure	VehPower	VehAge	DrivAge	BonusMalus	Density
0	2.0	0.0	1.126979	-0.165005	-0.844589	1.451036	-0.637179	-0.366980
1	2.0	1.0	-0.590896	-1.228181	0.586255	-1.548692	2.303010	-0.302700
2	4.0	1.0	-1.503517	3.024524	0.228544	-0.048828	-0.049141	1.031432

Glossary

column transformer
nominal variables
ordinal variables