Lab: Distributional Regression

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Download the freMTPL2sev and freMTPL2freq datasets.

We want to look at severity, but the covariates are stored in the frequency dataset, thus we need to merge the two datasets and drop the ClaimNb column from the frequency dataset.

import pandas as pd

sev_no_covars = pd.read_parquet('freMTPL2sev.parquet')
freq_df = pd.read_parquet('freMTPL2freq.parquet')

# Just pull out the covariates from the frequency dataset
covariates = freq_df.drop(columns=['ClaimNb'])

# Merge the severity data with the policyholder covariates
severity = pd.merge(sev_no_covars, covariates, on='IDpol', how='left')
severity = severity.dropna()
severity

	IDpol	ClaimAmount	Exposure	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Area	Density	Region
0	1552	995.20	0.59	11.0	0.0	39.0	56.0	B12	Diesel	D	778.0	Picardie
1	1010996	1128.12	0.95	4.0	1.0	49.0	50.0	B12	Regular	E	2354.0	Ile-de-France
2	4024277	1851.11	0.71	4.0	2.0	32.0	106.0	B12	Regular	D	570.0	Nord-Pas-de-Calais
3	4007252	1204.00	0.78	4.0	1.0	49.0	57.0	B12	Regular	C	288.0	Midi-Pyrenees
4	4046424	1204.00	0.86	12.0	0.0	37.0	50.0	B12	Diesel	F	27000.0	Ile-de-France
...	...	...	...	...	...	...	...	...	...	...	...	...
26634	3254353	1200.00	0.07	4.0	13.0	53.0	50.0	B1	Regular	D	824.0	Languedoc-Roussillon
26635	3254353	1800.00	0.07	4.0	13.0	53.0	50.0	B1	Regular	D	824.0	Languedoc-Roussillon
26636	3254353	1000.00	0.07	4.0	13.0	53.0	50.0	B1	Regular	D	824.0	Languedoc-Roussillon
26637	2222064	767.55	0.43	6.0	0.0	67.0	50.0	B2	Diesel	C	142.0	Languedoc-Roussillon
26638	2254065	1500.00	0.28	7.0	2.0	36.0	60.0	B12	Diesel	D	1732.0	Rhone-Alpes

26444 rows × 12 columns

Now we will try to predict claim severity, i.e. ClaimAmount, given the rest.

Data dictionary

IDpol: policy number (unique identifier)
Area: area code (categorical, ordinal)
BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
Density: density of inhabitants per km² in the city of the living place of the driver
DrivAge: age of the (most common) driver in years
Exposure: total exposure in yearly units
Region: regions in France (prior to 2016)
VehAge: age of the car in years
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
VehPower: power of the car (categorical, ordinal)
ClaimAmount: size of the particular claim (target)

GLM

Fit a gamma GLM using statsmodels with a log link function.
Report the average negative log-likelihood loss on the test set.
Compute the dispersion parameter using the code from the slides, and compare to statsmodels’ implementation.

CANN

Fit a CANN model.
Report the average negative log-likelihood loss on the test set.
Recompute the dispersion parameter using the adjusted model. Hint: use the code from the slides and change the following line of code
```
mus = np.exp(np.sum(CANN.predict(X_train), axis = 1))
```

MDN

Fit an MDN with 5 Gamma mixture components to predict the claim severity.

Change the distributional assumption from Gamma to InverseGamma. Hint: adjust the following code:

mixture_distribution = tfd.MixtureSameFamily(
    mixture_distribution=tfd.Categorical(probs=pis),
    components_distribution=tfd.Gamma(alphas, betas))

Report the average negative log-likelihood loss on the test set for both models.

Other Metrics

Compare the previous models using point-wise metrics such as MAE, MSE, and declare the best model based on these metrics. Which model is the best?
Compare the models using distributional metrics such as CRPS, log-likelihood. Which model is the best? Did the order change compared to the point-wise metrics?

Monte Carlo Dropout

Construct any style of neural network to predict the claim severity. Then add dropout to some part of the network.
Apply Monte Carlo dropout 2000 times and store the test set predictions. Make a histogram of the average test negative log-likelihoods.