We want to look at severity, but the covariates are stored in the frequency dataset, thus we need to merge the two datasets and drop the ClaimNb column from the frequency dataset.
import pandas as pdsev_no_covars = pd.read_parquet('freMTPL2sev.parquet')freq_df = pd.read_parquet('freMTPL2freq.parquet')# Just pull out the covariates from the frequency datasetcovariates = freq_df.drop(columns=['ClaimNb'])# Merge the severity data with the policyholder covariatesseverity = pd.merge(sev_no_covars, covariates, on='IDpol', how='left')severity = severity.dropna()severity
IDpol
ClaimAmount
Exposure
VehPower
VehAge
DrivAge
BonusMalus
VehBrand
VehGas
Area
Density
Region
0
1552
995.20
0.59
11.0
0.0
39.0
56.0
B12
Diesel
D
778.0
Picardie
1
1010996
1128.12
0.95
4.0
1.0
49.0
50.0
B12
Regular
E
2354.0
Ile-de-France
2
4024277
1851.11
0.71
4.0
2.0
32.0
106.0
B12
Regular
D
570.0
Nord-Pas-de-Calais
3
4007252
1204.00
0.78
4.0
1.0
49.0
57.0
B12
Regular
C
288.0
Midi-Pyrenees
4
4046424
1204.00
0.86
12.0
0.0
37.0
50.0
B12
Diesel
F
27000.0
Ile-de-France
...
...
...
...
...
...
...
...
...
...
...
...
...
26634
3254353
1200.00
0.07
4.0
13.0
53.0
50.0
B1
Regular
D
824.0
Languedoc-Roussillon
26635
3254353
1800.00
0.07
4.0
13.0
53.0
50.0
B1
Regular
D
824.0
Languedoc-Roussillon
26636
3254353
1000.00
0.07
4.0
13.0
53.0
50.0
B1
Regular
D
824.0
Languedoc-Roussillon
26637
2222064
767.55
0.43
6.0
0.0
67.0
50.0
B2
Diesel
C
142.0
Languedoc-Roussillon
26638
2254065
1500.00
0.28
7.0
2.0
36.0
60.0
B12
Diesel
D
1732.0
Rhone-Alpes
26444 rows × 12 columns
Now we will try to predict claim severity, i.e. ClaimAmount, given the rest.
Data dictionary
IDpol: policy number (unique identifier)
Area: area code (categorical, ordinal)
BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
Density: density of inhabitants per km2 in the city of the living place of the driver
DrivAge: age of the (most common) driver in years
Exposure: total exposure in yearly units
Region: regions in France (prior to 2016)
VehAge: age of the car in years
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
VehPower: power of the car (categorical, ordinal)
ClaimAmount: size of the particular claim (target)
GLM
Fit a gamma GLM using statsmodels with a log link function.
Report the average negative log-likelihood loss on the test set.
Compute the dispersion parameter using the code from the slides, and compare to statsmodels’ implementation.
CANN
Fit a CANN model.
Report the average negative log-likelihood loss on the test set.
Recompute the dispersion parameter using the adjusted model. Hint: use the code from the slides and change the following line of code
Report the average negative log-likelihood loss on the test set for both models.
Other Metrics
Compare the previous models using point-wise metrics such as MAE, MSE, and declare the best model based on these metrics. Which model is the best?
Compare the models using distributional metrics such as CRPS, log-likelihood. Which model is the best? Did the order change compared to the point-wise metrics?
Monte Carlo Dropout
Construct any style of neural network to predict the claim severity. Then add dropout to some part of the network.
Apply Monte Carlo dropout 2000 times and store the test set predictions. Make a histogram of the average test negative log-likelihoods.