Distributional Regression

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Author

Patrick Laub

Show the package imports
import os
import random
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
import pandas as pd

import keras
from keras.callbacks import EarlyStopping
from keras.models import Sequential, Model
from keras.layers import Input, Dense, Concatenate
from keras.initializers import Constant

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn import set_config
set_config(transform_output="pandas")

import scipy.stats as stats
import statsmodels.api as sm

Introduction

Why actuaries avoid neural networks

They’re not inherently interpretable, so just we have to look at inputs and outputs from the black box.

Neural network models typically just output a prediction without any sense of its confidence.

Therefore we cannot trust the neural network models, which is a dealbreaker.

Now let’s focus on actuarial problems.

Car insurance

Claim size prediction

👤 Age 🚗 Age 🏎️ Type
25 3 🚙 Sedan
40 5 🚐 SUV
19 1 🏎️ Sports Car
60 10 🚘 Hatchback



\longrightarrow

Cost
💵 $1,200
💵 $2,500
💵 $3,800
💵 $800

What’s wrong? Not enough rows? Not enough columns?

These claim size predictions are not genuine future predictions — the next claim that a 40-year-old with a 5-year-old SUV will probably not be $2,500. That’s only a best guess.

Code
data = [
    (25, 3, "🚙", 1200),  # (Driver Age, Car Age, Type of Car, Repair Cost)
    (40, 5, "🚐", 2500),
    (19, 1, "🏎️", 3800),
    (60, 10, "🚘", 800)
]

x = np.linspace(0, 5000, 1000)

def draw_gamma_distribution(mean, shape, color, customer_id):
  scale = mean / shape
  plt.plot(x, stats.gamma.pdf(x, a=shape, scale=scale), label=f"Customer {customer_id+1}", color=color)
  plt.title(f"Predicted Claim Size Distribution")
  plt.xlabel("Claim size")
  plt.ylabel("Density")

Distributional regression

Customer 1 = (25, 3, 🚙)

Code
draw_gamma_distribution(1200, 2, colors[0], 0)

What we are trying to predict is fundamentally unpredictable. Rather than making a 100% confident prediction on a customer’s next claim, the distributional regression model returns a range of possible claim sizes, namely, a distribution of predicted claim sizes for a given customer.

Based on the distribution output, actuaries can obtain risk measures such as expected claim size, variance, Value at Risk, etc.

Distributional regression

Customer 2 = (40, 5, 🚐)

Code
draw_gamma_distribution(2500, 2, colors[1], 1)

Distributional regression

Customer 3 = (19, 1, 🏎️)

Code
draw_gamma_distribution(3800, 2, colors[2], 2)

Distributional regression

Customer 4 = (60, 10, 🚘)

Code
draw_gamma_distribution(800, 2, colors[3], 3)

Distributional regression

All customers

Code
# Plot all on the same graph
plt.figure()
for i, (age, car_age, car_type, cost) in enumerate(data):
  mean = cost
  shape = 2  # Example shape parameter for the gamma distribution
  scale = mean / shape  # Scale parameter to achieve the correct mean

  plt.plot(x, stats.gamma.pdf(x, a=shape, scale=scale), label=f"Customer {i+1}", color=colors[i])

plt.title(f"Predicted Claim Size Distribution")
plt.xlabel("Claim size")
plt.ylabel("Density")
plt.legend()

Traditional vs distributional regression

Traditional:


👤 🚗 🏎️
35 12 Sport
Model

1582.48


Distributional:


👤 🚗 🏎️
35 12 Sport
Model

While traditional models give a single prediction (expected value), distributional regression models give a distribution of predictions.

Baseline: a generalised linear model

A gamma GLM with a log link function:

\begin{aligned} Y | \mathbf{X} &\sim \mathrm{Gamma}(\ldots, \ldots) \\ \mathbb{E}[Y | \mathbf{X}] &= \exp\Bigl\{ \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Car Age} + \beta_3 \cdot \text{Type} \Bigr\} \end{aligned}

A simple model, easy to train and interpret, but…

ImportantGLMs can be
  1. Bad at regression
  2. Bad at distributional regression

GLMs can be bad at regression because they are restricted to having to take a linear combination of the covariates.

GLMs can be bad at distributional regression because they are restricted to a limited number of probability distributions (Gamma, Poisson, etc.).

Example 1: Non-monotonicity

Code
ages = np.linspace(18, 80, 200)  # Driver ages from 18 to 80
claim_sizes = 1000 * (1 - np.exp(-((ages - 35) / 20)**2)) 

# Create the updated plot
# plt.figure(figsize=(8, 6))
plt.plot(ages, claim_sizes, linewidth=2)
plt.title("Effect of Driver Age on Claim Sizes")
plt.xlabel("Driver Age")
plt.ylabel("Claim Size Effect ($)")
# plt.grid(alpha=0.3)
# plt.axvline(35, color='gray', linestyle='--', label='New Middle Age Peak')
# plt.legend(fontsize=10)
plt.show()

The relationship between the input and output is not always linear. We need to find a way to reflect these non-linear relationships. While you can manually perform feature engineering before placing them in the GLM, this quickly becomes time-consuming. This doesn’t compare to the automated and complex feature engineering that neural networks can do.

GLMs cannot (easily) do this \longrightarrow Use a neural network

Example 2: Multi-modality

Code
x = np.linspace(0, 20*500, 500)  # Range restricted to positive real numbers

# Adjust the components to ensure a bimodal mixture
pdf1 = stats.lognorm.pdf(x, s=0.3, scale=np.exp(1)*500)  # Sharper log-normal distribution
pdf2 = stats.gamma.pdf(x, a=10, scale=0.8*500)  # Narrower gamma distribution
mixture_pdf = 0.3 * pdf1 + 0.7 * pdf2  # Equal weights to emphasize bimodality

# Create the updated plot
plt.figure();
plt.plot(x, pdf1, label='Parent driving', linestyle='--', color='blue')
plt.plot(x, pdf2, label='Kid driving', linestyle='--', color='green')
plt.plot(x, mixture_pdf, label='Mixture Distribution', color='red', linewidth=2)
plt.title("Claim Size Distribution")
plt.legend()
plt.xlabel("Claim Size")
plt.ylabel("Density")

# Turn off the y axis
# plt.gca().axes.get_yaxis().set_visible(False)

None

The distribution of the output cannot always be easily represented with the limited distributions available for GLMs. For example, if the distribution of the claim size is multi-modal… Gamma/Weibull/Pareto distributions cannot capture this.

NoteMixture distribution

A mixture distribution is, simply put, a mixture of two or more distributions. If f_1(\cdot) is the distribution of claim sizes when a parent is driving and f_2(\cdot) is the distribution when a kid is driving, then the distribution of all claim sizes is:

f(x) = wf_1(x) + (1-w)f_2(x)

The weight parameter w would be chosen based on the expected proportion of the time that a parent/kid is driving.

Key idea

An example of distributional forecasting over the All Ordinaries Index
  • Earlier machine learning models focused on point estimates.
  • However, in many applications, we need to understand the distribution of the response variable.
  • Each prediction becomes a distribution over the possible outcomes

In this example, the distribution of predictions is much more concentrated around the mean for earlier predictions. As we look further into the future, naturally our confidence decreases and the variance of predictions increases.

We will focus on regression not classification

Classifiers already give us a probability, which is a big step up compared to regression models.

However, neural networks’ “probabilities” can be overconfident.

We already saw a case of this.

See Guo et al. (2017), On Calibration of Modern Neural Networks.

Traditional Regression

Notation

  • scalars are denoted by lowercase letters, e.g., y,
  • vectors are denoted by bold lowercase letters, e.g., \boldsymbol{y} = (y_1, \ldots, y_n) ,
  • random variables are denoted by capital letters, e.g., Y
  • random vectors are denoted by bold capital letters, e.g., \boldsymbol{X} = (X_1, \ldots, X_p) ,
  • matrices are denoted by bold uppercase non-italics letters, e.g., \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix} .

Regression notation

  • n is the number of observations, p is the number of features,
  • the true coefficients are \boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p),
  • \beta_0 is the intercept, \beta_1, \ldots, \beta_p are the coefficients,
  • \boldsymbol{\beta}^* is the estimated coefficient vector,
  • \boldsymbol{x}_i = (1, x_{i1}, x_{i2}, \ldots, x_{ip}) is the feature vector for the ith observation,
  • y_i is the response variable for the ith observation,
  • \hat{y}_i is the predicted value for the ith observation,
  • probability density functions (p.d.f.), probability mass functions (p.m.f.), cumulative distribution functions (c.d.f.).

Traditional Regression

Multiple linear regression assumes the data-generating process is

Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon

where \varepsilon \sim \mathcal{N}(0, \sigma^2).

We estimate the coefficients \beta_0, \beta_1, \ldots, \beta_p by minimising the sum of squared residuals or mean squared error

\text{RSS} := \sum_{i=1}^n (y_i - \hat{y}_i)^2 , \quad \text{MSE} := \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 ,

where \hat{y}_i is the predicted value for the ith observation.

Visualising the distribution of each Y

Code
# Generate sample data for linear regression
np.random.seed(0)
X_toy = np.linspace(0, 10, 10)
np.random.shuffle(X_toy)

beta_0 = 2
beta_1 = 3
y_toy = beta_0 + beta_1 * X_toy + np.random.normal(scale=2, size=X_toy.shape)
sigma_toy = 2  # Assuming a standard deviation for the normal distribution

# Fit a simple linear regression model
coefficients = np.polyfit(X_toy, y_toy, 1)
predicted_y = np.polyval(coefficients, X_toy)

# Plot the data points and the fitted line
plt.scatter(X_toy, y_toy, label='Data Points')
plt.plot(X_toy, predicted_y, color='red', label='Fitted Line')

# Draw the normal distribution bell curve sideways at each data point
for i in range(len(X_toy)):
    mu = predicted_y[i]
    y_values = np.linspace(mu - 4*sigma_toy, mu + 4*sigma_toy, 100)
    x_values = stats.norm.pdf(y_values, mu, sigma_toy) + X_toy[i]
    plt.plot(x_values, y_values, color='blue', alpha=0.5)

plt.xlabel('$x$')
plt.ylabel('$y$')
plt.legend()

The error terms \varepsilon \sim \mathcal{N}(0, \sigma^2) are represented by the purple distribution curves along the fitted line.

The probabilistic view

From a probability point of view, we are fitting a distributional model. Since \varepsilon is random, then Y is also partially random.

Y_i \sim \mathcal{N}(\mu_i, \sigma^2)

where \mu_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip}, and the \sigma^2 is known.

The \mathcal{N}(\mu, \sigma^2) normal distribution has p.d.f.

f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right) .

The likelihood function is

L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu_i)^2}{2\sigma^2}\right) \Rightarrow \ell(\boldsymbol{\beta}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 .

Perform maximum likelihood estimation to find \boldsymbol{\beta}.

The predicted distributions

Code
y_pred = np.polyval(coefficients, X_toy[:4])

fig, axes = plt.subplots(4, 1, figsize=(5.0, 3.0))

x_min = y_pred[:4].min() - 4*sigma_toy
x_max = y_pred[:4].max() + 4*sigma_toy
x_grid = np.linspace(x_min, x_max, 100)

# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    mu = y_pred[i]
    y_grid = stats.norm.pdf(x_grid, mu, sigma_toy)
    ax.plot(x_grid, y_grid)
    ax.set_ylabel(f'$f(y ; \\boldsymbol{{x}}_{{{i+1}}})$')
    ax.set_xticks([y_pred[i]], labels=[r'$\mu_{' + str(i+1) + r'}$'])
    ax.plot(y_toy[i], 0, 'rx', clip_on=False)

plt.tight_layout();

\sigma^2 is known, fixed and the same for all observations. For each observation, the shape of the distribution doesn’t change, only the location.

The machine learning view

The negative log-likelihood \text{NLL}(\boldsymbol{\beta}) := -\ell(\boldsymbol{\beta}) is to be minimised:

\text{NLL}(\boldsymbol{\beta}) = \frac{n}{2}\log(2\pi) + \frac{n}{2}\log(\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 .

As \sigma^2 is fixed, minimising NLL is equivalent to minimising MSE:

\begin{aligned} \boldsymbol{\beta}^* &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \text{NLL}(\boldsymbol{\beta}) \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \frac{n}{2}\log(2\pi) + \frac{n}{2}\log(\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2 \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \frac{1}{n} \sum_{i=1}^n \Bigl( y_i - \hat{y}_i(\boldsymbol{x}_i; \boldsymbol{\beta}) \Bigr)^2 \\ &= \underset{\boldsymbol{\beta}}{\operatorname{arg\,min}}\,\, \text{MSE}\bigl( \boldsymbol{y}, \hat{\boldsymbol{y}}(\boldsymbol{\mathbf{X}}; \boldsymbol{\beta}) \bigr). \end{aligned}

Generalised Linear Model (GLM)

The GLM is often characterised by the mean prediction:

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = g^{-1} \left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right)

where g is the link function.

Common GLM distributions for the response variable include:

  • Normal distribution with identity link (just MLR)
  • Bernoulli distribution with logit link (logistic regression)
  • Poisson distribution with log link (Poisson regression)
  • Gamma distribution with log link

Logistic regression

A Bernoulli distribution with parameter p has p.m.f.

f(y)\ =\ \begin{cases} p & \text{if } y = 1 \\ 1 - p & \text{if } y = 0 \end{cases} \ =\ p^y (1 - p)^{1 - y}.

Our model is Y|\boldsymbol{X}=\boldsymbol{x} follows a Bernoulli distribution with parameter

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = \frac{1}{1 + \exp\left(-\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right)} = \mathbb{P}(Y=1|\boldsymbol{X}=\boldsymbol{x}).

The likelihood function, using \mu_i := \mu(\boldsymbol{x}_i; \boldsymbol{\beta}), is

L(\boldsymbol{\beta}) \ =\ \prod_{i=1}^n \begin{cases} \mu_i & \text{if } y_i = 1 \\ 1 - \mu_i & \text{if } y_i = 0 \end{cases} \ =\ \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1 - y_i} .

Binary cross-entropy loss

L(\boldsymbol{\beta}) = \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1 - y_i} \Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = -\sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

The binary cross-entropy loss is basically identical: \text{BCE}(\boldsymbol{y}, \boldsymbol{\mu}) = - \frac{1}{n} \sum_{i=1}^n \Bigl( y_i \log(\mu_i) + (1 - y_i) \log(1 - \mu_i) \Bigr).

Poisson regression

A Poisson distribution with rate \lambda has p.m.f. f(y) = \frac{\lambda^y \exp(-\lambda)}{y!}.

Our model is Y|\boldsymbol{X}=\boldsymbol{x} is Poisson distributed with parameter

\mu(\boldsymbol{x}; \boldsymbol{\beta}) = \exp\left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right) .

The likelihood function is

L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{ \mu_i^{y_i} \exp(-\mu_i) }{y_i!} \Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( -\mu_i + y_i \log(\mu_i) - \log(y_i!) \Bigr).

Poisson loss

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl( \mu_i - y_i \log(\mu_i) + \log(y_i!) \Bigr) .

The Poisson loss is

\text{Poisson}(\boldsymbol{y}, \boldsymbol{\mu}) = \frac{1}{n} \sum_{i=1}^n \Bigl( \mu_i - y_i \log(\mu_i) \Bigr).

Gamma regression

A Gamma distribution with mean \mu and dispersion \phi has p.d.f. f(y; \mu, \phi) = \frac{(\mu \phi)^{-\frac{1}{\phi}}}{\Gamma\left(\frac{1}{\phi}\right)} y^{\frac{1}{\phi} - 1} \mathrm{e}^{-\frac{y}{\mu \phi}}

Our model is Y|\boldsymbol{X}=\boldsymbol{x} is Gamma distributed with a dispersion of \phi and a mean of \mu(\boldsymbol{x}; \boldsymbol{\beta}) = \exp\left(\left\langle \boldsymbol{\beta}, \boldsymbol{x} \right\rangle\right).

The likelihood function is L(\boldsymbol{\beta}) = \prod_{i=1}^n \frac{(\mu_i \phi)^{-\frac{1}{\phi}}}{\Gamma\left(\frac{1}{\phi}\right)} y_i^{\frac{1}{\phi} - 1} \exp\left(-\frac{y_i}{\mu_i \phi}\right)

\Rightarrow \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ -\frac{1}{\phi} \log(\mu_i \phi) - \log \Gamma\left(\frac{1}{\phi}\right) + \left(\frac{1}{\phi} - 1\right) \log(y_i) - \frac{y_i}{\mu_i \phi} \right].

Gamma loss

The negative log-likelihood is

\text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ \frac{1}{\phi} \log(\mu_i \phi) + \log \Gamma\left(\frac{1}{\phi}\right) - \left(\frac{1}{\phi} - 1\right) \log(y_i) + \frac{y_i}{\mu_i \phi} \right].

Since \phi is a nuisance parameter \text{NLL}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ \frac{1}{\phi} \log(\mu_i) + \frac{y_i}{\mu_i \phi} \right] + \text{const} \propto \sum_{i=1}^n \left[ \log(\mu_i) + \frac{y_i}{\mu_i} \right].

Note

As \log(\mu_i) = \log(y_i) - \log(y_i / \mu_i), we could write an alternative version \text{NLL}(\boldsymbol{\beta}) \propto \sum_{i=1}^n \left[ \log(y_i) - \log\Bigl(\frac{y_i}{\mu_i}\Bigr) + \frac{y_i}{\mu_i} \right] \propto \sum_{i=1}^n \left[ \frac{y_i}{\mu_i} - \log\Bigl(\frac{y_i}{\mu_i}\Bigr) \right].

All of these examples show the overlap between the world of probability and the world of statistical machine learning. They are both trying to solve the same problem, and in fact use the same methods to solve those problems.

Why do actuaries use GLMs?

  • GLMs are interpretable.
  • GLMs are flexible (can handle different types of response variables).
  • We get the full distribution of the response variable, not just the mean.

This last point is particularly important for analysing worst-case scenarios.

Stochastic Forecasts

Stock price forecasting

Code
def lagged_timeseries(df, target, window=30):
    lagged = pd.DataFrame()
    for i in range(window, 0, -1):
        lagged[f"T-{i}"] = df[target].shift(i)
    lagged["T"] = df[target].values
    return lagged


stocks = pd.read_csv("../Time-Series-And-Recurrent-Neural-Networks/aus_fin_stocks.csv")
stocks["Date"] = pd.to_datetime(stocks["Date"])
stocks = stocks.set_index("Date")
_ = stocks.pop("ASX200")
stock = stocks[["CBA"]]
stock = stock.ffill()

# Compute daily log returns
stock_log = np.log(stock / stock.shift(1)).dropna()

# Helper functions for converting log returns to prices
def log_to_price(log_returns, initial_price):
    cumulative_log_returns = log_returns.cumsum()
    return initial_price * np.exp(cumulative_log_returns)

def get_last_price(stock_df, cutoff_date):
    last_known_date = stock_df.loc[:cutoff_date].index[-1]
    return stock_df.loc[last_known_date, "CBA"]

# Create lagged features from log returns
df_lags = lagged_timeseries(stock_log, "CBA", 40)

# Split the data in time
X_train = df_lags.loc[:"2024"]
X_test = df_lags.loc["2025":]

# Remove any with NAs and split into X and y
X_train = X_train.dropna()
X_test = X_test.dropna()

y_train = X_train.pop("T")
y_test = X_test.pop("T")

lr = LinearRegression()
lr.fit(X_train, y_train);
Code
stocks.plot();

Code
stock_log.plot()
plt.ylabel("Daily Log Return")
plt.title("CBA Daily Log Returns");

Noisy auto-regressive forecast

Remember the auto-regressive forecast: based on the stock’s log return history up to yesterday, predict today’s stock price.

What about tomorrow’s stock price? \longrightarrow Assume today’s prediction actually occurred, and use that to predict tomorrow’s price.

This time, add noise to the forecasts.

def noisy_autoregressive_forecast(model, X_val, sigma, suppress=False):
    """
    Generate a multi-step forecast using the given model.
    """
    multi_step = pd.Series(index=X_val.index, name="Multi Step")

    # Initialize the input data for forecasting
    input_data = X_val.iloc[0].values.reshape(1, -1)

    for i in range(len(multi_step)):
        # Ensure input_data has the correct feature names
        input_df = pd.DataFrame(input_data, columns=X_val.columns)
        if suppress:
            next_value = model.predict(input_df, verbose=0)
        else:
            next_value = model.predict(input_df)

1        next_value += np.random.normal(0, sigma)

        multi_step.iloc[i] = next_value

        # Append that prediction to the input for the next forecast
        if i + 1 < len(multi_step):
            input_data = np.append(input_data[:, 1:], next_value).reshape(1, -1)

    return multi_step
1
Add random normal noise with a specified variance to the prediction.

In this model, we assume that: \hat y_{t+1} \sim \mathcal N (\mu_i, \sigma^2)

This is equivalent to \hat y_{t+1} = \mu_i + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,\sigma^2)

which is what we see in the code.

Original forecast

lr_forecast = noisy_autoregressive_forecast(lr, X_test, 0)
last_price = get_last_price(stock, cutoff_date="2024-12")
price_forecast = log_to_price(lr_forecast, last_price)
Code
stock.loc[price_forecast.index, "AR Log-Return"] = price_forecast

def plot_forecasts(stock):
    stock.loc["2024-12":"2025"].plot()
    plt.axvline("2025", color="black", linestyle="--")
    plt.ylabel("Stock Price ($)")
    plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))

plot_forecasts(stock)

residuals = y_train.loc["2015":] - lr.predict(X_train.loc["2015":])
sigma = np.std(residuals)

With noise

np.random.seed(1)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_test, sigma)
price_noisy = log_to_price(lr_noisy_forecast, last_price)
Code
stock.loc[price_noisy.index, "AR Noisy Log-Return"] = price_noisy
plot_forecasts(stock)

This noisy forecast seems much more realistic than the non-noisy AR forecast.

With noise

np.random.seed(2)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_test, sigma)
price_noisy = log_to_price(lr_noisy_forecast, last_price)
Code
stock.loc[price_noisy.index, "AR Noisy Log-Return"] = price_noisy
plot_forecasts(stock)

With noise

np.random.seed(3)
lr_noisy_forecast = noisy_autoregressive_forecast(lr, X_test, sigma)
price_noisy = log_to_price(lr_noisy_forecast, last_price)
Code
stock.loc[price_noisy.index, "AR Noisy Log-Return"] = price_noisy
plot_forecasts(stock)

Many noisy forecasts

num_forecasts = 300
forecasts = []
for i in range(num_forecasts):
    log_sim = noisy_autoregressive_forecast(lr, X_test, sigma)
    price_sim = log_to_price(log_sim, last_price)
    forecasts.append(price_sim)
noisy_forecasts = pd.concat(forecasts, axis=1)
noisy_forecasts.index = X_test.index
Code
noisy_forecasts.loc["2024-12":"2025"].plot(legend=False, alpha=0.4)
plt.ylabel("Stock Price");

This plot contains 300 simulated forecasts of the stock price over time. We see that the variance in the predictions increases with time. This makes sense as we are accumulating the random components ove time.

95% “prediction intervals”

# Calculate quantiles for the forecasts
lower_quantile = noisy_forecasts.quantile(0.025, axis=1)
upper_quantile = noisy_forecasts.quantile(0.975, axis=1)
mean_forecast = noisy_forecasts.mean(axis=1)
Code
# Plot the mean forecast
plt.figure(figsize=(8, 3))

plt.plot(stock.loc["2024-12":"2025"].index, stock.loc["2024-12":"2025"]["CBA"], label="CBA")

plt.plot(mean_forecast, label="Mean")

# Plot the quantile-based shaded area
plt.fill_between(mean_forecast.index, 
                 lower_quantile, 
                 upper_quantile, 
                 color="grey", alpha=0.2)

# Plot settings
plt.axvline(pd.Timestamp("2025-01-01"), color="black", linestyle="--")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.tight_layout();

The model predicts with 95% confidence that the stock price in 6 months time will be somewhere between $120 and $200.

Residuals

y_pred = lr.predict(X_train)
residuals = y_train - y_pred
residuals -= np.mean(residuals)
residuals /= np.std(residuals)
stats.shapiro(residuals)
/Users/z3535837/miniforge3/envs/ai/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:531: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 6582.
ShapiroResult(statistic=np.float64(0.9305463633130846), pvalue=np.float64(5.938320016172636e-48))
Code
plt.hist(residuals, bins=60, density=True)
x = np.linspace(-3, 3, 100)
plt.xlim(-3, 3)
plt.plot(x, stats.norm.pdf(x, 0, 1));

How do we choose the value for sigma? Here, sigma was chosen to be the standard deviation of the residuals of the original AR (non-noisy) forecast model.

The plot above shows the histogram of the standardised residuals and the standard normal curve. Based on this plot and the Shapiro test, it appears that the residuals are not normally distributed.

Q-Q plot and P-P plot

Code
sm.qqplot(residuals, line="45");

Code
sm.ProbPlot(residuals).ppplot(line="45");

The distribution of the residuals has heavier tails than the standard normal distribution.

Residuals against time

Code
plt.plot(y_train.index, residuals)
plt.xlabel("Date")
plt.ylabel("Standardised Residuals")
plt.tight_layout();

Heteroskedasticity!

The variance of the standardised residuals changes over time. Higher variance occurs during significantly volatile economic periods (e.g. the global financial crisis in 2007-2009 and COVID).

Retrain on less data

Code
# Drop 2008-2009 and also 2020-2021
mask = ~((y_train.index.year >= 2008) & (y_train.index.year <= 2009)) & ~((y_train.index.year >= 2020) & (y_train.index.year <= 2021))
X2 = X_train.loc[mask]
y2 = y_train.loc[mask]

lr2 = LinearRegression()
lr2.fit(X2, y2)

res2 = y2 - lr2.predict(X2)
res2 = (res2 - np.mean(res2)) / np.std(res2)

plt.plot(y2.index, res2)
plt.xlabel("Date")
plt.ylabel("Standardised Residuals")
plt.tight_layout();

Refit the model without these crisis years.

Residual diagnostics

stats.shapiro(res2)
/Users/z3535837/miniforge3/envs/ai/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py:531: UserWarning:

scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 5565.
ShapiroResult(statistic=np.float64(0.9797545738507005), pvalue=np.float64(1.8584817751011823e-27))
Code
plt.hist(res2, bins=60, density=True)
x = np.linspace(-3, 3, 100)
plt.xlim(-3, 3)
plt.plot(x, stats.norm.pdf(x, 0, 1));

Code
sm.qqplot(res2, line="45");

Code
sm.ProbPlot(res2).ppplot(line="45");

While the distribution looks closer, it is still significantly farr off of normally distributed. At this point, we should question the validity of the normal distribution assumption we made.

GLMs and Neural Networks

French motor claim sizes

As freMTPL2sev just has Policy ID & severity, we merge with freMTPL2freq which has Policy ID, # Claims, and other covariates.

sev = pd.read_csv('freMTPL2sev.csv')
cov = pd.read_csv('freMTPL2freq.csv').drop(columns=['ClaimNb'])
1sev = pd.merge(sev, cov, on='IDpol', how='left').drop(columns=["IDpol"]).dropna()
sev
1
Merges the severity dataframe sev with the covariates in cov by matching the IDpol column. Assigning how='left' ensures that all rows from the left dataset sev is considered, and only the matching columns from cov are selected. Also drops the policy ID column and any rows with missing values.
ClaimAmount Exposure VehPower VehAge DrivAge BonusMalus VehBrand VehGas Area Density Region
0 995.20 0.59 11.0 0.0 39.0 56.0 B12 Diesel D 778.0 Picardie
1 1128.12 0.95 4.0 1.0 49.0 50.0 B12 Regular E 2354.0 Ile-de-France
... ... ... ... ... ... ... ... ... ... ... ...
26637 767.55 0.43 6.0 0.0 67.0 50.0 B2 Diesel C 142.0 Languedoc-Roussillon
26638 1500.00 0.28 7.0 2.0 36.0 60.0 B12 Diesel D 1732.0 Rhone-Alpes

26444 rows × 11 columns

Preprocessing

X_train, X_test, y_train, y_test = train_test_split(
1  sev.drop("ClaimAmount", axis=1), sev["ClaimAmount"], random_state=2023)
ct = make_column_transformer((OrdinalEncoder(), ["Area", "VehGas"]),
2    ("drop", ["VehBrand", "Region"]), remainder=StandardScaler())
3X_train = ct.fit_transform(X_train)
4X_test = ct.transform(X_test)
plt.hist(y_train[y_train < 5000], bins=30);
1
Split the data into train and test sets
2
Apply ordinal encoding to Area and VehGas variables, and apply standard scaling to all remaining numerical variables. To simplify things, VehBrand and Region variables are dropped from the dataframe.
3
Fit the column transformer to the train set and apply it
4
Apply the column trainsformer to the test set

Plotting the empirical distribution of the target variable ClaimAmount helps us to get an understanding of the inherent variability associated with the data. The data appears to be multi-modal.

Doesn’t prove that Y | \boldsymbol{X} = \boldsymbol{x} is multimodal

Code
# Make some example where the distribution is multimodal because of a binary covariate which separates the means of the two distributions
np.random.seed(1)

fig, axes = plt.subplots(3, 1, figsize=(5.0, 3.0), sharex=True)

x_min = 0
x_max = y_train.max()
x_grid = np.linspace(x_min, x_max, 100)

# Simulate some data from an exponential distribution which has Pr(X < 1000) = 0.9
n = 100
p = 0.1
lambda_ = -np.log(p) / 1000 
mu = 1 / lambda_
y_1 = np.random.exponential(scale=mu, size=n)

# Pick a truncated normal distribution with a mean of 1100 and std of 250 (truncated to be positive)
mu = 1100
sigma = 100
y_2 = stats.truncnorm.rvs((0 - mu) / sigma, (np.inf - mu) / sigma, loc=mu, scale=sigma, size=n)

# Combine y_1 and y_2 for the final histogram
y = np.concatenate([y_1, y_2])

# Determine common bins
bins = np.histogram_bin_edges(y, bins=30)


# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    if i == 0:
        ax.hist(y_1, bins=bins, density=True, color=colors[i+1])
        ax.set_ylabel(f'$f(y | x = 1)$')

    elif i == 1:
        ax.hist(y_2, bins=bins, density=True, color=colors[i+1])
        ax.set_ylabel(f'$f(y | x = 2)$')

    else:
        ax.hist(y, bins=bins, density=True)
        ax.set_ylabel(f'$f(y)$')

plt.tight_layout();

While the distribution of claim size from entire dataset appears to be multimodal, that does not mean that the distribution of claim size of a particular type of customer is multimodal. Two different categories of customers can have different unimodal distributions, and the combination (“mixture”) of the two categories would be multimodal.

The following section illustrates how embedding a GLM in a neural network architecture can help us quantify the uncertainty relating to the predictions coming from the neural network. The idea is to first fit a GLM, and use the predictions from the GLM and predictions from the neural network part to define a custom loss function. This embedding presents an opportunity to compute the dispersion parameter \phi_{\text{CANN}} for the neural network.

The idea of GLM is to find a linear combination of independent variables \boldsymbol{x} and coefficients \boldsymbol{\beta}, apply a non-linear transformation (g^{-1}) to that linear combination and set it equal to conditional mean of the response variable Y given an instance \boldsymbol{x}. The non-linear transformation provides added flexibility.

Gamma GLM

Suppose a fitted Gamma GLM model has

  • a log link function g(x)=\log(x) and
  • regression coefficients \boldsymbol{\beta}=(\beta_0, \beta_1, \beta_2, \beta_3).

Then, it estimates the conditional mean of Y given a new instance \boldsymbol{x}=(1, x_1, x_2, x_3) as follows: \mathbb{E}[Y|\boldsymbol{X}=\boldsymbol{x}] = g^{-1}(\langle \boldsymbol{\beta}, \boldsymbol{x}\rangle) = \exp\big(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \big).

A GLM can model any other exponential family distribution using an appropriate link function g.

Gamma GLM loss

If Y|\boldsymbol{X}=\boldsymbol{x} is a Gamma r.v. with mean \mu(\boldsymbol{x}; \boldsymbol{\beta}) and dispersion parameter \phi, we can minimise the negative log-likelihood (NLL) \text{NLL} \propto \sum_{i=1}^{n}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})} + \text{const}, i.e., we ignore the dispersion parameter \phi while estimating the regression coefficients.

Fitting Steps

Step 1. Use the advanced second derivative iterative method to find the regression coefficients: \boldsymbol{\beta}^* = \underset{\boldsymbol{\beta}}{\text{arg\,min}} \ \sum_{i=1}^{n}\log \mu (\boldsymbol{x}_i; \boldsymbol{\beta})+\frac{y_i}{\mu (\boldsymbol{x}_i; \boldsymbol{\beta})}

Step 2. Estimate the dispersion parameter: \phi = \frac{1}{n-p}\sum_{i=1}^{n}\frac{\bigl(y_i-\mu(\boldsymbol{x}_i; \boldsymbol{\beta}^*)\bigr)^2}{\mu(\boldsymbol{x}_i; \boldsymbol{\beta}^* )^2}

(Here, p is the number of coefficients in the model. If this p doesn’t include the intercept, then the scaling should be \frac{1}{n-(p+1)}.)

Code: Gamma GLM

In Python, we can fit a Gamma GLM as follows:

import statsmodels.api as sm

# Add a column of ones to include an intercept in the model
X_train_design = sm.add_constant(X_train)

# Create a Gamma GLM with a log link function
gamma_glm = sm.GLM(y_train, X_train_design,                   
            family=sm.families.Gamma(sm.families.links.Log()))

# Fit the model
gamma_glm = gamma_glm.fit()
gamma_glm.params
const                    7.786576
ordinalencoder__Area    -0.073226
                           ...   
remainder__BonusMalus    0.157204
remainder__Density       0.010539
Length: 9, dtype: float64
# Dispersion Parameter
mus = gamma_glm.predict(X_train_design)
residuals = y_train - mus
dof = (len(y_train)-X_train_design.shape[1])
phi_glm = np.sum(residuals**2/mus**2)/dof
print(phi_glm)
59.633631237358316

The above example of fitting a Gamma distribution assumes a constant dispersion, meaning that, the dispersion of claim amount is constant for all policyholders. If we believe that the constant dispersion assumption is quite strong, we can use a double GLM model. Fitting a GLM is the traditional way of modelling a claim amount.

ANN can feed into a GLM

Combining GLM & ANN.

By zooming in on the last hidden layer of the NN and the output neuron, we are essentially looking at a GLM. The hidden layer of neurons would be the input “covariates”. The output neuron before applying the activation function is \hat\mu, which is a linear combination of the covariates using the fitted weights and bias. The activation function (inverse link function) converts \hat \mu into \hat y. The only difference is that the input neurons are not the raw data values; they are the processed versions of the data. Then we can consider the first part of the NN to simply be the feature engineering process.

Combined Actuarial Neural Network

CANN

The Combined Actuarial Neural Network is a novel actuarial neural network architecture proposed by Schelldorfer and Wüthrich (2019). We summarise the CANN approach as follows:

  • Find coefficients \boldsymbol{\beta} of the GLM with a link function g(\cdot).
  • Find weights \boldsymbol{w} = (\boldsymbol{w}^{(1)}, \dots, \boldsymbol{w}^{(K+1)}) of a neural network s:\mathbb{R}^{p}\to\mathbb{R}.
  • Given a new instance \boldsymbol{x}, we have

\mathbb{E}[Y|\boldsymbol{X}=\boldsymbol{x}] = g^{-1}\Big( \langle\boldsymbol{\beta}, \boldsymbol{x}\rangle + s(\boldsymbol{x};\boldsymbol{w})\Big).

If just a sequential dense network, then

\mathbb{E}[Y|\boldsymbol{X}=\boldsymbol{x}] = g^{-1}\Big( \langle\boldsymbol{\beta}, \boldsymbol{x}\rangle + \Big\langle \boldsymbol{w}^{(K+1)}, \bigl( \boldsymbol{z}^{(K)} \circ \dots \circ \boldsymbol{z}^{(1)}\bigr)(\boldsymbol{x}) \Big\rangle \Big).

The neural network is shifting the predictions given by the GLM.

Shifting the predicted distributions

Code
# Ensure reproducibility
random.seed(1)

# Make a 4x1 grid of plots
fig, axes = plt.subplots(4, 1, figsize=(5.0, 3.0), sharex=True)

# Define the x-axis
x_min = 0
x_max = 5000
x_grid = np.linspace(x_min, x_max, 100)

# Plot a few Gamma distribution pdfs with different means.
# Then plot Gamma distributions with shifted means and the same dispersion parameter.
glm_means = [1000, 3000, 2000, 4000]
cann_means = [1500, 1400, 3000, 5000]
for i, ax in enumerate(axes):
    ax.plot(x_grid, stats.gamma.pdf(x_grid, a=2, scale=glm_means[i]/2), label=f'GLM')
    ax.plot(x_grid, stats.gamma.pdf(x_grid, a=2, scale=cann_means[i]/2), label=f'CANN')
    ax.set_ylabel(f'$f(y | x_{i+1})$')
    if i == 0:
        ax.legend(["GLM", "CANN"], loc="upper right", ncol=2)

While the assumed distribution of the target variable remains the same the neural network shifts the parameters of the distribution (e.g. mean and variance) to give a new shape under the CANN.

Architecture

Building CANNs is relatively straightforward.

The CANN architecture.

One copy of the variables is processed through many dense hidden layers (NN component), while a second copy skips the layers and gets combined directly into the generation of the output neuron (GLM component).

Code: Architecture

random.seed(1)
inputs = Input(shape=X_train.shape[1:])

# GLM part (won't be updated during training)
glm_weights = gamma_glm.params.iloc[1:].values.reshape((-1, 1))
1glm_bias = gamma_glm.params.iloc[0]
glm_part = Dense(1, activation='linear', trainable=False,
                     kernel_initializer=Constant(glm_weights),
2                     bias_initializer=Constant(glm_bias))(inputs)

# Neural network part
x = Dense(64, activation='leaky_relu')(inputs)                          
3nn_part = Dense(1, activation='linear')(x)

# Combine GLM and CANN estimates
4mu = keras.ops.exp(glm_part + nn_part)
cann = Model(inputs, mu)
1
Fit a GLM directly to the train data, and extract the fitted coefficients.
2
Add a Dense layer with just one neuron, to store the model output (before inverse link function) from the GLM. The linear activation is used to make sure that the output is a linear combination of inputs. The weights are set to be non-trainable, hence the values obtained during GLM fitting will not change during the neural network training process. kernel_initializer=Constant(glm_weights) and bias_initializer=Constant(glm_bias) ensures that weights are initialized with the optimal values estimated from GLM fit.
3
Specify the neural network layers (1 dense hidden layer and the output neuron).
4
Add the GLM contribution to the neural network output and exponentiate to get the mean estimate.

Since this CANN predicts Gamma distributions, we use the Gamma NLL loss function.

def cann_negative_log_likelihood(y_true, y_pred):
    return keras.ops.mean(keras.ops.log(y_pred) + y_true/y_pred)

Code: Model Training

1cann.compile(optimizer="adam", loss=cann_negative_log_likelihood)
hist = cann.fit(X_train, y_train,
    epochs=100, 
    callbacks=[EarlyStopping(patience=10)],  
    verbose=0,
    batch_size=64, 
2    validation_split=0.2)
1
Compiles the model with adam optimizer and the custom loss function
2
Fits the model (with a validation split defined inside the fit function)

Find the dispersion parameter.

mus = cann.predict(X_train, verbose=0).flatten()
residuals = y_train - mus
dof = (len(y_train)-(X_train.shape[1] + 1))
phi_cann = np.sum(residuals**2/mus**2) / dof
print(phi_cann)
31.04317327177946

Mixture Density Network

Mixture Distribution

One intuitive way to capture uncertainty using neural networks would be to estimate the parameters of the target distribution, instead of predicting the value it self. For example, suppose we want to predict y coming from a Gaussian distribution. Most common method would be to predict (\hat{y}) directly using a single neuron at the output layer. Another possible way would be to estimate the parameters (\mu and \sigma) of the y distribution using 2 neurons at the output layer. Estimating parameters of the distribution instead of point estimates for y can help us get an idea about the uncertainty. However, assuming distributional properties at times could be too restrictive. For example, it is possible that the actual distribution of y values is bimodal or multimodal. In such situations, assuming a mixture distribution is more intuitive.

Given a finite set of resulting random variables (Y_1, \ldots, Y_{K}), one can generate a multinomial random variable Y\sim \text{Multinomial}(1, \boldsymbol{\pi}). Meanwhile, Y can be regarded as a mixture of Y_1, \ldots, Y_{K}, i.e., Y = \begin{cases} Y_1 & \text{w.p. } \pi_1, \\ \vdots & \vdots\\ Y_K & \text{w.p. } \pi_K, \\ \end{cases} where we define a finite set of weights \boldsymbol{\pi}=(\pi_{1} \ldots, \pi_{K}) such that \pi_k \ge 0 for k \in \{1, \ldots, K\} and \sum_{k=1}^{K}\pi_k=1.

Mixture Distribution

Let f_k and F_k be the p.d.f. and the c.d.f of Y_k|\boldsymbol{X} for all k \in \{1, \ldots, K\}.

The random variable Y|\boldsymbol{X}, which mixes Y_k|\boldsymbol{X}’s with weights \pi_k’s, has the density function f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) f_{k}(y|\boldsymbol{x}), and the cumulative distribution function F_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) = \sum_{k=1}^{K}\pi_k(\boldsymbol{x}) F_{k}(y|\boldsymbol{x}).

Mixture Density Network

A mixture density network (MDN) \text{MDN}(\boldsymbol{x}) outputs each distribution component’s mixing weights and parameters of Y given the input features \boldsymbol{x}, i.e., \text{MDN}(\boldsymbol{x})=(\boldsymbol{\pi}(\boldsymbol{x};\boldsymbol{w}^*), \boldsymbol{\theta}(\boldsymbol{x};\boldsymbol{w}^*)), where \boldsymbol{w}^* is the networks’ weights found by minimising the following negative log-likelihood loss function, i.e., \boldsymbol{w}^* = \underset{\boldsymbol{w}}{\text{arg\,min}} \ \mathcal{L}(\mathcal{D}, \boldsymbol{w})= - \sum_{i=1}^{n} \log f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}), where \mathcal{D}=\{(\boldsymbol{x}_i,y_i)\}_{i=1}^{n} is the training dataset.

Mixture Density Network

An MDN that outputs the parameters for a K component mixture distribution. \boldsymbol{\theta}_k(\boldsymbol{x}; \boldsymbol{w}^*)= (\theta_{k,1}(\boldsymbol{x}; \boldsymbol{w}^*), \ldots, \theta_{k,|\boldsymbol{\theta}_k|}(\boldsymbol{x}; \boldsymbol{w}^*)) consists of the parameter estimates for the kth mixture component.

Model Specification

Suppose there are two types of claims:

  • Type I: Y_1|\boldsymbol{X}=\boldsymbol{x}\sim \text{Gamma}(\alpha_1(\boldsymbol{x}), \beta_1(\boldsymbol{x})) and,
  • Type II: Y_2|\boldsymbol{X}=\boldsymbol{x}\sim \text{Gamma}(\alpha_2(\boldsymbol{x}), \beta_2(\boldsymbol{x})).

The density of the actual claim amount Y|\boldsymbol{X}=\boldsymbol{x} follows \begin{align*} f_{Y|\boldsymbol{X}}(y|\boldsymbol{x}) &= \pi_1(\boldsymbol{x})\cdot \frac{\beta_1(\boldsymbol{x})^{\alpha_1(\boldsymbol{x})}}{\Gamma(\alpha_1(\boldsymbol{x}))}\mathrm{e}^{-\beta_1(\boldsymbol{x})y}y^{\alpha_1(\boldsymbol{x})-1} \\ &\quad + (1-\pi_1(\boldsymbol{x}))\cdot \frac{\beta_2(\boldsymbol{x})^{\alpha_2(\boldsymbol{x})}}{\Gamma(\alpha_2(\boldsymbol{x}))}\mathrm{e}^{-\beta_2(\boldsymbol{x})y}y^{\alpha_2(\boldsymbol{x})-1}. \end{align*} where \pi_1(\boldsymbol{x}) is the probability of a Type I claim given \boldsymbol{x}.

Output

The aim is to find the optimum weights \boldsymbol{w}^* = \underset{w}{\text{arg\,min}} \ \mathcal{L}(\mathcal{D}, \boldsymbol{w}) for the Gamma mixture density network \text{MDN}(\boldsymbol{x}) that outputs the mixing weights, shapes and scales of Y given the input features \boldsymbol{x}, i.e., \begin{align*} \text{MDN}(\boldsymbol{x}) = ( &\pi_1(\boldsymbol{x}; \boldsymbol{w}^*), \pi_2(\boldsymbol{x}; \boldsymbol{w}^*), \\ &\alpha_1(\boldsymbol{x}; \boldsymbol{w}^*), \alpha_2(\boldsymbol{x}; \boldsymbol{w}^*), \\ &\beta_1(\boldsymbol{x}; \boldsymbol{w}^*), \beta_2(\boldsymbol{x}; \boldsymbol{w}^*) ). \end{align*}

Architecture

We demonstrate the structure of a Gamma MDN that outputs the parameters for a Gamma mixture with two components.

Code: Architecture

The following code implements the gamma MDN from the previous slide.

1import torch
import torch.nn as nn

random.seed(1)
torch.manual_seed(1)

2class GammaMDN(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, n_components=2):
        super().__init__()
        self.hidden = nn.Sequential(
3            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
4        self.pi_layer = nn.Linear(hidden_dim, n_components)
        self.alpha_layer = nn.Linear(hidden_dim, n_components)
        self.beta_layer = nn.Linear(hidden_dim, n_components)

    def forward(self, x):
        h = self.hidden(x)
        pis = torch.softmax(self.pi_layer(h), dim=-1)
        alphas = torch.exp(self.alpha_layer(h))
        betas = torch.exp(self.beta_layer(h))
5        return pis, alphas, betas

gamma_mdn = GammaMDN(X_train.shape[1])
1
Import PyTorch
2
Defines the MDN as a PyTorch nn.Module
3
Specifies the hidden layers of the neural network
4
Specifies the output heads. softmax is used for \pi values as they must sum to 1. exp is used for \alpha’s and \beta’s as they must be positive.
5
Returns the mixing weights, shape parameters, and rate parameters separately.

Loss Function

The negative log-likelihood loss function is given by

\mathcal{L}(\mathcal{D}, \boldsymbol{w}) = - \frac{1}{n} \sum_{i=1}^{n} \log \ f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}) where the f_{Y|\boldsymbol{X}}(y_i|\boldsymbol{x}, \boldsymbol{w}) is defined by \begin{align*} &\pi_1(\boldsymbol{x};\boldsymbol{w})\cdot \frac{\beta_1(\boldsymbol{x};\boldsymbol{w})^{\alpha_1(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_1(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_1(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_1(\boldsymbol{x};\boldsymbol{w})-1} \\ & \quad + (1-\pi_1(\boldsymbol{x};\boldsymbol{w}))\cdot \frac{\beta_2(\boldsymbol{x};\boldsymbol{w})^{\alpha_2(\boldsymbol{x};\boldsymbol{w})}}{\Gamma(\alpha_2(\boldsymbol{x};\boldsymbol{w}))}\mathrm{e}^{-\beta_2(\boldsymbol{x};\boldsymbol{w})y}y^{\alpha_2(\boldsymbol{x};\boldsymbol{w})-1} \end{align*}

This loss function is quite complex, and for more complicated mixture distributions the loss function would be impossible to write in closed-form.

Code: Loss & training

torch.distributions to the rescue.

torch.distributions does all the hard work of mathematically describing the mixture distribution.

1from torch.distributions import Gamma, MixtureSameFamily, Categorical

2def gamma_mixture_nll(pis, alphas, betas, y_true):
    mixture = MixtureSameFamily(
        mixture_distribution=Categorical(probs=pis),
3        component_distribution=Gamma(concentration=alphas, rate=betas))
4    return -mixture.log_prob(y_true).mean()
1
Import distribution classes from torch.distributions
2
The loss function takes the MDN outputs (mixing weights, shapes, rates) and observed values
3
Specify the mixture distribution using the predicted model components
4
Calculate the mean negative log-likelihood given the observed data
# Convert data to PyTorch tensors
X_train_t = torch.tensor(X_train.values, dtype=torch.float32)
y_train_t = torch.tensor(y_train.values, dtype=torch.float32)

# Split into train and validation (80/20)
n_val = int(0.2 * len(X_train_t))
X_val_mdn, y_val_mdn = X_train_t[-n_val:], y_train_t[-n_val:]
X_trn_mdn, y_trn_mdn = X_train_t[:-n_val], y_train_t[:-n_val]

1optimizer = torch.optim.Adam(gamma_mdn.parameters())
loader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X_trn_mdn, y_trn_mdn),
    batch_size=64, shuffle=True)

best_val_loss, patience_counter, best_state = float('inf'), 0, None

2for epoch in range(100):
    gamma_mdn.train()
    for xb, yb in loader:
        pis, alphas, betas = gamma_mdn(xb)
        loss = gamma_mixture_nll(pis, alphas, betas, yb)
        optimizer.zero_grad(); loss.backward(); optimizer.step()

    gamma_mdn.eval()
    with torch.no_grad():
        pis_v, alphas_v, betas_v = gamma_mdn(X_val_mdn)
        val_loss = gamma_mixture_nll(pis_v, alphas_v, betas_v, y_val_mdn)

3    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_state = {k: v.clone() for k, v in gamma_mdn.state_dict().items()}
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= 10: break

gamma_mdn.load_state_dict(best_state)
1
Uses the Adam optimiser
2
Fits the model with a manual training loop, evaluating on a validation split each epoch
3
Early stopping with patience of 10 epochs, restoring the best weights

Metrics for Distributional Regression

How do we assess the distributional regression model?

Proper Scoring Rules

Proper scoring rules provide a summary measure for the performance of the probabilistic predictions. They are useful in comparing performances across models.

Definition

A scoring rule is the equivalent of a loss function for distributional regression.

Denote S(F, y) to be the score given to the forecasted distribution F and an observation y \in \mathbb{R}.

Definition

A scoring rule is called proper if \mathbb{E}_{Y \sim Q} S(Q, Y) \le \mathbb{E}_{Y \sim Q} S(F, Y) for all F and Q distributions.

It is called strictly proper if equality holds only if F = Q.

Note

Y \sim Q: the true distribution of Y is Q.

Example Proper Scoring Rules

Logarithmic Score (NLL)

The logarithmic score is defined as \mathrm{LogS}(f, y) = - \log f(y), where f is the predictive density.

Continuous Ranked Probability Score (CRPS)

The continuous ranked probability score is defined as \mathrm{crps}(F, y) = \int_{-\infty}^{\infty} (F(t) - {1}_{t\ge y})^2 \ \mathrm{d}t, where F is the predicted c.d.f.

Likelihoods

Code
y_pred = np.polyval(coefficients, X_toy[:4])
y_pred[2] *= 1.1
sigma_preds = sigma_toy * np.array([1.0, 3.0, 0.5, 0.5])
fig, axes = plt.subplots(1, 4, figsize=(5.0, 3.0), sharey=True)

x_min = y_pred[:4].min() - 4*sigma_toy
x_max = y_pred[:4].max() + 4*sigma_toy
x_grid = np.linspace(x_min, x_max, 100)

# Plot each normal distribution with different means vertically
for i, ax in enumerate(axes):
    y_grid = stats.norm.pdf(x_grid, y_pred[i], sigma_preds[i])
    ax.plot(x_grid, y_grid)
    ax.plot([y_toy[i], y_toy[i]], [0, stats.norm.pdf(y_toy[i], y_pred[i], sigma_preds[i])], color='red', linestyle='--')
    ax.scatter([y_toy[i]], [stats.norm.pdf(y_toy[i], y_pred[i], sigma_preds[i])], color='red', zorder=10)
    ax.set_title(f'$f(y ; \\boldsymbol{{x}}_{{{i+1}}})$')
    ax.set_xticks([y_pred[i]], labels=[r'$\mu_{' + str(i+1) + r'}$'])
    # ax.set_ylim(0, 0.25)

    # Turn off the y axes
    ax.yaxis.set_visible(False)
    
plt.tight_layout();

In this example the observed value is in red and the green line is the predicted distribution. For overconfiedent predictions (low variance), we can either be rewarded greatly if we are right (\mu_4). We can also be penalised heavily if our confident prediction is wrong (\mu_3). For underconfident predictions, the reward does not change significantly between predictions, and we are penalised for not being confident enough.

When fitting the model by trying to maximise the likelihood, you want to balance the trade-off between making an accurate prediction and making a confident prediction.

Code: NLL

def gamma_nll(mean, dispersion, y):
    # Calculate shape and scale parameters from mean and dispersion
    shape = 1 / dispersion; scale = mean * dispersion

    # Create a gamma distribution object
    gamma_dist = stats.gamma(a=shape, scale=scale)
    
    return -np.mean(gamma_dist.logpdf(y))

# GLM
X_test_design = sm.add_constant(X_test)
mus = gamma_glm.predict(X_test_design)
nll_glm = gamma_nll(mus, phi_glm, y_test)

# CANN
mus = cann.predict(X_test, verbose=0)
nll_cann = gamma_nll(mus, phi_cann, y_test)

# MDN
gamma_mdn.eval()
with torch.no_grad():
    X_test_t = torch.tensor(X_test.values, dtype=torch.float32)
    y_test_t = torch.tensor(y_test.values, dtype=torch.float32)
    pis, alphas, betas = gamma_mdn(X_test_t)
    nll_mdn = gamma_mixture_nll(pis, alphas, betas, y_test_t).item()
print(f'GLM: {round(nll_glm, 2)}')
print(f'CANN: {round(nll_cann, 2)}')
print(f'MDN: {round(nll_mdn, 2)}')
GLM: 11.02
CANN: 10.44
MDN: 8.67

The above results show that MDN provides the lowest value for the Logarithmic Score (NLL). Low values for NLL indicate better fits. One possible reason for the better performance of the MDN model (compared to the Gamma model) is the added flexibility from multiple modelling components. The multiple modelling components in the MDN model, together, can capture the inherent variation in the data better.

Aleatoric and Epistemic Uncertainty

Uncertainty in deep learning refers to the level of doubt one would have about the predictions made by an AI-driven algorithm. Identifying and quantifying different sources of uncertainty that could exist in AI-driven algorithms is therefore important to ensure a credible application.

Categories of uncertainty

There are two major categories of uncertainty in statistical or machine learning:

  • Aleatoric uncertainty: the inherent variability associated with the data generating process.
  • Epistemic uncertainty: the lack of knowledge, limited data information, parameter errors and model errors.

Sources of uncertainty

There are many sources of uncertainty in statistical or machine learning models.

  • Parameter error stems primarily due to lack of data.
  • Model error stems from assuming wrong distributional properties of the data.
  • Data uncertainty arises due to the lack of confidence we may have about the quality of the collected data. Noisy data, inconsistent data, data with missing values or data with missing important variables can result in data uncertainty.

If you decide to predict the claim amount of an individual using a deep learning model, which source(s) of uncertainty are you dealing with?

  1. The inherent variability of the data-generating process \rightarrow aleatoric uncertainty.
  2. Parameter error \rightarrow epistemic uncertainty.
  3. Model error \rightarrow epistemic uncertainty.
  4. Data uncertainty \rightarrow epistemic uncertainty.

Traditional regularisation

Regularisation is performed when fitting model parameters to avoid overfitting to the train data.

Say all the m weights (excluding biases) are in the vector \boldsymbol{\theta}. If we change the loss function to \text{Loss}_{1:n} = \frac{1}{n} \sum_{i=1}^n \text{Loss}_i + \lambda \sum_{j=1}^{m} \left| \theta_j \right|

this would be using L^1 regularisation. A loss like

\text{Loss}_{1:n} = \frac{1}{n} \sum_{i=1}^n \text{Loss}_i + \lambda \sum_{j=1}^{m} \theta_j^2

is called L^2 regularisation.

Regularisation in Keras

We can perform regularisation in Keras with the same intuition as for the context of regression.

Code
from sklearn.datasets import fetch_california_housing
features, target = fetch_california_housing(as_frame=True, return_X_y=True)

NUM_FEATURES = len(features.columns)

X_main, X_test, y_main, y_test = train_test_split(
    features, target, test_size=0.2, random_state=1
)
X_train, X_val, y_train, y_val = train_test_split(
    X_main, y_main, test_size=0.25, random_state=1
)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)
from keras.regularizers import L1, L2

def l1_model(regulariser_strength=0.01):
  random.seed(123)
  model = Sequential([
      Dense(30, activation="leaky_relu",
        kernel_regularizer=L1(regulariser_strength)),
      Dense(1, activation="exponential")
  ])

  model.compile("adam", "mse")
  model.fit(X_train_sc, y_train, epochs=4, verbose=0)
  return model

def l2_model(regulariser_strength=0.01):
  random.seed(123)
  model = Sequential([
      Dense(30, activation="leaky_relu",
        kernel_regularizer=L2(regulariser_strength)),
      Dense(1, activation="exponential")
  ])

  model.compile("adam", "mse")
  model.fit(X_train_sc, y_train, epochs=10, verbose=0)
  return model

This code defines functions that specify and fit a NN under the two regularisation methods.

Weights before & after L^2

model = l2_model(0.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

model = l2_model(1.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

May of the weights are much closer to 0 (but never exactly 0).

Weights before & after L^1

model = l1_model(0.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 0

model = l1_model(1.0)
weights = model.layers[0].get_weights()[0].flatten()
print(f"Number of weights almost 0: {np.sum(np.abs(weights) < 1e-5)}")
plt.hist(weights, bins=100);
Number of weights almost 0: 42

Many of the weights have been reduced to exactly 0. This breaks the connections between neurons inside the NN.

Early-stopping regularisation

A form of regularisation that we’ve already been doing is early-stopping. The model stops learning when the validation loss is minimised, clearly solving the issue of over-fitting the model to the train data.

A very different way to regularize iterative learning algorithms such as gradient descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping… It is such a simple and efficient regularization technique that Geoffrey Hinton called it a “beautiful free lunch”.

Alternatively, you can try building a model with slightly more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent it from overfitting too much. Vincent Vanhoucke, a scientist at Google, has dubbed this the “stretch pants” approach: instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size.

Dropout

Dropout

Dropout is one of the most popular methods for reducing the risk of overfitting. Dropout is the act of randomly selecting a proportion of neurons and deactivating them during each training iteration. It is a regularization technique that aims to reduce overfitting and improve the generalization ability of the model.

An example of neurons dropped during training.

Dropout quote #1

It’s surprising at first that this destructive technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organization; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them.

Dropout quote #2

The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Code: Dropout

Dropout is just another layer in Keras.

The following code shows how we can apply a dropout to each hidden layer in the neural network. The dropout rate for each layer is 0.2. There is also an option called seed in the Dropout function, which can be used to ensure reproducibility.

from keras.layers import Dropout

random.seed(2); 

model = Sequential([
    Dense(30, activation="leaky_relu"),
    Dropout(0.2),
    Dense(30, activation="leaky_relu"),
    Dropout(0.2),
    Dense(1, activation="exponential")
])

model.compile("adam", "mse")
model.fit(X_train_sc, y_train, epochs=4, verbose=0);

Code: Dropout after training

Making predictions is the same as any other model:

model.predict(X_train_sc.head(3),
                  verbose=0)
array([[1.0587903],
       [1.281435 ],
       [0.9994641]], dtype=float32)
model.predict(X_train_sc.head(3),
                  verbose=0)
array([[1.0587903],
       [1.281435 ],
       [0.9994641]], dtype=float32)

Dropout has no impact on model predictions because Dropout function is carried out only during the training stage. Once the model finishes its training (once the weights and biases are computed), all neurons together contribute to the predictions (no dropping out during the prediction stage). Therefore, predictions from the model will not change across different runs.

We can make the model think it is still training:

model(X_train_sc.head(3),
    training=True).numpy()
array([[1.082524  ],
       [0.74211484],
       [1.1583111 ]], dtype=float32)
model(X_train_sc.head(3),
    training=True).numpy()
array([[1.0132376 ],
       [1.2697871 ],
       [0.78005785]], dtype=float32)

By setting the training=True, we can let dropout happen during prediction stage as well. This will change predictions for the same output different. This is known as the Monte Carlo dropout and can be used to generate a distribution of predictions.

Dropout Limitation

  • Increased Training Time: Since dropout introduces noise into the training process, it can make the training process slower.
  • Sensitivity to Dropout Rates: the performance of dropout is highly dependent on the chosen dropout rate.

Ensembles

Ensembles

Combine many models to get better predictions.

Ensemble learning: combine predictions from multiple models.

Deep Ensembles

Train M neural networks with different random initial weights independently (even in parallel).

def build_model(seed):
1    random.seed(seed)
    model = Sequential([
        Dense(30, activation="leaky_relu"),
        Dense(1, activation="exponential")
    ])                                                                      
    model.compile("adam", "mse")                                           

    es = EarlyStopping(restore_best_weights=True, patience=5)
    model.fit(X_train_sc, y_train, epochs=1_000,
2        callbacks=[es], validation_data=(X_val_sc, y_val), verbose=False)
    return model
1
Set random seed which is used for a single NN
2
Specify, compile and fit the NN as usual

The following code fits 3 neural networks with different starting weights and biases.

M = 3
seeds = range(M)
models = []
for seed in seeds:
    models.append(build_model(seed))

Deep Ensembles II

Say the trained weights for the NNs are \boldsymbol{w}^{(1)}, \ldots, \boldsymbol{w}^{(M)}, then we get predictions \bigl\{ \hat{y}(\boldsymbol{x}; \boldsymbol{w}^{(m)}) \bigr\}_{m=1}^{M}

y_preds = []
for model in models:
    y_preds.append(model.predict(X_test_sc, verbose=0))

y_preds = np.array(y_preds)
y_preds
array([[[3.2678914 ],
        [0.76575136],
        [2.3880696 ],
        ...,
        [2.4143035 ],
        [2.227389  ],
        [1.1000473 ]],

       [[3.180622  ],
        [0.7190427 ],
        [2.5627623 ],
        ...,
        [2.3693144 ],
        [2.3057966 ],
        [1.0491788 ]],

       [[3.098881  ],
        [0.77898645],
        [2.6084518 ],
        ...,
        [2.6023817 ],
        [2.356631  ],
        [1.2005999 ]]], dtype=float32)

Now that we have multiple predictions for each NN, we can obtain a distribution of the predictions. The distribution helps us to understand how uncertain our predictions are. Some architectures are more sensitive to the starting weights than others; this would be reflected in the prediction uncertainty.

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow"))
Python implementation: CPython
Python version       : 3.11.12
IPython version      : 9.3.0

keras     : 3.8.0
matplotlib: 3.10.0
numpy     : 2.0.2
pandas    : 2.2.2
seaborn   : 0.13.2
scipy     : 1.13.1
torch     : 2.6.0
tensorflow: 2.18.0

Glossary

  • aleatoric and epistemic uncertainty
  • combined actuarial neural network
  • dead ReLU
  • deep ensembles
  • distributional forecasts
  • dropout
  • generalised linear model
  • mixture density network
  • mixture distribution
  • Monte Carlo dropout
  • proper scoring rule