import os
"CUDA_VISIBLE_DEVICES"] = ""
os.environ[
import random
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.initializers import Constant
Lab: Backpropagation
ACTL3143 & ACTL5111 Deep Learning for Actuaries
Backpropagation performs a backward pass to adjust the neural network’s parameters. It’s an algorithm that uses gradient descent to update the neural network weights.
Linear Regression via Batch Gradient Descent
Let \boldsymbol{\theta}^{(t)}=(w^{(t)}, b^{(t)}) be the parameter estimates of the tth iteration. Let \mathcal{D}= \{(x_i, y_i)\}_{i=1}^{N} represents the training batch. Let mean squared error (MSE) be the loss/cost function \mathcal{L}.
Finding the Gradients
- Step 1: Write down \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)}) and \hat{y}(x_i; \boldsymbol{\theta}^{(t)}) \begin{align*} \mathcal{L}(\mathcal{D},\boldsymbol{\theta}^{(t)}) &=\frac{1}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big)^2 \\ \hat{y}(x_i; \boldsymbol{\theta}^{(t)}) &= w^{(t)}x_i + b^{(t)} \end{align*}
- Step 2: Derive \frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} and \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial \boldsymbol{\theta}^{(t)}} \begin{align*} \frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} & = 2 \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \\ \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} & = x_i \\ \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} & = 1 \end{align*}
- Step 3: Derive \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial \boldsymbol{\theta}^{(t)}} \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} = \frac{1}{N}\sum_{i=1}^{N}\frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial w^{(t)}} = \frac{2}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \cdot x_i \tag{1} \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} = \frac{1}{N}\sum_{i=1}^{N}\frac{\partial \mathcal{L}(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}), y_i)}{\partial \hat{y}(x_i; \boldsymbol{\theta}^{(t)})} \frac{\partial\hat{y}(x_i; \boldsymbol{\theta}^{(t)})}{\partial b^{(t)}} = \frac{2}{N} \sum_{i=1}^{N} \big(\hat{y}(x_i; \boldsymbol{\theta}^{(t)}) - y_i \big) \cdot 1 \tag{2}
Then, we initialise \boldsymbol{\theta}^{(0)} = (w^{(0)}, b^{(0)}) and then apply gradient descent for t=1, 2, \ldots \begin{align} w^{(t+1)} &= w^{(t)} - \eta \cdot \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial w}\bigg|_{w^{(t)}} \\ b^{(t+1)} &= b^{(t)} - \eta \cdot \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta}^{(t)})}{\partial b}\bigg|_{b^{(t)}} \end{align} using the derivatives derived from Equation 1 and Equation 2. \eta is a chosen learning rate.
Exercise
- Use backpropagation algorithm to find \theta^{(3)} with \theta^{(0)}= (w^{(0)} = 1, b^{(0)} = 0). The dataset \mathcal{D} is as follows:
That is, the true model would be y_i = 3 x_i + 1, i.e., w = 3, b = 1. Implement batch gradient descent.
Neural Network
For a neural network with H hidden layers:
- L_0 is the input layer (the zeroth hidden layer). L_k represents the kth hidden layer for k\in \{1, 2, \ldots, H\}. L_{H+1} is the output layer (the H+1th hidden layer).
- \phi^{(k)} represents the activation function for the kth hidden layer, with k\in \{1, 2, \ldots, H\}. \phi^{(H+1)} represents the activation function for the output layer.
- \boldsymbol{w}^{(k)}_j represents the weights connecting the activated neurons \boldsymbol{a}^{(k-1)} from the k-1th hidden layer to the jth neuron in the kth hidden layer, where k\in \{1, \ldots, H+1\} and j\in \{1, \ldots, q_{k}\}, i.e., q_{k} denotes the number of neurons in the kth hidden layer. \boldsymbol{a}^{(0)} = \boldsymbol{z}^{(0)} =\boldsymbol{x} by definition.
- b^{(k)}_j represents the bias for the jth neuron in the kth hidden layer.
Gradients For the Output Layer
The gradient for \boldsymbol{w}_1^{(H+1)}, i.e., the weights connecting the neurons in the Hth (last) hidden layer to the first neuron of the output layer, is given by: \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial \boldsymbol{w}^{(H+1)}_1} = \frac{\partial \mathcal{L}(\mathcal{D}, \boldsymbol{\theta})}{\partial \hat{y}_1} \frac{\partial \hat{y}_1}{\partial z^{(H+1)}_1 } \frac{\partial z^{(H+1)}_1}{\partial \boldsymbol{w}^{(H+1)}_1} \tag{3} where
- \hat{y}_1=a^{(H+1)}_1= \phi (z^{(H+1)}_1)
- z^{(H+1)}_1 = \langle \boldsymbol{a}^{(H)}, \boldsymbol{w}_1^{(H+1)} \rangle + b^{(H+1)}_1.
- \langle \cdot, \cdot \rangle represents the inner product.
Example 1
- From input layer L_0 to the first hidden layer L_1: \begin{align*} a^{(1)}_1 &= \phi^{(1)}\big(w^{(1)}_{1, 1}x_1 + w^{(1)}_{2, 1}x_2 + w^{(1)}_{3, 1} x_3 + b^{(1)}_1\big) = \phi^{(1)} (\langle \boldsymbol{w}^{(1)}_{1}, \boldsymbol{x} \rangle + b^{(1)}_1 )\\ a^{(1)}_2 &= \phi^{(1)}\big(w^{(1)}_{1, 2}x_1 + w^{(1)}_{2, 2}x_2 + w^{(1)}_{3, 2} x_3 + b^{(1)}_2\big) = \phi^{(1)} (\langle \boldsymbol{w}^{(1)}_{2}, \boldsymbol{x} \rangle + b^{(1)}_2) \end{align*}
- From the first hidden layer L_1 to the output layer layer L_2: \begin{align*} \hat{y} &= \phi^{(2)}\big(w^{(2)}_{1, 1} a^{(1)}_1 + w^{(2)}_{2, 1} a^{(1)}_2 + b^{(2)}_1\big) = \phi^{(2)}( \langle \boldsymbol{w}^{(2)}_{1}, \boldsymbol{a}^{(1)} \rangle + b^{(2)}_1) \end{align*}
- \phi^{(1)}(z)= S(z) (sigmoid function) and \phi^{(2)}(z) = \exp(z) (exponential function).
Let \boldsymbol{\theta}^{(t)}=(\boldsymbol{w}^{(t)}, \boldsymbol{b}^{(t)})= \Big(\boldsymbol{w}^{(t, 1)}_1, \boldsymbol{w}^{(t, 1)}_2, \boldsymbol{w}^{(t, 2)}_1, b^{(t,1)}_1, b^{(t,1)}_2, b^{(t,2)}_1\Big) be the parameter estimates of the tth iteration. For illustration, we assume the bias terms \big(b^{(t,1)}_1, b^{(t,1)}_2, b^{(t,2)}_1\big) are all zeros.
- For \boldsymbol{w}_1^{(2)}, apply equation Equation 3
- For \boldsymbol{w}^{(1)}_1, apply equation Equation 4
- For \boldsymbol{w}^{(1)}_2, apply equation Equation 4
Implementing Backpropagation in Python
See Week_4_Lab_Notebook.ipynb
for more details. The required packages/functions are as follows:
True weights:
= np.array([[0.25], [0.5], [0.75]])
w1_1 = np.array([[0.75], [0.5], [0.25]])
w1_2 = np.array([[2.0], [3.0]]) w2_1
Some synthetic data to work with:
# Generate 10000 random observations of 3 numerical features
0)
np.random.seed(= np.random.randn(10000, 3)
X
# Sigmoid activation function
def sigmoid(z):
return(1/(1+np.exp(-z)))
# Hidden Layer 1
= X @ w1_1 # The first neuron before activation
z1_1 = X @ w1_2 # The second neuron before activation
z1_2 = sigmoid(z1_1) # The first neuron after activation
a1_1 = sigmoid(z1_2) # The second neuron after activation
a1_2
# Output Layer
= np.concatenate((a1_1, a1_2), axis = 1) @ w2_1 # Pre-activation of the ouput
z2_1 = np.exp(z2_1) # Output
a2_1
# The actual values
= a2_1 y
From Scratch
# Initialised weights
= np.array([[0.2], [0.6], [1.0]])
w1_1_hat = np.array([[0.4], [0.8], [1.2]])
w1_2_hat = np.array([[1.0], [2.0]])
w2_1_hat
= []
losses = 5000
num_iterations for _ in range(num_iterations):
# Compute Forward Passes
# Hidden Layer 1
= X @ w1_1_hat # The first neuron before activation
z1_1_hat = X @ w1_2_hat # The second neuron before activation
z1_2_hat = sigmoid(z1_1_hat) # The first neuron after activation
a1_1_hat = sigmoid(z1_2_hat) # The second neuron after activation
a1_2_hat = np.concatenate((a1_1_hat, a1_2_hat), axis = 1)
a1_hat
# Output Layer
= a1_hat @ w2_1_hat # The output before activation
z2_1_hat = np.exp(z2_1_hat).reshape(len(y), 1) # The ouput
y_hat
# Track the Losses
= (y_hat - y)**2
loss
losses.append(np.mean(loss))
# Compute Deltas
= 2 * (y_hat - y) * np.exp(z2_1_hat)
delta2_1 = w2_1_hat[0] * delta2_1 * sigmoid(z1_1_hat) * (1-sigmoid(z1_1_hat))
delta1_1 = w2_1_hat[1] * delta2_1 * sigmoid(z1_2_hat) * (1-sigmoid(z1_2_hat))
delta1_2
# Compute Gradients
= delta2_1 * a1_hat
d2_1_hat = delta1_1 * X
d1_1_hat = delta1_2 * X
d1_2_hat
# Learning Rate
= 0.0005
eta
# Apply Batch Gradient Descent
-= eta * np.mean(d2_1_hat, axis = 0).reshape(2, 1)
w2_1_hat -= eta * np.mean(d1_1_hat, axis = 0).reshape(3, 1)
w1_1_hat -= eta * np.mean(d1_2_hat, axis = 0).reshape(3, 1)
w1_2_hat
print(w1_1_hat)
print(w1_2_hat)
print(w2_1_hat)
[[0.24985576]
[0.5000211 ]
[0.75018656]]
[[0.74987578]
[0.49998626]
[0.25009692]]
[[1.99874327]
[3.00125615]]
From Keras
# An initialiser for the weights in the neural network
= Constant([[0.2, 0.4], [0.6, 0.8], [1.0, 1.2]])
init1 = Constant([[1.0, 2.0]])
init2
# Build a neural network
# `use_bias` (whether to include bias terms for the neurons or not) is True by default
# `kernel_initializer` adjusts the initialisations of the weights
= Input(shape=X.shape[1:], name="Inputs")
x = Dense(2, "sigmoid", use_bias=False,
a1 =init1)(x)
kernel_initializer= Dense(1, "exponential", use_bias=False,
y_hat =init2)(a1)
kernel_initializer= Model(x, y_hat)
model
# Choosing the optimiser and the loss function
="adam", loss="mse")
model.compile(optimizer
# Model Training
# We don't implement early stopping to make the results comparable to the previous section
= model.fit(X, y, epochs=5000, verbose=0, batch_size = len(y))
hist
# Print out the weights
print(model.get_weights())
[array([[0.3025748 , 0.80548114],
[0.49333417, 0.5067073 ],
[0.6842524 , 0.2076197 ]], dtype=float32), array([[2.5133712, 2.5152776],
[2.4867477, 2.4848893]], dtype=float32)]