As you have learned, a neural network consists of a set of weights and biases, and the network learns by adjusting these values so that we minimise the network’s loss. Mathematically, we aim to find the optimum weights and biases (\boldsymbol{w}^{*}, \boldsymbol{b}^{*}) : \begin{align}
(\boldsymbol{w}^{*}, \boldsymbol{b}^{*}) = \underset{\boldsymbol{w}, \boldsymbol{b}}{\text{arg min}}\ \mathcal{L}(\mathcal{D}, (\boldsymbol{w}, \boldsymbol{b})) \nonumber
\end{align} where \mathcal{D} denotes the training data set and \mathcal{L}(\cdot, \cdot) is the user-defined loss function.
Gradient descent is the method through which we update the weights and biases. We introduce two types of gradient descent: stochastic and batch .
Stochastic gradient descent updates the weights and biases once for each observation in the data set.
Batch gradient descent updates the values repeatedly by averaging the gradients across all the observations.
Mini-Batch gradient descent updates the values repeatedly by averaging the gradients across a group of the observations (the ‘mini-batch’, or just ‘batch’).
Example: Mini-Batch Gradient Descent for Linear Regression
Notation:
\mathcal{L}(\mathcal{D}, (\boldsymbol{w}, {b})) denotes the loss function.
\hat{y}(\boldsymbol{x}_i) denotes the predicted value for the i th observation \boldsymbol{x}_i \in \mathbb{R}^{1 \times p} , where p represents the dimension of the input.
\boldsymbol{w} \in \mathbb{R}^{p \times 1} denotes the weights.
N denotes the batch size.
The model is
\hat{y}_i = \hat{y}(\boldsymbol{x}_i) = \boldsymbol{x}_i \boldsymbol{w} + b, \quad i = 1, \ldots, n.
Let’s set p=2 and consider the true weights and bias as
\boldsymbol{w}_{\text{True}} = \begin{pmatrix} 1.5 \\ 1.5 \end{pmatrix}, b_{\,\text{True}} = 0.1.
Let’s just make some toy dataset (batch) to train on:
import numpy as np
# Make up (arbitrarily) 12 observations with two features.
X = np.array([[1 , 2 ],
[3 , 1 ],
[1 , 1 ],
[0 , 1 ],
[2 , 2 ],
[- 2 , 3 ],
[1 , 2 ],
[- 1 , - 0.5 ],
[0.5 , 1.2 ],
[2 , 1 ],
[- 2 , 3 ],
[- 1 , 1 ]
])
w_true = np.array([[1.5 ], [1.5 ]])
b_true = 0.1
y = X @ w_true + b_true
print (X); print (y)
[[ 1. 2. ]
[ 3. 1. ]
[ 1. 1. ]
[ 0. 1. ]
[ 2. 2. ]
[-2. 3. ]
[ 1. 2. ]
[-1. -0.5]
[ 0.5 1.2]
[ 2. 1. ]
[-2. 3. ]
[-1. 1. ]]
[[ 4.6 ]
[ 6.1 ]
[ 3.1 ]
[ 1.6 ]
[ 6.1 ]
[ 1.6 ]
[ 4.6 ]
[-2.15]
[ 2.65]
[ 4.6 ]
[ 1.6 ]
[ 0.1 ]]
If the batch size is N=3 , the first batch of observations is
\boldsymbol{X}_{1:3} = \begin{pmatrix}
1 & 2 \\
3 & 1 \\
1 & 1\\
\end{pmatrix} ,
\boldsymbol{y}_{1:3} =
\begin{pmatrix}
4.6 \\
6.1 \\
3.1 \\
\end{pmatrix}.
For simplicity, we will denote \boldsymbol{X}_{1:3} as \boldsymbol{X} and \boldsymbol{y}_{1:3} as \boldsymbol{y} .
Step 1: Write down \mathcal{L}(\mathcal{D}, (\boldsymbol{w}, {b})) and \hat{\boldsymbol{y}} \begin{align}
\mathcal{L}(\mathcal{D}, (\boldsymbol{w}, {b})) =\frac{1}{N} \sum_{i=1}^{N} \big(\hat{y}(\boldsymbol{x}_i) - y_i \big)^2 = \frac{1}{N} (\hat{\boldsymbol{y}} - \boldsymbol{y})^{\top}(\hat{\boldsymbol{y}} - \boldsymbol{y}),
\end{align} where \begin{align}
\hat{y}(\boldsymbol{x}_i) &= \boldsymbol{x}_i\boldsymbol{w} + b, \\
\hat{\boldsymbol{y}} &= \boldsymbol{X} \boldsymbol{w} + {b}\boldsymbol{1} = \begin{pmatrix}
\hat{y}(\boldsymbol{x}_1) \\
\hat{y}(\boldsymbol{x}_2) \\
\hat{y}(\boldsymbol{x}_3)
\end{pmatrix}.
\\ \nonumber
\end{align} with \boldsymbol{1} is a length 3 column vector of ones.
Step 2: Derive \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} , \frac{\partial \boldsymbol{\hat{y}}}{\partial \boldsymbol{w}} , and \frac{\partial \boldsymbol{\hat{y}}}{\partial {b}} \begin{align}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} &= \frac{2}{N} (\hat{\boldsymbol{y}} - \boldsymbol{y}), \\
\frac{\partial \boldsymbol{\hat{y}}}{\partial \boldsymbol{w}} &= \boldsymbol{X}, \\
\frac{\partial \boldsymbol{\hat{y}}}{\partial {b}} &= \boldsymbol{1} . \\ \nonumber
\end{align}
Step 3: Derive \frac{\partial \mathcal{L}}{\partial \boldsymbol{w}} and \frac{\partial \mathcal{L}}{\partial {b}}
\begin{align}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{w}} &= \left( \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \right)^\top \frac{\partial \boldsymbol{\hat{y}}}{\partial \boldsymbol{w}} =
\left( \frac{2}{N} (\hat{\boldsymbol{y}} - \boldsymbol{y}) \right)^\top \boldsymbol{X}
= \frac{2}{N} \boldsymbol{X}^{\top} (\hat{\boldsymbol{y}} - \boldsymbol{y}), \\
\frac{\partial \mathcal{L}}{\partial {b}} &= \left( \frac{\partial \mathcal{L}}{\partial \boldsymbol{\hat{y}}} \right)^\top \frac{\partial \boldsymbol{\hat{y}}}{\partial {b}} = \left( \frac{2}{N} (\hat{\boldsymbol{y}} - \boldsymbol{y}) \right)^\top \boldsymbol{1} = \frac{2}{N} \boldsymbol{1}^{\top}(\hat{\boldsymbol{y}} - \boldsymbol{y}). \\ \nonumber
\end{align}
Step 4: Initialise the weights and biases. Evaluate the gradients.
\begin{align}
\boldsymbol{w}^{(0)} = \begin{pmatrix} 1\\ 1 \end{pmatrix}, {b}^{(0)} = 0. \nonumber
\end{align} Subsequently, \begin{align}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{w}}\bigg|_{\boldsymbol{w}^{(0)}} &= \frac{2}{3}
\underbrace{\begin{pmatrix}
1 & 3 & 1 \\
2 & 1 & 1 \\
\end{pmatrix}}_{\boldsymbol{X}^\top} \Bigg[
\underbrace{\begin{pmatrix}
3 \\
4 \\
2
\end{pmatrix}}_{\hat{\boldsymbol{y}}}
-
\underbrace{ \begin{pmatrix}
4.6 \\
6.1 \\
3.1
\end{pmatrix}}_{\boldsymbol{y}}
\Bigg] =
\begin{pmatrix}
-6.000 \\
-4.267
\end{pmatrix},
\\
\frac{\partial \mathcal{L}}{\partial {b}}\bigg|_{{b}^{(0)}} &= \frac{2}{3}
\underbrace{\begin{pmatrix}
1 & 1 & 1
\end{pmatrix}}_{\boldsymbol{1}^{\top}} \Bigg[
\underbrace{\begin{pmatrix}
3 \\
4 \\
2
\end{pmatrix}}_{\hat{\boldsymbol{y}}}
-
\underbrace{\begin{pmatrix}
4.6 \\
6.1 \\
3.1
\end{pmatrix}}_{\boldsymbol{y}}
\Bigg] = -3.200.\\ \nonumber
\end{align}
#number of rows == number of observations in the batch
X_batch = X[:3 ]
y_batch = y[:3 ]
N = X_batch.shape[0 ]
w = np.array([[1 ], [1 ]])
b = 0
#Gradients
y_hat = X_batch @ w + b
dw = 2 / N * X_batch.T @ (y_hat - y_batch)
db = 2 / N * np.sum(y_hat - y_batch)
print (dw); print (db)
[[-6. ]
[-4.26666667]]
-3.1999999999999993
Step 5: Pick a learning rate \eta and update the weights and biases.
\begin{align}
\eta &= 0.1,\\
\boldsymbol{w}^{(1)} &= \boldsymbol{w}^{(0)} - \eta \frac{\partial \mathcal{L}}{\partial \boldsymbol{w}}\bigg|_{\boldsymbol{w}^{(0)}}
=
\begin{pmatrix}
1.600 \\
1.427
\end{pmatrix}, \\
{b}^{(1)} &= {b}^{(0)} - \eta \frac{\partial \mathcal{L}}{\partial {b}}\bigg|_{{b}^{(0)}}
= 0.320
\end{align}
#specify a learning rate to update
eta = 0.1
w = w - eta * dw
b = b - eta * db
print (w); print (b)
[[1.6 ]
[1.42666667]]
0.31999999999999995
Next Step: Update until convergence.
#loss function
def mse(y_pred, y_true):
return (np.mean((y_pred- y_true)** 2 ))
def lr_gradient_descent(X, y, batch_size= 32 , eta= 0.1 , w= None , b= None , max_iter= 100 , tol= 1e-08 ):
"""
Gradient descent optimization for linear regression with random batch updates.
Parameters:
eta: float - learning rate (default=0.1)
w: numpy array of shape (p, 1) - initial weights (default=ones)
b: float - initial bias (default=zero)
max_iter: int - maximum number of iterations (default=100)
tol: float - tolerance for stopping criteria (default=1e-08)
Returns:
w, b - optimized weights and bias
"""
N, p = X.shape
if w is None :
w = np.ones((p, 1 ))
if b is None :
b = 0
prev_error = np.inf
batch_size = min(N, batch_size)
num_batches = N// batch_size
for iteration in range (max_iter):
indices = np.arange(N)
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]
for batch in range (num_batches):
start = batch * batch_size
end = start + batch_size
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
y_hat = X_batch @ w + b
error = mse(y_hat.squeeze(), y_batch.squeeze())
if np.abs(error - prev_error) < tol:
return w, b
prev_error = error
dw = 2 / batch_size * X_batch.T @ (y_hat - y_batch)
db = 2 / batch_size * np.sum(y_hat - y_batch)
w -= eta * dw
b -= eta * db
return w, b
#Default initialisation
w_updated, b_updated = lr_gradient_descent(X, y, batch_size = 3 , max_iter = 1000 )
print (w_updated)
print (b_updated)
[[1.4999708 ]
[1.49981673]]
0.10030691487733752
Different Learning Rates and Initialisations
See more details in maths-of-neural-networks.ipynb .
Exercises
Apply stochastic gradient descent for the example given above.
Apply batch gradient descent for logistic regression. Follow the steps and information above.