We have a dataset \{ \boldsymbol{x}_i, y_i \}_{i=1}^n which we assume are i.i.d. observations.
Brand
Mileage
# Claims
BMW
101 km
1
Audi
432 km
0
Volvo
3 km
5
\vdots
\vdots
\vdots
The goal is to predict the y for some covariates \boldsymbol{x}.
Time series data
Have a sequence \{ \boldsymbol{x}_t, y_t \}_{t=1}^T of observations taken at regular time intervals.
Date
Humidity
Temp.
Jan 1
60%
20 °C
Jan 2
65%
22 °C
Jan 3
70%
21 °C
\vdots
\vdots
\vdots
The task is to forecast future values based on the past.
Attributes of time series data
Temporal ordering: The order of the observations matters.
Trend: The general direction of the data.
Noise: Random fluctuations in the data.
Seasonality: Patterns that repeat at regular intervals.
Note
Question: What will be the temperature in Berlin tomorrow? What information would you use to make a prediction?
Australian financial stocks
# First, install yfinance if you haven't already:# pip install yfinanceimport yfinance as yfimport pandas as pdfrom datetime import date# 1. Define the ASX tickers (Yahoo uses “.AX” for Australian stocks# and “^AXJO” for the S&P/ASX 200 index)tickers = {"ANZ": "ANZ.AX","BOQ": "BOQ.AX","CBA": "CBA.AX","NAB": "NAB.AX","QBE": "QBE.AX","SUN": "SUN.AX","WBC": "WBC.AX","ASX200": "^AXJO"}# 2. Choose your date rangestart ="1999-01-01"end = date.today().isoformat() # e.g. "2025-06-29"# 3. Download the data# This returns a Panel-like DataFrame: columns are tickers, rows are trading dates.raw_data = yf.download( tickers=list(tickers.values()), start=start, end=end, progress=False)raw_data.to_csv("asx_raw_data.csv") # Optional: Save raw data for inspectiondata = raw_data["Close"]# 4. Rename columns to the simple namesdata.rename(columns={v: k for k, v in tickers.items()}, inplace=True)cols = ["ANZ", "ASX200", "BOQ", "CBA", "NAB", "QBE", "SUN", "WBC"]data = data[cols]# 5. (Optional) Inspect the first few rowsprint(data.head())# 6. Save to CSVoutput_path ="asx_daily_close_prices.csv"data.to_csv(output_path)print(f"Saved daily close prices to {output_path}")
# Distribution of log returnsstock_log.hist(bins=50, alpha=0.7)plt.xlabel("Daily Log Return")plt.ylabel("Frequency")plt.title("Distribution of CBA Daily Log Returns");
def log_to_price(log_returns, initial_price):"""Convert log returns to raw prices given an initial price."""# Use cumulative sum of log returns for numerical stability# P_t = P_0 * exp(sum of log returns from 1 to t) cumulative_log_returns = log_returns.cumsum() prices = initial_price * np.exp(cumulative_log_returns)return pricesdef get_last_price(stock_df, cutoff_date):"""Get the last known price before the forecast period starts.""" last_known_date = stock_df.loc[:cutoff_date].index[-1]return stock_df.loc[last_known_date, "CBA"]
Persistence forecast
Predict the next value to be the same as the current value.
# Plot the trend line over the top of the 'recent_log_returns' datarecent_log_returns.plot()plt.axhline(trend_log, color="red", linestyle="--", label="Trend (mean log return)")plt.ylabel("Daily Log Return");
Trend fitted
# Create trend forecast for the recent period to show fitted trendrecent_trend_log = pd.Series(trend_log, index=recent_log_returns.index)trend_start_price = get_last_price(stock, cutoff_date=recent_log_returns.index[0].strftime('%Y-%m-%d'))recent_trend_prices = log_to_price(recent_trend_log, trend_start_price)
If we look at the mean squared error (MSE) of the two models:
# Calculate MSE using the actual forecasts we computedactual_prices = stock.loc["2025":, "CBA"]persistence_mse = mean_squared_error(actual_prices, persistence_prices)trend_mse = mean_squared_error(actual_prices, trend_prices)persistence_mse, trend_mse
(254.04075100411256, 99.28717915684173)
Use the history
Now let’s work with log returns instead of raw prices to create lagged features:
# Split the data in timeX_train = df_lags.loc[:"2021"]X_val = df_lags.loc["2022-01":"2024-12"] # 2022-2024X_test = df_lags.loc["2025":] # 2025# Remove any with NAs and split into X and yX_train = X_train.dropna()X_val = X_val.dropna()X_test = X_test.dropna()y_train = X_train.pop("T")y_val = X_val.pop("T")y_test = X_test.pop("T")
def autoregressive_forecast(model, X_val, suppress=False):""" Generate a multi-step forecast using the given model. """ multi_step = pd.Series(index=X_val.index, name="Multi Step")# Initialize the input data for forecasting input_data = X_val.iloc[0].values.reshape(1, -1)for i inrange(len(multi_step)):# Ensure input_data has the correct feature names input_df = pd.DataFrame(input_data, columns=X_val.columns)if suppress: next_value = model.predict(input_df, verbose=0)else: next_value = model.predict(input_df) multi_step.iloc[i] = next_value# Append that prediction to the input for the next forecastif i +1<len(multi_step): input_data = np.append(input_data[:, 1:], next_value).reshape(1, -1)return multi_step
“It’s tough to make predictions, especially about the future.”
Neural network forecasts
Simple feedforward neural network
model = Sequential([ Rescaling(1/0.02), Dense(32, activation="leaky_relu"), Dense(1)]) # Linear activation for log returnsmodel.compile(optimizer="adam", loss="mean_absolute_error")
if Path("aus_fin_fnn_model.keras").exists(): model = keras.models.load_model("aus_fin_fnn_model.keras")else: es = EarlyStopping(patience=15, restore_best_weights=True) model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=500, callbacks=[es], verbose=0) model.save("aus_fin_fnn_model.keras")model.summary()
A recurrent neural network is a type of neural network that is designed to process sequences of data (e.g. time series, sentences).
A recurrent neural network is any network that contains a recurrent layer.
A recurrent layer is a layer that processes data in a sequence.
An RNN can have one or more recurrent layers.
Weights are shared over time; this allows the model to be used on arbitrary-length sequences.
Applications
Forecasting: revenue forecast, weather forecast, predict disease rate from medical history, etc.
Classification: given a time series of the activities of a visitor on a website, classify whether the visitor is a bot or a human.
Event detection: given a continuous data stream, identify the occurrence of a specific event. Example: Detect utterances like “Hey Alexa” from an audio stream.
Anomaly detection: given a continuous data stream, detect anything unusual happening. Example: Detect unusual activity on the corporate network.
Origin of the name of RNNs
A recurrence relation is an equation that expresses each element of a sequence as a function of the preceding ones. More precisely, in the case where only the immediately preceding element is involved, a recurrence relation has the form
u_n = \psi(n, u_{n-1}) \quad \text{ for } \quad n > 0.
Example: Factorial n! = n (n-1)! for n > 0 given 0! = 1.
Diagram of an RNN cell
The RNN processes each data in the sequence one by one, while keeping memory of what came before.
The following figure shows how the recurrent neural network combines an input X_l with a preprocessed state of the process A_l to produce the output O_l. RNNs have a cyclic information processing structure that enables them to pass information sequentially from previous inputs. RNNs can capture dependencies and patterns in sequential data, making them useful for analysing time series data.
Schematic of a recurrent neural network. E.g. SimpleRNN, LSTM, or GRU.
A SimpleRNN cell
Diagram of a SimpleRNN cell.
All the outputs before the final one are often discarded.
LSTM internals
Simple RNN structures encounter vanishing gradient problems, hence, struggle with learning long term dependencies. LSTM are designed to overcome this problem. LSTMs have a more complex network structure (contains more memory cells and gating mechanisms) and can better regulate the information flow.
GRU internals
GRUs are simpler compared to LSTM, hence, computationally more efficient than LSTMs.
Diagram of a GRU cell.
Stock prediction with recurrent networks
SimpleRNN
from keras.layers import SimpleRNN, Reshapemodel = Sequential([ Rescaling(1/0.02), Reshape((-1, 1)), SimpleRNN(64, activation="tanh"), Dense(1)]) # Linear activation for log returnsmodel.compile(optimizer="adam", loss="mean_absolute_error")
es = EarlyStopping(patience=15, restore_best_weights=True)model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=500, callbacks=[es], verbose=0)model.summary()
At each time step, a simple Recurrent Neural Network (RNN) takes an input vector x_t, incorporate it with the information from the previous hidden state {y}_{t-1} and produces an output vector at each time step y_t. The hidden state helps the network remember the context of the previous words, enabling it to make informed predictions about what comes next in the sequence. In a simple RNN, the output at time (t-1) is the same as the hidden state at time t.
SimpleRNN (in batches)
The difference between RNN and RNNs with batch processing lies in the way how the neural network handles sequences of input data. With batch processing, the model processes multiple (b) input sequences simultaneously. The training data is grouped into batches, and the weights are updated based on the average error across the entire batch. Batch processing often results in more stable weight updates, as the model learns from a diverse set of examples in each batch, reducing the impact of noise in individual sequences.
Say we operate on batches of size b, then \boldsymbol{Y}_t \in \mathbb{R}^{b \times d}.
The main equation of a SimpleRNN, given \boldsymbol{Y}_0 = \boldsymbol{0}, is \boldsymbol{Y}_t = \psi\bigl( \boldsymbol{X}_t \boldsymbol{W}_x + \boldsymbol{Y}_{t-1} \boldsymbol{W}_y + \boldsymbol{b} \bigr) . Here,
\begin{aligned}
&\boldsymbol{X}_t \in \mathbb{R}^{b \times m}, \boldsymbol{W}_x \in \mathbb{R}^{m \times d}, \\
&\boldsymbol{Y}_{t-1} \in \mathbb{R}^{b \times d}, \boldsymbol{W}_y \in \mathbb{R}^{d \times d}, \text{ and } \boldsymbol{b} \in \mathbb{R}^{d}.
\end{aligned}
Selects the first two slices along the first dimension. Since the tensor of dimensions (4,3,2), X[:2] selects the first two slices (0 and 1) along the first dimension, and returns a sub-tensor of shape (2,3,2).
Selects the last two slices along the first dimension. The first dimension (axis=0) has size 4. Therefore, X[2:] selects the last two slices (2 and 3) along the first dimension, and returns a sub-tensor of shape (2,3,2).
1from keras.layers import SimpleRNN2random.seed(1234)3model = Sequential([SimpleRNN(output_size, activation="sigmoid")])4model.compile(loss="binary_crossentropy", metrics=["accuracy"])5hist = model.fit(X, y, epochs=500, verbose=False)6model.evaluate(X, y, verbose=False)
1
Imports the SimpleRNN layer from the Keras library
2
Sets the seed for the random number generator to ensure reproducibility
3
Defines a simple RNN with one output node and sigmoid activation function
4
Specifies binary crossentropy as the loss function (usually used in classification problems), and specifies “accuracy” as the metric to be monitored during training
5
Trains the model for 500 epochs and saves output as hist
6
Evaluates the model to obtain a value for the loss and accuracy
[3.1487133502960205, 0.5]
The predicted probabilities on the training set are:
Categories of recurrent neural networks: sequence to sequence, sequence to vector, vector to sequence, encoder-decoder network.
Input and output sequences
Sequence to sequence: Useful for predicting time series such as using prices over the last N days to output the prices shifted one day into the future (i.e. from N-1 days ago to tomorrow.)
Sequence to vector: ignore all outputs in the previous time steps except for the last one. Example: give a sentiment score to a sequence of words corresponding to a movie review.
Input and output sequences
Vector to sequence: feed the network the same input vector over and over at each time step and let it output a sequence. Example: given that the input is an image, find a caption for it. The image is treated as an input vector (pixels in an image do not follow a sequence). The caption is a sequence of textual description of the image. A dataset containing images and their descriptions is the input of the RNN.
The Encoder-Decoder: The encoder is a sequence-to-vector network. The decoder is a vector-to-sequence network. Example: Feed the network a sequence in one language. Use the encoder to convert the sentence into a single vector representation. The decoder decodes this vector into the translation of the sentence in another language.
Recurrent layers can be stacked.
Deep RNN unrolled through time.
CoreLogic Hedonic Home Value Index
Australian House Price Indices
Note
I apologise in advance for not being able to share this dataset with anyone (it is not mine to share).
Keras has a built-in method for converting a time series into subsequences/chunks.
from keras.utils import timeseries_dataset_from_arrayintegers =range(10)dummy_dataset = timeseries_dataset_from_array( data=integers[:-3], targets=integers[3:], sequence_length=3, batch_size=2,)for inputs, targets in dummy_dataset:for i inrange(inputs.shape[0]):print([int(x) for x in inputs[i]], int(targets[i]))
# Num. of input time series.num_ts = changes.shape[1]# How many prev. months to use.seq_length =6# Predict the next month ahead.ahead =1# The index of the first target.delay = seq_length + ahead -1
# Which suburb to predict.target_suburb = changes["Sydney"]train_ds = timeseries_dataset_from_array( changes[:-delay], targets=target_suburb[delay:], sequence_length=seq_length, end_index=num_train,)
2025-06-30 15:46:35.638164: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-06-30 15:46:35.726987: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
This model has 3191 parameters.
Epoch 57: early stopping
Restoring model weights from the end of the best epoch: 7.
CPU times: user 2.71 s, sys: 293 ms, total: 3 s
Wall time: 2.68 s
Plot the model
from keras.utils import plot_modelplot_model(model_dense, show_shapes=True)
Assess the fits
model_dense.evaluate(X_val, y_val, verbose=0)
1.1644607782363892
Code
y_pred = model_dense.predict(X_val, verbose=0)plt.plot(y_val, label="Sydney")plt.plot(y_pred, label="Dense")plt.xlabel("Time")plt.ylabel("Change in HPI (%)")plt.legend(frameon=False);
This model has 2951 parameters.
Epoch 62: early stopping
Restoring model weights from the end of the best epoch: 12.
CPU times: user 3.59 s, sys: 1.45 s, total: 5.04 s
Wall time: 3.31 s
Plot the model
plot_model(model_simple, show_shapes=True)
Assess the fits
model_simple.evaluate(X_val, y_val, verbose=0)
1.2507916688919067
Code
y_pred = model_simple.predict(X_val, verbose=0)plt.plot(y_val, label="Sydney")plt.plot(y_pred, label="SimpleRNN")plt.xlabel("Time")plt.ylabel("Change in HPI (%)")plt.legend(frameon=False);
WARNING:tensorflow:5 out of the last 128 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x173dc7c40> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
WARNING:tensorflow:6 out of the last 130 calls to <function TensorFlowTrainer.make_predict_function.<locals>.one_step_on_data_distributed at 0x173dc7c40> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Epoch 57: early stopping
Restoring model weights from the end of the best epoch: 7.
CPU times: user 3.8 s, sys: 377 ms, total: 4.17 s
Wall time: 3.54 s
Assess the fits
model_gru.evaluate(X_val, y_val, verbose=0)
0.7435100078582764
Code
y_pred = model_gru.predict(X_val, verbose=0)plt.plot(y_val, label="Sydney")plt.plot(y_pred, label="GRU")plt.xlabel("Time")plt.ylabel("Change in HPI (%)")plt.legend(frameon=False);
Epoch 56: early stopping
Restoring model weights from the end of the best epoch: 6.
CPU times: user 5.24 s, sys: 487 ms, total: 5.72 s
Wall time: 4.64 s
Assess the fits
model_two_grus.evaluate(X_val, y_val, verbose=0)
0.7989509105682373
Code
y_pred = model_two_grus.predict(X_val, verbose=0)plt.plot(y_val, label="Sydney")plt.plot(y_pred, label="2 GRUs")plt.xlabel("Time")plt.ylabel("Change in HPI (%)")plt.legend(frameon=False);
2025-06-30 15:46:56.533938: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
This model has 3317 parameters.
Epoch 75: early stopping
Restoring model weights from the end of the best epoch: 25.
CPU times: user 3.53 s, sys: 387 ms, total: 3.91 s
Wall time: 3.5 s
This model has 3257 parameters.
Epoch 70: early stopping
Restoring model weights from the end of the best epoch: 20.
CPU times: user 4.17 s, sys: 409 ms, total: 4.58 s
Wall time: 3.7 s
Epoch 74: early stopping
Restoring model weights from the end of the best epoch: 24.
CPU times: user 4.91 s, sys: 455 ms, total: 5.37 s
Wall time: 4.39 s
Epoch 70: early stopping
Restoring model weights from the end of the best epoch: 20.
CPU times: user 4.69 s, sys: 435 ms, total: 5.12 s
Wall time: 4.09 s
Epoch 67: early stopping
Restoring model weights from the end of the best epoch: 17.
CPU times: user 5.81 s, sys: 547 ms, total: 6.36 s
Wall time: 5.22 s