Exercise: Sydney Temperature Forecasting

ACTL3143 & ACTL5111 Deep Learning for Actuaries

This task will involve you forecasting tomorrow’s maximum temperature using Bureau of Meteorology data for Sydney Airport. The initial dataset is available here.

DALL-E’s rendition of this Sydney Airport maximum temperature forecasting task.

The data

Start by reading the data dictionary for the dataset.

Then load up the necessary packages.

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn import set_config

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

set_config(transform_output="pandas")

Load the data and inspect it.

df = pd.read_csv("https://laub.au/ai/data/DC02D_Data_066037_9999999910249598.txt", low_memory=False).iloc[:-1]
df
dc Station Number Year Month Day Precipitation in the 24 hours before 9am (local time) in mm Quality of precipitation value Number of days of rain within the days of accumulation Accumulated number of days over which the precipitation was measured Precipitation since last observation at 00 hours Local Time in mm ... Total cloud amount at 09 hours in eighths Quality of total cloud amount at 09 hours Local Time Total cloud amount at 12 hours in eighths Quality of total cloud amount at 12 hours Local Time Total cloud amount at 15 hours in eighths Quality of total cloud amount at 15 hours Local Time Total cloud amount at 18 hours in eighths Quality of total cloud amount at 18 hours Local Time Total cloud amount at 21 hours in eighths Quality of total cloud amount at 21 hours Local Time
0 dc 66037 1991 1 1 0.0 Y ... 7 Y 7 Y 7 Y 7 Y 5 Y
1 dc 66037 1991 1 2 0.2 Y 1 1 ... 5 Y 2 Y 5 Y 5 Y 1 Y
2 dc 66037 1991 1 3 0.0 Y ... 2 Y 1 Y 1 Y 1 Y 0 Y
3 dc 66037 1991 1 4 0.0 Y ... 0 Y 1 Y 2 Y 2 Y 1 Y
4 dc 66037 1991 1 5 0.0 Y ... 2 Y 7 Y 8 Y 8 Y 8 Y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11652 dc 66037 2022 11 26 0.4 N 1 0.0 ... 5 N 2 N 2 N 2 N
11653 dc 66037 2022 11 27 0.2 N 1 0.0 ... 7 N 8 N 4 N 4 N 7 N
11654 dc 66037 2022 11 28 7.0 N 1 3.2 ... 6 N 5 N 6 N 3 N 2 N
11655 dc 66037 2022 11 29 0.2 N 1 0.0 ... 1 N 1 N 1 N 3 N 7 N
11656 dc 66037 2022 11 30 0.2 N 1 0.0 ... 7 N 7 N 7 N 1 N 7 N

11657 rows × 120 columns

Preprocessing

Ensure that today’s maximum temperature is stored as floating point numbers.

df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].astype(float)

Create the target variable by shifting the temperature data by one day (and delete the final day which has no target).

df["Tomorrow's Max Temperature"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].shift(-1)
df = df.iloc[:-1]

Take a look at a subset of the data.

df[["Year", "Month", "Day", "Maximum temperature in 24 hours after 9am (local time) in Degrees C", "Tomorrow's Max Temperature", ]]
Year Month Day Maximum temperature in 24 hours after 9am (local time) in Degrees C Tomorrow's Max Temperature
0 1991 1 1 28.0 29.5
1 1991 1 2 29.5 31.2
2 1991 1 3 31.2 33.2
3 1991 1 4 33.2 36.8
4 1991 1 5 36.8 26.7
... ... ... ... ... ...
11651 2022 11 25 25.1 22.8
11652 2022 11 26 22.8 28.2
11653 2022 11 27 28.2 23.3
11654 2022 11 28 23.3 24.0
11655 2022 11 29 24.0 21.8

11656 rows × 5 columns

Try plotting the data to see if there are any trends.

Forecast using Sydney Airport’s weather data

# TODO: Split the data into training, validation and test sets.
# TODO: Consider a different imputation for some variables.
# E.g. it may be the case that some missing values (like precipitation) are actually 0.
# Another idea may be to simply throw out some columns with too many missing values.
# TODO: Rescale the data
# TODO: Fit a neural network model
# TODO: Report on the RMSE on the validation set (if comparing multiple NNs) and test sets (for the final/best model).

Forecast using multiple Sydney weather stations’ data

Download the full dataset. It is in a similar format to the Sydney Airport dataset, but one file per weather station. Incorporate the data from other weather stations into your model, without leaking future data into your forecasts.