Exercise: Sydney Temperature Forecasting

ACTL3143 & ACTL5111 Deep Learning for Actuaries

This task will involve you forecasting tomorrow’s maximum temperature using Bureau of Meteorology data for Sydney Airport. The initial dataset is available here.

DALL-E’s rendition of this Sydney Airport maximum temperature forecasting task.

The data

Start by reading the data dictionary for the dataset.

Then load up the necessary packages.

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn import set_config

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

set_config(transform_output="pandas")

Load the data and inspect it.

df = pd.read_csv("https://laub.au/ai/data/DC02D_Data_066037_9999999910249598.txt", low_memory=False).iloc[:-1]
df

	dc	Station Number	Year	Month	Day	Precipitation in the 24 hours before 9am (local time) in mm	Quality of precipitation value	Number of days of rain within the days of accumulation	Accumulated number of days over which the precipitation was measured	Precipitation since last observation at 00 hours Local Time in mm	...	Total cloud amount at 09 hours in eighths	Quality of total cloud amount at 09 hours Local Time	Total cloud amount at 12 hours in eighths	Quality of total cloud amount at 12 hours Local Time	Total cloud amount at 15 hours in eighths	Quality of total cloud amount at 15 hours Local Time	Total cloud amount at 18 hours in eighths	Quality of total cloud amount at 18 hours Local Time	Total cloud amount at 21 hours in eighths	Quality of total cloud amount at 21 hours Local Time
0	dc	66037	1991	1	1	0.0	Y				...	7	Y	7	Y	7	Y	7	Y	5	Y
1	dc	66037	1991	1	2	0.2	Y	1	1		...	5	Y	2	Y	5	Y	5	Y	1	Y
2	dc	66037	1991	1	3	0.0	Y				...	2	Y	1	Y	1	Y	1	Y	0	Y
3	dc	66037	1991	1	4	0.0	Y				...	0	Y	1	Y	2	Y	2	Y	1	Y
4	dc	66037	1991	1	5	0.0	Y				...	2	Y	7	Y	8	Y	8	Y	8	Y
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11652	dc	66037	2022	11	26	0.4	N		1	0.0	...	5	N	2	N	2	N			2	N
11653	dc	66037	2022	11	27	0.2	N		1	0.0	...	7	N	8	N	4	N	4	N	7	N
11654	dc	66037	2022	11	28	7.0	N		1	3.2	...	6	N	5	N	6	N	3	N	2	N
11655	dc	66037	2022	11	29	0.2	N		1	0.0	...	1	N	1	N	1	N	3	N	7	N
11656	dc	66037	2022	11	30	0.2	N		1	0.0	...	7	N	7	N	7	N	1	N	7	N

11657 rows × 120 columns

Preprocessing

Ensure that today’s maximum temperature is stored as floating point numbers.

df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].astype(float)

Create the target variable by shifting the temperature data by one day (and delete the final day which has no target).

df["Tomorrow's Max Temperature"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].shift(-1)
df = df.iloc[:-1]

Take a look at a subset of the data.

df[["Year", "Month", "Day", "Maximum temperature in 24 hours after 9am (local time) in Degrees C", "Tomorrow's Max Temperature", ]]

	Year	Month	Day	Maximum temperature in 24 hours after 9am (local time) in Degrees C	Tomorrow's Max Temperature
0	1991	1	1	28.0	29.5
1	1991	1	2	29.5	31.2
2	1991	1	3	31.2	33.2
3	1991	1	4	33.2	36.8
4	1991	1	5	36.8	26.7
...	...	...	...	...	...
11651	2022	11	25	25.1	22.8
11652	2022	11	26	22.8	28.2
11653	2022	11	27	28.2	23.3
11654	2022	11	28	23.3	24.0
11655	2022	11	29	24.0	21.8

11656 rows × 5 columns

Try plotting the data to see if there are any trends.

Forecast using Sydney Airport’s weather data

# TODO: Split the data into training, validation and test sets.

# TODO: Consider a different imputation for some variables.
# E.g. it may be the case that some missing values (like precipitation) are actually 0.
# Another idea may be to simply throw out some columns with too many missing values.

# TODO: Rescale the data

# TODO: Fit a neural network model

# TODO: Report on the RMSE on the validation set (if comparing multiple NNs) and test sets (for the final/best model).

Forecast using multiple Sydney weather stations’ data

Download the full dataset. It is in a similar format to the Sydney Airport dataset, but one file per weather station. Incorporate the data from other weather stations into your model, without leaking future data into your forecasts.