import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn import set_config
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
="pandas") set_config(transform_output
Exercise: Sydney Temperature Forecasting
ACTL3143 & ACTL5111 Deep Learning for Actuaries
This task will involve you forecasting tomorrow’s maximum temperature using Bureau of Meteorology data for Sydney Airport. The initial dataset is available here.
The data
Start by reading the data dictionary for the dataset.
Then load up the necessary packages.
Load the data and inspect it.
= pd.read_csv("https://laub.au/ai/data/DC02D_Data_066037_9999999910249598.txt", low_memory=False).iloc[:-1]
df df
dc | Station Number | Year | Month | Day | Precipitation in the 24 hours before 9am (local time) in mm | Quality of precipitation value | Number of days of rain within the days of accumulation | Accumulated number of days over which the precipitation was measured | Precipitation since last observation at 00 hours Local Time in mm | ... | Total cloud amount at 09 hours in eighths | Quality of total cloud amount at 09 hours Local Time | Total cloud amount at 12 hours in eighths | Quality of total cloud amount at 12 hours Local Time | Total cloud amount at 15 hours in eighths | Quality of total cloud amount at 15 hours Local Time | Total cloud amount at 18 hours in eighths | Quality of total cloud amount at 18 hours Local Time | Total cloud amount at 21 hours in eighths | Quality of total cloud amount at 21 hours Local Time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | dc | 66037 | 1991 | 1 | 1 | 0.0 | Y | ... | 7 | Y | 7 | Y | 7 | Y | 7 | Y | 5 | Y | |||
1 | dc | 66037 | 1991 | 1 | 2 | 0.2 | Y | 1 | 1 | ... | 5 | Y | 2 | Y | 5 | Y | 5 | Y | 1 | Y | |
2 | dc | 66037 | 1991 | 1 | 3 | 0.0 | Y | ... | 2 | Y | 1 | Y | 1 | Y | 1 | Y | 0 | Y | |||
3 | dc | 66037 | 1991 | 1 | 4 | 0.0 | Y | ... | 0 | Y | 1 | Y | 2 | Y | 2 | Y | 1 | Y | |||
4 | dc | 66037 | 1991 | 1 | 5 | 0.0 | Y | ... | 2 | Y | 7 | Y | 8 | Y | 8 | Y | 8 | Y | |||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11652 | dc | 66037 | 2022 | 11 | 26 | 0.4 | N | 1 | 0.0 | ... | 5 | N | 2 | N | 2 | N | 2 | N | |||
11653 | dc | 66037 | 2022 | 11 | 27 | 0.2 | N | 1 | 0.0 | ... | 7 | N | 8 | N | 4 | N | 4 | N | 7 | N | |
11654 | dc | 66037 | 2022 | 11 | 28 | 7.0 | N | 1 | 3.2 | ... | 6 | N | 5 | N | 6 | N | 3 | N | 2 | N | |
11655 | dc | 66037 | 2022 | 11 | 29 | 0.2 | N | 1 | 0.0 | ... | 1 | N | 1 | N | 1 | N | 3 | N | 7 | N | |
11656 | dc | 66037 | 2022 | 11 | 30 | 0.2 | N | 1 | 0.0 | ... | 7 | N | 7 | N | 7 | N | 1 | N | 7 | N |
11657 rows × 120 columns
Preprocessing
Ensure that today’s maximum temperature is stored as floating point numbers.
"Maximum temperature in 24 hours after 9am (local time) in Degrees C"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].astype(float) df[
Create the target variable by shifting the temperature data by one day (and delete the final day which has no target).
"Tomorrow's Max Temperature"] = df["Maximum temperature in 24 hours after 9am (local time) in Degrees C"].shift(-1)
df[= df.iloc[:-1] df
Take a look at a subset of the data.
"Year", "Month", "Day", "Maximum temperature in 24 hours after 9am (local time) in Degrees C", "Tomorrow's Max Temperature", ]] df[[
Year | Month | Day | Maximum temperature in 24 hours after 9am (local time) in Degrees C | Tomorrow's Max Temperature | |
---|---|---|---|---|---|
0 | 1991 | 1 | 1 | 28.0 | 29.5 |
1 | 1991 | 1 | 2 | 29.5 | 31.2 |
2 | 1991 | 1 | 3 | 31.2 | 33.2 |
3 | 1991 | 1 | 4 | 33.2 | 36.8 |
4 | 1991 | 1 | 5 | 36.8 | 26.7 |
... | ... | ... | ... | ... | ... |
11651 | 2022 | 11 | 25 | 25.1 | 22.8 |
11652 | 2022 | 11 | 26 | 22.8 | 28.2 |
11653 | 2022 | 11 | 27 | 28.2 | 23.3 |
11654 | 2022 | 11 | 28 | 23.3 | 24.0 |
11655 | 2022 | 11 | 29 | 24.0 | 21.8 |
11656 rows × 5 columns
Try plotting the data to see if there are any trends.
Forecast using Sydney Airport’s weather data
# TODO: Split the data into training, validation and test sets.
# TODO: Consider a different imputation for some variables.
# E.g. it may be the case that some missing values (like precipitation) are actually 0.
# Another idea may be to simply throw out some columns with too many missing values.
# TODO: Rescale the data
# TODO: Fit a neural network model
# TODO: Report on the RMSE on the validation set (if comparing multiple NNs) and test sets (for the final/best model).
Forecast using multiple Sydney weather stations’ data
Download the full dataset. It is in a similar format to the Sydney Airport dataset, but one file per weather station. Incorporate the data from other weather stations into your model, without leaking future data into your forecasts.