Exercise: Victorian Car Crash Severity

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Your task is to predict whether a specific car crash will be a high severity or a low severity incident. You will use a dataset of car crashes in Victoria where the police were called to assist. The dataset is available here (original source).

DALL-E’s rendition of this Victorian car crash severity classification task.

The network must use entity embedding for at least one of the categorical variables (e.g. the DCA_CODE), and train on a mix of both categorical and numerical features. The target variable is the binary outcome that severity is > 2. Report on the value of the accuracy of your classifier and give a confusion matrix.

Questions:

The data

Start by reading the data dictionary for the dataset.

import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

set_config(transform_output="pandas")
df_raw = pd.read_csv("https://laub.au/ai/data/ACCIDENT.csv", low_memory=False)
df_raw
ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME ACCIDENT_TYPE Accident Type Desc DAY_OF_WEEK Day Week Description DCA_CODE DCA Description DIRECTORY ... NO_PERSONS NO_PERSONS_INJ_2 NO_PERSONS_INJ_3 NO_PERSONS_KILLED NO_PERSONS_NOT_INJ POLICE_ATTEND ROAD_GEOMETRY Road Geometry Desc SEVERITY SPEED_ZONE
0 T20060000010 13/01/2006 12:42:00 1 Collision with vehicle 6 Friday 113 RIGHT NEAR (INTERSECTIONS ONLY) MEL ... 6 0 1 0 5 1 1 Cross intersection 3 60
1 T20060000018 13/01/2006 19:10:00 1 Collision with vehicle 6 Friday 113 RIGHT NEAR (INTERSECTIONS ONLY) MEL ... 4 0 1 0 3 1 2 T intersection 3 70
2 T20060000022 14/01/2006 12:10:00 7 Fall from or in moving vehicle 7 Saturday 190 FELL IN/FROM VEHICLE MEL ... 2 1 0 0 1 1 5 Not at intersection 2 100
3 T20060000023 14/01/2006 11:49:00 1 Collision with vehicle 7 Saturday 130 REAR END(VEHICLES IN SAME LANE) MEL ... 2 1 0 0 1 1 2 T intersection 2 80
4 T20060000026 14/01/2006 10:45:00 1 Collision with vehicle 7 Saturday 121 RIGHT THROUGH MEL ... 3 0 3 0 0 1 5 Not at intersection 3 50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
203703 T20200019239 1/11/2020 12:11:00 1 Collision with vehicle 0 Sunday 142 LEAVING PARKING MEL ... 4 1 0 0 3 1 5 Not at intersection 2 50
203704 T20200019247 1/11/2020 15:30:00 4 Collision with a fixed object 1 Sunday 171 LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... MEL ... 2 2 0 0 0 1 5 Not at intersection 2 999
203705 T20200019250 1/11/2020 18:00:00 1 Collision with vehicle 0 Sunday 116 LEFT NEAR (INTERSECTIONS ONLY) MEL ... 2 1 0 0 1 1 1 Cross intersection 2 60
203706 T20200019253 1/11/2020 12:00:00 6 Vehicle overturned (no collision) 1 Sunday 180 OFF CARRIAGEWAY ON RIGHT BEND VCD ... 1 1 0 0 0 1 5 Not at intersection 2 80
203707 T20200019417 4/11/2020 1:30:00 4 Collision with a fixed object 3 Wednesday 171 LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... MEL ... 1 1 0 0 0 1 5 Not at intersection 2 80

203708 rows × 28 columns

Preprocessing

# Drop observations which have categorical variables which are very rare (< 10 obs in the dataset)
# This is a crude solution / surely can be improved.
df_simple = df_raw.copy()

sparse_categories = ["DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]

for cat in sparse_categories:
    df_simple = df_simple[df_simple[cat].map(df_simple[cat].value_counts()) > 10]    

df_simple
ACCIDENT_NO ACCIDENTDATE ACCIDENTTIME ACCIDENT_TYPE Accident Type Desc DAY_OF_WEEK Day Week Description DCA_CODE DCA Description DIRECTORY ... NO_PERSONS NO_PERSONS_INJ_2 NO_PERSONS_INJ_3 NO_PERSONS_KILLED NO_PERSONS_NOT_INJ POLICE_ATTEND ROAD_GEOMETRY Road Geometry Desc SEVERITY SPEED_ZONE
0 T20060000010 13/01/2006 12:42:00 1 Collision with vehicle 6 Friday 113 RIGHT NEAR (INTERSECTIONS ONLY) MEL ... 6 0 1 0 5 1 1 Cross intersection 3 60
1 T20060000018 13/01/2006 19:10:00 1 Collision with vehicle 6 Friday 113 RIGHT NEAR (INTERSECTIONS ONLY) MEL ... 4 0 1 0 3 1 2 T intersection 3 70
2 T20060000022 14/01/2006 12:10:00 7 Fall from or in moving vehicle 7 Saturday 190 FELL IN/FROM VEHICLE MEL ... 2 1 0 0 1 1 5 Not at intersection 2 100
3 T20060000023 14/01/2006 11:49:00 1 Collision with vehicle 7 Saturday 130 REAR END(VEHICLES IN SAME LANE) MEL ... 2 1 0 0 1 1 2 T intersection 2 80
4 T20060000026 14/01/2006 10:45:00 1 Collision with vehicle 7 Saturday 121 RIGHT THROUGH MEL ... 3 0 3 0 0 1 5 Not at intersection 3 50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
203703 T20200019239 1/11/2020 12:11:00 1 Collision with vehicle 0 Sunday 142 LEAVING PARKING MEL ... 4 1 0 0 3 1 5 Not at intersection 2 50
203704 T20200019247 1/11/2020 15:30:00 4 Collision with a fixed object 1 Sunday 171 LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... MEL ... 2 2 0 0 0 1 5 Not at intersection 2 999
203705 T20200019250 1/11/2020 18:00:00 1 Collision with vehicle 0 Sunday 116 LEFT NEAR (INTERSECTIONS ONLY) MEL ... 2 1 0 0 1 1 1 Cross intersection 2 60
203706 T20200019253 1/11/2020 12:00:00 6 Vehicle overturned (no collision) 1 Sunday 180 OFF CARRIAGEWAY ON RIGHT BEND VCD ... 1 1 0 0 0 1 5 Not at intersection 2 80
203707 T20200019417 4/11/2020 1:30:00 4 Collision with a fixed object 3 Wednesday 171 LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... MEL ... 1 1 0 0 0 1 5 Not at intersection 2 80

203692 rows × 28 columns

drop = ["ACCIDENT_NO", 'ACCIDENTDATE', 'ACCIDENTTIME', "Accident Type Desc", "Day Week Description", "DCA Description",
        "DIRECTORY", "EDITION", "PAGE", "GRID_REFERENCE_X", "GRID_REFERENCE_Y",
        "Light Condition Desc", "NODE_ID", "Road Geometry Desc"]

df = df_simple.drop(drop, axis=1)

categorical_variables = ["ACCIDENT_TYPE", "DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]
numerical_variables = [col for col in df.columns if col not in categorical_variables]
print(categorical_variables)
print(numerical_variables)
['ACCIDENT_TYPE', 'DCA_CODE', 'LIGHT_CONDITION', 'ROAD_GEOMETRY']
['DAY_OF_WEEK', 'NO_OF_VEHICLES', 'NO_PERSONS', 'NO_PERSONS_INJ_2', 'NO_PERSONS_INJ_3', 'NO_PERSONS_KILLED', 'NO_PERSONS_NOT_INJ', 'POLICE_ATTEND', 'SEVERITY', 'SPEED_ZONE']
# Print the number of unique categories
for cat in categorical_variables:
    print(f"{cat}: {df[cat].nunique()}")
ACCIDENT_TYPE: 9
DCA_CODE: 80
LIGHT_CONDITION: 7
ROAD_GEOMETRY: 7
# Print out the unique values for each categorical variable and their descriptions
categorical_descriptions = ["Accident Type Desc", "DCA Description", "Light Condition Desc", "Road Geometry Desc"]

for cat, desc in zip(categorical_variables, categorical_descriptions):
    df_cat = df_raw[[cat, desc]].drop_duplicates().sort_values(by=[cat]).reset_index(drop=True)
    display(df_cat)
    print()
ACCIDENT_TYPE Accident Type Desc
0 1 Collision with vehicle
1 2 Struck Pedestrian
2 3 Struck animal
3 4 Collision with a fixed object
4 5 collision with some other object
5 6 Vehicle overturned (no collision)
6 7 Fall from or in moving vehicle
7 8 No collision and no object struck
8 9 Other accident



DCA_CODE DCA Description
0 100 PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIG...
1 101 PED EMERGES FROM IN FRONT OF PARKED OR STATION...
2 102 FAR SIDE. PED HIT BY VEHICLE FROM THE LEFT ...
3 103 PED PLAYING/LYING/WORKING/STANDING ON CARRIAGE...
4 104 PED WALKING WITH TRAFFIC
... ... ...
76 192 STRUCK TRAIN
77 193 STRUCK RAILWAY CROSSING FURNITURE
78 194 PARKED CAR RUN AWAY
79 198 OTHER ACCIDENTS NOT CLASSIFIABLE ELSEWHERE ...
80 199 UNKNOWN-NO DETAILS ON MANOEUVRES OF ROAD-USERS...

81 rows × 2 columns

LIGHT_CONDITION Light Condition Desc
0 1 Day
1 2 Dusk/Dawn
2 3 Dark Street lights on
3 4 Dark Street lights off
4 5 Dark No street lights
5 6 Dark Street lights unknown
6 9 Unknown
ROAD_GEOMETRY Road Geometry Desc
0 1 Cross intersection
1 2 T intersection
2 3 Y intersection
3 4 Multiple intersection
4 5 Not at intersection
5 6 Dead end
6 7 Road closure
7 8 Private property
8 9 Unknown
target = (df["SEVERITY"] > 2)
features = df.drop("SEVERITY", axis=1)

Classification task

This is for you to complete.