Exercise: Victorian Car Crash Severity

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Your task is to predict whether a specific car crash will be a high severity or a low severity incident. You will use a dataset of car crashes in Victoria where the police were called to assist. The dataset is available here (original source).

DALL-E’s rendition of this Victorian car crash severity classification task.

The network must use entity embedding for at least one of the categorical variables (e.g. the DCA_CODE), and train on a mix of both categorical and numerical features. The target variable is the binary outcome that severity is > 2. Report on the value of the accuracy of your classifier and give a confusion matrix.

Questions:

How did you preprocess your variables?
What neural network architecture did you use?
Did you try multiple options for the embedding dimension? Did any work better than others? (E.g. plot x = embedding dimension against y = validation accuracy)
If your entity embedding dimension was low (1, 2 or 3) can you make a scatterplot of the categories & their learned embeddings?

The data

Start by reading the data dictionary for the dataset.

import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

set_config(transform_output="pandas")

df_raw = pd.read_csv("https://laub.au/ai/data/ACCIDENT.csv", low_memory=False)
df_raw

	ACCIDENT_NO	ACCIDENTDATE	ACCIDENTTIME	ACCIDENT_TYPE	Accident Type Desc	DAY_OF_WEEK	Day Week Description	DCA_CODE	DCA Description	DIRECTORY	...	NO_PERSONS	NO_PERSONS_INJ_2	NO_PERSONS_INJ_3	NO_PERSONS_KILLED	NO_PERSONS_NOT_INJ	POLICE_ATTEND	ROAD_GEOMETRY	Road Geometry Desc	SEVERITY	SPEED_ZONE
0	T20060000010	13/01/2006	12:42:00	1	Collision with vehicle	6	Friday	113	RIGHT NEAR (INTERSECTIONS ONLY)	MEL	...	6	0	1	0	5	1	1	Cross intersection	3	60
1	T20060000018	13/01/2006	19:10:00	1	Collision with vehicle	6	Friday	113	RIGHT NEAR (INTERSECTIONS ONLY)	MEL	...	4	0	1	0	3	1	2	T intersection	3	70
2	T20060000022	14/01/2006	12:10:00	7	Fall from or in moving vehicle	7	Saturday	190	FELL IN/FROM VEHICLE	MEL	...	2	1	0	0	1	1	5	Not at intersection	2	100
3	T20060000023	14/01/2006	11:49:00	1	Collision with vehicle	7	Saturday	130	REAR END(VEHICLES IN SAME LANE)	MEL	...	2	1	0	0	1	1	2	T intersection	2	80
4	T20060000026	14/01/2006	10:45:00	1	Collision with vehicle	7	Saturday	121	RIGHT THROUGH	MEL	...	3	0	3	0	0	1	5	Not at intersection	3	50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
203703	T20200019239	1/11/2020	12:11:00	1	Collision with vehicle	0	Sunday	142	LEAVING PARKING	MEL	...	4	1	0	0	3	1	5	Not at intersection	2	50
203704	T20200019247	1/11/2020	15:30:00	4	Collision with a fixed object	1	Sunday	171	LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL...	MEL	...	2	2	0	0	0	1	5	Not at intersection	2	999
203705	T20200019250	1/11/2020	18:00:00	1	Collision with vehicle	0	Sunday	116	LEFT NEAR (INTERSECTIONS ONLY)	MEL	...	2	1	0	0	1	1	1	Cross intersection	2	60
203706	T20200019253	1/11/2020	12:00:00	6	Vehicle overturned (no collision)	1	Sunday	180	OFF CARRIAGEWAY ON RIGHT BEND	VCD	...	1	1	0	0	0	1	5	Not at intersection	2	80
203707	T20200019417	4/11/2020	1:30:00	4	Collision with a fixed object	3	Wednesday	171	LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL...	MEL	...	1	1	0	0	0	1	5	Not at intersection	2	80

203708 rows × 28 columns

Preprocessing

# Drop observations which have categorical variables which are very rare (< 10 obs in the dataset)
# This is a crude solution / surely can be improved.
df_simple = df_raw.copy()

sparse_categories = ["DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]

for cat in sparse_categories:
    df_simple = df_simple[df_simple[cat].map(df_simple[cat].value_counts()) > 10]    

df_simple

	ACCIDENT_NO	ACCIDENTDATE	ACCIDENTTIME	ACCIDENT_TYPE	Accident Type Desc	DAY_OF_WEEK	Day Week Description	DCA_CODE	DCA Description	DIRECTORY	...	NO_PERSONS	NO_PERSONS_INJ_2	NO_PERSONS_INJ_3	NO_PERSONS_KILLED	NO_PERSONS_NOT_INJ	POLICE_ATTEND	ROAD_GEOMETRY	Road Geometry Desc	SEVERITY	SPEED_ZONE
0	T20060000010	13/01/2006	12:42:00	1	Collision with vehicle	6	Friday	113	RIGHT NEAR (INTERSECTIONS ONLY)	MEL	...	6	0	1	0	5	1	1	Cross intersection	3	60
1	T20060000018	13/01/2006	19:10:00	1	Collision with vehicle	6	Friday	113	RIGHT NEAR (INTERSECTIONS ONLY)	MEL	...	4	0	1	0	3	1	2	T intersection	3	70
2	T20060000022	14/01/2006	12:10:00	7	Fall from or in moving vehicle	7	Saturday	190	FELL IN/FROM VEHICLE	MEL	...	2	1	0	0	1	1	5	Not at intersection	2	100
3	T20060000023	14/01/2006	11:49:00	1	Collision with vehicle	7	Saturday	130	REAR END(VEHICLES IN SAME LANE)	MEL	...	2	1	0	0	1	1	2	T intersection	2	80
4	T20060000026	14/01/2006	10:45:00	1	Collision with vehicle	7	Saturday	121	RIGHT THROUGH	MEL	...	3	0	3	0	0	1	5	Not at intersection	3	50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
203703	T20200019239	1/11/2020	12:11:00	1	Collision with vehicle	0	Sunday	142	LEAVING PARKING	MEL	...	4	1	0	0	3	1	5	Not at intersection	2	50
203704	T20200019247	1/11/2020	15:30:00	4	Collision with a fixed object	1	Sunday	171	LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL...	MEL	...	2	2	0	0	0	1	5	Not at intersection	2	999
203705	T20200019250	1/11/2020	18:00:00	1	Collision with vehicle	0	Sunday	116	LEFT NEAR (INTERSECTIONS ONLY)	MEL	...	2	1	0	0	1	1	1	Cross intersection	2	60
203706	T20200019253	1/11/2020	12:00:00	6	Vehicle overturned (no collision)	1	Sunday	180	OFF CARRIAGEWAY ON RIGHT BEND	VCD	...	1	1	0	0	0	1	5	Not at intersection	2	80
203707	T20200019417	4/11/2020	1:30:00	4	Collision with a fixed object	3	Wednesday	171	LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL...	MEL	...	1	1	0	0	0	1	5	Not at intersection	2	80

203692 rows × 28 columns

drop = ["ACCIDENT_NO", 'ACCIDENTDATE', 'ACCIDENTTIME', "Accident Type Desc", "Day Week Description", "DCA Description",
        "DIRECTORY", "EDITION", "PAGE", "GRID_REFERENCE_X", "GRID_REFERENCE_Y",
        "Light Condition Desc", "NODE_ID", "Road Geometry Desc"]

df = df_simple.drop(drop, axis=1)

categorical_variables = ["ACCIDENT_TYPE", "DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]
numerical_variables = [col for col in df.columns if col not in categorical_variables]

print(categorical_variables)
print(numerical_variables)

['ACCIDENT_TYPE', 'DCA_CODE', 'LIGHT_CONDITION', 'ROAD_GEOMETRY']
['DAY_OF_WEEK', 'NO_OF_VEHICLES', 'NO_PERSONS', 'NO_PERSONS_INJ_2', 'NO_PERSONS_INJ_3', 'NO_PERSONS_KILLED', 'NO_PERSONS_NOT_INJ', 'POLICE_ATTEND', 'SEVERITY', 'SPEED_ZONE']

# Print the number of unique categories
for cat in categorical_variables:
    print(f"{cat}: {df[cat].nunique()}")

ACCIDENT_TYPE: 9
DCA_CODE: 80
LIGHT_CONDITION: 7
ROAD_GEOMETRY: 7

# Print out the unique values for each categorical variable and their descriptions
categorical_descriptions = ["Accident Type Desc", "DCA Description", "Light Condition Desc", "Road Geometry Desc"]

for cat, desc in zip(categorical_variables, categorical_descriptions):
    df_cat = df_raw[[cat, desc]].drop_duplicates().sort_values(by=[cat]).reset_index(drop=True)
    display(df_cat)
    print()

	ACCIDENT_TYPE	Accident Type Desc
0	1	Collision with vehicle
1	2	Struck Pedestrian
2	3	Struck animal
3	4	Collision with a fixed object
4	5	collision with some other object
5	6	Vehicle overturned (no collision)
6	7	Fall from or in moving vehicle
7	8	No collision and no object struck
8	9	Other accident

	DCA_CODE	DCA Description
0	100	PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIG...
1	101	PED EMERGES FROM IN FRONT OF PARKED OR STATION...
2	102	FAR SIDE. PED HIT BY VEHICLE FROM THE LEFT ...
3	103	PED PLAYING/LYING/WORKING/STANDING ON CARRIAGE...
4	104	PED WALKING WITH TRAFFIC
...	...	...
76	192	STRUCK TRAIN
77	193	STRUCK RAILWAY CROSSING FURNITURE
78	194	PARKED CAR RUN AWAY
79	198	OTHER ACCIDENTS NOT CLASSIFIABLE ELSEWHERE ...
80	199	UNKNOWN-NO DETAILS ON MANOEUVRES OF ROAD-USERS...

81 rows × 2 columns

	LIGHT_CONDITION	Light Condition Desc
0	1	Day
1	2	Dusk/Dawn
2	3	Dark Street lights on
3	4	Dark Street lights off
4	5	Dark No street lights
5	6	Dark Street lights unknown
6	9	Unknown

	ROAD_GEOMETRY	Road Geometry Desc
0	1	Cross intersection
1	2	T intersection
2	3	Y intersection
3	4	Multiple intersection
4	5	Not at intersection
5	6	Dead end
6	7	Road closure
7	8	Private property
8	9	Unknown

target = (df["SEVERITY"] > 2)
features = df.drop("SEVERITY", axis=1)

Classification task

This is for you to complete.