import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
="pandas") set_config(transform_output
Exercise: Victorian Car Crash Severity
ACTL3143 & ACTL5111 Deep Learning for Actuaries
Your task is to predict whether a specific car crash will be a high severity or a low severity incident. You will use a dataset of car crashes in Victoria where the police were called to assist. The dataset is available here (original source).
The network must use entity embedding for at least one of the categorical variables (e.g. the DCA_CODE
), and train on a mix of both categorical and numerical features. The target variable is the binary outcome that severity is > 2. Report on the value of the accuracy of your classifier and give a confusion matrix.
Questions:
- How did you preprocess your variables?
- What neural network architecture did you use?
- Did you try multiple options for the embedding dimension? Did any work better than others? (E.g. plot x = embedding dimension against y = validation accuracy)
- If your entity embedding dimension was low (1, 2 or 3) can you make a scatterplot of the categories & their learned embeddings?
The data
Start by reading the data dictionary for the dataset.
= pd.read_csv("https://laub.au/ai/data/ACCIDENT.csv", low_memory=False)
df_raw df_raw
ACCIDENT_NO | ACCIDENTDATE | ACCIDENTTIME | ACCIDENT_TYPE | Accident Type Desc | DAY_OF_WEEK | Day Week Description | DCA_CODE | DCA Description | DIRECTORY | ... | NO_PERSONS | NO_PERSONS_INJ_2 | NO_PERSONS_INJ_3 | NO_PERSONS_KILLED | NO_PERSONS_NOT_INJ | POLICE_ATTEND | ROAD_GEOMETRY | Road Geometry Desc | SEVERITY | SPEED_ZONE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | T20060000010 | 13/01/2006 | 12:42:00 | 1 | Collision with vehicle | 6 | Friday | 113 | RIGHT NEAR (INTERSECTIONS ONLY) | MEL | ... | 6 | 0 | 1 | 0 | 5 | 1 | 1 | Cross intersection | 3 | 60 |
1 | T20060000018 | 13/01/2006 | 19:10:00 | 1 | Collision with vehicle | 6 | Friday | 113 | RIGHT NEAR (INTERSECTIONS ONLY) | MEL | ... | 4 | 0 | 1 | 0 | 3 | 1 | 2 | T intersection | 3 | 70 |
2 | T20060000022 | 14/01/2006 | 12:10:00 | 7 | Fall from or in moving vehicle | 7 | Saturday | 190 | FELL IN/FROM VEHICLE | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 5 | Not at intersection | 2 | 100 |
3 | T20060000023 | 14/01/2006 | 11:49:00 | 1 | Collision with vehicle | 7 | Saturday | 130 | REAR END(VEHICLES IN SAME LANE) | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 2 | T intersection | 2 | 80 |
4 | T20060000026 | 14/01/2006 | 10:45:00 | 1 | Collision with vehicle | 7 | Saturday | 121 | RIGHT THROUGH | MEL | ... | 3 | 0 | 3 | 0 | 0 | 1 | 5 | Not at intersection | 3 | 50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
203703 | T20200019239 | 1/11/2020 | 12:11:00 | 1 | Collision with vehicle | 0 | Sunday | 142 | LEAVING PARKING | MEL | ... | 4 | 1 | 0 | 0 | 3 | 1 | 5 | Not at intersection | 2 | 50 |
203704 | T20200019247 | 1/11/2020 | 15:30:00 | 4 | Collision with a fixed object | 1 | Sunday | 171 | LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... | MEL | ... | 2 | 2 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 999 |
203705 | T20200019250 | 1/11/2020 | 18:00:00 | 1 | Collision with vehicle | 0 | Sunday | 116 | LEFT NEAR (INTERSECTIONS ONLY) | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 1 | Cross intersection | 2 | 60 |
203706 | T20200019253 | 1/11/2020 | 12:00:00 | 6 | Vehicle overturned (no collision) | 1 | Sunday | 180 | OFF CARRIAGEWAY ON RIGHT BEND | VCD | ... | 1 | 1 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 80 |
203707 | T20200019417 | 4/11/2020 | 1:30:00 | 4 | Collision with a fixed object | 3 | Wednesday | 171 | LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... | MEL | ... | 1 | 1 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 80 |
203708 rows × 28 columns
Preprocessing
# Drop observations which have categorical variables which are very rare (< 10 obs in the dataset)
# This is a crude solution / surely can be improved.
= df_raw.copy()
df_simple
= ["DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]
sparse_categories
for cat in sparse_categories:
= df_simple[df_simple[cat].map(df_simple[cat].value_counts()) > 10]
df_simple
df_simple
ACCIDENT_NO | ACCIDENTDATE | ACCIDENTTIME | ACCIDENT_TYPE | Accident Type Desc | DAY_OF_WEEK | Day Week Description | DCA_CODE | DCA Description | DIRECTORY | ... | NO_PERSONS | NO_PERSONS_INJ_2 | NO_PERSONS_INJ_3 | NO_PERSONS_KILLED | NO_PERSONS_NOT_INJ | POLICE_ATTEND | ROAD_GEOMETRY | Road Geometry Desc | SEVERITY | SPEED_ZONE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | T20060000010 | 13/01/2006 | 12:42:00 | 1 | Collision with vehicle | 6 | Friday | 113 | RIGHT NEAR (INTERSECTIONS ONLY) | MEL | ... | 6 | 0 | 1 | 0 | 5 | 1 | 1 | Cross intersection | 3 | 60 |
1 | T20060000018 | 13/01/2006 | 19:10:00 | 1 | Collision with vehicle | 6 | Friday | 113 | RIGHT NEAR (INTERSECTIONS ONLY) | MEL | ... | 4 | 0 | 1 | 0 | 3 | 1 | 2 | T intersection | 3 | 70 |
2 | T20060000022 | 14/01/2006 | 12:10:00 | 7 | Fall from or in moving vehicle | 7 | Saturday | 190 | FELL IN/FROM VEHICLE | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 5 | Not at intersection | 2 | 100 |
3 | T20060000023 | 14/01/2006 | 11:49:00 | 1 | Collision with vehicle | 7 | Saturday | 130 | REAR END(VEHICLES IN SAME LANE) | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 2 | T intersection | 2 | 80 |
4 | T20060000026 | 14/01/2006 | 10:45:00 | 1 | Collision with vehicle | 7 | Saturday | 121 | RIGHT THROUGH | MEL | ... | 3 | 0 | 3 | 0 | 0 | 1 | 5 | Not at intersection | 3 | 50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
203703 | T20200019239 | 1/11/2020 | 12:11:00 | 1 | Collision with vehicle | 0 | Sunday | 142 | LEAVING PARKING | MEL | ... | 4 | 1 | 0 | 0 | 3 | 1 | 5 | Not at intersection | 2 | 50 |
203704 | T20200019247 | 1/11/2020 | 15:30:00 | 4 | Collision with a fixed object | 1 | Sunday | 171 | LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... | MEL | ... | 2 | 2 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 999 |
203705 | T20200019250 | 1/11/2020 | 18:00:00 | 1 | Collision with vehicle | 0 | Sunday | 116 | LEFT NEAR (INTERSECTIONS ONLY) | MEL | ... | 2 | 1 | 0 | 0 | 1 | 1 | 1 | Cross intersection | 2 | 60 |
203706 | T20200019253 | 1/11/2020 | 12:00:00 | 6 | Vehicle overturned (no collision) | 1 | Sunday | 180 | OFF CARRIAGEWAY ON RIGHT BEND | VCD | ... | 1 | 1 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 80 |
203707 | T20200019417 | 4/11/2020 | 1:30:00 | 4 | Collision with a fixed object | 3 | Wednesday | 171 | LEFT OFF CARRIAGEWAY INTO OBJECT/PARKED VEHICL... | MEL | ... | 1 | 1 | 0 | 0 | 0 | 1 | 5 | Not at intersection | 2 | 80 |
203692 rows × 28 columns
= ["ACCIDENT_NO", 'ACCIDENTDATE', 'ACCIDENTTIME', "Accident Type Desc", "Day Week Description", "DCA Description",
drop "DIRECTORY", "EDITION", "PAGE", "GRID_REFERENCE_X", "GRID_REFERENCE_Y",
"Light Condition Desc", "NODE_ID", "Road Geometry Desc"]
= df_simple.drop(drop, axis=1)
df
= ["ACCIDENT_TYPE", "DCA_CODE", "LIGHT_CONDITION", "ROAD_GEOMETRY"]
categorical_variables = [col for col in df.columns if col not in categorical_variables] numerical_variables
print(categorical_variables)
print(numerical_variables)
['ACCIDENT_TYPE', 'DCA_CODE', 'LIGHT_CONDITION', 'ROAD_GEOMETRY']
['DAY_OF_WEEK', 'NO_OF_VEHICLES', 'NO_PERSONS', 'NO_PERSONS_INJ_2', 'NO_PERSONS_INJ_3', 'NO_PERSONS_KILLED', 'NO_PERSONS_NOT_INJ', 'POLICE_ATTEND', 'SEVERITY', 'SPEED_ZONE']
# Print the number of unique categories
for cat in categorical_variables:
print(f"{cat}: {df[cat].nunique()}")
ACCIDENT_TYPE: 9
DCA_CODE: 80
LIGHT_CONDITION: 7
ROAD_GEOMETRY: 7
# Print out the unique values for each categorical variable and their descriptions
= ["Accident Type Desc", "DCA Description", "Light Condition Desc", "Road Geometry Desc"]
categorical_descriptions
for cat, desc in zip(categorical_variables, categorical_descriptions):
= df_raw[[cat, desc]].drop_duplicates().sort_values(by=[cat]).reset_index(drop=True)
df_cat
display(df_cat)print()
ACCIDENT_TYPE | Accident Type Desc | |
---|---|---|
0 | 1 | Collision with vehicle |
1 | 2 | Struck Pedestrian |
2 | 3 | Struck animal |
3 | 4 | Collision with a fixed object |
4 | 5 | collision with some other object |
5 | 6 | Vehicle overturned (no collision) |
6 | 7 | Fall from or in moving vehicle |
7 | 8 | No collision and no object struck |
8 | 9 | Other accident |
DCA_CODE | DCA Description | |
---|---|---|
0 | 100 | PED NEAR SIDE. PED HIT BY VEHICLE FROM THE RIG... |
1 | 101 | PED EMERGES FROM IN FRONT OF PARKED OR STATION... |
2 | 102 | FAR SIDE. PED HIT BY VEHICLE FROM THE LEFT ... |
3 | 103 | PED PLAYING/LYING/WORKING/STANDING ON CARRIAGE... |
4 | 104 | PED WALKING WITH TRAFFIC |
... | ... | ... |
76 | 192 | STRUCK TRAIN |
77 | 193 | STRUCK RAILWAY CROSSING FURNITURE |
78 | 194 | PARKED CAR RUN AWAY |
79 | 198 | OTHER ACCIDENTS NOT CLASSIFIABLE ELSEWHERE ... |
80 | 199 | UNKNOWN-NO DETAILS ON MANOEUVRES OF ROAD-USERS... |
81 rows × 2 columns
LIGHT_CONDITION | Light Condition Desc | |
---|---|---|
0 | 1 | Day |
1 | 2 | Dusk/Dawn |
2 | 3 | Dark Street lights on |
3 | 4 | Dark Street lights off |
4 | 5 | Dark No street lights |
5 | 6 | Dark Street lights unknown |
6 | 9 | Unknown |
ROAD_GEOMETRY | Road Geometry Desc | |
---|---|---|
0 | 1 | Cross intersection |
1 | 2 | T intersection |
2 | 3 | Y intersection |
3 | 4 | Multiple intersection |
4 | 5 | Not at intersection |
5 | 6 | Dead end |
6 | 7 | Road closure |
7 | 8 | Private property |
8 | 9 | Unknown |
= (df["SEVERITY"] > 2)
target = df.drop("SEVERITY", axis=1) features
Classification task
This is for you to complete.