In an effort to reduce the frequency of car collisions in a community, an algorithm must be developed to predict the severity of an accident given the current weather, road and visibility conditions. When conditions are bad, this model will alert drivers to remind them to be more careful.
Our predictor or target variable will be
SEVERITYCODE because it is used measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are
Severity codes are as follows:
0 : Little to no Probability (Clear Conditions) 1 : Very Low Probability - Chance or Property Damage 2 : Low Probability - Chance of Injury 3 : Mild Probability - Chance of Serious Injury 4 : High Probability - Chance of Fatality
In it's original form, this data is not fit for analysis. For one, there are many columns that we will not use for this model. Also, most of the features are of type object, when they should be numerical type.
We must use label encoding to covert the features to our desired data type.
With the new columns, we can now use this data in our analysis and ML models!
Now let's check the data types of the new columns in our data. Moving forward, we will only use the new columns for our analysis.
Our target variable
SEVERITYCODE is only 42% balanced. In fact, severity code in class 1 is nearly three times the size of class 2.
Perfectly balanced ( as all things should be! )
Our data is now ready to be fed into machine learning models.
We will use the following models:
K-Nearest Neighbor (KNN)
KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.
A decision tree model gives us a layout of all possible outcomes so we can fully analyze the consequences of a decision. It context, the decision tree observes all possible outcomes of different weather conditions.
Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.
Let's get started!
- Define X and y
- Normalize the dataset
- Train-Test Split
- K-Nearest Neighbor Finding the best k value
#Train Model & Predict k = mean_acc.argmax()+1 neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train) neigh Kyhat = neigh.predict(X_test) Kyhat[0:5]
array([2, 2, 1, 1, 2])
- Decision Tree
# Building the Decision Tree from sklearn.tree import DecisionTreeClassifier colDataTree = DecisionTreeClassifier(criterion="entropy", max_depth = 7) colDataTree colDataTree.fit(X_train,y_train) predTree = colDataTree.predict(X_test) print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))
DecisionTrees's Accuracy: 0.5664365709048206
# Train Model & Predict DTyhat = colDataTree.predict(X_test) print (predTree [0:5]) print (y_test [0:5])
[2 2 1 1 2]
[2 2 1 1 1]
- Logistic Regression
# Building the LR Model from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix LR = LogisticRegression(C=6, solver='liblinear').fit(X_train,y_train) # Train Model & Predicr LRyhat = LR.predict(X_test) yhat_prob = LR.predict_proba(X_test)
Here is the summary of the scores reported in the evaluation step:
In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algorithm, so label encoding was used to created new classes that were of type int8; a numerical data type.
After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was down-sampling the majority class with sklearn's resample tool. We down-sampled to match the minority class exactly with 58188 values each.
Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.
Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and log-loss for logistic regression. Choosing different k, max depth and hypermeter C values helped to improve our accuracy to be the best possible.
Based on historical data from weather conditions pointing to certain classes, we can conclude that particular weather conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2).
Thank you for reading! 😊