As a student at the Flatiron School, for my module 3 project I teamed up with Abzal Seitkaziyev to try and predict the winner of horse races.
We started by getting our data from Kaggle, which had a data set from the Hong Kong Jockey Club website. The races included were from 2014 to 2017.
The first step that I undertook was looking at the data and seeing if there were any categories that provided information that I didn't think would be useful. After dropping a large number of the columns, I checked the remaining categories for null values.
Fortunately, after removing unwanted columns there was only one column with null values. Since there were relatively few of them, I simply removed those rows from the data set.
However, there were some other issues with the data that needed to be dealt with. One was that there were missing values that didn't show up as missing because they were input as '---'. As a result, I searched columns for these types of missing values and removed them as well.
Feature engineering ended up being by far the main focus of mine while undertaking the project. In hindsight, I should planned ahead and set a limit on how much time I would spend engineering new features. As a result, I ended up having to rush through the encoding and modeling portions of the project in order to finish in time.
While I won't go through all of the features that I engineered, I will comment on a couple of main things that I focused on. The first thing that I new that I needed to do was to make sure that I wasn't using future information to predict the outcome of races.
For example, how fast a horse ran was clearly an import component in creating the model. However, we had to make sure to remove the speed from a particular race from the prediction process for that race, because that is not information that we will have going in. Therefore, when I was factoring the fastest that a horse had run up until that point, I made sure to remove the time from that race from consideration.
Additionally, in order to use the categorical variables in my model, I new that I had to encode them. I ultimately decided to use target encoding, but I knew that I had to be careful because of the possibility of target leakage. Since target encoding uses information about the target in the encoding process, it will bias the prediction process. You want to make predictions without any knowledge of the result, because in the future when using the model, I would have access to that information. Therefore, I used a pipeline in order to make sure that leakage did not occur.
When I got to the modeling, at first I tried a basic version of a number of models in order to see which one performed best right off the bat. I tried a decision tree, random forest, logistic regression, support vector machine, adaboost, and a Gradient Boosted Classifier.
The metric that I judged the models on was area under the curve. I went with this metric because of what we were ultimately using the model for. I plan on using this model in order to bet on races and therefore think that the most import things to keep in mind are the number of true positive and false positive results. In other words, how often do will I win when the model tells me I should place a bet on a particular horse. The area under the curve metric includes these factors.
Ultimately, I was able to achieve an area under the curve score of approximately .78 using the logistic regression model.
I think that I can improve on my area under the curve score pretty easily by working more with the models. I spent most of my time engineering features and didn't have as much time as I would have liked to work with the models and checking to see which features should be included and which shouldn't.
Furthermore, I would like to collect more data since the Kaggle dataset only included races from 2014-2017.
I would also like to see if I can get better results by trying different types of bets instead of simply picking the winning horse.