DEV Community

timhugele
timhugele

Posted on

Predicting Hong Kong Horse Racing Outcomes

Horse Racing

As a student at the Flatiron School, for my module 3 project I teamed up with Abzal Seitkaziyev to try and predict the winner of horse races.

Data Collection

We started by getting our data from Kaggle, which had a data set from the Hong Kong Jockey Club website. The races included were from 2014 to 2017.

Data Cleaning

The first step that I undertook was looking at the data and seeing if there were any categories that provided information that I didn't think would be useful. After dropping a large number of the columns, I checked the remaining categories for null values.

Fortunately, after removing unwanted columns there was only one column with null values. Since there were relatively few of them, I simply removed those rows from the data set.

However, there were some other issues with the data that needed to be dealt with. One was that there were missing values that didn't show up as missing because they were input as '---'. As a result, I searched columns for these types of missing values and removed them as well.

Feature Engineering

Feature engineering ended up being by far the main focus of mine while undertaking the project. In hindsight, I should planned ahead and set a limit on how much time I would spend engineering new features. As a result, I ended up having to rush through the encoding and modeling portions of the project in order to finish in time.

While I won't go through all of the features that I engineered, I will comment on a couple of main things that I focused on. The first thing that I new that I needed to do was to make sure that I wasn't using future information to predict the outcome of races.

For example, how fast a horse ran was clearly an import component in creating the model. However, we had to make sure to remove the speed from a particular race from the prediction process for that race, because that is not information that we will have going in. Therefore, when I was factoring the fastest that a horse had run up until that point, I made sure to remove the time from that race from consideration.

Additionally, in order to use the categorical variables in my model, I new that I had to encode them. I ultimately decided to use target encoding, but I knew that I had to be careful because of the possibility of target leakage. Since target encoding uses information about the target in the encoding process, it will bias the prediction process. You want to make predictions without any knowledge of the result, because in the future when using the model, I would have access to that information. Therefore, I used a pipeline in order to make sure that leakage did not occur.

Modeling

When I got to the modeling, at first I tried a basic version of a number of models in order to see which one performed best right off the bat. I tried a decision tree, random forest, logistic regression, support vector machine, adaboost, and a Gradient Boosted Classifier.

The metric that I judged the models on was area under the curve. I went with this metric because of what we were ultimately using the model for. I plan on using this model in order to bet on races and therefore think that the most import things to keep in mind are the number of true positive and false positive results. In other words, how often do will I win when the model tells me I should place a bet on a particular horse. The area under the curve metric includes these factors.

Results

Ultimately, I was able to achieve an area under the curve score of approximately .78 using the logistic regression model.

Gambling Gif

Future Work

I think that I can improve on my area under the curve score pretty easily by working more with the models. I spent most of my time engineering features and didn't have as much time as I would have liked to work with the models and checking to see which features should be included and which shouldn't.

Furthermore, I would like to collect more data since the Kaggle dataset only included races from 2014-2017.

I would also like to see if I can get better results by trying different types of bets instead of simply picking the winning horse.

Top comments (4)

Collapse
 
horseracedatab1 profile image
Horseracedatabase

For Hong Kong there is a web with all the data of Sha Tin and Happy Valley courses from 1979 until today with races, results, horses, and jockey/trainer stats in several formats to download as MySql dumbs, csv, json etc. Besides they offer an update every week on this season: horseracedatabase.com

Some comments may only be visible to logged-in visitors. Sign in to view all comments.