DEV Community


Posted on

Predicting Hong Kong Horse Racing Outcomes

Horse Racing

As a student at the Flatiron School, for my module 3 project I teamed up with Abzal Seitkaziyev to try and predict the winner of horse races.

Data Collection

We started by getting our data from Kaggle, which had a data set from the Hong Kong Jockey Club website. The races included were from 2014 to 2017.

Data Cleaning

The first step that I undertook was looking at the data and seeing if there were any categories that provided information that I didn't think would be useful. After dropping a large number of the columns, I checked the remaining categories for null values.

Fortunately, after removing unwanted columns there was only one column with null values. Since there were relatively few of them, I simply removed those rows from the data set.

However, there were some other issues with the data that needed to be dealt with. One was that there were missing values that didn't show up as missing because they were input as '---'. As a result, I searched columns for these types of missing values and removed them as well.

Feature Engineering

Feature engineering ended up being by far the main focus of mine while undertaking the project. In hindsight, I should planned ahead and set a limit on how much time I would spend engineering new features. As a result, I ended up having to rush through the encoding and modeling portions of the project in order to finish in time.

While I won't go through all of the features that I engineered, I will comment on a couple of main things that I focused on. The first thing that I new that I needed to do was to make sure that I wasn't using future information to predict the outcome of races.

For example, how fast a horse ran was clearly an import component in creating the model. However, we had to make sure to remove the speed from a particular race from the prediction process for that race, because that is not information that we will have going in. Therefore, when I was factoring the fastest that a horse had run up until that point, I made sure to remove the time from that race from consideration.

Additionally, in order to use the categorical variables in my model, I new that I had to encode them. I ultimately decided to use target encoding, but I knew that I had to be careful because of the possibility of target leakage. Since target encoding uses information about the target in the encoding process, it will bias the prediction process. You want to make predictions without any knowledge of the result, because in the future when using the model, I would have access to that information. Therefore, I used a pipeline in order to make sure that leakage did not occur.


When I got to the modeling, at first I tried a basic version of a number of models in order to see which one performed best right off the bat. I tried a decision tree, random forest, logistic regression, support vector machine, adaboost, and a Gradient Boosted Classifier.

The metric that I judged the models on was area under the curve. I went with this metric because of what we were ultimately using the model for. I plan on using this model in order to bet on races and therefore think that the most import things to keep in mind are the number of true positive and false positive results. In other words, how often do will I win when the model tells me I should place a bet on a particular horse. The area under the curve metric includes these factors.


Ultimately, I was able to achieve an area under the curve score of approximately .78 using the logistic regression model.

Gambling Gif

Future Work

I think that I can improve on my area under the curve score pretty easily by working more with the models. I spent most of my time engineering features and didn't have as much time as I would have liked to work with the models and checking to see which features should be included and which shouldn't.

Furthermore, I would like to collect more data since the Kaggle dataset only included races from 2014-2017.

I would also like to see if I can get better results by trying different types of bets instead of simply picking the winning horse.

Discussion (6)

gfreeman profile image
Guy Freeman • Edited on

The official horserace data from the Hong Kong Jockey Club (HKJC) website is now available for free from as an SQLite database, as JSON and as CSV, and can also be explored and queried through the frontend.

akselne profile image

Great initiative - is there any plans to keep this up to data? It seems like the last data entry was from 5/5/2021.

Is this an official source of data, or is this a private project? It sure would be nice if the racing authorities provided this data for everyone for free in this format, I guess HKJC will be the first to do this.. Maybe from this initiative?

P.S! I did find a few records lacking from 08/04/2021 HV and 11/04/2021 ST.

gfreeman profile image
Guy Freeman

Hi, the data is indeed kept up-to-date at the new URL, The data is also cleaner now, e.g. with a column for race date. A table of the raw data is also available.

This is data scraped from the Hong Kong Jockey Club's website. Unfortunately they don't seem to provide the data in a clean way such as CSV files or an API.

NB. The records you said were lacking don't seem to be lacking any more :) e.g. races on 2021-04-11 are here:

adamhaynes profile image

I fully understand the fact that betting on racing might be a difficult choice sometimes. You know, in this case I can recommend you to look at this page where you can also find online casino that will significantly assist you to check out if luck is really in your side or not. Then you may continue making bets on horse racing as this is really interesting

horseracedatab1 profile image

For Hong Kong there is a web with all the data of Sha Tin and Happy Valley courses from 1979 until today with races, results, horses, and jockey/trainer stats in several formats to download as MySql dumbs, csv, json etc. Besides they offer an update every week on this season:

austinggerald profile image

I like horse racing so much and I would like to read an article about it. Therefore, if someone has recommendations, share with me, please. I will be glad to get it.