Making a March Madness Bracket

As a student at the Flatiron School, we were given the opportunity to come up with our own idea for a Module 4 Project. As a life long basketball fan, I thought a good project would be to try to use Data Science to generate the best possible March Madness Bracket. Fortunately, every year on Kaggle there is a competition to see who can create the best bracket. As a result, I had the benefit of seeing what other, more experienced Data Scientists had already done. While helpful, it ended up taking up much more of my time than I should have allowed it to.

Research/Data Collection

My first step was going to the 2019 Men's March Madness Kaggle competition page and looking at the data. This is where my problems began.

When I looked at the data set, I quickly became overwhelmed with the amount of data. There were approximately 70 csv files included with the competition. Being a beginner in data science, I was having a difficult time figuring out where to start.

I then decided to go through some of the previous Kaggle competitions and look into what some of the previous top performing models looked like. However, many of the first models I looked at where written in R (which I don't have any experience with yet). After spending too much time trying to understand what was going on in R-coded notebooks with code that was foreign to me, I ended up focusing on the code of one poster in particular, which can be found here.

Due to being so intimidated by the many data sets, I ended up going through all of code in this particular competition submission trying to follow how he prepared his data. After spending a too much time doing this, I decided to ultimately use the csv files that he created out of the data provided by the competition.

Modeling

Kaggle had two different types of final results that they would accept. One is a prediction for the outcomes of every possible matchup from 2014-2018. The second was a prediction for every possible matchup from 2019. I decided to focus on the latter. Furthermore, the metric that was required for the competition was logloss.

Now that I had my data, I started trying to run some models. I liked some of the modeling ideas that I got from the previously mentioned notebook that I had been learning from, so I used them in my approach.

The first idea was to get a couple of benchmark logloss results. I therefore found the logloss for a model which gives every matchup a 50/50 likelihood, which produced a logloss of about 0.69. I then tried a second benchmark model to see how a model based purely on the betting markets would do. I used a logistic regression model, and this produced a logloss of 0.55.

Finally, for my final result I ran a logistic regression model based on the Adjusted Offensive and Defensive efficiencies of each team. This produced a logloss of 0.52 (on 2016 data).

Conclusions

While I was able to produce a model that performed better than the betting odds, the model didn't really do as well as it appears. The model ultimately picked the top seed in every matchup in the 2019 tournament, and only picked 11 out of all 2278 possible matchups. While this does disclose some information (seeds where better indicators of success than betting markets), I would have liked to have produced a result superior to just picking the top seeds every time.

Future Work

For future work I would like to go back and create the data set myself. I ended up relying on someone else's data due to the slow progress that I was making, and the fact that I had an approaching deadline for finishing my project.

I would also like to used more features than just Adjusted Offensive and Defensive Efficiency. I used these because they were the features that worked best for the notebook that I was learning from, but I would at least like to test out many more features to see if I can come up with better results. Specifically, I would like to use the Massey Ordinals, which collect many different rankings of College Basketball teams, and see which ones tend to make the best predictions and implement those into my model.

I would also like to engineer a TrueSkill feature. This would produce my own rankings of how good each team in the tournament is.

And finally, I would also like to create a feature that finds the distance of the schools from where they are playing and see if teams that play close to where they are from experience a home court advantage

Lessons Learned

In the future, I realized that I need to think through my approach to a project before diving in. That is what I did with this project and I ended wasting a significant amount of time and produced a project that I was disappointed with.