Predictive modeling and statistics have always been of interest to me. Before getting into coding I studied finance- so I've been exposed to many statistical concepts, but mainly through the lens of portfolio theory. However, for most people (myself included) statistics and their importance are much better digested through the world of sports. Basketball in particular is still very much in the middle of an analytics revolution- with content on advanced statistics more readily available than ever. For a while I've had the idea of creating some sort of model in the back of my mind - but before I could have only built something using excel- which would not have been as satisfying as trying to build something with actual code. Now that I can actually write code-for this first post- I decided to see what I could do in Ruby from scratch- starting with finding a stats framework on the basketball side of things and then seeing if I could re-create some semblance of it within a program.
So which stats are particularly useful as predictors of team success? Are there any inefficiencies that the market hasn't accounted for? Obviously this question goes infinitely deep, so to start I wanted to see if I could find some sort of commonly accepted framework that wasn't too overwhelming for someone like myself who is not fully versed in the world of advanced statistics. I came across Dean Oliver's Four Factor's of basketball success (you can read more about them here: https://www.basketball-reference.com/about/factors.html).
The four factors break down four areas in which basketball games are won- weighted by their significance. The factors are :
Shooting (40%) measured w/ eFG%
Turnovers (25%)- measured w/ TOV%
Rebounding (20%)- measured w/ ORB% AND DRB%
Free Throws (15%)- rate calculated by FT/FGA
(note - the "four" factors are actually composed of eight different stats -measured for both offense/defense. Defensive rebounding (DRB%) is the only explicitly different stat between offense/defense)
So with that framework to go off of the next step was to actual grab some data to work with. I couldn't find a free API that provided all of the stats I was looking for- so I turned to data from Basketball Reference (https://www.basketball-reference.com/)- which is an amazing resource of every stat you could think of going back 50+ years. Even better- they actually provide a Miscellaneous Stats table that specifically includes the four factors. I exported the table as a CSV and read it into Ruby- along with making a method to rip the column headers from the table and use them to create a hash of all the statistics with their labels.
Now how does this data get modeled? Can we test whether the four factors are actually a legitimate way of looking at how games are won? I'm going to reiterate here that I am in no way a statistician- so I once again consulted the internet to find an example of the four factors in action - and came across another blog that demonstrated some of the math used in the statistical analysis (all credit to Justin Jacobs https://squared2020.com/2017/09/05/introduction-to-olivers-four-factors/). The post provided a great example of how to try and use the data by doing a linear regression to model the relationship between the factors and a team's total wins. With an idea of where to take the data- I tried to then duplicate the result within Ruby.
To start I pulled a new table down- using data for all the teams from the 2016-17 season so I could compare my results to Justin's blog. I stripped the data in my hash down to just those relevant to the four factors- as well as adding data for win total- then I found a gem called 'eps' (https://github.com/ankane/eps) that would let me perform a linear regression in Ruby. I ran the regression on my hash using wins as the dependent variable and came out with data that was nearly identical to Justin's figures (slight variation due to minor differences in data used). To summarize the results- all of the coefficients pointed in the right directions (ie offensive points scored impacted wins positively and turnovers impacted negatively) and the the r² was 0.896- indicating a pretty high correlation between the four factors and a team's win total (seriously- read the blog post if any of this interests you for a much more thorough description with graphs and math shown)- but the most important part was that the model was able to accurately reflect reality!
To wrap this up with some final thoughts:
It was humbling to dive into the world of statistics like this and see just how deep even the most 'simple' stats and frameworks can go. The reality of these analytics is much like the stock market- with everyone looking for any inefficiencies- whether it's someone on a team's analytic department deciding on a team's draft pick or a professional better looking for opportunities in the lines. Just like coding/development this is one of those topics that feels infinitely deep- but trying to bridge the gap and bring the model into reality with my own code was very fun- better yet is the room for improvement- even in just this tiny thing I built. A few weeks from now I'll look back on this and be able to do so much more.
I don't think I'm the next Billy Bean- and I certainly wouldn't put any money on the line at what this model spits out (it doesn't even DO anything!)- but gathering statistical data like this and making use of it is definitely something I plan to work on more in the future. Just in this exercise I came across so many new concepts I'd like to learn and implement (ie machine learning). But those are for a another time- I will definitely be revisiting this in a few weeks at the latest and will have improvements ready.
Thanks for reading and please share any feedback!