DEV Community

jwi5
jwi5

Posted on

Understanding the Train-Test Split with NBA Data

In data analysis and machine learning, evaluating models is essential, and a key method for this is the "train-test split." This technique is crucial for assessing model performance on new data. In this guide, I will explore the importance of train-test split and walk through a tutorial using an NBA dataset(https://www.kaggle.com/datasets/nathanlauga/nba-games/). Our goal is show how we can train a model with linear regression to attempt to predict an NBA player's 'plus-minus,' which is a measure how a team performs while a certain player is on the court in terms of point differential, based on points scored by that player.

More on the Train-Test Split

Basically, what a train-test split is doing is splitting our dataset into data that trains our model, and data that will test the performance of our model after it has been trained. Because our model will have not seen the testing data until after it has been trained, validating the model on this independent subset will help us get a better sense of how well the model is likely to perform when faced with new, previously unseen data, which is the goal of predictive modeling with machine learning. Just as previewing exam questions before taking the test can misrepresent a student's true understanding of the material, allowing a model to assess the testing data before training might lead to inaccurate estimations of its ability to handle entirely new data in the future. The train-test split serves as a safeguard, providing a more reliable measure of the model's future performance.

Tutorial: Correlating Points Scored with 'Plus/Minus' in NBA

Image description

The first thing I have to do is import train_test_split from skikit-learn, a machine learning library for Python. I then create an X feature variable from the Points column of the data frame and a Y dependent or 'target' variable from the 'plus-minus' column, as I am trying to predict plus-minus through points scored. I then use these as parameters in train-test split to create our X and y train and test variables. I should also note that I have random_state as a parameter, which is used to make sure we get the same set of train and test data points every time it is run. You can use any number other than 0 and it will not matter, but I use 42 as it is the standard convention. Other parameters that can be used for train-test split include arrays, test size (which defaults to 0.25 as the percentage of data that will be in the testing set), train_size (default 0.75), shuffle, and stratify. You can read more about these parameters here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

Image description

Now, because I want my model to use linear regression, I can also import LinearRegression() from ski-kit learn and instantiate it. I then use '.fit()' to fit our model onto our training data, which will essentially allow our model to learn the relationship between a player's points and their 'plus/minus' statistic.

After fitting our model, I use .predict() on our X_test data to try to predict our y values, which in this case are the plus-minus stats.

Image description

Finally, we can look at our mean squared error and r-squared score to see how our model performed. As we can see, the r-squared is a very small number, and the mean squared error is fairly large, which tell us that our regression model is not a good fit for the data that we are using. This was somewhat expected, as plus-minus is a statistic that relies equally on all 9 other players on a court at a given time, as well as because points scored is not necessarily related to efficiency and overall positive impact.

Top comments (0)