DEV Community

Benjamin Bai
Benjamin Bai

Posted on • Updated on

SK Learn Practice: NCAA Big East

I wanted to practice using sklearn and create a short outline for preparing the data and running a few basic models (Linear Regression, Ridge Regression, Lasso Regression) and comparing the results for the following dataset.

Step 1: EDA
With the target column of 'wins' in mind, I opted to drop many of the columns that were not relevant ex. id, year, school and also some columns that had incomplete information due to the fact that the data has only started getting tracked relatively recently ex. offensive rating, net rating.

Before:
Image description

After:
Image description

Step 2: Train Test Split
After this, it is time to perform a train test split. It is important to split the training and test data before making transformations that could have a potential impact on the overall distribution of the data or lead to data leakage.
Train Test Doc

Step 3: Scale
Fitting and Transforming the training data with StandardScaler and just Transforming the testing data will allow our predictive models to perform better as these types of models tend to not like data that is not normally distributed.
StandardScaler Doc

Step 4: Set Up Models
For this particular example, I chose to run a Linear Regression, Ridge Regression, and Lasso Regression as these are 3 of the most basic and most frequently used linear models and we are also looking at a larger number of features.

LinearRegression serves as our simple model

LassoRegression applies L1 Regularization on top of Linear Regression (absolute value of magnitude)

RidgeRegression applies L2 Regularization on top of Linear Regression (squared magnitude)

Step 5: Evaluate and Select Final Model

OLS: Image description
Lasso: Image description
Ridge: Image description

The Ridge Model minimized RMSE and maximized R2, so we would go with the Ridge Regression model as our final model.

Notebook Link

Latest comments (0)