I wanted to practice using sklearn and create a short outline for preparing the data and running a few basic models (Linear Regression, Ridge Regression, Lasso Regression) and comparing the results for the following dataset.
Step 1: EDA
With the target column of 'wins' in mind, I opted to drop many of the columns that were not relevant ex. id, year, school and also some columns that had incomplete information due to the fact that the data has only started getting tracked relatively recently ex. offensive rating, net rating.
Step 2: Train Test Split
After this, it is time to perform a train test split. It is important to split the training and test data before making transformations that could have a potential impact on the overall distribution of the data or lead to data leakage.
Train Test Doc
Step 3: Scale
Fitting and Transforming the training data with StandardScaler and just Transforming the testing data will allow our predictive models to perform better as these types of models tend to not like data that is not normally distributed.
StandardScaler Doc
Step 4: Set Up Models
For this particular example, I chose to run a Linear Regression, Ridge Regression, and Lasso Regression as these are 3 of the most basic and most frequently used linear models and we are also looking at a larger number of features.
LinearRegression serves as our simple model
LassoRegression applies L1 Regularization on top of Linear Regression (absolute value of magnitude)
RidgeRegression applies L2 Regularization on top of Linear Regression (squared magnitude)
Step 5: Evaluate and Select Final Model
The Ridge Model minimized RMSE and maximized R2, so we would go with the Ridge Regression model as our final model.
Oldest comments (0)