Data is now a little less hard

#datamerging #datacleaning #seaborn #python

Previously I wrote about how Data is hard. Well, thanks to my mentor, Data is now a little less hard.

In these last few weeks, I have become an 'expert' at culling out data by using whole columns, rather than using the good old 'for' loop. For example, in order to get the data of how successful an NFL Running Back is on first down, I used the following code:

all_rbs = data.loc[(data['play_type']=='run') & (data['down'] == 1) & (data['ydstogo'] == 10)].groupby(by='rusher_player_name')[['epa', 'success','yards_gained']].mean()

Which gave me some output that looked like this:
Joe.Runningback 0.042594 0.57182 4.63

Except that it output one row for each running back in the dataset of all NFL plays from an entire NFL season.

However, given my newfound merge skills, I was then able to take that list of data and merge it into my main data table which has a record for each running back so that I now had 3 new columns for how a running back performed on 1st and 10: Expected Points Added (per play), Success rate, and average yards per attempt.

I was able to repeat this process many times over so that my final data table had 72 points of data (or features) per running back.

And, now, data was a little less hard. That is, until we moved forward to the first step in turning that data into a model that can actually predict something.

Once again, my mentor took my hand and taught me the basics so that I could start using the data.

Rather than just run all of the features through regression (or any other) analysis, he suggested that I first take each feature and run a regression plot against our target to see if the feature has any meaning in the first place. Since I had no idea what to do, he sent me some links and talked me through the general idea. This led to me writing this code, all by myself:

from scipy import stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

all_rb = pd.read_excel('rb_all_data_a.xlsx')
all_rb.set_index('Index', inplace=True)
cols_we_want = []

for a in range(18,90):
col_name = all_rb.columns[a]
cols_we_want.append(col_name)

for col in cols_we_want:
def r2(x, y):
return stats.pearsonr(x, y)[0] ** 2

sns.jointplot(x=col, y='Average_Salary', kind="reg", stat_func=r2, data=all_rb)
plt.savefig(f'{col}.png')
plt.close()

And, bingo, bango, bongo, in 20 lines of code, I had 72 regression plots with r^2 listed on each plot so that I could start to weed out those features which would likely prove insignificant.

Now that the features have been cut down, I'm excited to learn the next step and steps so that we can build a prediction model.

Principal Component Analysis, here we come....

To be continued.....