DEV Community

Cover image for Analysis And Prediction On HR Data Set For Beginners
Karthik Bhandary
Karthik Bhandary

Posted on

Analysis And Prediction On HR Data Set For Beginners

Are you a newbie when it comes to Data Analysis and Data modelling? If yes, then you are in the right place.
In this blog, we are going to be performing some Exploratory Data Analysis on the HR Dataset available in Kaggle. We’ll also be using RandomForest to predict who left their company. This is a beginner-friendly dataset and it is easy to work with. With that out of the box, let’s get into the juicy stuff.

IMPORTING

The first thing we always do when we start with our work is to import the libraries. You don’t need to import every single library that you’ll be using in the notebook right at the beginning itself. You can start with the bare minimum and those include:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Enter fullscreen mode Exit fullscreen mode

Just the above are enough when you start working on a problem. From then on you can just add them at the start or you can just import them where ever you are in the notebook.
Now that we are done with importing let’s load the data.

df = pd.read_csv("../input/hr-analytics/HR_comma_sep.csv")
df.head()
Enter fullscreen mode Exit fullscreen mode

Image by author

CHECKING FOR THE NEATNESS OF THE DATA

Now that the data is loaded. Let’s take a look at how the data is, i.e, the data types, the no of NaN values etc., We can do that by using the .info() method

df.info()
Enter fullscreen mode Exit fullscreen mode

Image by author
As you can see above, all the cols are filled with no NaN values. float, int and object are the data types of the columns.

for col in df.columns:
    print(f"{col} - ", df[col].unique())
    print()

Enter fullscreen mode Exit fullscreen mode

Image by author
There seem to be no weird values in the data, which is very good. Let’s look at the metrics.

df.describe()
Enter fullscreen mode Exit fullscreen mode

Image by author
Now that we took a look at the data, Let’s get into my favourite part, EDA.

EXPLORATORY DATA ANALYSIS (EDA)

In this, we will be visualizing the data. By doing this we will be able to further understand how the data is and if there is any work that is to be done.

sns.set()#sets the style of the plot.
fig = plt.figure(figsize=(12,6))#Used to display the plot
sns.barplot(x='Department', y='satisfaction_level', hue='salary', data=df, ci=None)
plt.title("Satisfaction_level Vs Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author
The ones with higher salaries are more satisfied. But product_mng said hold my cup. It seems like the ones with low salaries are more satisfied than the ones with high salaries. The same is for the land.

fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='last_evaluation', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Last Evaluation Vs Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author
Everybody has a similar level of evaluation done.

fig = plt.figure(figsize=(12,6))
sns.barplot(x='Department', y='number_project', data=df, ci=None)
plt.title("Number Project Vs Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author

fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='Work_accident', data=df, ci=None)
g.bar_label(g.containers[0])
plt.title("Work Accident Vs Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author

fig = plt.figure(figsize=(12,6))
g = sns.barplot(x='Department', y='satisfaction_level', hue='left', data=df, ci=None)
g.bar_label(g.containers[0])
g.bar_label(g.containers[1], rotation=90)
plt.title("Left Vs Department with satisfaction level", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author

fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='time_spend_company', data=df, ci=None, color='r', marker='o')
plt.title("Time spent per each Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author

fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='average_montly_hours', data=df, ci=None, color='g', marker='o')
plt.title("Avg Hours spent in the company per Department", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author

The chart looks hilarious lol🤣 Don’t know why🤷‍♂️

fig = plt.figure(figsize=(12,6))
sns.lineplot(x='Department', y='promotion_last_5years', data=df, ci=None, color='black', marker='o')
plt.title("Promotions of Last 5 years in the company per Department", size=15)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image by author
The management and the marketing department have more promotions when compared to the others.

fig = plt.figure(figsize=(12,6))
salary = df['salary'].value_counts()
sns.lineplot(x=salary.values, y=salary.index, ci=None, color='orange', marker='o')
plt.title("Salary (Counts) in the company", size=15)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author
Now that we’ve seen how the data is let’s get into modelling.
Before getting into that let’s take a look at the correlation.

sns.heatmap(df.corr(), center=0, linewidth=1, annot=True, fmt='.2f')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image by author
Here if we observe we can see that satisfaction_level and left columns are highly negatively correlated, which I think is quite obvious.

MODELLING

We need to split our data into a training set and test set so that the model doesn’t remember the data by heart. Before doing that let us drop the categorical columns ‘Development’ and ‘salary’.

df = df.drop(['Department', 'salary'], axis=1)
Enter fullscreen mode Exit fullscreen mode

Here we are passing in a list of col names that we want to drop. By specifying axis=1 we are telling it to remove the column.
You can do it like this as well

df.drop(['Department', 'salary'], axis=1, inplace=True)
Enter fullscreen mode Exit fullscreen mode

By doing this you don’t need to specifically assign it df, since it is going to do it in place.
Now that we got rid of them let's split our data.
First, separate the target var(the one we want to predict) and the rest of the data.

X = df.drop('left', axis=1) # we will predict who left
y = df['left']
Enter fullscreen mode Exit fullscreen mode

Then we pass the X, y into the following function.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Enter fullscreen mode Exit fullscreen mode

From the name of the function, we can say that it is going to be splitting our data into training and test sets.
Now we just need to fit the training data to the model of our choice. Since we are trying to predict we can go with RandomForest.
RandomForest can do both prediction and regression.
we are also going to be using GridSearchCV to find the best params for our model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param = [
    {'n_estimators': [100, 200, 300, 400, 450, 500], 
     'max_depth': [3, 4, 6, 8, 10, 12], 
     'max_leaf_nodes': [15, 20, 25]}, 
]
rf = RandomForestClassifier()
gs_rf = GridSearchCV(rf, param, cv = 5, n_jobs = -1, verbose = 1)
gs_rf.fit(X_train, y_train)
rf_best = gs_rf.best_estimator_
pred = gs_rf.predict(X_test)
gs_rf.best_estimator_
Enter fullscreen mode Exit fullscreen mode

Image by author
So finally we are finished with the modelling. The RandomForest gave us 98% accuracy. Wow really nice. Well done if you followed until the end.

CONCLUSION

In this blog we have seen:

  • A basic workflow of things
  • How to implement RandomForest
  • How to implement GridSearchCV

You can check out my Kaggle notebook here and give it an upvote if you found it helpful

I really hope that you found this analysis helpful and interesting. If you liked my work then don’t forget to follow me on Medium and YouTube, for more content on productivity, self-improvement, Coding, and Tech. Also, check out my works on Kaggle and follow me on LinkedIn. I also write in HackerNoon, you can check that out as well.

Top comments (0)