DEV Community

Cover image for Exploratory Data Analysis - Ultimate Guide
Catherine Kawara
Catherine Kawara

Posted on

Exploratory Data Analysis - Ultimate Guide

Exploratory data analysis (EDA) is an essential step in any data analysis project. It involves examining and analyzing data to uncover patterns, relationships, and insights.EDA is an iterative process that begins with simple data exploration and progresses to more complex analysis.

The main objective of EDA is to gain a deeper understanding of the data and use this understanding to guide further analysis and modeling. EDA can help in making informed decisions and generating new hypotheses, which can lead to new insights and discoveries.

Steps of EDA

1. Understanding your variables

Before beginning data cleaning, it is important first to understand the variables in your dataset. Understanding the variables will help you identify potential data quality issues and determine the appropriate cleaning techniques to use.

Reading data

We will use the IT Salary Survey for EU region(2018-2020) dataset

data = pd.read_csv("./Data/IT Salary Survey EU  2020.csv")

# get first five rows
data.head()
Enter fullscreen mode Exit fullscreen mode

data.head() output

# get the last five columns
data.tail()
Enter fullscreen mode Exit fullscreen mode

data.tail() output

data.shape
Enter fullscreen mode Exit fullscreen mode

This will return the number of rows by the number of columns for my dataset. The putput is (1253, 23) meaning our dataset has 1253 rows and 23 columns

data.describe()
Enter fullscreen mode Exit fullscreen mode

It summarizes the count, mean, standard deviation, min, and max for numeric variables.

data.describe() output

# get the name of all the columns in our dataset
data.columns
Enter fullscreen mode Exit fullscreen mode

data.columns output

# check for unique values
data.nunique()
Enter fullscreen mode Exit fullscreen mode

data.nunique() output

# check unique value for specific values
data['Seniority'].unique()
Enter fullscreen mode Exit fullscreen mode

unique specific values output

2. Cleaning the data

Before you start exploring your data, it's essential to clean and preprocess the data. Data cleaning involves identifying and correcting errors and inconsistencies in the data, such as missing values, duplicate records and incorrect data types.
By doing so, we can ensure that our analysis is based on high-quality data and that any insights or conclusions drawn from the data are accurate and reliable.

Renaming columns
First, we rename our columns with shorter, more relevant names.

data.columns = ["Year", "Age", "Gender","City","Position","Experience","German_Experience","Seniority","Main_language","Other_Language","Yearly_salary","Yearly_bonus","Last_year_salary","Last_year_bonus","Vacation_days","Employment_status","Сontract_duration","Language","Company_size","Company_type","Job_loss_COVID","Kurzarbeit","Monetary_Support"]
Enter fullscreen mode Exit fullscreen mode

Remove any columns we might don't need

data.drop([
        'German_Experience', # analysis focuses on only German
        'Last_year_salary', # analysis covers only 2020 
        'Last_year_bonus',  # analysis covers only 2020 
        'Job_loss_COVID', # analysis isn't related to covid
        'Kurzarbeit', # analysis isn't related to covid
        'Monetary_Support', # analysis isn't related to covid
        ], axis=1, inplace=True)
Enter fullscreen mode Exit fullscreen mode

Dealing with missing and duplicate values

Duplicates are records that are identical in all variables. Duplicates and missing values can affect the statistical analysis and visualization. One approach to handling this is to remove them from the dataset.

# check for missing values
data.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Missing values

# Drop missing and duplicate values
data=data.dropna(subset=['Age','Gender','Position','Experience','Seniority','Main_language', 'Сontract_duration', 'Yearly_bonus']) 
data=data.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

We can use data.isna().sum() to confirm if there are any missing values left

Handling data type issues

Data type issues can arise from errors in data collection or data entry. For example, a variable may be stored as a string instead of a numerical value. One approach to handling data type issues is to convert the variable to the appropriate data type. In our case, we will convert datetime to date

#  changing datetime to date
data['Year'] = pd.to_datetime(data['Year']).dt.year
Enter fullscreen mode Exit fullscreen mode

3. Analyzing relationships between variables

Correlation Matrix
The correlation is a measurement that describes the relationship between two variables. Therefore, a correlation matrix is a table that shows the correlation coefficients between many variables.

# calculate correlation matrix
corelation = data.corr()
Enter fullscreen mode Exit fullscreen mode

we will use sns.heatmap() to plot a correlation matrix of all of the variables in our dataset.

# plot the heatmap
sns.heatmap(corelation,xticklabels=corelation.columns,yticklabels=corelation.columns, annot=True)
Enter fullscreen mode Exit fullscreen mode

plot result

Countplot

A countplot is a type of visualization in which the frequency of categorical data is plotted.
It is a bar plot where the bars represent the number (or count) of occurrences of each category in a categorical variable.
Countplots are a useful tool for exploring the distribution of categorical data and identifying patterns or trends.

sns.countplot(x='Gender', data=data)
Enter fullscreen mode Exit fullscreen mode

countplot

The hue parameter

The hue parameter is used to display the frequency of the data based on a second categorical variable. In our case, Contract_duration.

sns.countplot(x='Gender', hue='Сontract_duration', data=data)
Enter fullscreen mode Exit fullscreen mode

The resulting plot shows the frequency of gender, grouped by the contract duration.

adding hue

Histogram

A histogram is a graphical representation of the distribution of a continuous variable, typically showing the frequencies of values in intervals or "bins" along a numerical range.

bins = [15,20,25,30,35,40,45,50] 
pd.cut(data['Age'],bins=bins).value_counts(normalize=True).mul(100).round(1).plot(kind='bar')
Enter fullscreen mode Exit fullscreen mode

age histogram

Conclusion

EDA is a critical step in any data analysis project. It allows you to understand the data, identify patterns and relationships, and extract insights that can help you make informed decisions.
There are several other types of visualizations that we didn't cover that you can use depending on the dataset, but by following the steps outlined in this article, you can conduct an effective EDA and extract valuable insights from your data.

The rest of the code where I perform more EDA on this dataset can be found on GitHub.
Till next time, Happy coding!✌️

Top comments (0)