Junn Hope

Posted on Feb 18, 2023

A beginner's Guide to Exploratory Data Analysis using Python

#eventdriven #python #datascience #tutorial

Exploratory Data Analysis (EDA) is an approach to analyze and summarize data sets in order to understand their main characteristics and detect patterns, relationships, and anomalies. EDA is typically performed at the beginning of a data analysis project and is used to gain insight into the data and identify potential problems with the data quality or the analytical approach.

EDA can involve a range of statistical and visualization techniques, such as summary statistics, histograms, box plots, scatter plots, and correlation matrices. The purpose of EDA is to uncover the main features of the data, such as its distribution, range, central tendency, variability, and any outliers or missing values. This information can be used to guide the development of more sophisticated statistical models or machine learning algorithms.

Some common tasks involved in EDA include examining the data for missing values, exploring relationships between variables, identifying outliers or anomalies, and detecting patterns or trends in the data. EDA can also involve data transformations, such as normalization or scaling, to help make the data more amenable to analysis. Ultimately, the goal of EDA is to gain a better understanding of the data and its underlying structure in order to make more informed decisions about how to analyze it.

In this blog post, we will cover some basic concepts and techniques to help you get started with EDA. We will perform EDA on the IT Salary Survey dataset, which is available at https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv. The IT Salary Survey dataset contains information about the salaries of IT professionals in Europe, as well as their job titles, years of experience, and other demographic information.

Importing the Data

The first step in any data analysis project is to import the data into your programming environment. The IT Salary Survey dataset is available in CSV format, which can be imported using the pandas library in Python:

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from ydata_profiling import ProfileReport
import ipywidgets as widgets
pd.options.display.float_format = "{:,.5f}".format


url = 'https://raw.githubusercontent.com/junn-hope/LuxAcademyBootcamp/main/IT_SalarySurvey_EU2020.csv'
df = pd.read_csv(url)

The read_csv function reads the CSV file from the specified URL and creates a pandas DataFrame.
After importing the data, we can start exploring it using various techniques.

Understanding the Dataset

Once you have imported the dataset, it's important to get a basic understanding of its structure and properties. Here are some questions to ask about the dataset:

How many rows and columns does the dataset have?
What are the names of the columns?
What is the data type of each column?
Are there any missing values?

We can use basic functions like shape, columns, info, and describe to answer the questions we have posed above:

# print the shape of the data (number of rows and columns)
print(df.shape)

# print the names of the columns
print(df.columns)

# print information about the data, including data types and number of non-null values
print(df.info())

# print summary statistics for the numeric columns
print(df.describe())

# print the sum of all null values
print(df.isnull().sum())

The output is as shown:

RangeIndex: 1253 entries, 0 to 1252
Data columns (total 23 columns):
 #   Column                                                                                                                   Non-Null Count  Dtype  
---  ------                                                                                                                   --------------  -----  
 0   Timestamp                                                                                                                1253 non-null   object 
 1   Age                                                                                                                      1226 non-null   float64
 2   Gender                                                                                                                   1243 non-null   object 
 3   City                                                                                                                     1253 non-null   object 
 4   Position                                                                                                                 1247 non-null   object 
 5   Total years of experience                                                                                                1237 non-null   object 
 6   Years of experience in Germany                                                                                           1221 non-null   object 
 7   Seniority level                                                                                                          1241 non-null   object 
 8   Your main technology / programming language                                                                              1126 non-null   object 
 9   Other technologies/programming languages you use often                                                                   1096 non-null   object 
 10  Yearly brutto salary (without bonus and stocks) in EUR                                                                   1253 non-null   float64
 11  Yearly bonus + stocks in EUR                                                                                             829 non-null    object 
 12  Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country                 885 non-null    float64
 13  Annual bonus+stocks one year ago. Only answer if staying in same country                                                 614 non-null    object 
 14  Number of vacation days                                                                                                  1185 non-null   object 
 15  Employment status                                                                                                        1236 non-null   object 
 16  Сontract duration                                                                                                        1224 non-null   object 
 17  Main language at work                                                                                                    1237 non-null   object 
 18  Company size                                                                                                             1235 non-null   object 
 19  Company type                                                                                                             1228 non-null   object 
...
 21  Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week                        373 non-null    float64
 22  Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR  462 non-null    object 
dtypes: float64(4), object(19)
memory usage: 225.3+ KB

From the output of these functions, we can see that the dataset contains 1,253 rows and 23 columns. The column names can be found by running df.columns as seen below:

print(df.columns)

Whose output is:

Index(['Timestamp', 'Age', 'Gender', 'City', 'Position ',
       'Total years of experience', 'Years of experience in Germany',
       'Seniority level', 'Your main technology / programming language',
       'Other technologies/programming languages you use often',
       'Yearly brutto salary (without bonus and stocks) in EUR',
       'Yearly bonus + stocks in EUR',
       'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
       'Annual bonus+stocks one year ago. Only answer if staying in same country',
       'Number of vacation days', 'Employment status', 'Сontract duration',
       'Main language at work', 'Company size', 'Company type',
       'Have you lost your job due to the coronavirus outbreak?',
       'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
       'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR'],
      dtype='object')

The dataset contains both categorical and numerical data, and there are some missing values within some columns.

An easy way of understanding the data would be to run a ProfileReport by the ydata_profiling library package that was earlier imported.

profile = ProfileReport(df, title ="IT Survey Profile Report", html={'style':{'full_width':True}})
profile.to_notebook_iframe()

## the iframe() exports the Jupyter Notebook to HTML

Cleaning the Data

Before we can begin analyzing the data, we need to clean it. This involves dealing with missing values, removing duplicates, and correcting any errors in the data.

Dealing with Missing Values

Missing values are a common problem in real-world datasets. Before we can analyze the dataset, we need to deal with these missing values. There are several ways to do this, including:

Dropping rows or columns with missing values
Imputing missing values with a statistical measure (e.g., mean or median)
Imputing missing values with a machine learning algorithm

Let's begin by dealing with the missing values.
We first seek to understand the percentage of missing values across the columns:

df_length = len(df)
missing_percentages = df.isna().sum().sort_values(ascending= False)/df_length

missing_percentages

With the output being:

Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week                         0.70231
Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR   0.63128
Annual bonus+stocks one year ago. Only answer if staying in same country                                                  0.50998
Yearly bonus + stocks in EUR                                                                                              0.33839
Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country                  0.29370
Other technologies/programming languages you use often                                                                    0.12530
Your main technology / programming language                                                                               0.10136
Number of vacation days                                                                                                   0.05427
Years of experience in Germany                                                                                            0.02554
Сontract duration                                                                                                         0.02314
Age                                                                                                                       0.02155
Company type                                                                                                              0.01995
Have you lost your job due to the coronavirus outbreak?                                                                   0.01596
Company size                                                                                                              0.01437
Employment status                                                                                                         0.01357
Total years of experience                                                                                                 0.01277
Main language at work                                                                                                     0.01277
Seniority level                                                                                                           0.00958
Gender                                                                                                                    0.00798
Position                                                                                                                  0.00479
City                                                                                                                      0.00000
Yearly brutto salary (without bonus and stocks) in EUR                                                                    0.00000
Timestamp                                                                                                                 0.00000
dtype: float64

We can then use the fillna() method in pandas to replace missing values with a specified value.
For example, if we want to replace missing values in the 'Gender' column with 'Unknown', we can do the following:

df['Gender'] = df['Gender'].fillna('Unknown')

df['Salary one year ago'] = df['Salary one year ago'].fillna(0)

We seek to also fill missing values within the Age column by using the median age of the columns

##filled missing values in Age with the median value while making sure to skip Null values

median_age = df['Age'].median(skipna= True)
df['Age'] = df['Age'].fillna(median_age)

Correcting Errors

Finally, we need to correct any errors in the data. In this dataset, we can see that some of the values in the 'Experience' column are negative, which is clearly an error. We can correct this by taking the absolute value of the column:

df['Total years of experience'] = df['Total years of experience'].abs()

Dropping columns

We can use the drop function in pandas to remove the Timestamp column:

data = df.drop(columns=['Timestamp'])

Exploring the Dataset

Now that we have a clean dataset, we can start exploring it. Let's take a look at some basic statistics for the dataset:

# compute summary statistics for the dataset
print(df.describe())

      Age     Yearly_brutto_salary   Annual_brutto_salary   Shorter_working week 
count  1,226.00000  1,253.00000 885.00000   373.00000
mean   32.50979 80,279,042.57872    632,245.87232   12.96783
std    5.66380  2,825,061,107.59049 16,805,081.75171    15.27517
min    20.00000 10,001.00000    11,000.00000    0.00000
25%    29.00000 58,800.00000    55,000.00000    0.00000
50%    32.00000 70,000.00000    65,000.00000    0.00000
75%    35.00000 80,000.00000    75,000.00000    30.00000
max    69.00000 99,999,999,999.00000    500,000,000.00000   40.00000

The output of this command shows that the Annual brutto salary in the dataset is 632,245.87232 Euros, with a median value of 75,000.00000 Euros.

Visualizing the Data

Visualizing the data is a great way to get an initial sense of the patterns and relationships in the dataset. We can use libraries like matplotlib or seaborn in Python to create visualizations. Here are some examples of how to create visualizations for the IT Salary Survey dataset:

# create a histogram of the salaries
sns.histplot(df['Current Salary'], kde=False, bins=20)
plt.title('Histogram of Salaries')
plt.xlabel('Salary (EUR)')
plt.ylabel('Count')
plt.show()

# create a box plot of the salaries by gender
sns.boxplot(x='Gender', y='Current Salary', data=data)
plt.title('Box Plot of Salaries by Gender')
plt.xlabel('Gender')
plt.ylabel('Salary (EUR)')
plt.show()

# create a scatter plot of the salaries by years of experience
sns.scatterplot

I hope this tutorial was of much help to you as a beginner. Feel free to reach out for any contributions or collaborations.

Happy coding,
Junn Hope

DEV Community