DEV Community

Cover image for Exploratory Data Analysis Ultimate Guide
Eric-GI
Eric-GI

Posted on • Updated on

Exploratory Data Analysis Ultimate Guide

INTRODUCTION

Exploratory Data Analysis

Exploratory Data Analysis is the process of exploring and summarizing a dataset in order to identify patterns, trends, and relationships in the data. EDA involves visualizing the data, identifying outliers, missing values, and other anomalies, and using statistical methods to understand the characteristics of the data. EDA is an important step in the data analysis process because it allows analysts to identify potential issues with the data, develop hypotheses, and test those hypotheses using statistical methods.

Exploratory data analysis (EDA) is an essential process in data science, which involves understanding and summarizing the characteristics of a dataset to derive meaningful insights. EDA provides a foundation for further analysis, modeling, and decision-making. However, exploring and analyzing large and complex datasets can be a daunting task, and require the use of specialized tools and techniques. In this essay, we will discuss some of the most common and effective tools and techniques for exploratory data analysis.

Effective tools and techniques for exploratory data analysis

  1. Summary Statistics: Summary statistics such as mean, median, standard deviation, minimum, and maximum can provide a quick overview of the central tendency, variability, and range of a dataset. Descriptive statistics can be used to identify outliers, skewness, and other patterns in the data. Additionally, summary statistics can be visualized using histograms, box plots, and scatter plots to gain a deeper understanding of the distribution and relationships among variables.

  2. Data Visualization: Data visualization is a powerful technique for exploring and communicating data. Visualization techniques such as scatter plots, histograms, heatmaps, and bar graphs can be used to display the patterns and relationships within and between variables. Visualization can help detect trends, clusters, outliers, and other patterns that may be hidden in the raw data.

  3. Correlation Analysis: Correlation analysis is a statistical technique used to measure the strength and direction of the relationship between two or more variables. Correlation can be visualized using scatter plots or heatmaps, and can help identify the most significant variables in the dataset. Correlation analysis can also be used to create predictive models by identifying the variables that are most strongly correlated with the target variable.

  4. Clustering: Clustering is a technique used to group similar data points into clusters based on their similarity. Clustering can help identify patterns and relationships in the data that may not be apparent using other techniques. Clustering can be performed using unsupervised machine learning algorithms such as k-means, hierarchical clustering, or DBSCAN.

  5. Dimensionality Reduction: Dimensionality reduction is a technique used to reduce the number of features or variables in a dataset while retaining as much information as possible. This technique can be useful when working with high-dimensional data, where it may be difficult to visualize and understand the relationships among variables. Techniques such as principal component analysis (PCA) and t-SNE can be used to reduce the dimensionality of the data and identify the most important features.

  6. Data Preprocessing: Data preprocessing involves cleaning and transforming the data to make it suitable for analysis. Data preprocessing techniques such as imputation, normalization, and encoding can be used to handle missing values, scale the data, and convert categorical variables to numerical values. Data preprocessing can help improve the accuracy and efficiency of EDA and subsequent analysis.

How to perform Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in data analysis that helps in understanding the characteristics and patterns in a dataset. EDA involves various techniques and methods to gain insights from the data. Here are some steps that can be followed to perform EDA:

Collect and Understand the Data: The first step is to gather the data and try to understand the nature of the data. This includes identifying the data sources, collecting data, and understanding the attributes of the data.

Clean and Prepare the Data: Before starting the analysis, the data needs to be cleaned and prepared. This includes handling missing data, removing outliers, scaling or normalizing the data, and converting categorical data to numerical data.

Summarize the Data: Summary statistics such as mean, median, mode, standard deviation, and range can be calculated for each attribute to get a quick overview of the data. This can be done using tools like Excel or statistical software like R or Python.

Visualize the Data: Visualization techniques like histograms, box plots, scatter plots, and heat maps can be used to understand the distribution of the data, identify outliers and patterns, and visualize the relationship between different attributes. Visualization tools like Tableau, matplotlib, or ggplot can be used for this purpose.

Perform Statistical Analysis: Statistical techniques like correlation analysis, regression analysis, and clustering can be used to uncover patterns and relationships between different attributes. These techniques can be performed using statistical software like R, Python, or SAS.

Draw Insights: Finally, after analyzing the data using various techniques, meaningful insights can be drawn from the data. The insights can be communicated in the form of reports, presentations, or visualizations.

It's important to note that EDA is an iterative process. The steps mentioned above are not necessarily sequential and may be repeated multiple times to gain a deeper understanding of the data. EDA is an exploratory process and involves the use of creativity and intuition to uncover hidden patterns and relationships. By performing EDA, analysts can gain a better understanding of the data, identify trends and patterns, and make data-driven decisions.
**
Data We are exploring today**

I got a very beautiful data-set of salaries from Kaggle. The data-set can be downloaded from https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region. We will explore the data and make it ready for modeling.

1. Importing the required libraries for EDA

First, you need to import the necessary libraries that will be used for the analysis. Here are the libraries you will need:

    # importing required libraries
Enter fullscreen mode Exit fullscreen mode

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set(color_codes=True)

2. Loading the dataset
After importing the necessary libraries, you need to load the dataset. You can use the pandas library to load the CSV file as follows:

df = pd.read_csv(r"C:\Users\Eric\Desktop\archive (1)\IT Salary Survey EU 2020.csv")
df.head(5)

3. Exploring the Dataset

Before analyzing the data, it's important to have a good understanding of what the dataset contains. Here are some methods to help you explore the dataset:
data.head() - displays the first five rows of the dataset
data.shape - displays the number of rows and columns in the dataset
data.info() - displays information about the columns in the dataset, such as data type and number of non-null values
data.describe() - displays basic statistical information about the numeric columns in the dataset

4. Cleaning the Dataset
After exploring the dataset, you may need to clean the dataset by handling missing values, renaming columns, dropping unnecessary columns, and converting data types. Here are some methods to help you clean the dataset:
data.isnull().sum() - displays the number of missing values in each column
data.drop(columns=['Column_Name'], inplace=True) - drops a column from the dataset
data.rename(columns={'Old_Column_Name': 'New_Column_Name'}, inplace=True) - renames a column in the dataset
data['Column_Name'].astype('New_Data_Type') - converts a column to a new data type.

5. Visualizing the Dataset
Visualizations can help you to understand the distribution of the data and identify patterns in the dataset. Here are some methods to help you visualize the dataset:
sns.countplot(x='Column_Name', data=data) - displays a bar chart of the number of occurrences of each unique value in a categorical column
sns.histplot(x='Column_Name', data=data) - displays a histogram of a numeric column
sns.boxplot(x='Column_Name', y='Column_Name', data=data) - displays a box plot of a numeric column based on the values of a categorical column
sns.scatterplot(x='Column_Name', y='Column_Name', data=data) - displays a scatter plot of two numeric columns

** Dropping irrelevant columns
**
data.drop(columns=['Respondent', 'MainBranch'], inplace=True)

In this code snippet, data is the pandas dataframe containing the loaded dataset, and drop is a method of the pandas dataframe that drops the specified columns. The columns parameter is used to specify the names of the columns to drop, and the inplace parameter is set to True to modify the dataframe in place.

*Dropping the duplicate rows
*

To drop duplicate rows from the "IT Salary Survey for EU region (2018-2020)" dataset, you can use the drop_duplicates method of the pandas dataframe. Here is an example code snippet that drops the duplicate rows:

data.drop_duplicates(inplace=True)

Sure! To drop duplicate rows from the "IT Salary Survey for EU region (2018-2020)" dataset, you can use the drop_duplicates method of the pandas dataframe. Here is an example code snippet that drops the duplicate rows:

data.drop_duplicates(inplace=True)

In this code snippet, data is the pandas dataframe containing the loaded dataset, and drop_duplicates is a method of the pandas dataframe that drops the duplicate rows. The inplace parameter is set to True to modify the dataframe in place.

By default, the drop_duplicates method considers all columns in the dataframe to determine duplicate rows. If you want to consider only certain columns to determine duplicate rows, you can pass the column names to the subset parameter of the drop_duplicates method. For example, if you want to consider only the "Country" and "SalaryUSD" columns to determine duplicate rows, you can modify the code snippet as follows:

data.drop_duplicates(subset=['Country', 'SalaryUSD'], inplace=True)

Sure! To drop duplicate rows from the "IT Salary Survey for EU region (2018-2020)" dataset, you can use the drop_duplicates method of the pandas dataframe. Here is an example code snippet that drops the duplicate rows:

data.drop_duplicates(inplace=True)

In this code snippet, data is the pandas dataframe containing the loaded dataset, and drop_duplicates is a method of the pandas dataframe that drops the duplicate rows. The inplace parameter is set to True to modify the dataframe in place.

By default, the drop_duplicates method considers all columns in the dataframe to determine duplicate rows. If you want to consider only certain columns to determine duplicate rows, you can pass the column names to the subset parameter of the drop_duplicates method. For example, if you want to consider only the "Country" and "SalaryUSD" columns to determine duplicate rows, you can modify the code snippet as follows:

data.drop_duplicates(subset=['Country', 'SalaryUSD'], inplace=True)

This code snippet drops the duplicate rows based on the values in the "Country" and "SalaryUSD" columns. You can modify the subset parameter to include any other columns that you want to consider to determine duplicate rows.
**
Dropping the missing or null values**

To drop missing or null values from the "IT Salary Survey for EU region (2018-2020)" dataset, you can use the dropna method of the pandas dataframe. Here is an example code snippet that drops the rows with missing or null values:

data.dropna(inplace=True)

Sure! To drop missing or null values from the "IT Salary Survey for EU region (2018-2020)" dataset, you can use the dropna method of the pandas dataframe. Here is an example code snippet that drops the rows with missing or null values:

data.dropna(inplace=True)

In this code snippet, data is the pandas dataframe containing the loaded dataset, and dropna is a method of the pandas dataframe that drops the rows with missing or null values. The inplace parameter is set to True to modify the dataframe in place.

By default, the dropna method drops any row that contains at least one missing or null value. If you want to drop only the rows with missing or null values in specific columns, you can pass the column names to the subset parameter of the dropna method. For example, if you want to drop the rows with missing or null values in the "Country" and "SalaryUSD" columns, you can modify the code snippet as follows:

data.dropna(subset=['Country', 'SalaryUSD'], inplace=True)

his code snippet drops the rows with missing or null values in the "Country" and "SalaryUSD" columns. You can modify the subset parameter to include any other columns that you want to consider to drop the rows with missing or null values.

Detecting outliers
Detecting outliers is an important step in data analysis because outliers can have a significant impact on statistical analysis, machine learning models, and data visualization. Here are some reasons why detecting outliers is important:

Impact on statistical analysis: Outliers can have a significant impact on statistical analysis, such as mean, standard deviation, correlation, and regression analysis. For example, the mean and standard deviation are sensitive to outliers, and a single outlier can significantly increase or decrease their values. This can distort the analysis and lead to inaccurate conclusions.

Impact on machine learning models: Outliers can also have a significant impact on machine learning models, such as linear regression, decision trees, and clustering. Outliers can skew the model's parameters and lead to poor performance or overfitting. Therefore, it is important to detect and remove outliers before training the machine learning models.

Impact on data visualization: Outliers can also affect data visualization, such as boxplots, histograms, and scatter plots. Outliers can distort the scale of the plot, making it difficult to interpret the data. By detecting and removing outliers, the data visualization can better reflect the underlying distribution and patterns in the data.

In summary, detecting outliers is important to ensure the accuracy and validity of statistical analysis, machine learning models, and data visualization. By removing outliers, we can obtain a more accurate representation of the data and make better decisions based on the analysis.

To detect outliers in the "IT Salary Survey for EU region (2018-2020)" dataset, you can use various statistical techniques and visualization tools. The common approach is:

Boxplot: You can use a boxplot to visualize the distribution of a numerical variable and detect potential outliers. In a boxplot, the outliers are represented by individual points outside the whiskers. Here is an example code snippet that creates a boxplot of the "SalaryUSD" column:

sns.boxplot(x=df['Age'])

sns.boxplot is a method of the seaborn library that creates a boxplot. The data parameter is set to the pandas dataframe containing the loaded dataset, and the y parameter is set to the name of the column to plot.

Plot different features against one another (scatter), against frequency (histogram)

Scatter plot of "SalaryUSD" against "Experience":

import matplotlib.pyplot as plt

plt.scatter(data['Experience'], data['SalaryUSD'])
plt.xlabel('Experience')
plt.ylabel('SalaryUSD')
plt.show()

In this code snippet, plt.scatter is a method of the matplotlib library that creates a scatter plot. The data parameter is set to the pandas dataframe containing the loaded dataset, and the ['Experience'] and ['SalaryUSD'] parameters select the "Experience" and "SalaryUSD" columns, respectively. The plt.xlabel and plt.ylabel methods set the labels of the x and y axes, respectively.

Histogram of "Age" with 20 bins:

plt.hist(data['Age'], bins=20)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In this code snippet, plt.hist is a method of the matplotlib library that creates a histogram. The data parameter is set to the pandas dataframe containing the loaded dataset, and the ['Age'] parameter selects the "Age" column. The bins parameter sets the number of bins in the histogram, and the plt.xlabel and plt.ylabel methods set the labels of the x and y axes, respectively.

Density plot of "SalaryUSD" grouped by "Gender":

import seaborn as sns

sns.kdeplot(data=data, x='SalaryUSD', hue='Gender')
plt.xlabel('SalaryUSD')
plt.ylabel('Density')
plt.show()

In this code snippet, sns.kdeplot is a method of the seaborn library that creates a density plot. The data parameter is set to the pandas dataframe containing the loaded dataset, and the x parameter is set to the name of the column to plot, which is "SalaryUSD" in this case. The hue parameter is set to the name of the column to group by, which is "Gender" in this case. The plt.xlabel and plt.ylabel methods set the labels of the x and y axes, respectively.

Exploratory Data Analysis (EDA) is a crucial step in data science that involves understanding the dataset and its underlying structure. It helps in discovering patterns, relationships, and outliers in the data, which can be used to inform further analysis or modeling. In this article, we explored the "IT Salary Survey for EU region (2018-2020)" dataset and performed various EDA tasks using Python.
Firstly, we loaded the dataset using the pandas library and checked its basic properties, such as shape, data types, and summary statistics. We found that the dataset has 8792 rows and 23 columns, with some missing values and duplicate rows that needed to be dropped.
Next, we performed some data cleaning tasks, such as dropping irrelevant columns, dropping duplicate rows, and dropping missing or null values. We also detected and dealt with outliers using various methods, such as Z-score, IQR, and scatter plot.**
Finally, we plotted different features against one another and against frequency using scatter plots, histograms, and density plots. These visualizations helped us understand the distribution, correlation, and variation of the data and identify any interesting patterns or insights.
In conclusion, EDA is an essential step in data science that helps in understanding the data and informing further analysis or modeling. Python provides various libraries, such as pandas, matplotlib, and seaborn, that make it easy to perform EDA tasks and visualize the data. By following the steps outlined in this article, data scientists can gain valuable insights from their data and make better decisions based on the analysis.

Stay tuned for more updates

Thank you!

Top comments (0)