## DEV Community # Introduction

Exploratory data analysis (EDA) is an essential step in the data science process which involves use of both statistical methods and data visualization techniques to uncover patterns, trends, understand relationships and gain meaningful insights from data in order to understand the problem and make informed decisions. Statistics show that the amount of data generated worldwide is growing at an exponential rate. It is estimated that by 2025, the world will generate 463 exabytes of data per day, thus data scientists and data analysts should understand the process.

Whether you’re an experienced data scientist or a beginner, this blog will walk you through the exciting process of EDA and how to use Python to perform this essential task.

So buckle up, it’s time to dive into the world of data and unearth some hidden insights.

“Data is like garbage. You’d better know what you are going to do with it before you collect it.” — Mark Twain.

### Why Exploratory data analysis (EDA)) is an essential task.

• EDA helps data practitioners understand and gain insights from data before applying machine learning and statistical techniques.
• EDA helps identify patterns, anomalies, and relationships within the data so as to make informed decisions and develop effective strategies.
• The EDA process aims in detecting faulty points in data such as errors or missing values which can be corrected by analysis.

#### Exploratory data analysis (EDA) steps.

To achieve this critical task, following steps need to be taken into consideration.

• Importing necessary libraries.
• Viewing the dataset.
• Check for duplication.
• Data preparation (Handling missing values and outliers)
• Analyzing the data.(univariate, bivariate and multivariate analysis)
• Visualization.

Wondering where to get datasets for practice? We got you covered. Take a look at the following resources.

Lets now perform EDA on the following sample Dataset from Kaggle by performing the following steps;

Dataset

Importing necessary Libraries

Python is a versatile programming language with robust libraries for data analysis, including Pandas, NumPy, seaborn, Matplotlib, Plotly, Tensor flow, Keras.
We will make use of the following modules

``````import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
``````

Get to understand other Python libraries for data analysis by reading through the following article.

Data Science Libraries Every Data scientist should know.

The pandas library can be used to read various datasets from multiple formats such as csv, txt, Excel, JSON, SQL, etc.

``````#How to read Datasets of various formats
import pandas as pd
df = pd.read_sql('SELECT * FROM TableName', connection)
``````

Our dataset is in csv format, we load it into data DataFrame as follows;

``````data=pd.read_csv('https://raw.githubusercontent.com/PhylisKorir/SalesPrediction-using-Linear-Regression/main/Electronic_sales.csv')
data
``````

Viewing the DataFrame.

We can quickly find out how many rows and columns there are in our dataset by using the shape method. This returns a tuple which contains the number of rows and columns.

Shape of the Data

``````data.shape
`````` Previewing the Dataset.

Preview the first 5 rows.

``````data.head()
`````` Preview last five rows.

``````data.tail()
`````` head() and tail() functions are used to preview the first five rows and last fie rows in the dataset respectively.

Columns names

``````data.columns
``````

The pandas.DataFrame.columns function is used to get the names of all the columns of a pandas dataframe object. It returns an Index object which holds the column labels. Concise info of dataset

``````data.info()
``````

The info() method allows us to obtain additional information about the dataset, such as the names of the columns, the data type of each column, and the number of non-null values. Data types

``````data.dtypes
``````

The pandas.dtypes function returns the data types of each column in a pandas DataFrame. It returns a pandas Series with the data type of each column. Descriptive statistics.

``````data.describe()
``````

The describe() function in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values. It provides the count, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile (median) and 75th percentile of the data. Checking Duplicates

``````data.duplicated.sum()
``````

The pandas.DataFrame.duplicated.sum() function returns a Series or DataFrame containing the sum of the boolean values (True or False) in the duplicate rows of the DataFrame.

Unique Values DATA CLEANING
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It involves handling missing values and handling outliers

Missing values

We will begin by scanning the dataset for missing values. We may do this by utilizing the isna() method, which returns a dataframe of boolean values indicating whether or not a field is null. We can use the sum() method to group all missing values by column. Our dataset has no null values and its ready for analysis.

It is common for datasets to contain errors, missing values, outliers, or other types of inconsistencies. Incase your dataset has missing values this blog (Handling Missing Values)will guide you how to handle missing values in data.

"No data is clean, but most is useful."~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

Analyzing Data

With our cleansed dataset we can go ahead and begin the task of exploring the data.

Univariate Analysis.
Univariate analysis is a form of exploratory data analysis (EDA) that involves the examination of a single variable. It is used to summarize the data and gain insight into the data's distribution, central tendency, and variability. It can be used to answer questions such as what is the range of the data, what is the most common value, and is there any outliers. It is also used to identify any trends or patterns in the data.

we can visualize this information using a boxplot from Seaborn.

``````#Handling outliers
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(data['TV'], ax = axs)
plt2 = sns.boxplot(data['Newspaper'], ax = axs)
plt3 = sns.boxplot(data['Radio'], ax = axs)
plt.tight_layout()
`````` With the above box plots we can be able distribution of the data in the vriables.

Bivariate analysis

Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all.

``````# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()
`````` Multivariate analysis

Multivariate analysis is a type of data analysis that involves examining more than two variables at once in order to better understand the relationships between them. It is a powerful tool for exploratory data analysis, as it allows researchers to identify patterns and trends in large datasets that would otherwise be difficult to spot. It also allows researchers to explore the relationships between multiple variables and determine which ones are most important in predicting the outcome of interest.

``````# Let's see the correlation between different variables.
sns.heatmap(data.corr(), cmap="Greens", annot = True)
plt.show()
`````` As it is visible from the scatterplot and the heatmap, the variable TV seems to be most correlated with Sales

Conclusions Chris Greening

Hey Phylis thanks for sharing!

A really neat tool I found a while back for `pandas` is the `ydata-profiling` library, it takes your `DataFrame` and outputs a full report about the data with summary stats, visualizations, etc. it's a really fantastic EDA tool!

NOTE: the tool used to be called `pandas-profiling` but they recently changed it to `ydata-profiling` just FYI Phylis Jepchumba

Will check it out! Matt Curcio

YES, Love Pandas-Profiling or Ydata-Profiling. Great starter. Phylis Jepchumba Phylis Jepchumba

Thank you Rodgers Nyangweso

Great piece here, thanks. Very helpful Phylis Jepchumba 