DEV Community

Cover image for Exploratory Data Analysis
Mugi  Mugendi
Mugi Mugendi

Posted on

Exploratory Data Analysis

Exploratory data analysis (EDA) is an essential step in the data science process. It helps to uncover patterns, trends and correlations that are not easily visible in a dataset. EDA is especially important if you are dealing with large datasets or if you need to find relationships between variables. In Python, it is possible to use the pandas library to work with data frames, create visualizations and carry out correlation tests. By leveraging data frames, we can easily explore our dataset and gain insights into how different variables interact with each other. Moreover, we can build models based on the insights from our exploratory analysis. This will help us make better predictions or decisions based on our datasets. In this guide, we will cover the essential techniques and tools for EDA in Python.

STEPS IN EXPLORATORY DATA ANALYSIS

Importing and Loading Data

The first step in any data analysis project is to import and load the data. Python has many libraries for reading data from various sources, such as CSV, Excel, SQL databases, and more. Some popular libraries for loading data include pandas, NumPy, and SciPy.

For example, to load a CSV file in pandas, you can use the following code:

import pandas as pd
df = pd.read_csv('data.csv')

Enter fullscreen mode Exit fullscreen mode

Understanding the Data.

Understand the data: shape, rows(samples), columns(features), features’ type, null values…
Get introductory details about data: check few introductory details like number of columns, number of rows, type of features, and data types of column entries…

Get statistical insight of data: get details about various statistical data like count, mean, standard deviation, min value, median, max value
Here are some of the methods used

data.head()#view the first few rows
data.tail()# view the last few rows
data.describe()#Gives summary of the data
data.shape# Prints the shape of dataset
data.columns#gives the column names
data.nunique() data.feature.unique()
# gives sum of unique values in each column
data.isnull().sum()# counts the Null values
Enter fullscreen mode Exit fullscreen mode

Cleaning and Preprocessing Data

Clean the data from redundancies: such as irregularity in the data, uninformative features, and noisy outliers. This involves removing missing values, handling outliers, scaling the data, and more. Pandas provides many methods for cleaning and preprocessing data, such as dropna(), fillna(), replace(), apply(), and more.

For example, to remove missing values from a DataFrame, you can use the following code:

df.dropna(inplace=True)

Enter fullscreen mode Exit fullscreen mode
data.isNull().sum # give the number of missing values for each 
variable
data.dropna(axis=0, inplace=True)# remove NULL entries if it exists
data[column].fillna(value=data[column].mean(), inplace = True)# fill in NULL entries with mean/median or any integer
data.duplicated().sum()# return total number of duplicate entries
data.drop_duplicates(inplace=True)# remove duplicates
Enter fullscreen mode Exit fullscreen mode

Visualizing Data

Visualization is a crucial part of EDA, as it allows us to see patterns and relationships that might not be apparent from numerical summaries alone.It helps us convert raw data into a visual form such as a graph.
Visualization makes data easier for us to understand and extract useful insights.
Python has many libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and more.

For example, to create a scatter plot using Matplotlib, you can use the following code:


import matplotlib.pyplot as plt
plt.scatter(df['x'], df['y'])
plt.show()

Enter fullscreen mode Exit fullscreen mode

Here,s an introductory tutorial on Matplotlib
Here's one on seaborn

Exploring Relationships

Once we have summarized and visualized the data, the next step is to explore relationships between variables. This involves calculating correlations, creating heatmaps, and more. Pandas provides many methods for exploring relationships, such as corr(), pivot_table(), and more.

For example, to calculate the correlation matrix for a DataFrame, you can use the following code:


print(df.corr())

Enter fullscreen mode Exit fullscreen mode

In this guide, we have covered the essential techniques and tools for exploratory data analysis in Python. By using these techniques, you can gain valuable insights from your data and improve the performance of your models. Remember that EDA is an iterative process, and you should always be exploring and testing new ideas.
Below is the link to my github with An example of EDA in python
GITHUB

Top comments (0)