Understanding Your Data: The Essentials of Exploratory Data Analysis

#python #datascience #machinelearning

Introduction

Exploratory data analysis (EDA) has been used by data scientists to analyze and investigate data sets and summarize their main characteristics, by applying data visualization methods. It is a vital process in Data science since it helps in understanding the data you are dealing with and also making conclusions about it. EDA serves as a bridge between the process of data collection and the processes of building machine learning models.

Prerequisites

What is Exploratory Data Analysis
Data Preprocessing and Feature Engineering in data science. 3 Types of Exploratory Data Analysis
EDA Python Libraries.

1. What Is Exploratory Data Analysis

Exploratory data analysis (EDA) is a critical initial step in the data science workflow. It involves using Python libraries to inspect, summarize, and visualize data to uncover trends, patterns, and relationships.

2. Data Preprocessing and Feature Engineering in data science.

Data preprocessing and feature engineering are crucial steps in preparing datasets for effective model training.

Data Preprocessing

Data pre-processing involves cleaning and preparing raw data to facilitate effective analysis and also prove the validity of a model. Data preprocessing include:

Handling Missing Values.
Detecting Outliers in a given dataset.
Perform Encoding for categorical variables such as gender, country names needs to be converted into numerical format for machine learning algorithms. Encoding techniques like one-hot encoding or label encoding transform categorical ****variables into a format that algorithms can understand.
Checking for Duplicate entries in given a dataset.
Performing Train/Test Split to have dataset divided into two sets for further training the model and also for testing.

Feature Engineering

Feature engineering is a critical task that significantly influences the outcome of a model. It involves crafting new features based on existing data. This task is called Creation of Derived Features. For example, extracting the day of the week from a date or creating interaction terms between existing features can provide valuable information.

Dimensionality Reduction is also another method of feature engineering. High-dimensional datasets may suffer from the curse of dimensionality, leading to increased computational complexity and potential overfitting.
Handling Outliers Outliers can distort model training, and addressing them is crucial. Techniques such as trimming, minorizing, or transforming features can mitigate the impact of outliers on model performance.

3. Types of Exploratory Analysis.

There are three main types of Exploratory Data Analysis (EDA):

Univariate (Non - Graphical).
Univariate (Graphical).
Bivariate.
Multivariate (Non-Graphical).
Multivariate (Graphical).

Univariate non-graphical

This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

Univariate graphical

Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include: Stem-and-leaf plots, which show all data values and the shape of the distribution.

Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

Multivariate Non-graphical

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

Multivariate graphical

Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

Scatter plots: which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Multivariate chart: which is a graphical representation of the relationships between factors and a response.
Run chart: which is a line graph of data plotted over time.
Bubble chart: which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map: which is a graphical representation of data where values are depicted by color.

4. EDA Python Libraries.

Python has top libraries for EDA which include;

Pandas for data manipulation.
Matplotlib and Seaborn for visualisations.
Plotly for interactive plots and
Dask for scalable computing.

Example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

These libraries enhance data analysis by offering powerful tools for summarizing, visualizing, and managing large datasets effectively.