When working with a new dataset, it's important to explore the data to understand its structure, patterns, and anomalies. This process, known as Exploratory Data Analysis (EDA), helps you get familiar with the data before diving into modeling or drawing conclusions.
Exploratory data analysis is one of the basic and essential steps of a data science project. A data scientist involves almost 70% of his work in doing the EDA of the dataset. Source.
Key aspects of EDA include:
Distribution of Data: Examining the distribution of data points to understand their range, central tendencies (mean, median), and dispersion (variance, standard deviation).
Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize relationships within the data and distributions of variables.
Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence statistical analyses and might indicate data entry errors or unique cases.
Correlation Analysis: Checking the relationships between variables to understand how they might affect each other. This includes computing correlation coefficients and creating correlation matrices.
Handling Missing Values: Detecting and deciding how to address missing data points, whether by imputation or removal, depending on their impact and the amount of missing data.
Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
Types of Exploratory Data Analysis (EDA)
1. Univariate Analysis
Definition: Focuses on analyzing a single variable at a time.
Purpose: To understand the variable's distribution, central tendency, and spread.
Techniques:
- Descriptive statistics (mean, median, mode, variance, standard deviation).
- Visualizations (histograms, box plots, bar charts, pie charts).
2. Bivariate Analysis
Definition: Examines the relationship between two variables.
Purpose: To understand how one variable affects or is associated with another.
Techniques:
- Scatter plots.
- Correlation coefficients (Pearson, Spearman).
- Cross-tabulations and contingency tables.
- Visualizations (line plots, scatter plots, pair plots).
3. Multivariate Analysis
Definition: Investigates interactions between three or more variables.
Purpose: To understand the complex relationships and interactions in the data.
Techniques:
- Multivariate plots (pair plots, parallel coordinates plots).
- Dimensionality reduction techniques (PCA, t-SNE). Cluster analysis.
- Heatmaps and correlation matrices.
4. Descriptive Statistics
Definition: Summarizes the main features of a data set.
Purpose: To provide a quick overview of the data.
Techniques:
- Measures of central tendency (mean, median, mode).
- Measures of dispersion (range, variance, standard deviation).
- Frequency distributions.
5. Graphical Analysis
Definition: Uses visual tools to explore data.
Purpose: To identify patterns, trends, and data anomalies through visualization.
Techniques:
- Charts (bar charts, histograms, pie charts).
- Plots (scatter plots, line plots, box plots).
- Advanced visualizations (heatmaps, violin plots, pair plots).
How to perform Exploratory Data Analysis ?
In this article I'll demonstrate the process using a sample weather dataset. This will be a hands-on approach, so we'll walk through each step with simple explanations.
Let's get started!
Step 1: Loading the Data
Importing Libraries
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pandas is used for data manipulation and analysis.
numpy provides support for large, multi-dimensional arrays and matrices.
matplotlib is a plotting library for creating static, animated, and
interactive visualizations.
seaborn is built on top of matplotlib and provides a high-level interface for creating attractive and informative statistical graphics
Loading the data
The first step in EDA is loading your data into a DataFrame.
We can do this using pandas.
# Load the dataset
data_frame_name = pd.read_csv("File_path_to_your_csv_file")
the content in the bracket is usually the file path. alternatively you can define file path then call it.
# Load the dataset
File_path = "File_path_to_your_csv_file"
data_frame_name = pd.read_csv(File_path)
You can get a quick overview of the data and its structure by
using the .head()
method. It gives us a glance at the first few rows to understand the basic structure — what columns are present, how the data is organized, and any initial impressions you might have about the values. By default, it shows the first 5 rows, including column names.
# Display the first five rows
data_frame_name.head()
# Alternatively you can use a print function. Gives you the same results
print(data_frame_name.head())
but wait, what if you want the last rows instead? We use .tail()
method instead.
# Display the last five rows
data_frame_name.tail()
# Alternatively you can use a print function. Gives you the same results
print(data_frame_name.tail())
Hmm, how about both the head and tail of the dataframe? We can do that by simply calling our dataframe.
# Display the first five and last five rows
data_frame_name
# Alternatively you can use a print function. Gives you the same results
print(data_frame_name)
Step 2: Checking the Data's Structure
Understanding the structure means knowing the number of rows, columns, and data types present in the dataset. This can give you clues about the kind of analysis you'll be able to perform.
# Check the structure of the dataset
weather_data.info()
This command will show the total number of entries (rows), the number of columns, and the data type of each column. It also highlights how many non-null values are present in each column.
you can also get the numbers of rows and columns separately using .shape
# Display the number of rows and columns in the DataFrame (rows,cols)
data_frame_name.shape
and the columns using .columns
# Display the column names of the DataFrame
data_frame_name.columns
and the data types using .dtypes
# Display the data types of each column
data_frame_name.dtypes
Step 3: Summarizing the Data
Next, you'll want to get a summary of the numerical columns. This provides an overview of the data's central tendency, dispersion, and shape of the distribution. We do that using .describe()
. This command gives you a quick statistical summary of each numeric column, including the mean, standard deviation, minimum, and maximum values. This helps in identifying any outliers or unusual distributions.
# Get a summary of numerical columns
data_frame_name.describe()
Step 4: Identifying Missing Values
Missing values can be tricky—they might represent gaps in data collection, or they might be errors. It's essential to identify and decide how to handle them.
# Check for missing values
data_frame_name.isnull().sum()
Luckily our dataset does not have any missing values
If any columns have missing values, you'll need to decide whether to remove them or fill them with an appropriate value (like the mean or median).
We can also check for any duplicates by
# Check for and count duplicate rows
data_frame_name.duplicated().sum()
Luckily our dataset does not have duplicates
Step 5: Visualizing the Data
Visualization helps you see patterns, trends, and relationships in the data that might not be obvious from raw numbers. How can we represent data this way? Most ways to visualize include:
- Histograms: Display the distribution of numerical data.
- Scatter plots: Show the relationship between two numerical variables.
- Bar charts: Compare categorical data.
- Line charts: Visualize data over time.
Let's start with a simple histogram to understand the distribution of a particular column, such as temperature:
# Plot a histogram of the 'Column_name' column
plt.hist(weather_data['Column_name'], bins=30, edgecolor='black') #you can change the color to anything
plt.title('Distribution of Column_name')
plt.xlabel('Column_name') # Label for the x-axis (horizontal axis)
plt.ylabel('Frequency') # Label for the y-axis (vertical axis)
plt.show()
bins=30
: This divides the data into 30 bins (intervals) for counting frequencies. You can adjust this number.
edgecolor='black'
: This adds black outlines to the bars for better visual separation.
let's change the edgecolor
to white and add gridlines
# Plot a histogram of the 'Column_name' column
plt.hist(weather_data['Column_name'], bins=30, edgecolor='white') #you can change the color to anything
plt.title('Distribution of Column_name')
plt.xlabel('Column_name') # Label for the x-axis (horizontal axis)
plt.ylabel('Frequency') # Label for the y-axis (vertical axis)
plt.grid(True) # Add gridlines for better readability
plt.show()
Try using different columns and different visualization methods
Step 6: Finding Correlations
Correlation analysis helps you understand how different variables relate to each other. This is especially useful if you plan to build a predictive model later.
# Compute correlation matrix
correlation_matrix = weather_data.corr()
# Display the correlation matrix
correlation_matrix
But wait, we get an error. Why is that? The .corr()
method works only on numerical data. How do we handle this? We extract numerical columns from the dataset by creating a new dataframe. Then try the .corr()
method again on the new dataframe that contains the numerical columns only.
# Extract numerical features for correlation
numerical_data = data_frame_name.select_dtypes(include=['number'])
# Compute correlation matrix on numerical data
correlation_matrix = numerical_data.corr()
# Display the correlation matrix
correlation_matrix
This matrix shows how each pair of columns relates. Values close to 1 or -1 indicate a strong relationship, while values near 0 suggest little to no relationship.
A way to visualize it is using heatmaps. Heatmaps are a type of data visualization that use color to represent the values in a matrix or table.
sns.heatmap(data, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Heatmap Title")
plt.show()
data
: This is the 2D dataset (e.g., a correlation matrix) that you want to visualize.
annot=True
: This displays the numerical values within each cell of the heatmap.
cmap='coolwarm'
: This sets the color palette for the heatmap. 'coolwarm' is a common choice, but you can explore other options.
fmt=".2f"
: This formats the displayed numerical values to two decimal places.
Exploratory Data Analysis Tools
Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
Remember:
An efficient EDA lays the foundation of a successful machine learning pipeline.
EDA is not just about statistics; it's about understanding the story your data tells.
Visualization is key to uncovering patterns and anomalies.
Domain knowledge is essential for interpreting findings effectively.
By mastering EDA, you lay a strong foundation for building predictive models, making data-driven decisions, and gaining valuable insights from your data.
Top comments (0)