Exploratory Data Analysis(EDA) is one of the fundamental steps in a Data Science project. In this article we will dive deep into what EDA is and its applications and why it is important in the Data Science world.
What is Exploratory Data Analysis?
Exploratory Data Analysis is a technique used by Data Scientists/Analysts to analyse and investigate datasets and summarize the main characteristics mostly using data visualization tools such as matplotlib
.
EDA helps us identify errors in a dataset, understand patterns in a dataset and also detect outliers. This step is quite useful because it helps one provide valid results from a dataset.
Steps in Exploratory Data Analysis
1. Understand the Data and Problem
First step is to look at the dataset we are dealing with and trying to understand what problem we are trying to solve. Here we set out clear objectives of what we want to achieve
2. Data Collection
Here we import our dataset into the environment we are using i.e. if we are using pandas
to load a csv file we use the following command;
df = pd.load_csv('weather_data.csv')
We then inspect the dataset, checking the rows and columns, any missing data or any errors in the dataset
3. Data Cleaning
In data cleaning we will look at a few things i.e. ;
Remove any duplicates in the dataset
Check for any missing values-impute or remove any missing values
Fix any apparent errors in the dataset
Convert columns to appropriate data types
4. Data Visualization
Now that we have explored and cleaned our data, we can present our findings graphically in order for it to be consumed by anyone who does not understand the dataset in its raw form.
Some of the visualization tools we can use include:
Bar Charts
Box plots
Scatter plots
Heatmaps and many more.
Types of Exploratory Data Analysis
There are three main types of EDA namely;
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
a). Univariate Analysis
Involves looking at one variable at a time. This can help you identify outliers. We can use Histogram to present this graphically .
Example of a univariate analysis;
b). Bivariate Analysis
Involves taking at least two variables. This can help you identify the relationship between two variables. Graphically we can use Scatter plot to represent this data.
Example of a Bivariate analysis;
c). Multivariate Analysis
Involves taking three or more features to help identify the relationship between the variables. Graphically we can use Pair plot
to represent this data.
Example of a Multivariate analysis;
Tools used in Exploratory Data Analysis
We use different tools in EDA for example Python, R etc. In this article we will focus more on Python.
Libraries used in EDA in Python include ;
Pandas
NumPy
Matplotlib
Seaborn
Conclusion
In conclusion, EDA is very important in any problem being looked at. For one to find conclusive and valid results we must perform EDA as one of the key steps in providing a solution to real life problems.
Top comments (0)