Exploratory Data Analysis commonly known as EDA is a crucial step in data analysis as it helps examine and explore the characteristics of a dataset and through this, we are able to gain an understanding of the data
This step has different processes which are:
- Data visualization which is the physical representation of the dataset using graphs, charts, and other forms of visualization. This helps us to identify trends in the data and how data in different columns relates to each other. Python Libraries such as seaborn and Matplotlib are greatly used to visualize the data.
sns.displot(df, x='Temp_C')
sns.displot(df, x='Dew_Point_Temp_C',kde=True)
The snippet above helps us create bar graphs to visualize the temperature and dew point temperature columns.
- Descriptive analysis of the data is the statistical analysis of the different columns to get values such as the mean, standard deviation, range, the max and minimum values of the columns.This is made possible by the following code:
df.describe(include ="all")
Cleaning the dataset- removal of errors in the dataset is done in this stage. It helps us deal with missing and inconsistent values thus making it easier for us to get accurate information from the data.
Identifying and removing outliers-outliers are anomalous data points in the dataset and they deviate significantly from the other values.It is important to identify them because they can make it difficult to draw meaningful conclusions as they skew the results.There are various methods to identify outliers and the most common one is the box model.Points outside the box are considered as outliers and need to be removed.
Understanding the distribution of data in the dataset-This helps to determine the statistical methods to be used in analysis.
In conclusion, EDA is very important as it helps us to get more information and identify patterns from the data and this helps us in making accurate choices in the statistical methods used in analysis.
Top comments (0)