Introduction
According to Wikipedia, Exploratory data analysis is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected.
EDA is an approach to data in order to:
- Summarize the main characteristics of data
- Gain better understanding of the dataset
- Uncover relationships between different variables and extract important variables in the problems we are trying to solve.
Important EDA techniques that we will discuss are as follows:
- Descriptive Statistics.
- Grouping data by use of Group by to transform the dataset.
- Correlation.
- Advanced Correlation.
1. Descriptive Statistics
This is a branch of statistics that involves summarizing, organizing, and presenting data meaningfully and concisely.
Before understanding our dataset, we must first import the relevant libraries we are to use and load the data. Mine was an excel sheet so I loaded it as shown below.
Before building models, we need to first understand the data we have. This is why we do descriptive statistics. The following functions are important as they help with the process:
**data.describe**
- helps us take a look at all the numerical functions of the dataset. As per my dataset below, you can see what output it gives. I have added include all to also check on the categorical variable summary.
**value_counts**
- Checks categorical variables in our dataset. These variables are those that can be divided into different groups and have discrete values. When you run the code, it will show you the outcome of your dataset as below.
Boxplots
A great way to visualize numeric data since you can visualize various distribution of the data. A box plot shows a set of five descriptive statistics of the data: the minimum and maximum values (excluding the outliers), the median, and the first and third quartiles. Optionally, it can also show the mean value.
It is the right choice if you're interested only in these statistics, without digging into the real underlying data distribution.Here is how I generated my boxplot. You can substitute with the details in your dataset.
# Assuming 'data' is your DataFrame and 'Temp_C' is the column for temperature
plt.figure(figsize=(8, 6))
sns.boxplot(y=data['Temp_C'])
plt.title('Boxplot of Temperature (°C)')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()
Scatter Plot
We use scatter plots to visualize continuous variables.Each value in the data set is represented by a dot.
Lets draw a simple scatter plot where x represents the age of the car and y the speed of the car.(Remember to import all the libraries to be used, for this, you'll need Matplotlib.)
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]-independent variable
y = [99,86,87,88,111,86,103,87,94,78,77,85,86] - dependent variable
plt.scatter(x, y)
plt.show()
2. Group By
Works with categorical variables. It is used for grouping the data according to the categories and applying a function to the categories. It helps to aggregate data efficiently.
It makes the task of splitting the Dataframe over some criteria really easy and efficient.
In my dataset below, I needed the mean of the weather_conditions for all the elements in my dataset.
Heatmap Plot
A heatmap is a table-style data visualization type where each numeric data point is depicted based on a selected color scale and according to the data point's magnitude within the dataset.
These plots illustrate potential hot and cold spots of the data that may require special attention.
plt.pcolor(df_pivot, cmap="RdBu")
plt.colorbar()
plt.show()
Correlation
We use correlation to check the level of interdependence of variables.
Correlation is a statistical metric for measuring to what extent different variables are interdependent.
Advanced Correlation
We can measure the strength of the correlation between continuous numerical variables is by using Pearson Correlation.
Pearson Correlation method gives you two values;
- correlation coefficient a value close to 1 shows a large positive correlation, while a value close to -1 implies a large negative correlation, and a value close to 0 implies no correlation between the variables.
- p-value tells us how certain we are about the correlation that we calculated. For the p-value, a value less than 0.001 gives us a strong certainty about the correlation coefficient that we calculated, a value between 0.001 and 0.05 gives us moderate certainty, a value between 0.05 and 0.1 will give us a weak certainty, and a p-value larger than 0.1 will give us no certainty of correlation at all.
We can say that there is a strong correlation when the correlation coefficient is close to one or -1 and the p-value is less than 0.001.
Top comments (0)