Hey data science enthusiast 😄, you have been to many data sets and are always curious to know deep insights of the datasets so that your model can be framed easily and work effectively . To get deep insights we have to do a detailed analysis of this dataset and that’s how EDA (Exploratory Data Analysis) has came into picture .
EDA stands for Exploratory Data Analysis , it refers to the method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. It’s all about getting to know your data intimately before you dive into any serious analysis or modeling.
The foremost goals of EDA are -:
1.** Understanding the Data’s Structure and Composition** -: EDA helps us grasp the basic layout of our dataset — its dimensions, variables, and overall structure. By familiarizing ourselves with the data’s anatomy, we lay the foundation for deeper analysis and exploration.
Identifying Anomalies and Outliers -: One of the key goals of EDA is to spot any irregularities or outliers hiding within the data. These outliers can skew our analysis and lead to erroneous conclusions. By identifying and addressing them early on, we ensure the integrity and reliability of our insights.
Uncovering Patterns and Relationships -: EDA is all about connecting the dots — uncovering hidden patterns, trends, and relationships within the data. Whether it’s a correlation between variables or a seasonal trend in sales figures, EDA helps us make sense of the data’s underlying structure and dynamics.
4.** Communicating Insights Effectively** -: Last but not least, EDA is about communicating our findings effectively. Whether it’s through visualizations, reports, or presentations, EDA empowers us to convey complex insights in a clear and compelling manner. By telling the story behind the data, we inspire action and drive meaningful change.
How EDA is performed ?
To get started with EDA you need to have a structured dataset first , then you need important libraries and then we can do visualization of data. So let’s break each points in steps .
Step 1 ->
Import important libraries .
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('dark_background')
pd.set_option('max_columns',200)
- Pandas -> used for data manipulation and analysis.
- Numpy -> used for numerical computing. It provides support for large , multidimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays effectively.
- Matplotlib -> used for producing various types of plots , charts and many more.
- Seaborn -> it is a data visualization library built on top of Matplotlib in python.
Step 2 ->
Getting a dataset into dataframe.
df=pd.read_csv('/kaggle/input/youtube-2023-trending-videos/Youtube_2023_trending_videos.csv',lineterminator='\n')
Step 3 ->
Visualizing data.
Get starting 4 to 5 rows of dataframe -:
df.head()
Getting complete information of columns in dataframe -:
df.info()
Check the dimensions of data-:
df.shape
Retrieving some statistical information out of complete dataset -:
df.describe()
Step 4 ->
Cleaning data.
Dropping unnecessary column to reduce dataframe dimension -:
df=df.drop(['video_id','thumbnail_link','description','channelId','tags'],axis=1)
Checking for duplicate values and removing them -:
df.drop_duplicates(inplace=True)
Cleaning the data column-wise -:
So here we are either going to normalize columns accordingly or we are going to represent each value in more structured and efficient way .
def handle_published_at(value):
value=str(value).split('T')
value=value[0]
return value
df['publishedAt']=df['publishedAt'].apply(handle_published_at)
df['publishedAt']
Step 5 ->
Analyzing the data.
Univariate Analysis
Univariate analysis involves examining the distribution and characteristics of a single variable.
So let’s suppose we need to find that how many times a channel has its videos trending
channelTitle=df['channelTitle'].value_counts(ascending=False)
channelTitle
So here channelTitle contains the list of all the and their frequency.
Now sometimes the data can be very large and it is very difficult to do univariate analysis of such a large dataset , so here we manipulate data for easy univariate analysis.
channeltitle=channelTitle[channelTitle<50]
def handle_channel(value):
if value in channeltitle:
return 'Others'
else:
return value
df['channelTitle']=df['channelTitle'].apply(handle_channel)
df['channelTitle'].value_counts()
So here what we have done is that we have identified all the channels with frequency less than 50 and aggregated all of them in Others using handle_channel()
.
Now we can easily plot a count plot for channelTitle
.
plt.figure(figsize = (16,10))
ax = sns.countplot(x='channelTitle',data=df[-(df.channelTitle=='Others')])
plt.xticks(rotation=90)
This code creates a countplot using Seaborn, displaying the count of observations for each unique value in the ‘channelTitle’ column of the DataFrame df, excluding the category 'Others'. The plt.figure(figsize = (16,10)) line sets the size of the figure to 16 inches in width and 10 inches in height. The plt.xticks(rotation=90) line rotates the x-axis labels by 90 degrees for better readability.
Bivariate Analysis
Bivariate analysis involves analyzing the relationship between two variables. It aims to understand how the value of one variable changes concerning the value of another variable. Bivariate analysis is essential for identifying patterns, trends, correlations, and dependencies between variables.
The analysis can be done between -:
Numerical V/S Numerical
Numerical V/S Categorical
Categorical V/S Categorical
Multivariate Analysis
Multivariate analysis involves analyzing the relationships between multiple variables simultaneously. It explores how changes in one variable are associated with changes in other variables. Multivariate analysis techniques are used to understand complex relationships, patterns, and interactions among variables in a dataset
We can do more to this dataset but that’s it for this post.
You can download dataset from here 👉https://www.kaggle.com/datasets/nehagupta09/youtube-2023-trending-videos
You can see complete EDA here 👉https://www.kaggle.com/code/akshatsharma0610/youtube-india-trending-videos-dataset-eda/notebook
I hope you have understood this article .Don’t forget to follow me . 😄
Top comments (0)