Once data has been collected and stored, there's need for its analysis to derive meaningful understanding of it. It is for this reason that exploratory data analysis (EDA) comes into play. As the name suggests, we are 'exploring' the data i.e. getting a general overview of it.
The data collected may either be text, videos or images and will usually be stored in an unstructured manner. Rarely will you find data that is 100% clean i.e. without any anomalies. Additionally, data may be in various formats like Excel, CSV (comma separated values), Json, Parquet etc.
In the world of data, EDA may also be referred to as data manipulation or data cleaning. Practitioners in the industry emphasize the importance of cleaning data to remove 'junk' as this may negatively impact the results as well as predictions. Structured data, usually in tabular format, can be analysed using several techniques and tools (like Excel, Power BI, SQL) but we will focus on Python for this illustration.
EDA using Python
Python programming language is one of the most widely tools in EDA owing to its versatility which allows for its use across multiple industries, be it finance, education, healthcare, mining, hospitality among others.
Inbuilt libraries, namely Pandas and NumPy are highly effective in this regard and work across board (whether using Anaconda/Jupyter Notebook, Google Collab, or an IDE like Visual Studio)
Below are the common steps and code lines executable when performing EDA:
First, you'll import the python libraries necessary for manipulation/analysis:
import pandas as pd
import numpy as np
Secondly, load the dataset
df = pd.read_excel('File path')
Note: df is the standard function for converting tabular data into a data Frame.
Once loaded, you can preview the data using the code:
df.head()
This will show the first 5 rows of the dataset
Alternatively, you can simply run df which will show a select few rows (both top and bottom) of the entire dataset as well as all the columns therein.
Thirdly, understand all the datatypes using:
df.info()
Note: Datatypes include integers (whole numbers), floats (decimals) or objects (qualitative data/descriptive words).
At this step, it's advisable to get summary statistics of the data using:
df.describe()
This will give you stats like Mean, Mode, Standard Deviation, Maximum/Minimum values and the Quartiles.
Fourthly, identify whether null values exist in the dataset using:
df.isnull()
This can then be followed by checking for duplicates (repetitive entries)
df.duplicated()
Other key aspects of EDA are checking how the various variables in a dataset relate with each other (Correlation) and their distribution.
Correlation can be positive or negative and ranges from -1 to 1. Its code is:
df.corr()
Note: A correlation figure close to 1 indicates a strong positive correlation, while a figure close to -1 indicates a strong negative correlation.
Distribution checks on how symmetrical or asymmetrical data is, as well as the skewness of the data and it can either be normal, binomial, Bernoulli or Poisson.
In summary, exploratory data analysis is an important process in gaining a better understanding of the data. It allows for better visualizations and model building.
Top comments (0)