Overview
Imagine if you are given a treasure chest of data—vast that is brimming with untold stories! How would you unlock its secrets? The key to this treasure lies in the art of Exploratory Data Analysis (EDA) and the magic of Data Visualization. In this article, you and I will embark on a journey together, packing your backpack with the tools to not only open that chest but also uncover the hidden treasures within— insights that can revolutionize your decision-making. Welcome to EDA and the captivating realm of data visualization.
What is Exploratory Data Analysis?
Exploratory Data Analysis, as defined by Prasad Patil in his article on Towards Data Science, refers to 'the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses, and to check assumptions with the help of summary statistics and graphical representations.' [[Patil, 2018 (https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)). More specifically, it is a technique used to investigate data and to summarize the most prominent insights that can be derived from such investigation using various statistical and visualization techniques. Quite simply, it is all about analyzing the data before coming to any assumptions or conclusions.
Why do we use Data Visualization in Exploratory Data Analysis?
When analyzing data, it can appear complex and perplexing to the average observer. Exploratory Data Analysis (EDA) aims to unravel this complexity and effectively communicate insights. Data visualization is a crucial component of EDA, simplifying data in a manner that enhances comprehension by empowering decision-makers within organizations to swiftly discern data trends and give them the leverage to make informed choices. Therefore, in essence, data visualization's purpose is to distill information into a compelling narrative about the data of interest. Moreover, it helps to highlight anomalies in data for eg. outliers and to help decision-makers make sense of such information as well.
Common Data Visualization Techniques.
In an effort to properly analyze data using visualization techniques, there are a myriad of options available. Firstly, it is imperative to choose the right chart type for the data and the message that is to be conveyed. Some common data visualization techniques involve understanding whether your data consists of categorical or continuous data. Categorial Data is efficiently visualized using bar charts, stacked bar charts, grouped bar charts, pie charts, and even doughnuts. On the other hand, continuous data is efficiently visualized using line charts, area charts, boxplots, and scatter plots - just to name a few.
A real-world example of using the right data visualization techniques could be:
If we wanted to visualize the five (5) different types of crimes observed in Country X within a dataset: we can use a pie chart as this is categorial data.
If we wanted to visualize the relationship between two continuous variables like the diameter of several people in cm and their respective Heights in cm, we can use a scatterplot as this is continuous data.
Tools and Libraries for Data Visualization.
There are various tools and libraries used to visualize data. In Python, a widely used programming language used in data science and data analysis, there are libraries such as Matplotlib and Seaborn which can be used to visualize both categorical and continuous data alike. On the other hand, some common tools for visualizing data are Microsoft Excel, Tableau, and Power BI by Microsoft. Microsoft Excel is a widely used spreadsheet software that can be used to create simple visualizations. It has some built-in chart types such as column charts, line charts, pie-chart and more. Tableau is a data visualization tool that allows you to connect to a variety of data sources and create interactive visualizations. Power BI is another data visualization tool that allows the user to connect to various data sources and create interactive visualizations. In order to master these tools & resources more effectively, one can seek to play around with them in their spare time or they could take courses on them on Udemy if they want to learn them more professionally or get ahead in their career.
Here are some Udemy courses to consider:
Step-by-Step Guide to EDA with Data Visualization.
Here is a step-by-step guide to using EDA with data visualization in Python using Crimes in Jamaica data.
Firstly, we will need to import the libraries necessary for the EDA process. The following libraries that are necessary for this process are shown below.
# importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
We will read the data to be visualized using the code syntax in a Pandas DataFrame (df). The data can either come in the form of an Excel spreadsheet, a CSV sheet, and many others. In this case, we will be loading the data from Google Drive.
drive.mount('/content/gdrive')
df = pd.read_excel(crimes_data = pd.read_excel(r'/content/gdrive/My Drive/crimes_in_jamaica/Crimes_in_Jamaica.xlsx')
To view the first five (5) rows in the data frame, the function below will be able to do such:
# View the data
crimes_data.head()
To view the distribution of the data in the crime data frame, which includes the count of the values, the mean, the standard deviation, the minimum number, the maximum number, the quartiles (25%, 50%, 75%), we can write the function:
crimes_data.describe()
The next step to take is to clean the dataset by dropping duplicate rows, reformatting the date format for consistency, and dropping null rows.
today = pd.Timestamp(datetime.today().date())
crimes_cleaned = crimes_cleaned[crimes_cleaned['DATE'] <= today]
crimes_cleaned['DATE'] = pd.to_datetime(crimes_cleaned['DATE'], format='%Y/%m/%d')
column_name = 'CRIMEID'
# Counting the number of duplicates in the specified column
duplicates_count = crimes_cleaned.duplicated(subset=[column_name]).sum()
# Dropping duplicates based on a specified column
crimes_cleaned = crimes_cleaned.drop_duplicates(subset=[column_name])
print(f"Number of duplicates in {column_name}: {duplicates_count}")
print(f"Number of rows after dropping duplicates: {crimes_cleaned.shape[0]}")
# Removing all 10 rows with missing values in the 'NUMBER_OF_VICTIMS' column (Complete case analysis)
crimes_cleaned.dropna(subset=['NUMBER_OF_VICTIMS'], inplace=True)
# Removing all 50 rows with missing values in the 'LOCATION' column (Complete case analysis)
crimes_cleaned.dropna(subset=['LOCATION'], inplace=True)
# Remove the negative values for the Number of Victims in the crimes as they are inaccurate.
crimes_cleaned = crimes_cleaned[crimes_cleaned['NUMBER_OF_VICTIMS'] >= 0]
Now on to the data analysis...
To visualize the NUMBER OF VICTIMS Column we can create a boxplot using this single line of code:
crimes_cleaned[['NUMBER_OF_VICTIMS']].boxplot()
Additionally, we can model the different locations in the crime dataset using a pie-chart.
crimes_cleaned['LOCATION'].value_counts().plot(kind="pie", autopct="%.2f")
plt.ylabel("LOCATION")
plt.show()
To analyze the distribution of the number of victims in the cleaned dataset, a histogram can be used.
crimes_cleaned['NUMBER_OF_VICTIMS'].plot.hist(bins=32, edgecolor='k')
plt.show()
Conclusion
In this journey through Exploratory Data Analysis (EDA) and the world of data visualization, we've explored the power of these essential tools. EDA, defined by Prasad Patil, bridges the gap between raw data and valuable insights, helping us discover patterns, spot anomalies, test hypotheses, and challenge assumptions using summary statistics and visualizations.
Data visualization is the heart of EDA, simplifying complex data and empowering decision-makers to understand trends and outliers. It transforms data into a compelling narrative.
Throughout our exploration, we've learned about common visualization techniques for different data types and explored tools like Matplotlib, Seaborn, Excel, Tableau, and Power BI.
In our step-by-step guide, we analyzed crime data in Jamaica, showcasing how EDA and visualization can bring data to life.
Remember, EDA and data visualization are more than tools; they're gateways to uncovering stories within data. Armed with these skills, you can revolutionize decision-making and embark on countless data-driven journeys.
Top comments (1)
Very insightful and informative article, will definitely be using this information for my Data analysis project.