DEV Community

Nick Kimani
Nick Kimani

Posted on

Exploratory Data Analysis using Data Visualization Techniques

Data is everywhere in this new age. However, data cannot make sense of itself. This aspect is what leads us to Exploratory data analysis (EDA).

EDA is the method by which we can make sense of data. It involves analysing data to gain insights, and identify relationships and patterns. By performing EDA an individual can detect outliers and anomalies, view distributions of data e.g. sales per region, detect trends and come up with insights useful to stakeholders.

EDA involves the use of certain visualization techniques implemented by various tools.

Visualization Tools for EDA

Python - Python is an object oriented programming language popularly used in data analysis. Within python, there are a number of libraries dedicated to data visualization.
They include:

  • Maplotlib
  • Seaborn
  • Plotly

This is how to import them to your script or notebook:

Importing Libraries

To note: Plotly creates interactive visualizations whereas the other 2 create static visualizations.

Visualization Softwares - These are softwares fully dedicated to data visualization and analysis tasks.
They include:

  • Tableau
  • PowerBI
  • Qlik
  • Plotly

Visualization Techniques for EDA

In this section we look at the popular visualization techniques (Plots & Graphs) for performing EDA. All of the techniques discussed can be implemented in any of the tools mentioned above.

Before that it is important to know the different types of data:

  1. Categorical data
  2. Continuous data

Bar Graphs

It is a way of visualizing categorical data usually with rectangular bars of different heights or lengths that depict the size of a category. It is used for scenarios such as comparing the sales per city or per product.

Bar Graph

Pie Charts
Just as the bar graph, this is also used for comparing categorical data. It is a circle, to which proportions are assigned to categories based on their size in relation to the whole data.

Pie Chart

Histograms
It is used for checking the distribution of continuous data such as sales, profits, e.t.c.. It can be viewed as the equivalent of bar graphs but for continuous data. This is because it groups the data into ranges, called bins, and displays the total count of records in each bin as bars.

Histogram

Box and Whisker plots
This is used to check the distribution of continuous data, it is more popular for its usefulness in detecting outliers. The 'box' displays the 1st, 2nd(median) and 3rd quartiles. The endpoints of the whiskers are 1.5*Interquartile range (+3rd quartile OR -1st quartile). Any records beyond the whiskers are considered to be outliers.

Box and Whisker

Scatterplots
They are used in bivariate analysis in checking whether two variables could have a correlation to each other. They are typically used with continuous data.

Scatterplot

For example: The figure on the left shows that the two variables are likely to have a high positive correlation. Whereas the figure on the right shows that there is a low possibility of correlation between the variables.

Wordcloud
It is a technique of visualizing text data. The size of the word is dictated by the frequency of the word, with large words signifying high frequency. It is popular in analysis for comments or sentiment analysis.
For example a teleco company seeking to find out what its customers think about their products.

Wordcloud

Correlation Matrix
This is a type of heatmap visualization that uses correlation values between variables in data to determine the 'heatcolor' to assign to a relationship between variables. It is useful in finding patterns and relationships in large datasets as it saves on the time that would have been used to inspect the variable relationships individually.

Correlation Matrix

Conclusion
Exploratory data analysis (EDA) is an important aspect in making sense of data and extracting valuable and actionable insights from it. The good news is that there are a myriad of ways to perform EDA, from the tools to use, to the techniques. It is important to understand the use cases of the various techniques and because applying them to unsuitable cases would lead to errors in interpretation or illogical visualizations.

Top comments (0)