Why use data vis
When you need to work with a new data source, with a huge amount of data, it can be important to use data visualization to understand the data better.
The data analysis process is most of the times done in 5 steps:
- Extract - Obtain the data from a spreadsheet, SQL, the web, etc.
- Clean - Here we could use exploratory visuals.
- Explore - Here we use exploratory visuals.
- Analyze - Here we might use either exploratory or explanatory visuals.
- Share - Here is where explanatory visuals live.
Types of data
To be able to choose an appropriate plot for a given measure, it is important to know what data you are dealing with.
Qualitative aka categorical types
Nominal qualitative data
Labels with no order or rank associated with the items itself.
Examples: Gender, marital status, menu items
Ordinal qualitative data
Labels that have an order or ranking.
Examples: letter grades, rating
Quantitative aka numeric types
Discrete quantitative values
Numbers can not be split into smaller units
Examples: Pages in a Book, number of trees in a park
Continuous quantitative values
Numbers can be split in smaller units
Examples: Height, Age, Income, Workhours
Summary Statistics
Numerical Data
Mean: The average value.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value.
Variance/Standard Deviation: Measures of spread or dispersion.
Range: Difference between the maximum and minimum values.
Categorical Data
Frequency: The count of occurrences of each category.
Mode: The most frequent category.
Visualizations
You can get insights to a new data source very quick and also see connections between different datatypes easier.
Because when you only use the standard statistics to summarize your data, you will get the min, max, mean, median and mode, but this might be misleading in other aspects. Like it is shown in Anscombe's Quartet: the mean and deviation are always the same, but the data distribution is always different.
In data visualization, we have two types:
- Exploratory data visualization We use this to get insights about the data. It does not need to be visually appealing.
- Explanatory data visualization This visualizations need to be accurate, insightful and visually appealing as this is presented to the users.
Chart Junk, Data Ink Ratio and Design Integrity
Chart Junk
To be able to read the information provided via plot without distraction, it is important to avoid chart junk. Like:
- Heavy grid lines
- Pictures in the visuals
- Shades
- 3d components
- Ornaments
- Superfluous texts
Data Ink Ratio
The lower your chart junk in a visual is the higher the data ink ratio is. This just means the more "ink" in the visual is used to transport the message of the data, the better it is.
Design Integrity
The Lie Factor is calculated as:
$$
\text{Lie Factor} = \frac{\text{Size of effect shown in graphic}}{\text{Size of effect in data}}
$$
The delta stands for the difference. So it is the relative change shown in the graphic divided by the actual relative change in the data. Ideally it should be 1. If it is not, it means that there is some missmatch in the way the data is presented and the actual change.
In the example above, taken from the wiki, the lie factor is 3, when comparing the pixels of each doctor, representing the numbers of doctors in California.
Tidy data
make sure you're data is cleaned properly and ready to use:
- each variable is a column
- each observation is a row
- each type of observational unit is a table
Univariate Exploration of Data
This refers to the analysis of a single variable (or feature) in a dataset.
Bar Chart
- always plot starting with 0 to present values in real comparable way.
- sort nominal data
- don't sort ordinal data - here it is more important to know how often the most important category appears than the most frequent
- if you have a lot of categories use a horizontal bar chart: having the categories on the y-axes, to make it better readable.
Histogram
- quantitative version of a bar chart. This is used to plot numeric values.
- values are grouped into continous bins, one bar for each is plotted
KDE - Kernel Density Estimation
- often a Gaussian or normal distribution, to estimate the density at each point.
- KDE plots can reveal trends and the shape of the distribution more clearly, especially for data that is not uniformly distributed.
Pie Chart and Donut Plot
- data needs to be in relative frequencies
- pie charts work best with 3 slices at maximum. If there are more wedges to display it gets unreadable and the different amounts are hard to compare. Then you would prefer a bar chart.
BiVariate Exploration of Data
Analyzes the relationship between two variables in a dataset.
Clustered Bar Charts
- displays the relationship between two categorical values. The bars are organized in clusters based on the level of the first variable.
Scatterplots
- each data point is plotted individually as a point, its x-position corresponding to one feature value and its y-position corresponding to the second.
- if the plot suffers from overplotting (too many datapoints overlap): you can use transparency and jitter (every point is moved slightly from its true value)
Heatmaps
- 2d version of a Histogram
- data points are placed with its x-position corresponding to one feature value and its y-position corresponding to the second.
- the plotting area is divided into a grid, and the numbers of points add up there and the counts are indicated by color
Violin plots
- show the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
- the distribution is plotted like a kernel density estimate, so we can have a clear
- to display the key statistics at the same time, you can embedd a box plot in a violin plot.
Box plots
- it also plots the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
- compared to the violin plot, the box plot leans more on the summarization of the data, primarily just reporting a set of descriptive statistics for the numeric values on each categorical level.
- it visualizes the five-number summary of the data: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Key elements of a boxplot:
Box: The central part of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). This contains the middle 50% of the data.
Median Line: Inside the box, a line represents the median (Q2, 50th percentile) of the dataset.
Whiskers: Lines extending from the box, known as "whiskers," show the range of the data that lies within 1.5 times the IQR from Q1 and Q3. They typically extend to the smallest and largest values within this range.
Outliers: Any data points that fall outside 1.5 times the IQR are considered outliers and are often represented by individual dots or marks beyond the whiskers.
Combined Violin and Box Plot
The violin plot shows the density across different categories, and the boxplot provides the summary statistics
Faceting
- the data is divided into disjoint subsets, most often by different levels of a categorical variable. For each of these subsets of the data, the same plot type is rendered on other variables, ie more histograms next to each other with different categorical values.
Line plot
- used to plot the trend of one number variable against a seconde variable.
Quantile-Quantile (Q-Q) plot
- is a type of plot used to compare the distribution of a dataset with a theoretical distribution (like a normal distribution) or to compare two datasets to check if they follow the same distribution.
Swarm plot
- Like to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap.
Spider plot
- compare multiple variables across different categories on a radial grid. Also know as radar chart.
Useful links
My sample notebook
Libs used for the sample plots:
- Matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
- Seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
- pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).
Further reading:
- Anscombes Quartett: Same stats for the data, but different distribution: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
- Chartchunk: https://en.wikipedia.org/wiki/Chartjunk
- Data Ink Ratio: https://infovis-wiki.net/wiki/Data-Ink_Ratio
- Lie factor: https://infovis-wiki.net/wiki/Lie_Factor
- Tidy data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
- Colorblind-friendly visualizations: https://www.tableau.com/blog/examining-data-viz-rules-dont-use-red-green-together
Top comments (0)