DEV Community

loading...

Popular Data Science Plots and When to Use Them

hoganbyun
・6 min read

When working in Data Science, being able to investigate and answer questions is only half of your responsibilities. No matter how well you are able to manipulate data and code difficult techniques, your findings are no good if you aren't able to communicate them clearly. Doing so, you will probably run into using matplotlib to do a lot of your plotting. Through this blog post, I will be showing you some of the most common types of plots and what situations to use them in, while providing some do's and don't's that'll ensure that your plots are easy to understand.

Line Plot

When to Use One
A line plot is probably the most simple plot out there. It plots points with x and y values on the chart and draws lines thought each, connecting them. A situation where one might use a line plot is when visualizing time-series data, that is, displaying changes of some variable over time. Take a look at this example,

Alt Text

We can clearly see that this plot is measuring the speed (mph) of some object, say a car, over time (sec). From the information that the plot conveys, we can see that the car accelerated early and eventually started to decelerate later on.

How to Code a Line Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = [1,2,3,4,5,6]
y = [4,3,5,2,3,1]

plt.plot(x, y)
plt.xlabel("Week")
plt.ylabel("Pounds Lost")
plt.title("Client Pounds Lost During Training")

plt.show()
Enter fullscreen mode Exit fullscreen mode

This example will yield the following graph:

Alt Text

Bar Plot

When to Use One
Bar plots, like line plots, may also be used to track changes over time. Yet, another use for bar plots is to visualize differences between groups. For example, here is a plot from my recent project:

Alt Text

You can see that the x-axis represents budgets tiers in increments of $1.5 million. The height of each bar depends on the average ROI of all movies that belong to a certain budget tier. In this case, a bar plot is especially useful because it can clearly show that the $6 million budget tier yields the highest ROI, on average. Bar plots are also useful when comparing metrics within groups that aren't quantifiable through numbers. An example would be comparing the number of award-winning movies from each movie studio.

How to Code a Bar Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = ['ATL', 'BOS', 'DAL', 'MEM', 'SAC', 'WAS']
free_agents = [1, 3, 5, 4, 2, 6]

plt.bar(x, free_agents)
plt.xlabel("Team")
plt.ylabel("Free Agents Signed")
plt.title("Free Agents Signed in 2020")

plt.show()
Enter fullscreen mode Exit fullscreen mode

This example will yield the following graph:

Alt Text

Box (and Whisker) Plot

When to Use One
The Box and Whisker plot is an ideal choice when you want to convey information from a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), maximum). Here, the median is the middle value of a sample. For example, in a list of [1,2,3,4,5], the median would be 3. If there is an odd number of values, the median is the average of the two middle values. The first and third quartiles are the 25th and 75th percentiles, respectively. The Inner Quartile Range (IQR) is calculated Q3 - Q1, while the minimums and maximums are calculated Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively. These plots are especially useful for displaying how skewed a sample is and for highlighting outliers. Referring to the following example:

Alt Text

The box that you see indicates 3 values. The middle line is the median. Here, we can see that the median is between 20 and 30. The right border of the box is Q3 while the left is Q1. The min and max are represented by the ends of the "whiskers" connected to the box. We also see one outlier, represented by the 55 point game where the player shot extremely well. The code for this example is below.

How to Code a Box and Whisker Plot

import matplotlib.pyplot as plt
%matplotlib inline

x = [22,25,15,33,31,27,18,19,22,37,55,16,24,25,26,25]

plt.boxplot(x, vert=False)
plt.xlabel("Points")
plt.title("Player A: Points Scored Per Game")

plt.show()
Enter fullscreen mode Exit fullscreen mode

Scatter Plot

When to Use One
A scatter plot is used when you have numerical data that is associated by pairs (eg. Age vs. Running Speed). A scatter plot will plot each data point onto an x-y plane, giving the viewer a good picture of how the data is distributed. They are particularly useful when trying to discern whether two variables may be related. Take a look at this example:

Alt Text

In this example, the scatter plot clearly shows that as age increases, max speed tends to go down. Each point represents a different person that was timed. The code for this is shown below.

How to Code a Scatter Plot

import matplotlib.pyplot as plt
%matplotlib inline

age = [18,20,20,24,25,26,29,33,31,32,36,44,44,46,48,55,57,63,64,67,66,62]
max_speed = [19,18,22,16,19,21,17,16,19,16,14,16,13,13,11,12,9,10,8,7,7,8]

plt.scatter(age, max_speed)
plt.xlabel("Age")
plt.ylabel("Max Speed (mph)")
plt.title("Age vs. Max Speed")

plt.show()
Enter fullscreen mode Exit fullscreen mode

BONUS: Regression Plot

Lastly, the regression plot is sort of an extension of the scatter plot. It takes in each data point and calculates a line that "fits" the sample the best. What this means is that it will display a line cutting through the data, indicating what the approximate slope or "trend" is for the sample. Regression lines also have an r-value (between 0 and 1) which indicates how correlated two variables are. The closer this r-value is to 1, the more correlated the variables are. For example,

Alt Text

Here, we used the same scatter plot as an example. You can see the now, there is a line crossing through the data. This line gives us a good estimate of what speed to expect for a certain age. For example, judging from the line, we can approximate that a 40-year old will reach max speed at just under 15 mph. The code is written below. In this case, we had to use Seaborn (an extension of matplotlib) to use its regression plot functionality.

How to Code a Regression Plot

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

age = [18,20,20,24,25,26,29,33,31,32,36,44,44,46,48,55,57,63,64,67,66,62]
max_speed = [19,18,22,16,19,21,17,16,19,16,14,16,13,13,11,12,9,10,8,7,7,8]

sns.regplot(x = age, y = max_speed)
plt.xlabel("Age")
plt.ylabel("Max Speed (mph)")
plt.title("Age vs. Max Speed")

plt.show()
Enter fullscreen mode Exit fullscreen mode

Tips When Plotting

Now that we covered some commonly-used plots in data science, we can now go over a few tips that you should keep in mind.

  • Try not to make your visualizations too "busy" by highlighting the exact, relevant information that you would want the audience to see. In the below case, I've highlighted bars in green, blue, and red depending on what information I want to convey, as opposed to showing the graph with every bar being the same color.

Alt Text

  • Avoid pie charts as they are often hard to read when each slice is very close in size. In these cases, bar charts are much more preferred. Below is the data represented in pie and bar format. Note that it is much easier to figure out what is larger and smaller in the bar graph.

Alt Text

Alt Text

  • Make sure that your graphs are scaled properly, while avoiding "white-space" on the graph, if possible. Take a look at the two examples below and the difference proper axis-scaling does.

Alt Text

Alt Text

Summary

Now that you have had a rundown on some of the most commonly used plots in data science along with some tips to make your graphs more digestible, you are ready to go out a plot your data into effective charts to show your findings!

Discussion (0)