The notebooks of my Data Visualization Series are here:
Let's talk about time series.
Anyone interested in data visualization should know and understand time series, but what on earth are time series <?>
In a nutshell, a time series is any chart that shows a trend over time; and often is a line chart.
Here's an example of a time series:
Usually times series in Python are built using Matplotlib, as the one shown above; and as you can see it's a combination of a line chart and a scatter plot.
Part 1: When should you use charts and time series <?>
Let's suppose you are a marketing manager for an online store.
You started selling some kind of popular product recently, and you want to see what kind of customers are buying this product.
So you started analyzing the sales data, and you found this piece of data:
When you looked at the sales volume on a particular Sunday, it turns out that there are more male buyers than female buyers, about 450 units sold for male customers, versus about 300 for female customers.
You might conclude, this product is more popular with males than females. With this information in mind you might then start targeting male customers in your marketing strategy. But, here's the question: "Using this graph alone, can you actually conclude that this product is more popular with male customers than female customers <?>".
Short answer: NOT NECESSARILY.
First of all, there are only about 800 units sold in total here, meaning the sample size is quite small. And even if the difference is statistically significant, it's possible that male customers tend to buy this product more than female customers, only on Sundays.
In order to make an analysis much more robust, one approach is to plot a line chart over time, and make a time series for male and female customers.
After doing that, you might see a chart like this:
After seeing a chart like this and you can be more confident of your conclusion that male customers buy this product more than female customers, because the difference is consistent over time.
However, it is possible that you ended up with a chart like this:
Then, you wouldn't be able to make the same conclusion anymore.
Summarizing the main reasons why time series and line charts are so useful:
It's a consistent way to examine the trend over time.
If you have a particular hypothesis that you want to test, or an experiment that you're running; time series and line charts allow you to test it on a variety of conditions.
They make your analysis much more statistically robust, and reduces misinterpretation of your data.
Part 2: Creating Line Charts with Matplotlib
I am going to compare the GDP Per Capita growth in the US and China, using the same 'dataii.csv' file.
Since I want to compare GDP Per capita's trend over time; I'm going to create a time series with a line chart.
The important columns for this exercise are:
Also, I'm going to use the iloc syntax to select an item in the Pandas series, and I'll be multiplying and dividing it with a scalar.
First, lets load the data into a Pandas Dataframe:
Time to examin how the GDP Per capita in the US has grown over time:
Now I'll grab the data for China, and compare it with the US data and plot it:
Now, for comparing the growth itself:
Now, to plot the final chart, I'm going to call the
'plt.plot()' function twice, by putting US and China growth on the same graph, instead of the raw GDP Per Capita values:
And the final graph is the following:
Part 3: When to use Scatter Plots
In a nutshell, scatter plots provide a convenient way to visualize how two numeric variables are related in your data.
Here's a glimpse to a scatter plot that shows how weights and heights are related in a hundred people:
Part 4: Creating Scatter Plots with Matplotlib
We're going to examine how to create scatter plots with Matplotlib.
Suppose as an example, you need to find the GDP Per Capita and life expectancy are related to each other in different countries.
To do this, in the 'dataii.csv' file, our countries dataset; the columns we'll need to use are lifeExpectancy and gdpPerCapita, as well as year, so we can find the relationship between life expectancy and GDP Per Capita for each given year.
For this part I'm going to be importing the NumPy library, since I'm going to need the 'log10()' function:
Let's first examine how GDP Per Capita and life expectancy are related in 2007:
To create a scatter plot with gdpPerCapita and lifeExpectancy using the data of 'data2007' with plt, just type:
And the result plot is the following: