In this step-by-step tutorial, we're going to cover how to animate a meaningful visualization from a real world dataset with almost 700,000 data points.
If you would rather clone and follow this tutorial as a Jupyter Notebook, you can do so at this link
Otherwise, we'll be animating ~60 years of United States energy consumption data in only 10 lines of Python!
To prove it to you, here is the code from start to finish:
import pandas as pd
import plotly.express as px
# Importing the dataset
df = pd.read_csv("https://www.eia.gov/state/seds/sep_use/total/csv/use_all_btu.csv")
# Filter out aggregate United States and Washington DC rows
df = df[~df.isin(["US", "DC"])]
# Filter rows that contain total consumption per energy source
total_df = df[df["MSN"].str.match("[A-Z]{2}TC[A-Z]")]
# Melt from wide to long format
total_df = total_df.melt(id_vars=["Data_Status", "State", "MSN"], var_name="Year", value_name="BTU")
# Sum energy consumption per state per year
summed_df = total_df.groupby(["State", "Year"], as_index=False).sum()
# Animate the bar plot
fig = px.bar(summed_df, x="State", y="BTU", animation_frame="Year", range_y=(0, summed_df["BTU"].max()), color="State", title="United States total energy consumption (BTU)")
fig.update_xaxes(categoryorder="total ascending")
fig.show()
The .csv we will be working with is publicly available on the US Energy Information Administration government website.
Prerequisite packages:
And now... time to code!
Importing the dataset
Importing our dataset is relatively straightforward; we're able to pass the URL of a .csv to pandas.read_csv and it will import it into a pandas.DataFrame
.
df = pd.read_csv("https://www.eia.gov/state/seds/sep_use/total/csv/use_all_btu.csv")
Removing aggregate US and Washington DC rows
In this case, our dataset contains rows for both aggregate United States (US
) and Washington DC (DC
) data.
Since we are only concerned with energy consumption per state, we're going to filter the US
and DC
rows out.
df = df[~df.isin(["US", "DC"])]
Understanding the MSN column
Do NOT worry if you don't fully understand this next section! It has no bearing on the code, we're just going to develop some important domain knowledge before moving forward.
Looking through documentation provided on the website, we can see that the "MSN"
column stands for Mnemonic Series Names. These are codes that contain information on the type of energy source, the sector, and the unit.
Understanding MSN by example
In the context of our data, row 0
's MSN value is ABICB.
Let's break it down:
- AB: Aviation gasoline blending components
- IC: Industrial sector
- B: British thermal units (BTU)
Thus, that row corresponds to Aviation gasoline blending components consumed by the industrial sector in british thermal units (BTU).
When the 3rd and 4th characters of an MSN are TC, this means the row corresponds to the total energy consumption across all sectors (i.e. residential, industrial, etc.) for that resource.
Looking back at the Aviation gasoline blending components, an MSN of ABTCB would thus be
- AB: Aviation gasoline blending components
- TC: Total of all sectors
- B: British thermal units (BTU)
Filtering rows that contain total consumption per energy source
To get all rows that have TC as the 3rd and 4th characters (and thus the total), we can use a regular expression with the Series.str.match
method.
The pattern we will use is "[A-Z]{2}TC[A-Z]"
.
Let's break it down:
- [A-Z]{2}: The 1st and 2nd letter can be any uppercase letters
- TC: The 3rd and 4th letters must be TC
- [A-Z]: The 5th character can be any uppercase letter
total_df = df[df["MSN"].str.match("[A-Z]{2}TC[A-Z]")]
Melting years into a Year column
Now that we have the rows we want to visualize, we're going to unpivot our DataFrame
from wide to long format. This will massage the year columns into rows as two new columns: "Year"
and "BTU"
.
This allows us to select and filter our data much easier now as well as pass it through to plotly.express
which is expecting a long format DataFrame
.
total_df = total_df.melt(id_vars=["Data_Status", "State", "MSN"], var_name="Year", value_name="BTU")
Summing all energy sources together
Now that we have the total consumption per energy source, state, and year, we can sum them all together grouped by each state and each year.
This will leave us with the total energy consumption per state per year which is what we're looking to visualize.
summed_df = total_df.groupby(["State", "Year"], as_index=False).sum()
Creating the animated bar plot
Our data is now ready to be visualized! plotly.express.bar
has an animation_frame
argument which allows us to pass what column to animate our data on.
For this example, we will be animating one frame per "Year"
of data.
fig = px.bar(
summed_df,
x="State",
y="BTU",
color="State",
animation_frame="Year",
title="United States total energy consumption (BTU)",
range_y=(
0,
summed_df["BTU"].max()
),
).update_xaxes(categoryorder="total ascending")
fig.show()
Additional exercises
Great work! Now that we have this data, there is a ton of other insights you can find. I recommend messing around with different MSN codes, groupings, etc. and see what you can learn!
If you have any questions, feel free to message me or reach out in the comments below! 👇
Top comments (4)
Hi Chris! I use matplotlib and seaborn for data visulization. Recently I read introduction of plotly. It can animate the figure. Thank you for your example and introduction. I'd like to learn plotly and may discuss the use of plotly with you later haha.
Anytime! Plotly is a lot of fun to use - it's been a while since I've used it but I had a lot of fun building data dashboards with it back a year or two ago
This is great. Thanks for sharing!
Hey, thanks so much for checking it out! 😄