In this tutorial, we will be creating a county-level geographic bubble map of the active COVID-19 cases in the United States. First of all, let us understand what a Bubble Map is!
What is a Bubble Map?
Bubble maps are a kind of geographic visualization that draws their roots from the bubble charts. In bubble charts, the bubbles are plotted on a Cartesian plane. In the case of bubble maps, these bubbles are plotted on geographic regions. The size of the bubble over the geographic area is proportional to the value of a particular variable. Bubble maps are important as they are one of the best ways to compare proportions over a geographic region.
Building a Bubble Map Using Plotly
Let us dive straight into the tutorial now. Throughout this tutorial, we will also do some basic exploratory data analysis and data cleaning.
1. Importing Libraries
The first step is to import the necessary libraries we will need throughout this tutorial. We will be using the popular python data analysis library called 'Pandas' and our data visualization library - Plotly. We need to import specifically a class called graph_objects
from plotly.
import pandas as pd
import plotly.graph_objects as go
2. Loading Our Dataset
Next, we import our dataset and store it into a DataFrame. The dataset I am using is by Johns Hopkins University and can be found here. When this code was written, the dataset for the 6th March 2021 was the last dataset that included the active COVID-19 cases count. It seems like Johns Hopkins removed the active and recovered cases data for datasets after 6th March 2021.
df = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-06-2021.csv",
dtype={"FIPS": str})
df.head()
This is what our DataFrame looks like -
3. Exploratory Data Analysis and Data Cleaning
Now you can see that this DataFrame has data for other countries as well. Since we are focusing only on the United States data for this tutorial, let's filter to only the US data and update our DataFrame.
df = df[df.Country_Region == "US"]
df.head()
Now, our updated DataFrame looks like this -
Let us now explore the data further. First, let us find the length of the DataFrame. Since we are going to make a bubble map for the active COVID-19 cases in the US, let us check the maximum and minimum values in the Active
column. The Active
column contains the data for the active COVID-19 cases.
len(df)
df.Active.max()
df.Active.min()
Depending on the dataset you are using, you will get different values for the above statements.
Wait, how can active cases be a negative number? Surely there must be something wrong. Let us see which row in the DataFrame has this data. Furthermore, let us also check what other rows have their active cases values less than 0.
df[df.Active == df.Active.min()]
df[df.Active < 0]
Ah! You can see that some unassigned rows have these values. This data needs to be cleaned from our DataFrame as it would serve us no purpose. So, we will filter out the rows which have values less than 0 in the Active
column. we will also take a look at the length of the DataFrame once again and the minimum and maximum values in the Active
column.
df = df[df.Active > 0]
df.head()
len(df)
df.Active.max(), df.Active.min()
Let us check for missing values in other columns before moving ahead, specifically the Admin2
, Lat
, and Long_
columns. The Admin2
column specifies the county name. The Lat
and Long_
columns specify the latitude and longitude values for these counties. These columns will feature heavily while we work on the code for the bubble map.
df.isna().sum()
So we get the number of missing values in each column of our DataFrame. Our Admin2
column has 5 missing values, while the Lat
and Long_
columns have 36 missing values. Let us remove these missing values from the Admin2
, Lat
, and Long_
columns. They anyways won't serve any purpose to us while plotting our bubble map. We will also verify if these values have been removed or not.
df.dropna(subset=['Lat', 'Long_', 'Admin2'], inplace=True)
df.isna().sum()
Fantastic! Our three main columns - Admin2
, Lat
, and Long_
do not have any missing values.
4. Sorting And Rearranging Data
Next, let us sort our DataFrame in descending order of active cases. Since the sorting rearranges the indexes of the DataFrame, we will also reset the indexes of our newly sorted DataFrame.
df = df.sort_values(by=["Active"], ascending=False)
df.reset_index(drop=True, inplace=True)
df.head()
5. Setting Value Limit Intervals
We need to set some levels or limits to group the range of COVID-19 cases by specifying an upper bound and a lower bound of active COVID cases. For this, we create a list called stages
. This stages
list will be used for our bubble map's legend.
0-100 cases will be one range, 101-1000 cases will be another range, and so on.
After that, we will store the index values of rows that fall in these ranges as a list of tuples called limits
.
stages = ["400000+", "300001-400000", "200001-300000", "100001-200000", "50001-100000", "10001-50000",
"1001-10000", "101-1000", "1-100"]
# Create tuples of row indexes for the above ranges
tuple1 = (0, df[df.Active > 400000].index[-1]+1)
tuple2 = (tuple1[1], df[(df.Active > 300000) & (df.Active <=400000)].index[-1]+1)
tuple3 = (tuple2[1], df[(df.Active > 200000) & (df.Active <=300000)].index[-1]+1)
tuple4 = (tuple3[1], df[(df.Active > 100000) & (df.Active <=200000)].index[-1]+1)
tuple5 = (tuple4[1], df[(df.Active > 50000) & (df.Active <=100000)].index[-1]+1)
tuple6 = (tuple5[1], df[(df.Active > 10000) & (df.Active <=50000)].index[-1]+1)
tuple7 = (tuple6[1], df[(df.Active > 1000) & (df.Active <=10000)].index[-1]+1)
tuple8 = (tuple7[1], df[(df.Active > 100) & (df.Active <=1000)].index[-1]+1)
tuple9 = (tuple8[1], df[df.Active <=100].index[-1]+1)
limits = [tuple1, tuple2, tuple3, tuple4, tuple5, tuple6, tuple7, tuple8, tuple9]
limits
So, all rows with the value of their active cases greater than 400,000 will be in tuple1
. All rows with their active cases value greater than 300,000, but less than or equal to 400,000 will be in tuple2
. And so on.
6. Time to Plot our Bubble Map!
Since bubble maps show a bubble size proportional to the variable's value, it is also essential to set the right colour for the bubble. Aesthetics make a lot of difference in data visualizations. We will set a list of colours. I chose shades of red from the following link - http://www.workwithcolor.com/red-color-hue-range-01.htm. Note that the number of colours should be equal to the number of tuples we have in the limits
variable.
colors = ["#CC0000","#CE1620","#E34234","#CD5C5C","#FF0000", "#FF1C00", "#FF6961", "#F4C2C2", "#FFFAFA"]
Note that if you are using a Jupyter notebook, the below code should be in one cell. I have split it up in this blog post for explaining the code easily.
fig = go.Figure()
stage_counter = 0
for i in range(len(limits)):
lim = limits[i]
df_sub = df[lim[0]:lim[1]]
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = df_sub['Long_'],
lat = df_sub['Lat'],
text = df_sub['Admin2'],
marker = dict(
size = df_sub['Active']*0.002,
color = colors[i],
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode = 'area'
),
name = '{}'.format(stages[stage_counter])))
stage_counter = stage_counter+1
Okay, here starts the complex part.
First, we set our stage_counter
(the variable that tracks which stage
we are on) to 0.
Next comes the for loop, which loops 9 times, once for every tuple in the limits
variable. During each iteration, we extract a part of our original DataFrame to df_sub
. The new DataFrame df_sub
contains the rows whose index falls in the range specified by that tuple. During our first iteration, df_sub
will contain rows with indexes - 0, 1, 2 and 3. In the same iteration, we plot the bubbles for those rows using the latitude and longitude value specified for that county under the Lat
and Long_
columns. We specify the 'text' parameter as the county's name (value in Admin2
column) so that once the visualization is ready, we can hover over the bubble to see the name of the county. Next, we specify the size of the bubble proportional to the Active COVID-19 cases by multiplying the value in the Active
column with 0.002. You may use a different value. This value seemed apt to me for my visualization. We also specify the colour of the bubble. The 'name' parameter will specify the trace name. The trace name appears as the legend item and on hover. For the first iteration, this value will be the first item in the stages
list, i.e., "400000+". And finally, before we move to the next iteration, we increment the stage_counter
by 1.
If you are confused by the parameters in the above code snippet, check out this documentation.
fig.update_layout(
title_text = 'Active Covid-19 Cases In The United States By Geography',
title_x=0.5,
showlegend = True,
legend_title = 'Range Of Active Cases',
geo = dict(
scope = 'usa',
landcolor = 'rgb(217, 217, 217)',
projection=go.layout.geo.Projection(type = 'albers usa'),
)
)
Next, we focus on the aesthetics of our bubble map visualization. We set the title of the bubble map and its position (title_x=0.5 means center aligned) and the title of the legend. Since we are making a bubble map about the US COVID-19 Active cases, we specify the bubble map scope as 'usa'. For aesthetics, I changed the US landmass colour to grey using the 'landcolor' parameter.
If you have any queries about this code snippet, this plotly documentation will help you!
Finally, we save our graph on our local machine. And then, we display it on our Jupyter notebook.
fig.write_image("Active-Covid19-Cases-US-bubblemap.png", scale=2)
fig.show()
And our bubble map is ready!
Conclusion
You can find the code for this tutorial on my GitHub.
Thanks a lot for reading my tutorial! If you have any questions, feel free to ask me! You can also follow me on Twitter or connect with me on LinkedIn. I would also love to get some feedback on my code and my post!
Top comments (0)