Have your ever wondered how to create an age distribution graph using Python, Pandas and Seaborn? If so, keep reading in order to find out how!
Figure 1: Here the graph we'll learn to build in this tutorial
Setup
First, here is the GitHub repo for this tutorial: Kaggle Titanic Project
We'll be working with the contents in the file age-distribution-graph.ipynb
for this tutorial.
Note: We'll be working with Jupyter Notebook for this tutorial so if you don't have it installed you can do so in the official Jupyter website
Development
After opening up age-distribution-graph.ipynb
you'll notice that the code is divided up into blocks that can be run individually.
Let's go through each code block one by one:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
warnings.filterwarnings("ignore")
Here we are importing all the neccessary libraries for constructing the Histograph that we're about to build. We'll be using Seaborn
to create the Histograph using its histplot
method(more on that method in their docs page)
The warnings.filterwarnings("ignore")
line is making sure to never print warnings that match an ordered list of filter specifications(more on warnings.filter()
in their official docs page)
Next, we add the following code block:
def read_data():
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")
return train_data, test_data
train_data, test_data = read_data()
Here we're defining the read_data()
method, which is responsible for loading the data contained in a .csv
file into a Pandas DataFrame
object(more on DataFrame
in their official docs).
Now the train_data
variable contains the training data and the test_data
variable containing the testing data.
Next we can add the following code:
def survived_age_table(feature):
sns.histplot(data=train_data, x='Age', hue='Survived', palette=['yellow', 'green']).set_title(f"{feature} Vs Survived")
plt.legend(labels=['Died', 'Survived'])
plt.show()
This method is responsible for creating the age distribution graph. Here are some more details about it:
- First we create the histogram by calling the method
sns.histplot()
(more on this method can be found in their official docs). - The
data
parameter takes an input data structure, which is apandas.DataFrame
in our case. - The
x
parameter specifies the variable subject to being counted, which in this case is theAge
variable. Assigning a variable to thehue
parameter,Survived
in our case, would be an instance of conditional subsetting, whereby a seperate histogram containing its own unique values and colors will be rendered in the same graph. - The
palette
parameter is a way to choose the colors to use when mapping thehue
variable. - Finally, we can set the title of the histogram via
set_title()
- The
plt.legend()
method is a way to customize the legends displayed in the legend box located in the top right of the histogram. - Lastly,
plt.show()
displays our histogram.
And here is our finished histogram:
Figure 2: Our Finished Histogram
Thanks for following along and I hope this article was helpful to you.
Conclusion
Well that's it for this post! Thanks for following along in this article and if you have any questions or concerns please feel free to post a comment in this post and I will get back to you when I find the time.
If you found this article helpful please share it and make sure to follow me on Twitter and GitHub, connect with me on LinkedIn and subscribe to my YouTube channel.
Top comments (0)