Hugo Estrada S.

Posted on Jan 15, 2021

Data Visualization with Python pt. i

#python #datascience

First things first, all the code I cover on this lecture it's right here:

https://github.com/hugoestradas/Data_Visualisation_with_Python.git

Part 1: Using Matplitlib for the Very First Time

Matplotlib is a popular data visualization library for Python.

The reason I'm going to use it, it's because it's fairly easy to use and out of many Python data visualization libraries it's the most commonly used one.

With Matplotlib you'll be able to create many different types of charts.

Let's see how to create a line chart with Matplotlib:

The first thing to do is to import the Matplotlib module:

Now to start plotting data I'll use the following line:

This line says "put 1, 2 and 3 in the 'x' axis; and 1, 4 and 9 in the 'y' axis".
To show this plot, it's necessary the following line:

It is possible to add labels for the 'x' and 'y' axis and a title for the whole plot:

The whole cell should look like this, and the end plot should be the following:

It's also possible to plot multiple lines on the same plot:

And the plot looks like this:

To clarify the values of each line, it is possible to define them by name using the "plt.legend" method:

And the plot looks like this:

It is possible to export the plot as an image as well:

Part 2: Using Pandas

Pandas is a Python library that helps you import, organize and process data, it's familiar to "dataframes" in the R language.

Let's create a dataframe in Pandas, select data with Boolean indexing and finally plots using the same Pandas dataframe:

This is the data I'll be using:

To create a dataframe to store this data, I'm going to create a dummy data as dictionary with three attributes: 'year', 'attendees' and 'average age'.

And after executing the cell, the displayed dataframe should look like this one:

I can assign this newly created dataframe to a variable called 'df' (the standard variable name for a dataframe in Pandas):

And the result should be the same:

There are three columns in this dummy dataframe, you can select a single column out of this dataframe, for example:

The type of this new data is something called a "Pandas Series".

It's similar to a regular Python list and also to the NumPy array, if you're familiar with the NumPy library.

Knowing this, you can apply an inequality operation on the series with df['year'] < 2010:

This returns a series of Boolean values:

Let's store the output into a variable:

Using the Boolean Series you can select only the part of the data where the year is earlier than 2010, this is called "Boolean Indexing":

Imagine that you want to examine how the number of attendees has changed for the last three events.

To best figure this out, you might want to plot the number of attendees against the year:

This line of course puts the year on the x axis and the attendees on the y axis, and the result it's the following:

If you want to plot the number of attendees and average age on the same plot we can just call 'plt.plot()' multiple times:

Part 3: Importing Data with Pandas

For this example, the sample data that I'm going to use, is the following .csv file:

It is a list of countries and their basic demographics for each year, years ranging from 1952 to 2007 for every five years.

To import this .csv file make sure that you ether know the path of the file or the both the notebook and the .csv file are located in the same location within the Jupyter intance.

This dataset is pretty small, but in real world scenarios if you want to have a glimpse of the data you're dealing with, all you need to to is to use the 'head()' method: