Hugo Estrada S.

Posted on Apr 5, 2020 • Edited on Jun 29, 2020

Data Science With Docker pt. II: Analyze your First Data Set.

#tutorial #docker #jupyter #datascience

In my previous lecture, I provided a guide on how to build your first scientific environment using Docker Compose. Now I'll explain how to start diving into the mysteries of Data Science using that environment. In this lecture you'll learn the basics of Python for Data Analysis, using the most popular Python packages and you'll resolve your fist Data Analysis Requirement.

Again, here's the repo with the materials for this tutorial:

https://github.com/hugoestradas/Data-Science_with_Docker

We'll be doing some data exploration and data analysis, so go directly to the jupyter-notebooks folder.

Part 1: Hello Jupyter (again)

Let's talk about the Jupyter Notebook, no matter what image you selected for your scientific environment, the graphical interface is the same.

So the Jupyter interface it's pretty straightforward, easy to understand and easy to use. There are three major sections in the left side of the GUI: Files, Running and Clusters. To the right we have a "Quit" and Logout" buttons, below those buttons there's a space for upload or create new files, and refresh button to update the content of your work space, which is listed at the center of the window. As right now, what you're seeing is the content of the Files tab.

Moving to the Running tab, when you start the notebook server you'll find that there's nothing on this section.

The final section corresponds to the Clusters tab. And again, since this is the first time you're opening the server, won't be anything executing. But you might as well notices that there's something interesting in this tab: "Clusters tab is now provided by IPython parallel. See 'IPython parallel' for installation details."
What on earth it means? well, let me explain. The Jupyter Notebook are web IDE's that allow the creation of files with executable code and data visualization. To achieve this Jupyter works with something called Kernel, and is the backbone of this web-based application. The Kernel is in charge of execute the code written in whatever language you prefer. In our case, since we're going to be working with Python the IPython kernel is the default Kernel. If you'd like to read more about the IPython kernel, here's the official Jupyter Documentation:
https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html#kernel

With the basic understanding of what's the Kernel and what it does, let's your first notebook.

Go to the New button and click on Python 3. This will launch a blank space for code. A new term you need to understand when working in Jupyter Notebook is the concept of a "Cell". Each cell ant its contents (code, images attached, etc.) is listed therein along with some metadata, forming a single unit of code within a group of code units. This is because the Notebook, has its own extension as well: ".ipynb". These files are very similar to a JSON format, where each section refers to a single code unit (or cell).

To better understand what's going on under the hood with all this Kernel, Cells and Notebooks things, let's right some Python code. Since we want to analyze data, I'm going to import the main 3 libraries for this purpose: Pandas, Numpy and Matplotlib.

In a nutshell:

• Pandas stands for "Python Data Analysis Library" and at the same time is derived from the term "panel data", an econometric term for data sets that include observations over multiple time periods for the same individuals.
• Numpy is the general-purpose array-processing package. Its name means “Numerical Python” and its used for linear algebra operations. At its core is the NumPy array, a multi-dimensional data structure that can be used to represent vectors and matrices.
• Matplotlib is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

Before we start our valuable data-science code, we need to upload our source of information to the Jupyter server. So let's upload the .csv file.

The file .csv is compressed in a .zip format, that's totally fine since Python supports this type of files. Now back to our fist notebook we'll add one more library: The Zipfile library, this will allow us to access the content of the .csv without unzip it.

I'm adding a short name to each individual library, so I don't have to re-type the entire name each time I need to gain access to them. Now, to execute the cell in which you are located you can either press the "Run" button or use the keyboard shortcut SHIFT + ENTER.

Now let's create your first DataFrame. This is what Pandas is used for, and there are two distinct types: Pandas DataFrame and Pandas Series. A Pandas DataFrame is nothing but an in-memory representation of like-excel sheet via Python programming language. It provides various functionalities to analyze, change, and extract valuable information from the given dataset. In addition since is located in the memory, it's really easy to work with this type of data structure, offering more speed and reliability.

Now it's time to explore what type of information we have. This .csv file contains the sales records of a large company with operations all over the world. The file contains an "Order ID" field, "Region", "Country", "Total Revenue" and other many useful information.

To take a glimpse of the data use the head() function. This is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your dataframe has the right type of data in it, or just to get familiar with the data your are dealing with.

For a detailed group of information of the dataframe, like the type of rows, how many records we have, the size of the file, etc. You can use the info() function, which will get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset.

Okay, by this point you should have a pretty good understanding o the data you are dealing with. And if you remember, we needed to write some code to understand they way Jupyter and its Kernel load and execute the code with write into the Notebooks. So now it's a good time to save our first Notebook. For that go to the File section and click on "Save as". Or you might as well just type the name you want to provide to this Notebooks, by simply clicking on the Untitled name of your notebook at the top of the window.

Either option you chose is going to save the current state of your Notebook. Now, when you go to take a look of the file itself in your scientific environment for analysis and development, you'll notice that the file in which you just type a couple lines of code, does not look like what you've seen in Jupyter. This is because, the Kernel is the one that takes the file, and when executing the code also takes the task of creating this Cell format for your code. Creating a more organized and clean space for work.

Now, let's go to your first Data-Science Requirement. Let's say you are giving the task of provide in a visualization of the Revenue by region. Sounds easy, so let's go ahead and try to solve this.

First of all, we know that the file you were given already has records specifying a total amount based on an specific number of order, and the region. So all we have to do, if find a way of grouping the total revenue based on each region. First of all we need to know how many regions we have, and define each region as a unique type of region. I'm going to approach this situation, by creating and counting classes, and then set as unique each class (or in this case region).

This is Numpy moment to shine, because we're going to work with different arrays.

Great! now I've stored the region and its total revenue in a new structure (classes and counts, respectively). But let's say I want a new dataframe only for this data. We can do it as it follows:

One thing that it's wort to mention here, is that with a small switch in this line of code, I can get the average Revenue, by changing .sum() with .median(). That's why we use Python for Data Science, Computer Science and AI. It's really simply syntax turns out very powerful when dealing this kind of scenarios.

Okay, but let's remember that we're doing data exploration here, and the requirement indicates that we have to present these results not in a detailed manner, but a visualization chart. So lets plot this data we just discovered.
For this we need to explicitly indicate to Matplotlib that we need the pyplot command available.

Okay, now I'm going to use the classes and counts variables to create a new chart of this data, using a for loop and the zip function. I'm going to define the classes variable as the dependent variable in the chart, and the height will depend on the number of occurrences it has on the dataframe.

Part 2: Data Persistence

So far you've created your first DataFrame in Python using Pandas, you classified and limited the data you need to fulfill the requirmeent using Numpy and created your first plot in Python, using Matplotlib and your first sub-dataframe. But let's remember that all this was possible because the DataFrame is still in memory, meaning when you close your notebook or stop the Kernel of Jupyter all this data will be lost. And that's definitely not good. Luckily for you, our scientific environment already comes with a relational database, capable of storing this structure in a table. An since in the previous lecture you already configure the communication between Jupyter Notebooks and PostreSQL (our database of choise), all we need to do is to create a connection string and store our DataFrame into a new table in PostreSQL.

In the previous lecture we configured pgweb as the graphical interface for our database PostgreSQL. To access to it just open a new tab in your browser, type the IP of your environment following with :8081.

The credentials for the postgres user we'll be using are into the YAML file. Now we need to insert our dataframe into a table stored in postgres. To achieve this I'm going to use the .to_sql() function of Pandas.

This is going to take some time, remember that SQL databases are not particularly popular for its speed, in contrast with DataFrames their quite slow. Wait until this sand clock is no longer visible in your Jupyter tab.

Go back to pgweb, and you'll find the "sales" table. Now, if you go ahead and execute a query to count the records, you'll find that indeed the .to_sql() function inserted the half a million records into the table.

If you want to dig more into SQL and replicate what we just did in Python, there's a simple query for this:

select "Region", SUM("Total Revenue") as "Total Revenue" from "sales" group by "Region" order by "Total Revenue"

Part 3: Data Visualization - Superset

We've used Python to understand the data we were given, we already store the data in a relational database to keep it whenever we need it. Now it's time to create a high-level dashboard for our study, this is why we need Apache Superset.

Our scientific environment already has a docker image for Superset. Type the IP of your environment and this time use the port 8088. The credentials for Superset are: Admin y the password is superset.

Let's think of Apache Superset, as the Power BI of the Open-Source. Following a minimalist approach the Superset GUI it's very clean ans simple.

In order to create the dashboard we were required, we need to connect Superset with Postgres, or we can upload the .csv. Either solution works for us.
For adding a new database source, you need to go to the Sources tab, and click on Databases.

At the top of the right corner, there's a plus button, click on it to add a new connection.

From there, all that's left is to copy and paste the connection string we used in our Jupyter notebook, pick a name for your database within Superset and test the connection to see if it actually works.

If you're eager to know more about Superset, I encourage you try the SQLab tab, which provides it's own sql engine, and run the previous query we used.

Now, the other way we can create the dashboard is by uploading the .csv file, for that you need to go to the Sources tab, and click on the Upload CSV option.

From there, type a name for the file and select the file from your local system, then click the save button at the bottom of the page, no need to change anything from the configuration. Again, this will take some time since the file size is over 50mb.

Next up, click on the Sources and Tables section. You'll find the table you just uploaded from the csv file. Click the edit record to set the "Region" and "Total Revenue" fields as groupable.

Awesome, now it's time to actually make that dashboard. Click on the Dashboards tab and then add a new one by clicking the plus button on the right corner.

Type a name for your dashboard and save it.

Go to the Charts tab, and add a new one by clicking on the plus button in the right corner, now select the datasource (either the csv file or the database), do not worry about the visualization type for the moment and click Create New Chart button.

You should see a screen like the one in the image below.

Now, if you click on the Visualization type in the left side of the window, you'll find lots of graphs! Superset has over 50 types of charts, and depending of the type of data you are dealing with, you can use them all.

To fulfill this requirement, in an interactive an pleasant touch of user-friendly experience, I'm going to select the "Sunburst" chart, which is a a minimalist circle plot, that adds the different percentages of the analysis whenever the user pass the cursor over each section of the graph. To achieve this, all I have to do, is go to the visualization type section, select the Sunburst chart, and then Hierarchy option select Region, then on the Primary Metric section select the Total Revenue as SUM(). Then click on "Run Query".

The result should look like this one:

And finally, we just need to save the chart and added to the dashboard we previously created.

Click on "Save and Go to Dashboard" and that's it: you have an interactive dashboard with the sales by region completed!

Part 4: Final thoughts and conclusions.

Okay, on this lecture you learned the basics of Jupyter Notebook, the main three libraries for Data Analysis using Python, how to deal with DataFrames and its pros and cons, how to persist data using Jupyter Notebook and PostgreSQL and how to create interactive dashboards with graphical data using Superset.

Hope you find this useful!

DEV Community

Data Science With Docker pt. II: Analyze your First Data Set.

Top comments (0)