Renata Maçãs

Posted on Sep 18, 2023

Python + Google Colab Tutorial for Data Analysis

#python #community #datascience #googlecolab

Introduction

Using data analysis is very important for creating or improving more efficient public policies. Here, we will talk about how numbers and data can be friends of public policies. This is important because we want representatives of the democratic state to do things that work, right?

Public policies are basically government plans to make society better. They can be about health, education, money, or even fun things like culture. Sometimes, we all help think about them!

The idea is that these public policies follow the rules written in the 1988 Constitution, which is like the manual of laws here in Brazil. But how do we know what to do and where to invest our money? That's where data comes in.

Data is like clues that help us understand what is happening in society. They show us things like how much money people earn, whether they have access to services like health and education, and even if everyone has the same opportunities.

For example, we have the Brazilian Institute of Geography and Statistics - IBGE. They are collecting information about everything, from how many people live in a city to how long it takes for people to get to work.

Transparency is crucial here. We need to make sure that everyone can see and understand this data because it helps keep things fair. There are even laws, like the Access to Information Act, that ensure you can request this information from the government. And we also have the General Data Protection Law (LGPD), which protects your personal information.

So, in summary, data is like valuable tips for creating better public policies. And it's important that everyone can access them and that our personal data is protected. After all, we are all on this journey towards a fairer society!

Tutorial

Let's perform a simple Data Analysis using Python, Pandas, Matplotlib, and Google Colab! :)

Let's access Google Colab:

Click on this link: https://colab.google/

You need to have a Google account.

It will open a new page in your browser with your 'Notebook' open.
The cool thing is that on this platform, we can simulate a virtual environment to work with code, and we can store these files in various locations, on your computer, GitHub, Google Drive, etc.

Let's change the name of the file:

At the top of the file, just click on the file name with the .ipynb extension and change it to 'lesson1'.
If you want, you can save this file in Drive, GitHub, etc.

Let's set up our project:

On the left side of the file, there's an option to configure the 'file,' let's click on it, and a folder icon will appear, which is our project!
In Google Colab, we already have a prepared environment for data analysis, but you can use other technologies, such as Jupyter, Anaconda, it all depends on what we are going to analyze. :)

image

Let's copy or drag the .csv file to the 'sample-data' folder or integrate it with Google Drive.
Mount Google Drive in Colab, you need to create a cell (code block) in the notebook with the following content:

from google.colab import drive
drive.mount('/content/drive')

By doing this, a message like this will appear:

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=[value]
Enter your authorization code:

Access the URL above, choose a Google account, copy the generated token, and paste it in Colab, then press enter.
After doing this, the cell will update, and the following message will appear:

··········
Mounted at /content/drive

Send the file to Google Drive.
With the drive mounted, go to your Google Drive and upload the file.

Example: Datasets/imdb-reviews-pt-br.csv.

Open the Dataset in Colab. With the file in Google Drive, create a cell with the following values:

import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Datasets/imdb-reviews-pt-br.csv')
df.head()

You should be able to see the first rows of your dataset! :)

Now let's access the website that provides open data for our analysis.

Let's access the INEP website at the following link: https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem
Clicking on the link will download the .zip file with all the data.
Unzip the .zip file and check which folders and file types it contains.
For the tutorial, we will select the Enem microdata.

Now let's program! If you don't know anything about Python, no problem, just copy and paste the code block below:

But it's essential to study Python or another programming language if you intend to advance in a career in Data Science. :)

Create a variable to store the data to be imported with Pandas:

microdata = pd.read_csv(‘file-path’), sep=";", enconding='ISO-8859-1'

After copying the code and pasting it into Google Colab, click the "run" button within the code block and watch the magic happen!
When you run the command, you will have the DataFrame, which is the data structure of Pandas that is encapsulated and read by it, meaning it's a table with rows and columns. :)

image

From here, we start the analysis:

At this point, it's time to ask questions.
What do we want to analyze?
What questions can the data and indices answer, or what hypotheses can they suggest?
How can this result impact decisions in both the public and private sectors?

To continue, let's perform a brief exploratory analysis based on the following question:

How to organize the columns to gain insights?
Let's type the following code to check the data:

microdata.columns.values

image

This command will return an array with the names of all the columns.
Since today we are only analyzing one dataset, let's select only a few columns for analysis, meaning we'll create a DataFrame for this analysis. Let's type the following code:

We'll use the filter method to filter the columns we want to analyze:

microdataSelect = microdados.filter(items=columnSelect)

microdataSelect.head()

image

Let's analyze the distribution of students by municipality.
You can use this link for reference on how to calculate statistics with Pandas:

https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html

Or you can check the Pandas documentation here:

https://pandas.pydata.org/docs/reference/index.html

In other words, how many rows are there for each municipality:
With the column variable name and the Pandas library method, we can also sort the data and look for the municipality we want, for example:

columnSelect = microdataSelect['NO_MUNICIPIO_PROVA']

columnSelect

image

and

columnSelect.value_counts()

image

columnSelectAge = microdataSelect['TP_FAIXA_ETARIA']

columnSelectAge.value_counts()

image

And now, to visualize the data, we'll use the Matplotlib library.

https://matplotlib.org/stable/api/

You imported it at the beginning of the project, remember? Here's how we imported it:

import matplotlib

image

run the command:

columnSelectAge.hist()

image

If we want to increase the data distribution for better use, we can use the 'bins' parameter:

columnSelectAge.hist(bins=30)

image

Now let's do the analysis by gender:

columnGender = microdataSelect['TP_SEXO']
columnGende.hist()

image

spoiler!

Oops, but the Enem only registers male and female, can't these data help with a public policy to have more gender options?

Conclusion:

Now, it's time to delve deeper into Data Analysis. Let's explore more, ask different questions, and try new technologies. Remember, what we saw was a basic tutorial to show how to use public data in simple or complex analyses.

The use of data is essential for effective public policies. It helps government and non-governmental organizations make informed decisions, aligned with the real needs of society, and also evaluate policy performance after implementation. Let's continue exploring data analysis to better serve the needs of society!

References:

[PT-BR] Free Courses from the Federal Government for Data Science: https://www.gov.br/governodigital/pt-br/capacita/ciencia-de-dados

[PT-BR] Research on Data Science and Education: https://www.institutounibanco.org.br/iniciativas/centro-de-pesquisa-transdisciplinar-em-educacao-cpte/ciencia-de-dados-na-educacao/

[PT-BR] 2022 - Data Science in Public Policies: An Education Experience National School of Public Administration (Brazil); De Toni, Jackson (Editor); Dorneles, Rachel (Editor) - https://repositorio.enap.gov.br/handle/1/7472

DEV Community

Python + Google Colab Tutorial for Data Analysis

Introduction

Tutorial

Conclusion:

References:

Top comments (0)