To my dear readers, today I discovered Google Colab, a tool that can be very handy working with huge datasets for example In my case datasets larger than 10 gigabytes are huge and I would not like my computer fan overworking. No required prerequisite for this article, just basic knowledge about computers and working in the internet.
What is Google Colab?
Google Colab is a tool allows you to write and execute Python in your browser, with zero configuration required to access to GPUs free of charge and provides easy sharing of your code.
Colab is essentially the Google Suite version of a Jupyter Notebook.
Google Colab can be used by a student, an Artificial Intelligence Researcher, Machine Learning Engineer, Data Scientist, Data Engineer.
You need access to good internet and go to your favorite browser, (Brave is my favorite browser) type google colab and click on the first link.
Google colab is easy to use, you are able to write your python code, run it, share with others, easier installation of packages and sharing of documents. However, when one wants to upload a file or folder to google colab, it is quite a hustle.
To Upload a File or a Folder to Google Colab
Mostly people do download CSV file, upload into the Google Colab, read/load the data frame. After a while one needs to repeat everything again because the data was not stored there anymore. This article solves this issue.
In this article, I will show you how to use PyDrive to read a file in CSV format directly from your Google Drive using Python3 in the Google Colab environment.
First Step: Install PyDrive
The first step is to install PyDrive in our colab.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
Since we are in colab environment our pip will have exclamation (!) at the beginning as it is the set standard.
Step Two: Authenticate and Authorize.
We need to authenticate and create a PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
When you learn the above code, it will prompt you to allow to give permission for Google Colab to access your Drive click allow and proceed to allow Google colab to access your drive.
Step Three: generate a shareable link
Once you have completed verification, go to Google Drive
- find your file and click on it;
- click on the “share” button;
- generate a shareable link “get link”
The link will be copied into your clipboard and paste this link into a string variable in Colab.
Step four: Getting the file id
Do not share your link with others, to avoid unauthorized users from accessing your file. The link below is just for demonstration to help you understand the file id that one needs.
##https://drive.google.com/file/d/25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X/view?usp=sharing
your_file = drive.CreateFile({'id':'25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X'})
You assign the id to a variable your_file, use drive.CreateFile({'id' : 'id_value'})
Step Five: To load the file and show results.
I was uploading a csv file, so let's see if our process is success by loading the csv file and giving an output.
Indicate the name of the CSV file you want to load into memory.
your_file.GetContentFile('matches.csv')
I use Pandas to turn this into a Data Frame and display its header. I use import pyforest, a package that avails a lot of python packages for me including pandas.
import pyforest
df = pd.read_csv('matches.csv', delimiter=';' )
df.head()
As you can see in our picture above the csv file was uploaded successfully and we were able to operate on the data using pandas.
Now you know how to upload files, folders into your Google colab. This saves you the need to do everything locally in your machine, you are able to work comfortably with huge datasets.
We are still learning data engineering together. Reading the article to Install Apache PySpark in Ubuntu, you can read it here. Installing PySpark in our Local environment was indeed involving.
In Google Colab, I only have to run the following the following command to install PySpark and py4j library
!pip install pyspark==3.3.0 py4j==0.10.9.5
Then move on to using Apache PySpark in my work. To learn about Apache pySpark, read it here
This was a short comprehensive article to solve a challenge, I faced and solved. Feel free to leave your comments and suggestions.
Top comments (0)