DEV Community

Cover image for Provision the VM Instance on GCP
Cris Crawford
Cris Crawford

Posted on

Provision the VM Instance on GCP

As promised, here are the steps I took to provision my VM Instance on GCP. The reason for doing this is that it's faster than running the assignments and the project on your computer, and it doesn't take up space on your computer. This post is what I did after I watched the video 1.4.1, which you can access from the course repo at https://github.com/DataTalksClub/data-engineering-zoomcamp.

Previously, I created a VM Instance and set up ssh to access it from my home computer. I describe that in my previous post, https://dev.to/cmcrawford2/install-a-vm-instance-on-gcp-1hkk.

You can edit a config file to make connecting with ssh easier. On your computer, in the ~/.ssh directory, edit a file named "config". You may have one already from github. If not, open a new file in VSCode, in that directory. Add the following lines of code:

Host <easy name to remember>
    HostName <the External IP address>
    User <your name>
    IdentityFile ~/.ssh/<your private key>
Enter fullscreen mode Exit fullscreen mode

I entered:

Host de-zoomcamp
    HostName 34.145.251.35
    User cris
    IdentityFile ~/.ssh/gcp
Enter fullscreen mode Exit fullscreen mode

Now, rather than typing ssh -i ~/.ssh/key name@EXTERNAL IP you can just type ssh de-zoomcamp

The first thing I did was to set up Anaconda on the VM. I went to the site https://www.anaconda.com/download and navigated to the bottom of the page. On the right were the links for downloading to Ubuntu. I clicked on the first one. It started downloading to my computer. I didn't want that. I killed the download and did control-click on the link, and copied the link. Then on the VM, I typed wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh (Don't take my word for it - I'm copying from my notes. Use the link.)

Then type bash Anaconda3-2021.11-Linux-x86_64.sh and you will be prompted to read the license agreement, and then asked to accept it. Type yes. It will start installing.

Then type "yes" again to run the Anaconda initializer. Now you can see in .bashrc that Anaconda has added some code at the end. Type source .bashrc to run it (this has the same effect as logging out and logging back in). You should see (bash) in front of the prompt. Type python. You should get a >>> prompt. Type import pandas as pd and pd.__version__ and you should see pandas installed. Type ctrl-d to exit.

Next we have to install docker. First we have to update apt-get: sudo apt-get update. Next install docker: sudo apt-get install docker.io. After docker is installed, you will also need to fix permissions to run docker without sudo. I googled "docker run without sudo" and found instructions on https://docs.docker.com/engine/install/linux-postinstall/. You'll need to create a docker group: sudo groupadd docker and then add yourself to the group: sudo usermod -aG docker $USER and then run source .bashrc. Then you should be able to run docker run hello-world and see something.

You can configure VSCode to run on the VM. In VSCode, choose extensions from the menu on the left and enter "remote ssh" in the search bar at the top of the menu. Install the Microsoft extension that says "Open any folder on a remote machine..." Open the command palette (shift-command-p) and look for "connect to host". You should be able to find "de-zoomcamp" that you set up in the config file. Once you click this, you're working in the virtual machine.

Now let's install the course repo. Go to https://github.com/DataTalksClub/data-engineering-zoomcamp. Click the green "code" button, and copy the HTTPS link to clone the repo using the web url. On the VM running in the terminal window, type git clone https://github.com/DataTalksClub/data-engineering-zoomcamp.git

Next, install docker-compose. Find the repo where it lives, https://github.com/docker/compose/releases, and find the latest version that will run on linux: docker-compose-linux-x86_64. Again, if you just click on this, it will start downloading to your computer. You should use ctrl-click and copy the link. Then on the VM running in the terminal window, type:

mkdir bin
cd bin/
wget https://github.com/docker/compose/releases/download/v2.24.1/docker-compose-linux-x86_64 -O docker-compose
Enter fullscreen mode Exit fullscreen mode

Then type chmod +x docker-compose

Now we have to put our new bin directory in the path. Type nano .bashrc and at the bottom of the file add export PATH="${HOME}/bin:${PATH}" Then hit ^O to save and ^X to exit. Type source .bashrc. Now you should be able to type which docker-compose and docker-compose version. Now navigate to data-engineering-zoomcamp/01-docker-terraform/2_docker_sql and type docker-compose up -d (-d is detached mode, so you can still work in the terminal), and you should see the docker containers spinning up. Leave this running for the next step.

Next we'll install pgcli. Run pip install pgcli. Then run pgcli -h localhost -U root -d ny_taxi. You need to have docker-compose running in order to see the database. I had some random error messages about "keyring", but otherwise it worked. You should be able to type \dt and see the list of tables, which is empty. You can exit.

We are going to see how to forward the port on the VM to our local computer so that we can look at postgres locally. Go to VSCode running in the VM. Open the terminal (ctrl-~). One of the tabs is PORTS. Select this and add ports 5432 (postgres), 8080 (pgAdmin), and 8888 (jupyter notebook). Now if you go to localhost:8080 you can log in to pgAdmin using admin@admin.com and password root. You'll be looking at the data on GCP.

In the terminal connected to VM, navigate to data-engineering-zoomcamp/01-docker-terraform/2_docker_sql and run jupyter notebook. You'll get a message that there's no browser. But the port is mapped to your computer. Go to your computer, and in the browser, type localhost:8888. You'll see the directory on the VM. Open the python notebook "upload-data.ipynb" and start running the code, line by line. You'll need to do a wget of the yellow_tripdata_2021-01.csv. I had to use the one on the course repo and unzip it using gunzip. Again, you have to find the file on https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow and then copy the link to the file. It ends with .gz. Then run gunzip. Now the jupyter notebook should run. I also had to install psycopg2-binary using pip install psycopg2-binary. I don't know why, as this was not in the video. But once I did, everything ran.

Finally, we'll install terraform. Google "download terraform." Go to https://developer.hashicorp.com/terraform/install. We will just download the binary. Find the linux binaries and ctrl-click on the AMD64 Download to get the link. On your terminal window for the VM, go to ~/bin directory and type wget https://releases.hashicorp.com/terraform/1.7.0/terraform_1.7.0_linux_amd64.zip for this you will need to install unzip. Type sudo apt-get install unzip and run it on the file. Terraform is already executable, and we have bin in the path from before, so you can type terraform -version and see something.

If you watched the data-engineering-zoomcamp videos for running terraform on your computer, you know that you should have created a service account on Google Cloud and that you need json credentials to authorize it. You will need to transfer the json file to the VM. For this we'll use sftp. On a terminal window on your computer, navigate to the directory where the .json file lives (not the VM), and type sftp de-zoomcamp. Now type

mkdir .gc
cd .gc
put <filename>.json
Enter fullscreen mode Exit fullscreen mode

where is the name of your key file. (Mine is keys.json.) Remember, the key in the json file is also secret secret, so don't show it to anyone.

Now in the VM terminal window, authorize the service account on google cloud. First define the credentials and then activate the service account with gcloud sdk.

export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/keys.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
Enter fullscreen mode Exit fullscreen mode

Now you can go to the terraform folder in the course repo and run the commands "terraform init", "terraform plan", "terraform apply" and "terraform destroy". You will have to change the variables to your own service account. I won't go over how to do that here as it's covered in the terraform videos.

To shut down the virtual machine, type sudo shutdown now in the VM terminal. Or you can go to the Google Cloud console, go to "Compute Engine->VM instances" and select the dot menu next to the VM Instance. Choose STOP to stop running the VM Instance. Then you will be charged only for storage. If you don't want to be charged at all, you can choose DELETE, which will remove the instance and everything that you put on it.

Top comments (0)