DEV Community

Cover image for Cloud GPU instance with PyTorch and TensorFlow easy setup in 10 minutes
Hideki Okamoto
Hideki Okamoto

Posted on • Edited on

Cloud GPU instance with PyTorch and TensorFlow easy setup in 10 minutes

(10/29/2024 Update) This article has been updated to reflect the release of the RTX 4000 Ada GPU instance. Some screenshots and descriptions still reference the older RTX 6000 GPU, so interpret them accordingly.

PyTorch from Meta (ex-Facebook) and TensorFlow from Google are highly popular deep learning frameworks. While GPU is essential to the development and training process of deep learning models, it is time-consuming to build an environment to make GPU available for these frameworks and to make both PyTorch and TensorFlow usable in a single environment. This article shows how to set up an environment with GPU-enabled PyTorch and TensorFlow installed on Akamai Connected Cloud (formerly Linode) in 10 minutes. The procedures in this article make it easy to set up a dedicated deep learning environment in the cloud, even for those unfamiliar with setting up a Linux server.

What is Akamai Connected Cloud (formerly Linode)?

Akamai Connected Cloud (ACC / formerly Linode) is an IaaS acquired by Akamai in February 2022. ACC offers simple and predictable pricing, which includes SSD storage and a certain amount of network transfers within the price, which are often expensive in other cloud computing providers. There are no price differences by region, making spending on cloud providers more predictable. For example, a virtual machine with 16GB of memory with NVIDIA RTX 4000 Ada costs $350 per month (as of October 2024), including 500GB of SSD storage and 1TB of network transfer. If you use a service for just part of the month, hourly billing enables you to only be charged for the time the machine is present on the account.

You can use the Cloud Estimator tool provided by ACC to compare prices with other cloud providers.

The goal of this article

In this article, we will set up Docker Engine Utility for NVIDIA (nvidia-docker), a container virtualization platform that supports NVIDIA's GPUs, on a GPU instance on ACC, where we will deploy NGC Container, containers officially provided by Facebook and Google for deep learning. You can set up the environment in about 10 minutes with almost no prior knowledge of ACC, Docker, or NGC Container by using StackScripts, ACC's deployment automation function.

Tools - StackScripts

NVIDIA Container Toolkit

The environment built with this procedure includes a sample Jupyter Notebook that can be used with OpenAI Whisper, a speech recognition model that has been widely praised for its extremely high recognition accuracy so that even those who do not develop deep learning models themselves can experience the benefits of GPU instances.

Voice Recognition with OpenAI Whisper

If you provide the Object Storage credentials to the StackScript, the PyTorch and TensorFlow containers will automatically mount the external Object Storage, which can be used to retrieve training data from or to store your deep learning models. Using Object Storage is optional. You can skip it.

Setup a GPU instance with PyTorch and TensorFlow

First, open the StackScript I have prepared for you from the following link. This StackScript will automatically install nvidia-docker, PyTorch, and TensorFlow. (You must be logged into your ACC account to access the link.) If you can't open this StackScript for some reason, I have uploaded the contents of this StackScript to GitHub for you.

deeplearning-gpu
https://cloud.linode.com/stackscripts/1102035

Click "Deploy New Linode"
Deploy New Linode

StackScript has a feature called UDF (User Defined Fields) that automatically creates an input form with the parameters required for deployment. This StackScript requires you to set the login credential of a non-root user who can SSH into the virtual machine, Access Key to mount Object Storage as external storage (optional). If you want to mount Object Storage, create a bucket and obtain an Access Key in advance.

The regions where both Object Storage and RTX 4000 Ada GPU instances are available are as follows as of October 2024.

  • Seattle, WA, US
  • Chicago, IL, US
  • Paris, FR
  • Osaka, JP

Linode configuration

Since GPU instances are available only in limited regions, select a virtual machine type first, then the region. Here I have selected Dedicated 32 GB + RTX6000 GPU x1 in the Singapore region for example.
Singapore region
Name the virtual machine, enter the root password, and click "Create Linode".
Name the virtual machine
The screen will transition to the virtual machine management dashboard. Wait a few minutes until the virtual machine status changes from PROVISIONING to RUNNING. The IP address of the virtual machine you just created is displayed on the same screen, so take note of it.
IP address
The virtual machine is now booted. The installation process of nvidia-docker and NGC Containers will proceed automatically in the background. Wait 10 minutes for the installation to complete before proceeding to the next step.

Starting a container

Now let's log in to the virtual machine via SSH. If the setup process performed by StackScript is complete, the following message will appear when you log in. If you do not see this message, log out and wait a few minutes before logging in again. If you have inadvertently started a virtual machine that does not have a GPU, you will get the message "GPU is not available. This StackScript should be used for GPU instances." In that case, please start a GPU instance and redo the procedure from the beginning.

% ssh root@45.118.XX.XX
root@45.118.XX.XX's password:

(snip)

##############################################################################
You can launch a Docker container with each of the following commands:

pytorch: Log into an interactive shell of a container with Python and PyTorch.
tensorflow: Log into an interactive shell of a container with Python and TensorFlow.
pytorch-notebook: Start Jupyter Notebook with PyTorch as a daemon. You can access it at http://[Instance IP address]/
tensorflow-notebookm: Start Jupyter Notebook with TensorFlow as a daemon. You can access it at http://[Instance IP address]/

Other commands:
stop-all-containers: Stop all running containers.
##############################################################################
Enter fullscreen mode Exit fullscreen mode

The following five commands are available on the machine created by this StackScript.

Command Usage
pytorch Start a container with PyTorch installed and enter its interactive shell
tensorflow Start a container with TensorFlow installed and enter its interactive shell
pytorch-notebook Start Jupyter Notebook with PyTorch installed as a daemon
tensorflow-notebook Start Jupyter Notebook with TensorFlow installed as a daemon
stop-all-containers Stop all running containers

Each container has the directories /workspace/HOST-VOLUME/ and /workspace/OBJECT-STORAGE/ to mount the host machine directory and external Object Storage. The container created by the above command is configured to remove the container when it is stopped (--rm option of docker run is set), so place the files you want to keep in /workspace/HOST-VOLUME/ or /workspace/OBJECT-STORAGE/.
Directory structure
Let's spin up Jupyter Notebook with PyTorch as a daemon and run a speech recognition model OpenAI Whisper. Run the pytorch-notebook command from the console.

root@45-118-XX-XXX:~# pytorch-notebook
[I 04:36:22.823 NotebookApp] http://hostname:8888/?token=0ee3290287b3bd90f2e8e3ab447965d3e074267f0d60420b
        http://hostname:8888/?token=0ee3290287b3bd90f2e8e3ab447965d3e074267f0d60420b
Enter fullscreen mode Exit fullscreen mode

Jupyter Notebook should now be started. If you get the error "Bind for 0.0.0.0:80 failed: port is already allocated." Stop the existing container first with the stop-all-containers command. If you get the above result without any problem, replace hostname of the URL with the IP address of the virtual machine that you noted when creating the virtual machine, delete :8888, and access the virtual machine from a web browser. The token will change each time the container is started.

Click on Voice Recognition with OpenAI Whisper.ipynb in HOST-VOLUME to open it.
Jupyter Notebook
Click Cell->Run All in the menu to run OpenAI Whisper. The first time you run it, it will take a few minutes to download dependent software and deep learning models.
OpenAI Whisper

If the execution completes without problems, the last cell will show the result of the speech recognition: "I'm getting them for $12 a night."

Congratulations! Now you have GPU-enabled PyTorch and TensorFlow

Deleting the instance

You can delete the virtual machine that you have finished by clicking "Delete" in the ACC Management Console. The contents of /workspace/HOST-VOLUME/ (/root/shared/ from the host OS) will be deleted, so move any files you want to keep to another location.
Delete instance

You are charged even for powered-off virtual machines. Delete virtual machines that you do not want to be charged for.

Platform - Billing

Access control for the instance

Access to the virtual machines created in the above procedure via SSH requires password authentication or public key authentication, and access to Jupyter Notebook requires token authentication. If you want to add access control based on the IP address of the client, refer to the following articles to apply firewalls to port 22 (SSH) and port 80 (HTTP).

Cloud Firewall - Get Started

For more advanced access control, Akamai's zero-trust solution, Enterprise Application Access, can be used for integration with external Identity Providers and SSO support.

Enabling HTTPS

Follow the steps below to enable HTTPS in Jupyter Notebook for production use.

Running a public notebook server

The five commands listed above are defined as aliases for docker commands in /root/.bash_profile. When HTTPS is enabled, the argument of the -p option of the docker command used by the pytorch-notebook and tensorflow-notebook commands should also be modified to the appropriate port such as 443. And finally, execute ufw allow 443/tcp so that the firewall allows port 443.

Top comments (0)