Mirza Bilal

Posted on Sep 8, 2023 • Edited on Oct 26, 2023 • Originally published at mirzabilal.com

Why Your AWS Deep Learning AMI is Holding You Back and How to Fix

#aws #deeplearning #cuda #machinelearning

If you're exploring your options for Deep Learning on AWS, you've likely considered using Deep Learning AMIs (Amazon Machine Images) to simplify your setup. Although pre-configured environments can be a good starting point and look like a no-brainer, but they have several limitations that will haunt you in the long run.

The Bloatware Problem

Deep Learning Amazon Machine Images (DLAMI) comes pre-installed with a plethora of applications, frameworks, and libraries. You will not need many of them in your production environment and sometimes not even in development.

Outdated Drivers and Toolkits

Your favorite deep learning framework released a new version that offers a valuable addition to your application. You are eager to start using it but unfortunately discover that your toolkit and drivers are outdated, dampening your enthusiasm. Now you're locked into using older drivers, toolkits, and older framework and misses out on your favorite new feature of Deep Learning, which you were so excited about.

Dependency Hell

Installing required modules or libraries for your application can be challenging with these DLAMIs. You may encounter an issue where the module you are attempting to install requires version 2 of XYZ, but you only have version 1.5 installed. This issue should be resolved by simply updating XYZ. However, upon attempting that, you may find that another application ABC or library requires XYZ. When you try to remove ABC, which your application does not neet, but to your surprise, yet another application is dependent on it, and this chain of dependencies seems never-ending.

Limited Architecture Support

Suppose you want to leverage cost-effective instances like g5g.xlarge for deep learning inferences. In that case, you're out of luck because no Deep Learning AMIs support them, or the only solution available has an older OS or outdated build tools. Especially for ARM-based instances, your choices are minimal. For Example, the only DLAMI available for the mentioned instance family is NVIDIA DLAMI, built on top of older version of Ubuntu 20.04.

Solution

Frustrated with these limitations ourselves, We've developed an automated, customizable script that can set up a high-performing deep learning environment on AWS EC2. This script downloads the latest Nvidia Drivers, CUDA 12.2, and cuDNN library. It uses the latest Amazon Linux 2023 as its base AMI. It offers several advantages, including support for the latest Linux Kernel 6 and more recent versions of GCC and other build tools and utilities. This script clones PyTorch and compiles it from the source to ensure you have the latest CUDA device support.

Performance and Cost Benefits

The customization allows for a lean, performance-optimized setup with a minimal footprint.
As in this script, PyTorch is compiled from source after cloning it from the official repository. It offers advantages like hardware-specific optimization and the use of up-to-date code. Hence resulting in better performance and security as compared to pre-built Pytorch module.
And if your compute tasks can tolerate interruptions or you can design your application with failover tolerance in mind, you can take advantage of spot instances, which are incredibly cost-effective at as low as $0.152 per hour these days.

Want the Full Step-By-Step Guide? Dive In Here!

For a comprehensive guide addressing these problems and access to this game-changing script, check out our complete guide at Deep Learning on AWS Graviton2, NVIDIA Tensor T4G for as Low as Free with CUDA 12.2.

Resources

Dark Truth about AWS Deep Learning AMIs: The Silent Roadblock to Your Success! Learn How to Break Free at https://mirzabilal.com/why-your-aws-deep-learning-ami-is-holding-you-back-and-how-to-fix.
Deep Learning with AWS Graviton2 and Nvidia Tensor T4g: Discover Cutting-Edge Insights! Read this article and more at https://mirzabilal.com/deep-learning-with-aws-graviton2-nvidia-tensor-t4g-for-as-low-as-free-with-cuda-12-2-56d8457a6f6d.
Building FFmpeg from Source: Master the Compilation Process! Follow the detailed instructions on https://mirzabilal.com/how-to-install-ffmpeg-on-linux-from-source.
CPU vs GPU Benchmark for video Transcoding on AWS: Debunking the CPU-GPU myth! See for yourself at https://mirzabilal.com/cpu-vs-gpu-for-video-transcoding-challenging-the-cost-speed-myth
Unlock GPU Acceleration for FFmpeg: Harness Hardware Power on AWS! Get the guide at https://mirzabilal.com/how-to-install-ffmpeg-with-harware-accelaration-on-aws.
Finding and Downloading Nvidia GPU Drivers: Get Started with Your GPU Journey! Visit NVIDIA.
Download CUDA Toolkit: https://developer.nvidia.com/cuda-toolkit.
Download CuDNN: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa.
Explore AWS Instance Prices and Specs: https://instances.vantage.sh/?selected=g5g..x|g5g.x.

Top comments (2)

زوہیب خان • Sep 9 '23

Fantastic article! I've been tinkering with AWS for a bit, and the issues you've pointed out with deep learning AMIs are spot on — pure gold.

Mirza Bilal • Sep 9 '23

Thank you, i am glad you find it helpful

DEV Community

Why Your AWS Deep Learning AMI is Holding You Back and How to Fix

The Bloatware Problem

Outdated Drivers and Toolkits

Dependency Hell

Limited Architecture Support

Solution

Performance and Cost Benefits

Want the Full Step-By-Step Guide? Dive In Here!

Resources

Top comments (2)

Read next

AI Models Can Now Generate High-Quality Medical Exam Questions, Study Shows

Primeros pasos con AWS PartyRock

Qwen2.5: New AI Model Matches GPT Performance with 3x More Training Data and Specialized Variants

Upload to S3