Running language models on AWS

#devops #opensource #aws #ai

The goal

Hi all! In my previous post I wrote about my experience running LLaMa model at my personal MacBook. Soon I started to want more than this, so I decided to spin up an instance on AWS to try out more powerful models there! I will give you a complete guide which should work for everyone. It will cost something, but should fit in a very tight budget (around 10 bucks for one try). The guide assumes that you have a very basic knowledge of cloud stack or AWS. Nevertheless, I tried to make it very complete for everyone.

AWS account

If you don't have an AWS account, it's pretty straightforward to create. Just click Sign Up on that home page and follow the instructions. Then sign in to the console and you should see the home page. I advise you set up MFA for your root account immediately after creating an account. Follow the instructions provided here

Creating a EC2 instance

When you are on the home page, right on the top just search for "EC2" That's a weird (like everything else) name for "virtual machines" on AWS.

Then on the left go to Instances, then on the right find an orange button Launch Instances.

You should see something like this:

Here you should input the following details:

Name and tags: any name you prefer
Application and OS Images: select Ubuntu then in a drop-down menu select Ubuntu Server 22.04 or 20.04. Make sure you chose 64-bit(x86)
Instance type: it's one of the most important parts, where you select the CPU and RAM for your machine. I use r5d.2xlarge with 8 CPUs and 64GM ram (which should be enough even for very large models), but you can select anything you want. Just make sure that the type has enough resources to run whatever you want to run
Key pair: that's for SSH connection. You can select an existing one, or create a new one. It will download the private key to your computer, later you can use it to access the machine remotely via SSH
Network settings: For the sake of simplicity, you can leave the default settings, like this:

Of course, if you understand what you are doing, feel free to implement routing and security rules to your liking

Configure storage: you can expand your root volume to make it more spacious, or attach a new volume. I use a separate volume with 150GB. Keep in mind, volumes cost something like $0.08 GB/month. Before proceeding, please check the exact pricing for your type of volume
Advanced details: the most interesting part. Here you can check "Request Spot Instances". This is a very cheap options to rent AWS unused resources (up to 90% discount). It comes with a small disadvantage: at any time provisioned resources can be revoked, and your instance will be shutdown. In reality, it happens very rarely, so if you don't plan on running the instance 24/7, feel free to choose this option. Here you can see my setup:

Make sure, request type is set to Persistent, otherwise you won't be able to stop the instance.

Aaaand, that's it! Just click Launch an instance and in 1 or 2 minutes your VM will be ready!

We are almost ready. The initial goal was running llama.cpp compatible models, so let's get there

Configuring VM

By default, VM is pretty clean and empty hence we have to install some tools in order to work with it.

First, we need to get into the machine. There are various means to achieve it, we will use the easiest one: EC2 Instance Connect.

First, click on your newly created instance and on the right click Connect. You will see the window with different connecting methods, where you should also click connect under EC2 Instance Connect tab. That's it, it will load a terminal window with SSH session.

If you attached a volume storage (not a root volume), then you have to mount it and make it persistent across the runs:

Check the location of the NVMe drive, running this in shell:

lsblk

Format it with the following command (all data on this drive will be lost), replacing /dev/nvme1n1 with your actual volume name:

sudo mkfs -t ext4 /dev/nvme1n1

Create a directory (here it's /data) and mount the drive:

sudo mkdir /data 
sudo mount /dev/nvme1n1 /data

To make the mount persist across reboots, add the following entry to /etc/fstab: /dev/nvme1n1 /data ext4 defaults,nofail 0 0 For example:

sudo nano /etc/fstab

Then paste /dev/nvme1n1 /data ext4 defaults,nofail 0 0, press Ctrl+X and type Y. It will save and close the file.

Validate the /etc/fstab entries:

sudo mount -a

It should not produce any errors. If it does, then it's better to clear the file and troubleshoot. But usually, it should be fine.

Now you have your storage mounted at /data. By default only root user (sudo) has permissions to this storage, hence we need to make it writable for everyone:

sudo chmod 777 /data

We will need some build tools to compile llama.cpp tool therefore you need to run this:

sudo apt install build-essential

We are done now! Now you have a complete virtual machine ready to run almost any large language model. Don't forget to stop your instance when you are done working with it. You are charged only for running time and storage.

If you want to have a constant DNS and IP, you should use Elastic IP. Otherwise, DNS and IP change every time you restart the instance.

Now all you have to do is to clone llama.cpp, compile it, download the model to your VM and start chatting! To do this, you can follow the steps outlined in my previous blog post.

If you have any questions or issue, feel free to ask anything in the comment section!