Hyunho Richard Lee for Meadowrun

Posted on Jul 26, 2022 • Edited on Sep 6, 2022 • Originally published at Medium

Run Your Own DALL·E Mini (Craiyon) Server on EC2

#aws #datascience #python

In case you’ve been under a rock for the last few months, DALL·E is an ML model from OpenAI that generates images from text prompts. DALL·E Mini (renamed to Craiyon) by Boris Dayma et al. is a less powerful but open version of DALL·E, and there’s a hosted version at craiyon.com for everyone to try.

If you’re anything like us, though, you’ll feel compelled to poke around the code and run the model yourself. We’ll do that in this article using Meadowrun, an open-source library that makes it easy to run Python code in the cloud. For ML models in particular, we just added a feature for requesting GPU machines in a recent release. We’ll also feed the images generated by DALL·E Mini into additional image processing models (GLID-3-xl and SwinIR) to improve the quality of our generated images. Along the way we’ll deal with the speedbumps that come up when running open-source ML models on EC2.

Running dalle-playground

For the first half of this article, we’ll show how to run saharmor/dalle-playground, which wraps the DALL·E Mini code in an HTTP API, and provides a simple web page to generate images via that API.

dalle-playground provides a Jupyter notebook that you can run in Google Colab. If you’re doing anything more than kicking the tires, though, you’ll run into the dynamic usage limit in Colab’s free tier. You could upgrade to Colab Pro ($9.99/month) or Colab Pro+ ($49.99/month), but we’ll get this functionality for pennies on the dollar by using AWS directly!

Prerequisites

First, you’ll need an AWS account. If you’ve never used GPU instances in AWS before, you’ll probably need to increase your quotas. AWS accounts have quotas in each region that limit how many CPUs of a particular instance type you can run at once. There are 4 quotas for GPU instances:

L-3819A6DF: “All G and VT Spot Instance Requests”
L-7212CCBC: “All P Spot Instance Requests”
L-DB2E81BA: “Running On-Demand G and VT instances”
L-417A185B: “Running On-Demand P instances”

These are all set to 0 for a new EC2 account, so if you try to run the code below, you’ll get this message from Meadowrun:

Unable to launch new g4dn.xlarge spot instances due to the L-3819A6DF
quota which is set to 0. This means you cannot have more than 0 CPUs
across all of your spot instances from the g, vt instance families.
This quota is currently met. Run `aws service-quotas
request-service-quota-increase --service-code ec2 --quota-code
L-3819A6DF --desired-value X` to set the quota to X, where X is
larger than the current quota. (Note that terminated instances
sometimes count against this limit:
https://stackoverflow.com/a/54538652/908704 Also, quota increases are
not granted immediately.)

We recommend running the command in that message or clicking on one of the links in the list above to request a quota increase if you’re giving this a go (if you use a link, make sure you are in the same region as your AWS CLI as given by aws configure get region). It seems like AWS has a human in the loop for granting quota increases, and in our experience it can take up to a day or two to get a quota increase granted.

Second, we’ll need a local Python environment with Meadowrun, and then we’ll install Meadowrun in our AWS account. Here’s an example using pip in Linux:

$ python3 -m venv meadowrun-venv
$ source meadowrun-venv/bin/activate
$ pip install meadowrun
$ meadowrun-manage-ec2 install --allow-authorize-ips

Running DALL·E Mini

Now that we have that out of the way, it’s easy to run the dalle-playground backend!

import asyncio
import meadowrun

async def run_dallemini():
    return await meadowrun.run_command(
        "python backend/app.py --port 8080 --model_version mini",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=16,
            max_eviction_rate=80,
            gpu_memory=4,
            flags="nvidia"
        ),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            interpreter=meadowrun.PipRequirementsFile("backend/requirements.txt", "3.9")
        ),
        ports=8080
    )

asyncio.run(run_dallemini())

A quick tour of this snippet:

run_command tells Meadowrun to run python backend/app.py --port 8080 --model_version mini on an EC2 instance. This starts the dalle-playground backend on port 8080, using the mini version of DALL·E Mini. The mini version is 27 times smaller than the mega version of DALL·E Mini, which makes it less powerful but easier to run.
The next few lines tell Meadowrun what the requirements for our job are: 1 CPU, 16 GB of main memory, and we’re okay with spot instances up to an 80% probability of eviction (aka interruption). The instance types we’ll be using do tend to get interrupted, so if that becomes a problem we can change this to 0% which tells Meadowrun we want an on-demand instance. We also ask for an Nvidia GPU that has at least 4GB of GPU memory which is what’s needed by the mini model.
Next, we want the code in the https://github.com/hrichardlee/dalle-playground repo, and we want to construct a pip environment from the backend/requirements.txt file in that repo. We were almost able to use the saharmor/dalle-playground repo as-is, but we had to make one change to add the jax[cuda] package to the requirements.txt file. In case you haven’t seen jax before, jax is a machine-learning library from Google, roughly equivalent to Tensorflow or PyTorch. It combines Autograd for automatic differentiation and XLA (accelerated linear algebra) for JIT-compiling numpy-like code for Google’s TPUs or Nvidia’s CUDA API for GPUs. The CUDA support requires explicitly selecting the [cuda] option when we install the package.
Finally, we tell Meadowrun that we want to open port 8080 on the machine that’s running this job so that we can access the backend from our current IP address. Be careful with this! dalle-playground doesn’t use TLS and it’s not a good idea to give everyone with your IP address access to this interface forever.

To walk through selected parts of the output from this command:

Launched a new instance for the job:
ec2-3-138-184-193.us-east-2.compute.amazonaws.com: g4dn.xlarge (4.0
CPU, 16.0 GB, 1.0 GPU), spot ($0.1578/hr, 61.0% chance of
interruption), will run 1 workers

Here Meadowrun tells us everything we need to know about the instance it started for this job and how much it will cost us (only 15¢ per hour!).

Building python environment in container  eccac6...

Next, Meadowrun is building a container based on the contents of the requirements.txt file we specified. This takes a while, but Meadowrun caches the image in ECR for you so this only needs to happen once (until your requirements.txt file changes). Meadowrun also cleans up the image if you don’t use it for a while.

\--> Starting DALL-E Server. This might take up to two minutes.

Here we’ve gotten to the code in dalle-playground, which needs to do a few minutes of initialization.

\--> DALL-E Server is up and running!

And now we’re up and running!

Now we’ll need to run the front end on our local machine (if you don’t have npm, you’ll need to install node.js):

git clone https://github.com/saharmor/dalle-playground
cd dalle-playground/interface
npm start

You’ll want to construct the backend URL in a separate editor, e.g. http://ec2-3-138-184-193.us-east-2.compute.amazonaws.com:8080 and copy/paste it into the webapp—typing it in directly causes unnecessary requests to the partially complete URL, which fail slowly.

Time to generate some images!

DALL·E Mini (mini version): batman praying in the garden of gethsemane

DALL·E Mini (mini version): olive oil and vinegar drizzled on a plate in the shape of the solar system

It was pretty easy to get this working, but this model isn’t really doing what we’re asking it to do. For the first set of images, we clearly have a Batman-like figure, but he’s not really praying and I’m not sure he’s in the garden of Gethsemane. For the second set of images, it looks like we’re either getting olive oil or a planet, but we’re not getting both of them in the same image, let alone an entire system. Let’s see if the “mega” version of DALL·E Mini can do any better.

Running DALL·E Mega

DALL·E Mega is a larger version of DALL·E Mini, meaning it has the same architecture but more parameters. Theoretically we can just replace --model_version mini with --model_version mega_full in the previous snippet and get the mega version. When we do this, though, the dalle-playground initialization code takes about 45 minutes.

We don’t need any real profiling to figure this one out—if you just kill the process after it’s running for a while, the stack trace will clearly show that the culprit is the from_pretrained function, which is downloading the pretrained model weights from Weights and Biases (aka wandb). Weights and Biases is an MLOps platform that helps you keep track of the code, data, and analyses that go into training and evaluating an ML model. For the purposes of this article, it’s where we go to download pretrained model weights. We can look at the specification for the artifacts we’re downloading from wandb, browse to the web view for the mega version and see that the main file we need is about 10GB. If we ssh into the EC2 instance that Meadowrun creates to run this command and run iftop, we can see that we’re getting a leisurely 35 Mbps from wandb.

We don’t want to wait 45 minutes every time we run DALL-E Mega, and it’s painful to see a powerful GPU machine sipping 35 Mbps off the internet while almost all of its resources sit idle. So we made some tweaks to dalle-playground to cache the artifacts in S3. cache_in_s3.py effectively calls wandb.Api().artifact("dalle-mini/dalle-mini/mega-1:latest").download() then uploads the artifacts to S3. To follow along, you’ll first need to create an S3 bucket and give the Meadowrun EC2 role access to it:

aws s3 mb s3://meadowrun-dallemini
meadowrun-manage-ec2 grant-permission-to-s3-bucket meadowrun-dallemini

Remember that S3 bucket names need to be globally unique, so you won’t be able to use the exact same name we’re using here.

Then we can use Meadowrun to kick off the long-running download job on a much cheaper machine—note that we’re only requesting 2 GB of memory and no GPUs for this job:

import asyncio
import meadowrun

async def cache_pretrained_model_in_s3():
    return await meadowrun.run_command(
        "python backend/cache_in_s3.py --model_version mega_full --s3_bucket meadowrun-dallemini --s3_bucket_region us-east-2",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(1, 2, 80),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            branch="s3cache",
            interpreter=meadowrun.PipRequirementsFile(
                "backend/requirements_for_caching.txt", "3.9"
            )
        )
    )

asyncio.run(cache_pretrained_model_in_s3())

We’ve also changed the model code to download files from S3 instead of wandb. We’re downloading the files into the special /var/meadowrun/machine_cache folder which is shared across Meadowrun-launched containers on a machine. That way, if we run the same container multiple times on the same machine, we won’t need to redownload these files.

Once that’s in place, we can run the mega version and it will start up relatively quickly:

import asyncio
import meadowrun

async def run_dallemega():
    return await meadowrun.run_command(
        "python backend/app.py --port 8080 --model_version mega_full --s3_bucket meadowrun-dallemini --s3_bucket_region us-east-2",
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(1, 32, 80, gpu_memory=12, flags="nvidia"),
        meadowrun.Deployment.git_repo(
            "https://github.com/hrichardlee/dalle-playground",
            branch="s3cache",
            interpreter=meadowrun.PipRequirementsFile("backend/requirements.txt", "3.9")
        ),
        ports=8080
    )

asyncio.run(run_dallemega())

A few things to note about this snippet:

We’re asking Meadowrun to use the s3cache branch of our git repo, which includes the changes to allow caching/retrieving the artifacts in S3.
We’ve increased the requirements to 32 GB of main memory and 12 GB of GPU memory, which the larger model requires.
The first time we run, Meadowrun builds a new image because we added the boto3 package for fetching our cached files from S3.

One last note—Meadowrun’s install sets up an AWS Lambda that runs periodically and cleans up your instances automatically if you haven’t run a job for a while. To be extra safe, you can also manually clean up instances with:

meadowrun-manage-ec2 clean

Here’s what we get:

DALL·E Mega (full version of DALL·E Mini): batman praying in the garden of gethsemane

DALL·E Mega (full version of DALL·E Mini): olive oil and vinegar drizzled on a plate in the shape of the solar system

Much better! For the first set of images, I’m not sure Batman is praying in all of those images, but he’s definitely Batman and he’s definitely in the garden of Gethsemane. For the second set of images, we have a plate now, some olive oil and vinegar, and it definitely looks like more of a solar system. The images aren’t quite on par with OpenAI’s DALL·E yet, but they are noticeably better! Unfortunately there’s not too much more we can do to improve the translation of text to image short of training our own 12 billion-parameter model, but we’ll try tacking on a diffusion model to improve the finer details in the images. We’ll also add a model for upscaling the images, as they’re only 256x256 pixels right now.

Building an image generation pipeline

For the second half of this article, we’ll use meadowdata/meadowrun-dallemini-demo which contains a notebook for running multiple models as sequential batch jobs to generate images using Meadowrun. The combination of models is inspired by jina-ai/dalle-flow.

DALL·E Mini: The model we’ve been focusing on in the first half of this article. This post is a good guide to how OpenAI’s DALL·E 2 is built. To simplify, DALL·E is a combination of two models. The first model is trained on images and learns how to “compress” images to vectors and then “decompress” those vectors back into the original images. The second model is trained on image/caption pairs and learns how to turn captions into image vectors. After training, we can put new captions into the second model to produce an image vector, and then we can feed that image vector into the first model to produce a novel image.
GLID-3-xl: A diffusion model. Diffusion models are trained by taking images, blurring (aka diffusing) them, and training the model on original/blurred image pairs. The model learns to reconstruct the original unblurred version from the blurred version. Diffusion models can be used for a variety of tasks, but in this case we’ll use GLID-3-xl to fill in the finer details in our images.
SwinIR: A model for upscaling images (aka image restoration). Image restoration models are trained by taking images and downscaling them. The model learns to produce the original higher resolution image from the downscaled image.

To run this pipeline, in addition to the prerequisites from the first half of this article, we’ll get the meadowrun-dallemini-demo git repo and the local dependencies, then launch a Jupyter notebook server:

git clone https://github.com/meadowdata/meadowrun-dallemini-demo
cd meadowrun-dallemini-demo
# assuming you are already in a virtualenv from before
pip install -r local_requirements.txt
jupyter notebook

We’ll then need to open the main notebook in Jupyter, and edit S3_BUCKET_NAME and S3_BUCKET_REGION to match the bucket we created in the first half of this article.

The code in the notebook is similar to the first half of this article so we won’t go over it in depth. A few notes on what the rest of the code in the repo is doing:

We’ve adapted the sample code that comes with all of our models to use our S3 cache and provide easy-to-use interfaces in dalle_wrapper.py, glid3xl_wrapper.py, and swinir_wrapper.py.
We’re referencing our three models directly as git repos (because they’re not available as packages in PyPI) in model_requirements.txt, but we had to make a few changes to make these repos work as pip packages. Pip looks for a setup.py file in the git repo to figure out which files from the repo need to be installed into the environment, as well as what the dependencies of that repo are. GLID-3-xl and latent-diffusion (another diffusion model that GLID-3-xl depends on) had setup.py files that needed tweaks to include all of the code needed to run the models. SwinIR didn’t have a setup.py file at all, so we added one. Finally, all of these setup.py files needed additional dependencies, which we just added to the model_requirements.txt file.
All of these models are pretty challenging to run on anything other than Linux, which is why we’ve split out the local_requirements.txt from the model_requirements.txt. Even if you’re running on Windows or Mac, you shouldn’t have any trouble running this notebook—Meadowrun takes care of creating the hairy model environment on an EC2 instance running Linux.

And a couple more notes on Meadowrun:

Now that because we’re running these models as batch jobs instead of as services, Meadowrun will reuse a single EC2 instance to run them.
If you’re feeling ambitious, you could even use meadowrun.run_map to run these models in parallel on multiple GPU machines.

Let’s see some results!

The notebook asks for a text prompt and has DALL·E Mini generate 8 images:

DALL·E Mini: batman praying in the garden of gethsemane

We select one of the images and GLID-3-xl produces 8 new images based on our chosen image.

Images generated by GLID-3-xl based on image 6 above

Finally, we select one of these images and have SwinIR upscale it from 256x256 to 1024x1024 pixels:

Image 3 from above upscaled by SwinIR

Not terrible, although we did provide some human help at each stage!

Here’s what OpenAI’s DALL·E generates from the same prompt:

And here’s one more comparison:

DALL·E Mini: “olive oil and vinegar drizzled on a plate in the shape of the solar system”, upscaled by SwinIR

All of this underscores how impressive OpenAI’s DALL·E is. That said, DALL·E Mini is very fun to play with, is truly open, and will only get better as it continues to train.

Closing remarks

This post demonstrates how to use Meadowrun for running GPU computations like ML inference in EC2. Meadowrun takes care of details like finding the cheapest available GPU instance types, as well as making sure CUDA and Nvidia’s Container Runtime (previously known as Nvidia Docker) are installed in the right places.

It’s pretty cool that we can point Meadowrun to a repo like dalle-playground, tell it what resources it needs, and get it running with almost no fuss. One of the most annoying things in software is getting other people’s code to work, and it’s great to see that the Python and ML ecosystem have made a ton of progress in this regard. Thanks to better package management tools, MLOps tools like Hugging Face and Weights and Biases, as well as Meadowrun (if we do say so ourselves) it’s easier than ever to build on the work of others.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Top comments (1)

Vince Fulco (It / It's) • Jul 27 '22

Such cool work as usual

DEV Community

Run Your Own DALL·E Mini (Craiyon) Server on EC2

Running dalle-playground

Prerequisites

Running DALL·E Mini

Running DALL·E Mega

Building an image generation pipeline

Closing remarks

Top comments (1)

Read next

A Power-Filled IDE for Neovim with Sane Defaults

1 Week to Build the Future of AI with Humiris

Glue cross-account setup

AI Assistant Learns to Google: New Model Combines Language AI with Autonomous Web Search