Stable Diffusion Inference using FastAPI and load testing using Locust

#microservices #database #architecture #softwaredevelopment

Digital art or NFT has become incredibly valuable as the metaverse growing. To take advantage of this opportunity, We at Qolaba have chosen to investigate the methods of developing API endpoints and load testing for various numbers of concurrent users. In this article, We will go through this experiment and discover the numerous conclusions.

What is Stable-diffusion?

A machine learning system called Stable Diffusion uses diffusion to produce visuals from text. It functions as both Text to Image and Image to Image. The Stable Diffusion model is used to generate the majority of contemporary AI art that can be found online. With simply a word prompt and an open-source application, anyone can easily produce amazing art images. The new version of Stable diffusion offers a number of features, which may be found on the stability ai blog at Stable Diffusion 2.0 Release — Stability AI .

Inference of Stable-Diffusion using FastApi

I used the workstation with the specifications listed below to carry out the experiment.

The necessary Python packages for stable diffusion have to be installed before we can begin the inference procedure. We can do it by following the instructions provided in this Link. Make sure Pytorch is accessible in accordance with a CUDA version before beginning the installation.

pip install diffusers transformers accelerate scipy safetensors

Once installation of required packages is completed, we can go ahead with inferencing of stable diffusion using the FastAPI.

from fastapi import FastAPI
from typing import List, Optional, Union
import io, uvicorn, gc
from fastapi.responses import StreamingResponse
import torch
import time
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
from concurrent.futures import ThreadPoolExecutor

app = FastAPI()
app.POOL: ThreadPoolExecutor = None

@app.on_event("startup")
def startup_event():
    app.POOL = ThreadPoolExecutor(max_workers=1)
@app.on_event("shutdown")
def shutdown_event():
    app.POOL.shutdown(wait=False)

model_id = "stabilityai/stable-diffusion-2-1"
pipe_nsd = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe_nsd.scheduler = DPMSolverMultistepScheduler.from_config(pipe_nsd.scheduler.config)
pipe_nsd = pipe_nsd.to("cuda")

@app.post("/getimage_nsd")
def get_image_nsd(
    #prompt: Union[str, List[str]],
    prompt: Optional[str] = "dog",
    height: Optional[int] = 512,
    width: Optional[int] = 512,
    num_inference_steps: Optional[int] = 50,
    guidance_scale: Optional[float] = 7.5,
    negative_prompt: Optional[str] = None,):

    image = app.POOL.submit(pipe_nsd,prompt,height,width,num_inference_steps,guidance_scale,negative_prompt).result().images
    gc.collect()
    torch.cuda.empty_cache()
    filtered_image = io.BytesIO()
    image[0].save(filtered_image, "JPEG")
    filtered_image.seek(0)
    return StreamingResponse(filtered_image, media_type="image/jpeg")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=9000)

To start the local server with Stable diffusion API end point, we can run the above mentioned code.

python <filename>.py

Once API endpoint is started, we can check that using FastAPI Interactive API docs. For that, we can go to http://127.0.0.1:9000/docs. After that, we can specify the input parameters and click on execute to generate the image.

Load testing using Locust

We may utilise the Python-based framework Locust to carry out the load testing task. Testing techniques that simulate a large number of users can be constructed using the locust tool. This will pinpoint the main area of vulnerability in terms of application load management, security, and performance. We can execute the command shown below to install the locust.

pip3 install locust

To run the load testing, we can run the below mentioned code using locust package. As part of the load test, we made an effort to randomly select images with sizes of 512x512, 768x768 or 1024x1024 to mimic the real world scenario.

from locust import HttpUser, task
import random
import urllib

class HelloWorldUser(HttpUser):
    host="http://127.0.0.1:9000"
    @task(1)
    def hello_world(self):
        h_list=[512,768,1024]
        height=random.sample(h_list, 1)
        url="/getimage_nsd?prompt=dog&height="+str(height[0])+"&width="+str(height[0])+"&num_inference_steps=50&guidance_scale=7.5&negative_prompt=%20"
        b=urllib.parse.quote_plus(url)
        self.client.post(url)

locust -f <filename>.py

We can view the Locust WebUI in a browser once the code has been executed. According to the needs, we can define the concurrent user count, spawn rate, and host in the WebUI.

With regards to the present article, we tested the load with both 5 and 10 concurrent users. In both situations, the maximum response times are 146 and 307 seconds, respectively.

Conclusion

The highest response time in load testing was found to be 307s in 10 concurrent users and 146 users in 5 concurrent users, which is ridiculous. To solve this problem, we may try using Docker and the Kubernetes load balancer to generate many endpoints on various GPUs, dividing the overall load and speeding up response time. In addition, we may experiment with FastAPI’s batch requests or several workers so that we can process numerous requests concurrently. Nevertheless, in my opinion, the second idea won’t make a significant difference because the number of iterations per second for a single process will decrease when we run several operations simultaneously on a single GPU for stable diffusion. Individual processes will therefore take longer, and overall response time will also lengthen.

During doing inferencing, I also discovered a further problem: the overall GPU needs are too high due to the reason that we evaluated the load for three distinct image sizes, including 512x512, 768x768 and 1024x1024. Space is allotted in the GPU for each sort of arrangement. We can recreate the stable diffusion pipe before image generation and erase it after that to overcome this issue. Although the total response time may be longer with this technique, the cost will be lower since we can conduct the inferencing on a GPU with lower vram.