Jinwook Baek

Posted on Jul 27, 2020

Gunicorn performance analysis on AWS EC2

#aws #locust #python #gunicorn

Introduction

There are several production level self-hosted options for running django app server behind nginx web sever using wsgi protocol, such as uWsgi, mod_wsgi and gunicorn. I have been using gunicorn for many projects for mayn years. Since then I have never doubted the performance and configuration details but recently I wanted to test the actual performance for gunicorn associated with worker count and find out optimal configuration for ECS fargate. In contrast to on-prem servers where I can grasp on actual number of physical cores, AWS only allow me to configure number of logical cores(via vCPU). I will illustrate how I have tested the performance using gunicorn, django and locust.

Standalone WSGI Containers - Flask Documentation (1.1.x)

Gunicorn

Gunicorn is Python WSGI HTTP Server for UNIX.

$ gunicorn django-sample.wsgi:application -w ${WORKER_COUNT} 
--threads ${THREAD_COUNT} -b 0.0.0.0:8000

You generally use following command to run the applicaiton. According to the design document, they recommend (2 x $num_cores) + 1 as the number of workers to start off with. The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

Gunicorn - WSGI server - Gunicorn 20.0.4 documentation

EC2 vCPU

I need to determine the number of cpu cores for the server. In traditional linux server, you can determine number of core using following commands.

According to AWS, not all vCPUs are made the same. For T2 instances, 1 vCPU = 1 physical core. For all others, 1 vCPU = 1 logical core.

$ cat /proc/cpuinfo
-> cpu_cores

Optimizing CPU options

The number of vCPUs for the instance is the number of CPU cores multiplied by the threads per core. To specify a custom number of vCPUs, you must specify a valid number of CPU cores and threads per core for the instance type.

vCPU = Physical core * threads

According to this blog post there were 34% drop in multi-threaded performance on logical cores. ****

What's in a vCPU: State of Amazon EC2 in 2018 - Credera

You can configure threads per core during instance launch.

Methodology

Architecture

Github Repo

Test will run on 3 tier web architecture. Django application will run behind the nginx through gunicorn. Django app will be querying Mysql DB in private subnet.

Instances used

t3.micro - 1 CPU
- cpuinfo
t3.micro - 2 vCPU (1 cpu * 2 threads)
- cpuinfo
t3.xlarge - 2 cpu
- cpuinfo
t3.xlarge - 4 vCPU (2 cpu * 2 threads)
t3.2xlarge - 4 cpu
- cpuinfo

Locust

Locust - A modern load testing framework

I will use locust to test and analyze the performance under load. Following Locust file will spawn multiple users concurrently with simple SQL query.

from random import randint

from locust import User, TaskSet, between, task, HttpUser

class ReadPosts(TaskSet):
    @task(1)
    def read_posts(self):
        response = self.client.get("api/v1/posts/", name="list posts")
        print("Response status code:", response.status_code)
        print("Response content:", response.text)

class WebsiteUser(HttpUser):
    tasks = [ReadPosts]
    wait_time = between(1.0, 2.0)

Additional setup

Tested django app with DEBUG = False
Latency is not considered since test was executed within VPC
Used serverless aurora RDS to avoid bottleneck on DB I/O

Test procedure

Used htop to record CPU usage of each core and mem usage
Incremented number of concurrent users by 100 to determine the load to see two thresholds

Result

Full results:
Gunicorn performance test result

Observations

Single worker cannot fully utilize cpus
Recommended number of workers (2*core +1) generally perform well as expected.
There are not much performance difference in physical or logical core.
- Single threads seem to have slightly better performance
- Therefore, it is efficient to use multiple threads if you consider price and memory usage is not the issue
  - t3.micro 1 core 2 threads - $0.0104 per Hour
  - t3.large 2 cores 1 thread - $0.0832 per Hour

Questions

What would be the efficient cpu usage cap? 60% - 90%?
Any other metrics I should have count into experiments?

ECS fargate performance

Since I gained general idea how gunicorn performed on ec2 instances. I took it further to test how gunicorn performs on ECS fargate setup with regarding vCPUs allocated. Please refer to the github for task definitions I have used.

ecs-task-definition

Since I cannot manage instance in faragte setup, I needed to assign cpu units for each tasks to comare with ec2. It was impossible to run htop on fargate since I have no access to the instances. ECS provide container insight but they are not as accurate nor real-time as I would run htop inside the instances. I only recorded Fail rate according to concurrent user count.

I didn't assign cpu unit for each task, since two tasks in previous test shared cpu resources in single machine.

Task definition parameters

Results

Results does not seem to be as consistent as ec2 cores. It seems like more worker counts result better performance proportionally.

After thoughts

I was not entirely satisfied with the result since I was not able to isolate all variables.

Performance probably will depend on app behavior (mem
long running i/o app will probably will have different results
Test result would have been more accurate with CPU reousrce config on docker-compose file (https://docs.docker.com/compose/compose-file/#resources)