Introduction
There are several production level self-hosted options for running django app server behind nginx web sever using wsgi protocol, such as uWsgi
, mod_wsgi
and gunicorn
. I have been using gunicorn for many projects for mayn years. Since then I have never doubted the performance and configuration details but recently I wanted to test the actual performance for gunicorn associated with worker count and find out optimal configuration for ECS fargate. In contrast to on-prem servers where I can grasp on actual number of physical cores, AWS only allow me to configure number of logical cores(via vCPU). I will illustrate how I have tested the performance using gunicorn
, django
and locust
.
Standalone WSGI Containers - Flask Documentation (1.1.x)
Gunicorn
Gunicorn is Python WSGI HTTP Server for UNIX.
$ gunicorn django-sample.wsgi:application -w ${WORKER_COUNT}
--threads ${THREAD_COUNT} -b 0.0.0.0:8000
You generally use following command to run the applicaiton. According to the design document, they recommend (2 x $num_cores) + 1
as the number of workers to start off with. The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Gunicorn - WSGI server - Gunicorn 20.0.4 documentation
EC2 vCPU
I need to determine the number of cpu cores for the server. In traditional linux server, you can determine number of core using following commands.
According to AWS, not all vCPUs are made the same. For T2 instances, 1 vCPU = 1 physical core. For all others, 1 vCPU = 1 logical core.
$ cat /proc/cpuinfo
-> cpu_cores
The number of vCPUs for the instance is the number of CPU cores multiplied by the threads per core. To specify a custom number of vCPUs, you must specify a valid number of CPU cores and threads per core for the instance type.
vCPU = Physical core * threads
According to this blog post there were 34% drop in multi-threaded performance on logical cores. ****
What's in a vCPU: State of Amazon EC2 in 2018 - Credera
You can configure threads per core during instance launch.
Methodology
Architecture
Test will run on 3 tier web architecture. Django application will run behind the nginx through gunicorn. Django app will be querying Mysql DB in private subnet.
Instances used
- t3.micro - 1 CPU
- cpuinfo
- t3.micro - 2 vCPU (1 cpu * 2 threads)
- cpuinfo
- t3.xlarge - 2 cpu
- cpuinfo
- t3.xlarge - 4 vCPU (2 cpu * 2 threads)
- t3.2xlarge - 4 cpu
- cpuinfo
Locust
Locust - A modern load testing framework
I will use locust to test and analyze the performance under load. Following Locust file will spawn multiple users concurrently with simple SQL query.
from random import randint
from locust import User, TaskSet, between, task, HttpUser
class ReadPosts(TaskSet):
@task(1)
def read_posts(self):
response = self.client.get("api/v1/posts/", name="list posts")
print("Response status code:", response.status_code)
print("Response content:", response.text)
class WebsiteUser(HttpUser):
tasks = [ReadPosts]
wait_time = between(1.0, 2.0)
Additional setup
- Tested django app with
DEBUG = False
- Latency is not considered since test was executed within VPC
- Used
serverless
aurora RDS to avoid bottleneck on DB I/O
Test procedure
- Used
htop
to record CPU usage of each core and mem usage - Incremented number of concurrent users by 100 to determine the load to see two thresholds
Result
Full results:
Gunicorn performance test result
Observations
- Single worker cannot fully utilize cpus
- Recommended number of workers (2*core +1) generally perform well as expected.
- There are not much performance difference in physical or logical core.
- Single threads seem to have slightly better performance
- Therefore, it is efficient to use multiple threads if you consider price and memory usage is not the issue
- t3.micro 1 core 2 threads - $0.0104 per Hour
- t3.large 2 cores 1 thread - $0.0832 per Hour
Questions
- What would be the efficient cpu usage cap? 60% - 90%?
- Any other metrics I should have count into experiments?
ECS fargate performance
Since I gained general idea how gunicorn performed on ec2 instances. I took it further to test how gunicorn performs on ECS fargate setup with regarding vCPUs allocated. Please refer to the github for task definitions I have used.
Since I cannot manage instance in faragte setup, I needed to assign cpu units for each tasks to comare with ec2. It was impossible to run htop
on fargate since I have no access to the instances. ECS provide container insight
but they are not as accurate nor real-time as I would run htop inside the instances. I only recorded Fail rate according to concurrent user count.
I didn't assign cpu unit for each task, since two tasks in previous test shared cpu resources in single machine.
Results
Results does not seem to be as consistent as ec2 cores. It seems like more worker counts result better performance proportionally.
After thoughts
I was not entirely satisfied with the result since I was not able to isolate all variables.
- Performance probably will depend on app behavior (mem
- long running i/o app will probably will have different results
- Test result would have been more accurate with CPU reousrce config on docker-compose file (https://docs.docker.com/compose/compose-file/#resources)
Next Todo
- ASGI benchmark on different vCPU options
- Use Asyncworker +
threads
options
Reference
PEP 3333 -- Python Web Server Gateway Interface v1.0.1
uWSGI vs. Gunicorn, or How to Make Python Go Faster than Node
A Performance Analysis of Python WSGI Servers: Part 2 | Blog | AppDynamics
Better performance by optimizing Gunicorn config
A Guide to ASGI in Django 3.0 and its Performance
Quick start - Locust 1.1.1 documentation
A Guide to ASGI in Django 3.0 and its Performance
What's in a vCPU: State of Amazon EC2 in 2018 - Credera
How Amazon ECS manages CPU and memory resources | Amazon Web Services
Top comments (0)