jessielin

Posted on Jun 9, 2023

Experiment workload performance impact by number of Connections

#cockroachdb #performance #connectionpool

Motivation:

As a Cockroach Enterprise Architect, I often help customers tune workload performance before projects launch in Production. And connection pool sizing is one of the nobs I often get asked about.

In Cockroach Labs document on Sizing connection pools, it states

Many workloads perform best when the maximum number of active connections is between 2 and 4 times the number of CPU cores in the cluster.

The number above applies for ACTIVE connections, meaning connections w/ an active statement being executed, not total connections. Hence customers sometimes asks me about optimal connection pool size for their application.

The short answer, as you might have guessed, is "it depends". A longer answer is it depends on the workload characteristics (CPU/IO/Network) and SLA requirements. In this blogpost, I'd like to articulate the reason and run experiments w/ two workloads to illustrate the concept. Hope it helps you tune your workload on CockroachDB and get into Production quickly and successfully.

Overview:

We selected CockroachDB built-in TPCC and KV as two workload examples and three scenario to demonstrate how workload characteristics impact the choice of connection pool. We used different number of connections, and compare resource utilization and performance metrics, to demonstrate how to best choose connection pool size. And when such parameter is chosen, we achieve a balance of resource, latency as well throughput.

You'll find

The optimal connection pool size and the number of active connection in use depends on the workload.
what metrics to watch for and how to find the sweetspot of connection pool size

Experiment Design:

In Production Checklist, it explains

Creating the appropriate size pool of connections is critical to gaining maximum performance in an application. Too few connections in the pool will result in high latency as each operation waits for a connection to open up. But adding too many connections to the pool can also result in high latency as each connection thread is being run in parallel by the system. The time it takes for many threads to complete in parallel is typically higher than the time it takes a smaller number of threads to run sequentially.

From above, we expect to see:
When not enough connections in use, a cluster has low inflight queries (or active connections), low CPU usage, and workload has high latency due to waiting on connections. When applications opens too many connections, a cluster has too many inflight queries or active connections, CPU overloaded and high latency due to waiting on CPU.

The good news is all the metrics above can be easily monitored in built-in DBConsole when using CockroachDB. You can run experiments and find the optimal connection pool size for your workload.

Also note the recommended number of Active Connections assumes that workloads are CPU bound, which is true for many OLTP workloads. In this experiment we choose TPCC as the first workload to test, since it's a popular OLTP benchmark workload.

On the other hand, not all workloads are CPU bound. In the experiment, we will use KV, a simple key-value access workload, an I/O intensive workload as an example. Cockroach uses LSM tree to quickly flush data to disk thus it’s less likely to make workload IO bound. But if Disk IOPS is not provision properly, it could be an issue.

Lastly to simulate Network bound workload, we deployed a client on the other side of US continent to compare connection pool size configurations and performance.

Implementation:

We set up a 4-node cluster in AWS US-East-2 using m5d.xlarge instances with local ssd. And another node in the same region to run the workloads using cockroach workload tool, and connect to the cluster via haproxy installed on the same node. In the third scenario, the client is deployed in US-West-1.
version: v22.2.6

Testing / validation:

TPCC test with 500 warehouses

Since the cluster has 16 cores, using the benchmark number, we started w/ 32 connections, then increase it to 64. At this stage, the number of open transactions and active connections are far less than the number of connection. Efficiency ( a metric that TPCC benchmark measures) is high, and CPU utilization is low. Though P50 and P90 is low , PMax is 50 times higher than P90. The extreme tail-latency is a sign of connection starvation. Interestingly the number of active connections are higher than other scenario.

Next, we increased the number of connections to 125, Efficiency increased, latency reduced. Most noticeably PMax dropped by 90%. We further increase it to 250 connection, and it showed get diminishing return. Efficiency and P50 stay flat, but tail latency increased slightly.

However 500 connections seems to be a sweet-spot, where CPU utilization and throughput is among the highest, while latency is lowest across the board.

Lastly, if we further increase the number of connections to 1000, CPU utilization increases, but Efficiency drops and tail latency increased. It indicates the applications use the additional connections to send query but they're queued up and nothing much more gets done.

If this is a real workload, we could further change the pool size to 700 or 400 to explore further. But for this exercise, 500 connections is the best choice. It may be be relevant to the number of warehouses we test is also 500. Nevertheless, the number of open transactions and active connections are still far less than the baseline, 32, 2 times of the number of CPU cores. This may be an indication of client becoming the bottleneck, as a single node on m5d.xlarge instance may not be able to send more queries.

KV test

Compared with TPCC, KV is a much simpler workload based on key value pairs. We expect KV to have higher throughput and be more IO bound.

Similarly we'll started w/ 32 connections. QPS is much higher and latency is much lower than TPCC workload, as queries are much simpler. At this stage, number of Active Connection is at 30, CPU utilization are already over 80%. Both metrics are higher than TPCC. We also notice the IO throughputs are about twice as high as TPCC. It indicates the expectation above is correct. But the PMax is 100 times than P50. This is a warning sign.

Next we increased the number of connections to 64, number of Active Connection jumped to 62, and CPU utilization and throughput increases slightly. P50 almost double, but PMax drops dramatically. It indicates doubling the connection pool size are effective in reducing PMax.

We further increased the number of connections to 125, CPU utilization dramatically increases, P50 almost double again and tail latency increases 10x. Similar case for 250 connections.

In this exercise the best connection pool size is likely around 64. We also see the number of open transactions and active statements are almost the same as the number of connections, and are close to 64, 4 times of number of cpus.

KV test - network bottleneck

Lastly, to simulate network bound workload, we put the drive node to us-west region, as opposed to in the same region in scenario 1 & 2.

Since we know the network latency will be much longer, using more connection likely will address the bottleneck. We started from 250 connections, but unlike scenario 2, the number of Open Transaction is only 75 and number of Active Connections is 63. Both are far less than the number of connections. The QPS is 5K and half of scenario 2, even though there are plenty of connections to send more work. When we increase the number of connections to above 500, QPS are and getting close to the same level as Scenario 2, and CPU utilization are still lower than previous example. Also note the number of Active Statements are now are 85-90% of number of Open Transactions.

Inter-region network latency is about 50ms round-trip, and application latency is slightly over 50ms as expected. Also note tail latency more than doubled when increasing number of connections to 2000. Given us-east & west latency is about 50ms, the sweet-spot for number of connection is likely about 500, where mean latency is low and tail latency is reasonable, CPU utilization and throughput are reasonable high.

Take-away:

Ultimately performance tuning is to address bottleneck iteratively, as we add more connections, bottleneck may moves to another part of the system, CPU in this case. So we want to find a sweet-spot when multiple metrics are considered together. And slightly err on over-provisioning to counter unexpected peak.
Understand workload characteristics (CPU/Network/IO) and experiment with number of connections to tune the workload, and check out Production Checklist and other documents and blogposts to build confidence that the workload is ready for Production launch!

DEV Community