How I calculate capacity for Systems Design

#design #algorithms #interview

Among the most influential programming books I read is Programming Pearls and More Programming Pearls. In the More Programming Pearls book, there is a nugget of wisdom in chapter 7 called "The Envelope is Back". It discusses the need to have certain formulas memorized, and the ability to quickly make quick calculations on-the-fly regarding scale.

As a Software Architect™, I've built systems that scale to nearly a billion transactions each day. And to reason about scale in my head there's just a very small trick (or two) that I use and it works well (for me).

I always memorized what a million of whatever it is that I need to reason about in two different ways:

One million in scale
One million in quantity

One Million in Scale

Developers would often come to me and shout: "We need to support 2 million transactions each day with this API... HELP!" (Or something like it). It rarely rattled me. Because I understood the math of scale. I'd just answer: "Great, so we need to support 24 calls per second. No problem!".

Back of the envelope calculations aren't meant to be precise. Only approximate. Something easily computed in our head. One million transactions per day is:

~42k per hour
~700 per minute
~12 per second

Bottlenecks rarely happen at the scale of minute or hour. So the only number I need to know is 1 million of something that scales is ~10-12 per second. That's it.

Once we know that, we can determine what scale we need to support easily enough:

1 million per day = ~12/second (12 * 1)
5 million per day = ~60/second (12 * 5)
30 million per day = ~360/second (12 * 30)
100 million per day = ~1200/second (12 * 100)
etc.

That's only a partial reality. We usually think in terms of users. A user action or session might generate many transactions. So we need to reason about the scale slightly differently.

One million users per day is:

1 million x approximate number of transactions per user session

If a typical user generates 50 API calls during their session, then we can use our back of the envelope skills to reason that we must support:

12 * 50 = ~600 txn/second.

When it's okay to use ~10 as a factor, also. We are not trying to be engineeringly precise here, just approximate. For numbers that will cause me too much thinking, I'll use 10 instead of 12 as a factor.

Calculating Peak Times

If supporting only the average were enough, we'd but done with this part of the post. But we often have to think about Peak Times. Certain parts of the day see more traffic than usual. We also have to support that.

My personal view is that we always MUST support the average first. It's the usual condition. But we then need to support the expected peak times. For this, we need another metric to memorize. I call it the 10% per hour rule. Remember, we're thinking in terms of 1 million. If 10% of that traffic happened during 1 hour, (or 30% of it during a 3 hour window), how much traffic per second is that?

100k transactions = 30/second.

So we can derive the following metrics:

1m/day app @ 10% peak for 1 hour: 100k rule = 30/second.
1m/day app @ 30% peak for 1 hour: 100k rule = 90/second.
1m/day app @ 30% peak for 3 hours: 100k rule = (30 x 3) / 3 = 30/second.

So a 1m/day app would need to sustain 12/second on average, and 10% peak for 1 hour would need to sustain 30/second for that hour. And so on.

Got 10 million tnxn's/day? Multiply those number by 10. And so on.

Quick Recap.

1 million/day = ~12/sec
Assuming a user requires 50 tns per session, that's 50 million/day or ~600/sec

Always think in terms of per second

How Many	Per User	sec.	1hr peak @ 10%	@ 30%
1m	0	~12	~30	~90
1m	10	~120	~300	~900
1m	100	~1200	~3000	~9000

Pretty much, that's all I need to remember. Everything else is a matter of multiplication in powers of 10

Sustained vs dynamic throughput?

It's rare that scale is sustained. It usually flows in ebbs and tides. Sometime, tho, there isn't a peak. It's constant through the day. Examples of this would be:

Monitoring an engine
Monitoring a water filtration system
The data on a flight control system
Time of Day server
Etc.

You might think that a Time of Day only experiences it's peak at 1am when most computers query it. That's not the case. This kind of service becomes active at 1 am every local time zone for each of the 24 times zones. It likely experiences it's peak at the beginning of each hour as each new timezone makes its query. I would classify this as a sustained scale. Meaning, we must treat each hour like a peak time.

One Million in Quantity

Quantity is different. We're actually talking about capacity here. The following is a simple chart to help with computing different quantity. And again, we're thinking in terms of 1 million:

An Int32 = 4 bytes. That's 4 million bytes
An Int64 = 8 bytes. That's 8 million bytes
A float = 4, or 8 bytes.
A JavaScript boolean = 4 bytes.
An Object is the size of all it's metadata for itself and all members, + the size of data for each member.
A UTF-8 Char in English is usually 1 byte.
A UTF-8 Char in other languages is 1-3 bytes. Chinese characters might be 3 bytes. So if you support other languages, think of UTF-8 chars in terms of 2 or 3 bytes. I'd assume 3.
@ 3 bytes per char, that's 3 million bytes.

BONUS: Know your bottlenecks.

Just because you've determined your bandwidth can support 120/sec doesn't mean your entire system can. Parts of the system that might not be able to keep up: