baseds (16 Part Series)
Many conversations about distributed systems arise because everyone is talking about one topic in particular: size. When people talk about size, what they’re usually referring to is the size of their system itself. And when it comes to distributed systems, it turns out, size really does matter.
The field of computing itself has a surprisingly long history; when it comes to distributed systems, however, the story is a little different. Distributed computing as we know it today started springing up in the late 1960's/early 1970’s, when computers began operating on networks (think Ethernet!). Within a decade, distributed computing was becoming more and more widespread and widely-researched. Interestingly, systems began becoming “distributed” with the creation of one massive distributed system: the Internet. By the 1980’s, computers started becoming more ubiquitous, and access to the Internet (and other networks) became more commonplace. Without going into too much historical detail here, we can just say that, steadily, there were more machines, and they were all becoming more interconnected with one another.
This kind of “growth” has a name of its own in the realm of distributed systems. In fact, many engineers (even non-distributed systems engineers) use it fairly loosely these days. I’m talking, of course, about scalability. So, let’s dive into what a scaleable distributed system is, and the different ways in which a system can grow!
The idea of scalability in a system can mean many different things depending on the system that we’re dealing with, as well as which part of the system is growing. But speaking on more general terms, we can say that the scalability of system reflects its ability to handle tasks as the system grows in size. In the same vein, we can say that a system which is scalable is one that can continue to do its job and perform effectively as it begins to grow.
But what is it that is “growing”, exactly? What part of the system is increasing here?
When we refer to how a system grows, we are likely talking about different aspects of the system increasing at any given time. In other words, a system can very easily grow in different directions.
For example, we could say that a system can grow by the number of users who are using or interacting with it. If a system that has 5,000 users working within it suddenly experiences an uptick in users and sees 50,000 users…well, that is certainly one kind of growth!
However, a system could also grow in terms of its own resources, too. For example, a system could possibly end up receiving a high amount of traffic one day, and end up with an extremely high amount of data resources to process. Alternatively, a system could experience high usage during a certain period, which might lead to a significantly higher number of processes being run at a given time. Yet another example: a single system could grow in the number of actual physical machines that it has within it; it could require an increase in the number of actual machines or servers (not just processes!). Whether the resource in question is actual data, a process, or a machine, each of these represent ways in which a system could need to grow at any point. And of course, as we discovered earlier, even just an onslaught of users is a kind of growth in a system, too.
In truth, it is likely that distributed systems in the real world don’t just grow in one direction, or in a vacuum. Instead, they grow in different directions at the same time. For example, because of a growth of users, we may find the amount of data resources might grow as as a result. And because of larger data size, we may add more servers or databases to handle the increased resources as a result.
A real life distributed system tends to grow in more than one way.
Given the fact that there are so many different ways that a system can grow, it would make sense for us to talk about the scalability of a system based on the way that it has grown too. As it turns out, there are a few different terms to help us describe more concretely just how a system scales.
We can measure the scalability of a distributed system in three main ways: size scalability, geographical scalability, and administrative scalability. These three forms of measuring how a system scales are often refereed to as scalability dimensions.
Let’s explore each one in a bit more depth.
The first scalability dimension lies at the crossroads of two things we’ve already discussed: users and resources (whatever types of resources they may be). We know that both of of these facets can increase in number; when they do, we say that the size of the system has increased.
When a distributed system increases in size, we need to consider the size scalability of the system, or what the system will do in order to continue performing efficiently as the size (as either users or resources) of the system increases. A size scalable ensures its system remains performant and usable for its users, regardless of how many resources it might have.
For example, let’s say that our distributed system receives high traffic on a specific day; because of the high traffic, we may realize that the number of database or the number of servers that we have in our system just can’t handle the heavy input of requests on this high traffic day.
We might choose, perhaps, to grow the size of our database or instead opt for adding more servers in order to process our larger stream of data for a few hours. However, if our system truly scales in accordance with size, adding more nodes should not disrupt the operations and performance of the system. In other words, if we decide to add more nodes to handle an uptick in our resources or users, we shouldn’t degrade the performance or efficiency of our system while doing so. If we choose to add more nodes, our system’s performance should continue to allow the system to work as expected for our users…and it certainly shouldn’t make things slower for them!
So far, we’ve talked about adding more resources or users into a functioning system. But there’s more to the story! As a system grows, we may find that adding more nodes/users into the mix doesn’t entirely solve everything; in fact, sometimes it can present unforeseen consequences of its own.
At first glance, it might seem easy enough to add more nodes or resources into a system and call it a day. However, this is a slightly naive approach to an otherwise complex problem. Geographical and administrative scalability are a bit more nuanced when it comes to measuring the scalability dimension of a distributed system. Whereas size scalability seems like an obvious way of thinking about how to grow a distributed system, these two dimensions are a little less on-the-nose. But we would be remiss to skip out on them!
So let’s dig right in.
As a system begins to grow, it might seem easy to simply throw more nodes at it and assume that our problem of scalability is solved. Easy, right? Not exactly (nothing is ever that simple).
When our system begins to grow significantly in size, we have to consider another question: where exactly are our resources? And that’s where geographical scalability comes into play. A geographically scalable system is one that can continue to function efficiently and can be accessed by its users not matter the distance between the users and the resources of the system.
At first, this may seem like an odd or potentially ridiculous facet to consider, but stay with me here. Let’s say that, in order to make our system size scalable, we add more nodes in order to handle a higher load of users or resources. If those nodes are servers, those servers could be located in one place, or they could be located in different places. If we’re adding something as significant as a whole data center of servers, our new data center could be much further away from our preexisting one(s).
The seemingly simple act of adding another node to our system actually adds complexity to our ultimate goal of creating a system that will continue to perform efficient and can be easily accessed by our users. Depending on the location between our new and old nodes and the distance between our nodes and our resources/users, we could very well be adding new nodes into our system without ever even accounting for the distance between them. The distance between resources in a system is the literal physical space between them, or the geographical distance.
But why does geographical distance matter? Well, because it takes time for pieces of a system to send data and communicate with one another. Two nodes in the same room can understandably send each other information much faster than a node that is halfway across from the other. Unfortunately, this is just a limitation of…physics (or maybe geography?). No matter how fast your system can communicate and send data/messages within it, there are some geographical and physical limitations that even really great wifi can’t solve. In a geographically scalable system, adding new nodes shouldn’t slow down the amount of time it takes to communicate between them; in other words, adding another node should take into account the physical location of the node in relation to the rest of the parts of the system.
Finally, as a system grows in size and location, one last question remains: who is going to oversee the new added parts of this ever-expanding system? Enter administrative scalability. In an administratively scalable system, adding more nodes into a system shouldn’t require a significant uptick in terms of managing those new nodes. This is particularly important if a system is used by multiple administrators (either a manager or a organization that is running the system while sharing it with others) who each expect to be able to use the same system, even when others are sharing the system with them.
Administrative scalability is another one of those “easy-to-forget” scalability dimensions. Just as it is easy for us to think that we can simply throw a node at a problem in order to make our system scale, it is also easy for us to forget that adding resources into a system mean taking on some form of effort in the future in order to maintain them.
In an administratively scalable system, adding more resources into a system as it grows shouldn’t require a large increase in administrative overhead. If we were to add a whole data center of nodes in order to make our system scale, then we must be able to account for the amount of manual work — human engineering and management time — that it will take to maintain those new nodes and keep them running (recall that we must keep them running in order to keep our system performant for our users!).
Many times, when developers talk about “making something scale”, they’re only really thinking about the size scalability dimension of a system. However, things can be measured in more ways than one, and the same is true of distributed systems. Size scalability is one aspect of the problem, but it’s important to consider geographical and administrative scalability, too! And as we dig deeper into distributed systems, we’ll see that each of these more nuanced forms of scaling present their own problems to solve as well.
But that’s a problem for another day!
In my readings on the dimensions of scalability, I found that there was still so much to learn. As a system grows larger, the questions to answer and situations to consider start to multiply! This post just scratches the surface of system scalability; if you’re interested in further reading, check out these materials below.
- A brief introduction to distributed systems, Maarten van Steen & Andrew S. Tanenbaum
- On System Scalability, Charles Weinstock & John Goodenough
- Distributed Systems for Fun and Profit, Mikito Takada
- An Introduction to Distributed Systems (Lecture Notes), Dr. Tong Lai Yu