Terms like "graceful degradation" are coming into vogue alongside the emerging discipline of chaos engineering and evolving understandings of what it means to make a distributed system fault-tolerant.
Take, for example, the case of Netflix, where the term "chaos engineering" got its start. If they experience a hiccup and can't stream their normal catalog of movie titles, they might display a polite error message and instead offer a more limited selection of "featured titles". If your personalized recommendations can't be retrieved as they normally would, the platform might opt to show you the 50 most popular movies as chosen by viewers. The point is that, to their credit, Netflix will rarely "fail hard" but instead will "fail soft" by gracefully degrading the functionality they provide to ensure a reasonable user experience.
In short, systems which degrade gracefully are those which prioritize availability over completeness. This is a hard, but often-necessary tradeoff when evaluating service quality.
Yet, prioritizing availability over other core functionality is nothing new to the distributed systems community. The CAP theorem, otherwise known as Brewer's Theorem, has long shown us that in the presence of network partitions within a distributed system, you must choose between that system's availability (ability to serve clients) and its data consistency (all parts of the system must agree on the same values at the same time). Therefore, it should be no surprise that a subsequent paper, by Armando Fox and Eric Brewer, lays out a foundation for following through on this sort of tradeoff.
In Harvest, Yield, and Scalable Tolerant Systems, Fox and Brewer say quite a lot in just a few short pages. Leaving aside the fascinating statements that the authors make regarding the interpretation of the CAP theorem (strong and weak!), the paper introduces the super-useful concepts of harvest and yield.
In short, Fox and Brewer borrow from engineering and define the two terms as such:
We assume that clients make queries to servers, in which case there are at least two metrics for correct behavior: yield, which is the probability of completing a request, and harvest, which measures the fraction of the data reflected in the response, i.e. the completeness of the answer to the query.
So now we have two definitions:
Yield: a measure of a distributed system's ability to provide an answer.
Harvest: a measure of that answer's completeness.
To illustrate this concept, let's consider a simplified scenario involving a constellation of four microservices feeding data to the fictional online store for Captain Carolina's Dive Shop. Now, Captain Carolina cares just as much about designing effective software systems as she does about selling scuba lessons or surfboards so her engineers have split things up into four logical units.
Search service: Provides a searchable accounting of the store's product inventory.
Shopping cart service: Facilities for aggregating orders based on product inventory.
Account service: Managing everything from first-time marketing traffic through customer preferences and contact details.
Recommendation engine: Based on browsing habits and purchase history, provides the sort of "you might like" suggestions someone would come to expect from an online store.
Every user who comes to the site is a potential customer so we want to make sure that we provide them with the best experience possible. However, we also know that not all software works all the time and even with backing-service SLAs sitting squarely in the four-nines region, no service works perfectly all the time. Therefore, we may use the dual concepts of harvest and yield to reason about the behavior of Captain Carolina's website storefront.
Since things have been broken up into the aforementioned microservices, any, but not all of them, can fail and the site will still be up. In the event that the shopping cart fails, the site will still work such that visitors may still search the catalog. If the search service fails, users can still browse. If the account service fails, users can always purchase as a guest. If recommendations can't be provided, it doesn't impede the general operation of the store and merely sacrifices a chance to up sell.
By isolating and compartmentalizing logic, we have also isolated and compartmentalized failure. Furthermore, in this example we have traded harvest (completeness) for yield (availability).
While Brewer and Fox remind us that yield is usually measured in nines of uptime, they don't provide much guidance on harvest. In the world of microservices, you might define harvest as a measure of responding services:
Harvest = Responding Services / Total Services
Or more compactly:
H = R / S
Therefore, if just one of Captain Carolina's services went down, then we could concludes that H = 3/4 = 0.75. In fact, it would only be if all of our services became unavailable, H = 0/4 = 0, that our yield would be compromised!
This article uses microservices as an example even though a microservices architecture isn't for every product, team, or problem. Generally, I hope that the paired terms of harvest and yield help you to have better conversations about service quality in distributed systems. If you liked anything about this post, please go ahead and read the paper!