Marco Salis

Posted on Jan 4, 2022

Why clients should fail fast on API contract violations

#api #mobile #json #webdev

(repost from my Medium article on 28/03/21)

My view on error management in client/server communications when clients rely on data from a RESTful API.

TL;DR: don't "make up" data or fallback to arbitrary defaults on a client when the one returned by an API fails its contract. There is nothing worse, for the user, than to show incorrect or incomplete data on screen without warning. A fail-fast approach and an effective policy to notify errors are the only way to provide a solid, reliable and professional-behaving client to users. Any other strategy might show less errors to a user and give a false impression of stability, but it will only hinder their trust on the data they see and ultimately increase frustration. Silently swallowing and eagerly managing server data errors on the client also reduce scalability and lead to an increase in development time and technical debt in the long term.

On API contracts

Most RESTful API technologies provide ways to define what data a client should expect in a response/request payload. It's no coincidence that most of them also give ways to enforce a more or less strict contract so that both parties "speak the same language" and fail when they don't (other than knowing what changes are safe, and which ones are backwards incompatible).

Without the need to get into the extensive and controversial topic of which technology is better (or even simply listing all the available ones), it is safe to say that some are better than others when it comes to preventing client/server communication issues.

Protocol buffers (and its many derived systems) provide ways to auto-generate data models from the same, shared schema, so that an incompatible update would most likely cause a compilation error on the client side.

On the other hand JSON, one of the most universally (ab)used data interchange formats, provides schema definitions that allow specifying whether a field is required or optional, and a myriad of different tools and frameworks to document and enforce that. But let's face it: most developers and companies are still using the good old "manual mapping", based on faith, scarce documentation and an unrelenting instinct of self-flagellation and resignation.
That's where the issues come in, and there isn't as much discussion on the topic as there should be.

Consuming API data on clients

A good deal of client applications, mobile or desktop, ultimately just do one thing: fetch data from a server and present it to the user. Depending on the type and complexity of the application, this can be done in a straightforward, no-frills way, or using some sort of caching strategy (ETags & HTTP caches, DBs, disk and so on) and offline state management.

In this scenario, data stored locally in the client is a representation of what the server provided, updated to a certain time (and, consequently, more or less up to date from the user perspective). But it is the "server entity" that contains the real state, and is in charge of delivering it to multiple clients and managing its consistency through time.

As a consequence, when it comes to data, clients should always rely on the server as the single source of truth, and avoid messing with data, or "make up" for any missing information for the only sake of not showing failure.
It is ultimately counterproductive to alter data on a client's cache, try to guess unexpected missing contents or to go to any lengths to show this incomplete/incorrect data to the user (more on this later).

Defensive programming vs data consistency

"But hey, I've always been taught to code defensively and take into account corner cases and error states!"

Yes indeed, you should. Error management is one of the most crucial aspects of code quality, and managing errors/exceptions reliably and consistently is a requirement and a sign of well written code.
However that does not, in any way, imply that clients should not fail or present error messages to the user.

What the best practice actually says is that given any input, our code should react in a predictable way and do its best to recover gracefully from unexpected states. Nobody ever said that clients should mess with the input to adapt it to a success path in its own logic.
The application must do its best to allow the user to continue what they were doing without losing work or being shown inconsistent data.

Fail fast

Should an app crash when getting unexpected input then? Of course not! When faced with an API that violates its contracts, however, a client shouldn't try to eagerly recover by setting arbitrary defaults and modifying the data locally, but instead:

Avoid presenting incomplete data to the user (and if available show cached data -always making it clear for the user that the data shown might not be the most up to date-).
Provide the user with a clear error message, reassuring them that the error has been reported and, when it makes sense, offering them a way to retry (e.g. via a button, or a pull to refresh action).
Promptly log the error to an error reporting system, along with as many details as possible, so that it can be detected as quickly as possible (and make sure it includes enough information for the API team to fix the issue).

A client should never fallback to a default value when server data is missing, unless said value has been agreed upon beforehand and it is specified as part of the API contract.

data class BankAccountResponse( val balance: Float? = 0f // would anybody ever do this?? )

Why "fail fast" is better

One could simply bring up a few undeniable truths in most client/server communications: a server deployment is usually much faster than a client-side one, especially when it comes to mobile applications (just think of the hours/days for an Apple Store approval).

An API issue fixed on the server side is fixed at its source, as opposed to "patched" somewhere else. It is also fixed once, instead of having to be sorted out (often differently) on each different client. So why do we want to "fix" it there?
However, there's much more to it, and I will try to give you many more reasons to turn this into an extremely compelling argument.

Let's consider a scenario where a GET endpoint of a RESTful API using JSON returns a response with a missing required field, hence violating its contract.
Let's now consider two different client-side approaches to the issue, keeping into account the technical effort, the impact on scalability and, obviously, on the user.

The "eager" approach

Clients set arbitrary defaults or support a null value for the required field. Since the above is considered a valid state, no error messages are shown to the user.

Technical / Scalability impact

Mapping data models to classes requires lots of second-guessing and interpretation. Default values must be thought of, and the client's behaviour on dealing with unexpected 'null's needs to be either set as default (meaning it can't be easily overridden depending on the endpoint) or chosen on a case-by-case fashion (meaning no consistency throughout the codebase).
Default values and null states for fields become a client-side affair, hence they need to be unit tested. This quickly becomes a heavy burden on large response models or codebases with hundreds of them.
Response models provide no indication of the API intents: they need to be constantly compared to API specs (if any exist), and, unless they're heavily documented, a developer modifying them or trying to figure out why a specific default value was set need to second-guess the class author intentions (most likely getting it wrong). It becomes almost impossible to change a default value without the fear of breaking something unexpectedly.
Caching data becomes more complicated: when defaults are set and null values are accepted everywhere, there is no certainty as to whether the data stored in the client is "genuine" data coming from the source, or an artificial value. Developers are tempted to manually alter single cached values on business entities (which is not a great idea in itself -for reasons that are out of the scope of this article-), increasing the gap between the real data and what the user is presented with.
Handling scores of nullable fields means an increase in the total number of states to manage in the client. This often creates a ripple effect where all layers of the codebase are impacted, from the data to the business, presentation and view layers: each new state combination must be taken into account, and tested.
Different clients, using a different language, technology and frameworks will inevitably manage null values and default values differently; every client will have to fix different API-related bugs and, unless communication between teams is nothing short of perfect, information will be lost and some bugs won't be fixed (or fixed much later on).
API-related bugs are much more difficult to spot by QA or integration testing: since the client "hides" them (not showing any explicit error), it is more difficult to spot them in regular tests. If error reporting is not checked regularly and extremely carefully, those bugs might be lying there undetected, wreaking havoc in the future when something in the API specs changes.
Clients seem to be accepting and dealing with any type of data that the API is feeding them; this naturally leads to a false sense of security for the API teams, which will be happy to "cut some corners" when making changes to an API. It also inherently lowers the importance of a stable, well-documented API with strong contracts.

Impact on user

All the above mentioned technical reasons contribute towards exponentially growing technical debt, especially on medium/large sized codebases. In the long run this undoubtedly causes a significant slowing down in the client's feature throughput, as developers try to manage technical debt and fix subtle API-related bugs.
Occasionally, the client will manage to successfully hide an API issue (e.g. a required field is suddenly removed by the API, but the client is not currently using it -or it's using it on a different screen-). With time, you are going to realise that this is more likely to be the exception than the rule. The saving of a few error messages is going to be vastly overshadowed by the times the approach will hide a real API issue, or show incorrect and incomplete data to the user.
The accumulation of undetected error states, with time, frustrates the user. Even subtle differences are going to be eventually noticed, especially when different clients (e.g. web and mobile) are used. The user is going to lose confidence in the application, eventually leading them to abandon it or, in the best case, forcing them to a constant manual refresh of data.
Without a clear and repeatable error message, it is difficult for the user to report the issue. They don't know whether the issue is a glitch or it is somehow caused by how they used the client, and they're going to be much less likely to go to the trouble of creating a support ticket. Sum up several small inconsistencies and undetected bugs for a long period of time, and the support team is more likely to get a stream of ragey messages as opposed to clear, helpful, rational steps to reproduce.

The fail-fast approach

Clients don't show incomplete data nor set defaults. A clear, meaningful error message or state is shown to the user due to the missing field and inability to show data (all or part of it).

Technical / Scalability impact

Mapping data models to classes is a straightforward affair. Nullability of fields is clearly documented by the API. As a consequence, little to no default values are necessary (and only when they are already part of the API documentation).
Data response models accurately reflect the API documentation: they are a reference to the API itself, rarely requiring additional explanations or testing. In the best case scenario, these can be auto-generated using a tool to further lower the chance of human error.
Change logs on response models solely represent a modification in the API specs (often requiring an explicit, mutually agreed and documented reason from the API team).
QA and/or integration tests catch API issues more easily. The visible and clear error message ensures a "human" can promptly realise something is wrong with simple exploratory testing. On the other hand, client/server integration tests are able to realise that data is not presented correctly by the client. Either way, this means the issue will be flagged much earlier and fixed more quickly.
Clients handle the API consistently: with no client-side patches, the API contract is the only specification needed. It is also easier to realise when one of the clients is not following it correctly because a valid API response generates errors.
API teams are naturally pushed to respect the contracts and avoid backwards incompatible changes on a production API. It is only a natural reaction as they see the results of the issue first-hand, and they will sometimes even be grateful to the client for spotting a potentially tricky issue before it does any major damage. What's more, failing fast helps API teams detect what type of changes are more likely to affect clients so that they can improve their testing process (and API consistency).

Impact on user

Seeing an error message and, sometimes, being unable to see the data they need is inevitably frustrating for the user. On the other hand, they will be confident that, when there is no error, the information on the data the client is presenting is rock solid, complete and up to date (caching tip: never show outdated data without notifying the user of the possibility).
The user will appreciate the honesty. There is clearly an issue, but they are given reassurance that the developers have accounted for it, that it has been somehow reported and that they will (hopefully) work on a fix soon. In a way the user feels like they are on the same side of the developers. Give it a witty and catchy copy ("team of highly trained monkeys" anyone?) and they will even feel "complicit".
An error message or state, especially if recurring, can be easily remembered by a user, or even screenshotted. This makes it a lot easier to report, and they are much more likely to do so if they can "prove it happened". They will feel confident that, even without having to write lengthy steps to reproduce and the need for endless customer support message exchanges, the issue could be fixed (whether that's actually true is a different story).

The exception

It is fair to say that, in some limited circumstances, the "eager" approach is the only viable solution. It is the case with legacy or third party APIs that need to be used without documentation or a proper contract. When there is no assurance about the intent of the authors of the API, it is necessary to eagerly accept any and all values possible.

It is fundamental to realise that this choice will have a serious impact on technical debt and possibly turn into a maintenance and stability nightmare, and it should only be considered as a last resort.

If a new API (either from a third-party or developed internally), does not provide documentation and a contract for clients to rely on, it is very likely a recipe for disaster. You should consider, when possible, temporarily halting the client side development and have a serious conversation with the providers of that API.

Fail fast, but "granularly"

One of the main objectives of UX is to ensure that the user gets the smoothest, most consistent and predictable experience as possible from the application they're using. With this being the one of the utmost priorities in every technical choice we make, it's important to point out that an API error should not necessarily lead to an empty screen with a big error message, and prevent the user from doing anything on a certain feature or screen.

That is only the last resort, to be taken only when the missing data represents the "core" of a certain screen. If a user misses the most important information on a business entity, or it can be misled into thinking the entity has a different state than the real one, no data should be shown. An error screen with a clear message, a reassurance that the issue will be looked at soon and a button for them to retry the loading will serve the user much better.

In any other circumstance, the more granular way the "fail fast" strategy is applied, the better. If a screen is presented with data from different RESTful requests, developers should do their best to show all the valid data, and contain the error messages or states to the specific section that was impacted by the error. It is vital that users can still use the sections of the screen which are not impacted by the API error while the bug is fixed.

Equally importantly, this usually applies to limiting the scope of the failure within the API response as well: if we are using a GET endpoint to retrieve a list of entities, we only want to fail on the ones that don't respect the contract. The others can be presented, as long as the user is notified about the error and that the list is potentially not up to date. Many JSON parsing frameworks, for example, allow customisation of error handling settings, or at least provide ways to override the parsing logic that takes care of element arrays.

Conclusions

Trying to eagerly fix any kind of server data issue on the client-side might appear to be the best option in the short term: the user can see data, even if potentially incomplete or incorrect. The client can sometimes recover and avoid showing any error state. However, this approach is not scalable as it forces clients to handle (and therefore test) states it shouldn't need to, making it harder to maintain and extend as the number of endpoints to support grows. It also gives a sense of false security to those who maintain the API, as they'll grow confident that clients can handle whatever they throw at them and, as a natural consequence, they'll be less strict on quality, respecting contracts and backwards compatibility.

With time, development speed and stability of the client will decline, technical debt will grow and users will experience more subtle bugs and inconsistencies in the data (especially when using multiple clients). Their trust on the client will crumble, leading them to constantly question the integrity of the data they are shown and forcing them to constantly refresh it manually, or abandon the client altogether out of frustration. They will have a feeling that the application is "lying" to them, hiding errors by pretending to recover from them.

On the other hand, a fail fast approach is honest with the user. Error messages due to data issues suck, and they should be avoided with a strict API contract and a solid policy for client/server communication. However, they are still the lesser evil: a well worded, witty, clear error message also promptly informs the user that the data they're seeing is outdated or incomplete; it prevents them from making mistakes by using that data, it makes them feel informed, and on the same side as the developers. It will give them satisfaction when they are (hopefully quickly) fixed, and give them a feeling that the developers are there to respond to their frustrations.

DEV Community

Why clients should fail fast on API contract violations

On API contracts

Consuming API data on clients

Defensive programming vs data consistency

Fail fast

Why "fail fast" is better

The "eager" approach

Technical / Scalability impact

Impact on user

The fail-fast approach

Technical / Scalability impact

Impact on user

The exception

Fail fast, but "granularly"

Conclusions

Top comments (0)

Read next

Console Ninja - Your logs on steroids

weeklyfoo #30

Vite vs Nextjs: Which one is right for you?

Introducing VASA 1: Microsoft's Cutting-Edge AI Model