Leveraging third party APIs is a great way to add features and functionality to your software products, but each one introduces new risks that need to be managed. Unexpected problems can lead to feature breakage, or even total outage - which means serious consequences for users and loss of revenue for your business. But with enough preparation, you can anticipate and solve these issues. Here are three things your software team needs to do when calling third-party APIs:
When calling a third-party API it is important to have basic visibility into:
- How many calls are being made
- How long those calls are taking
- How many errors are being returned from a call
- Headers and and bodies for requests and responses
Most API providers don’t give you this kind of visibility, but having the above records is critical for maintaining a robust and available service.
Many APIs enforce rate limits or quotas, which often go unnoticed while call volumes are low in development, but appear unexpectedly after going into production. Having visibility into the number of calls is also important to keep an eye on the health of the system. If calls drop to zero or suddenly spike, it could be an indication of a problem elsewhere that needs to be addressed.
Once a problem is noticed, an engineer will often need to inspect the call log to get to the root cause. If the problem has never been encountered before (for example a rate limit, expired credential or internal server error), an engineer will learn about this by inspecting the response body. If the body is not already recorded, they will need to add logging and do an emergency production deployment before being able to further diagnose the issue. Having comprehensive logging in place from the start will reduce the time to resolution and eliminate emergency build deployments.
Frequently, API providers will encounter service degradation resulting in increased call response times or intermittent errors. Once comprehensive logging is in place, it is important to create alerts so that issues are identified before they are reported by users. Without monitoring, many teams assume they are not having issues and are surprised to find that problems were simply going unnoticed.
At a minimum, alerts should be put in place for:
- 95th percentile latency above threshold
- Errors increased above threshold
Latency thresholds can be determined on a per-API basis, but a good default for most providers is one second. Errors are determined by the status code or a connection failure. Once these are logged, an alert should be created so you know as soon as one of your integrations is failing.
Automatic retries are often overlooked when an integration is first being developed. The calls work in development, but once they are deployed to production, intermittent failures are seen in application logs. API calls fail for many reasons, and can often be immediately retried successfully. Implementing an automatic retry around API calls that checks for retriable conditions can significantly reduce the impact of intermittent issues with API integrations.
Each API call should be wrapped in logic that:
- Makes the API call
- Checks for a retriable error
- Repeats the call up to a set number of times after a delay
It is important to think about the maximum number of retries and delay between each call. While a fixed delay between retries can be used, it is often preferable to use an exponential backoff, increasing the delay between calls. If exponential backoff is used, it is important to set a maximum delay as exponential functions grow very quickly.