DEV Community

loading...
Cover image for Hacking your Product Support Strategy

Hacking your Product Support Strategy

wparad profile image Warren Parad Originally published at Medium on ・11 min read

There’s the unknown number calling in the middle of the night, there’s only ever one reason that happens. You missed the email and the app notification that one of your services non-production environment is down. Another team working in a separate timezone had no idea it was the middle of the night, and sent some emails and followed up with triggering your on-call rotation. Apparently your non-production server, which was being used for their testing, stopped working. They thought it was important and thought perhaps this was a problem in your production environment as well, so they escalated their problem and here you are, now partially awake.

Perhaps worse, as part of a customer support strategy you were carrying around a special phone as part of the support rotation, and you are now, not totally awake, supposed to run some scripts to ensure that your customer environment is still working correctly. There was an unexpected surge in their data center, and their production servers went down for 30 minutes. You’re the lucky person who get’s to validate that everything is alright.

Spoiler alert, everything is fine, but you’ve been asked to handle this anyway. And why?

On the other side of the spectrum, there were days you began to peruse your emails and found that a critical service had been down for almost a full day. Unfortunately you have users active while you were busy sleeping, and they aren’t happy right now.

The team that works next to you has a red flashing light that turns on whenever there is a problem, and yet when it does, everyone sees it except for them.

These sorts of issues may have happened to you before, and they certainly can happen again. The source? Is a poorly crafted support strategy for your products. As part of the software development life cycle at some point, if you are lucky, your product has made it out of a prototype and into an MVP. Perhaps it is even past that, and now your team is responsible for monitoring and alerting. However, how exactly to do this isn’t obvious because everyone seems to know what needs to be done, throw in some logging statements with level INFO or ERROR. We’ll see them when we are looking back through the logs later. Or you team is part of a larger organizational structure and there is a recommended reporting tool. You just send your messages there and another team exists to specifically triage them.

It would be nice if it were so easy to make this work, but fundamentally it isn’t enough. Just like you may think about how the software language and construction of the API supports your service and product, so too should you be thinking about the Service Level Agreement (or SLA). Now that usage has increased, and you are thinking about long term sustainability around the product, if being diligent, you’ll be thinking about what your SLA should be. A good SLA includes among others uptime expectations, as well as error rates and public status reporting. However that is just the external aspect. What’s even more important are the internal changes your team makes to the service and infrastructure to support higher reliability.

A team that is on top of issues with their product service will:

  • Know when the service is behaving unexpectedly from the norm, higher 4XXs or 5XXs being returned to callers, also known as anomaly detection
  • Be able to identify if there are problems that the service wasn’t designed to handle, as a good MVP, not every corner case has been handled, but when one comes up your team is notified
  • Swarm together when there is an issue to quickly rectify the problem
  • Identify when clients of your service are hitting a problem before they do
  • Identify when you should be taking action and when you should be ignoring a problem

There are lot’s of ways to accomplish them, but many of them come with increased cost, rather than streamlined support. I know many experienced developers and teams whom have suffered waking up in the middle of the night to support an issue which wasn’t a real problem, as well as those that had real issues and don’t know about them.

To handle the listed issues, just littering your code with log statements is not the right approach, each problem should be deliberately considered, include an expected warning indicator and remedy. When X goes wrong we do Y. How will you know X is wrong, and how will you know to do Y. For my teams, I use logging, monitoring, and alerting. If you don’t have these in place, and instead someone not on your team reports problems to you, there is an issue. You may have fallen into the Support Team trap.

The Logging Levels

To be able to track, log, monitor, and alert at the right times, have a deliberate strategy, this is the one my teams use:

TRACK: There is an an action being taken in a service that you are interested in knowing if it is happening at all. You are interested in scale. 1 or 10 or 100, that’s the interesting aspect. You can’t use logging to know exact numbers (I’ll get to why in a later section).

INFO: Any action being taken that is useful for debugging purposes later. Perhaps your system is hitting a critical problem, do you have enough information to know what is going on? INFO should never be logged anywhere which will happen 100% of the time, unless it is specific request or response information. Logging “INFO: We got some data correctly.” Is a waste of space and it is noise. However, logging “INFO: We decided to fallback here because we didn’t have the data we thought we would.” is really helpful, especially if there is a follow up issue.

WARN: Any suspected problem, this is something that you would likely review, on a weekly basis and attempt to reduce, perhaps there is a rare situation and you are wondering if it is happening at all. Or a new feature you just introduced that might not be working correctly. If you shouldn’t be woken up in the middle of the night, but this is a problem, this is the right log level.

ERROR: A real problem, an unexpected exception flow, for instance an Exception was thrown, error returned, or perhaps a 5XX on an API. One of these isn’t a problem, could be a temporary issue. So on ~10 of these per minute sustained you could fire off your on-call rotation.

CRITICAL: This is a problem you didn’t catch, but you should be catching. I.e. if you get even one of these, on-call should happen. This is more than just a single failure, which may never happen again, this is a database corruption, or something worse.

Now that logging is done, onto next will be what to do with these messages. If you aren’t following a pattern and instead guessing what logging level to pick, your strategy for support won’t work. You’ll be alerted at the wrong times, and everyone will start to ignore your reporting system. It will include Known Errors, and excluded search results. Your team will think none of it needs to be supported, and you are now just wasting the resources to support logging altogether.

The Framework

Having the logging in place is great, but it doesn’t necessarily cause the right things to happen. In addition to just putting some log statements in your service, you’ll need to send that data to a remote tracker, identify interesting patterns, and then alert your team to take action. The first step is having a log aggregator. This is something which is the sink of all the logs and allows you search and filter. How many TRACK events have you gotten, what sorts of WARNings are there. If your service has elevated 403s, why is that? What else are those users doing? If you are logging correctly, then finding out this information becomes trivial. Once you are able to identify the patterns, forwarding that information to an alerting system should be simple. There are any number of these that exist. That can trigger some emails, a text notification, or something more disruptive. You could even update a public page on the overall status of your system.

A Support Approach

Some organizations will swear by having a dedicated support team, which can handle the first level of triage. Everyone thinks that they know how to utilize this team, but most will have different approaches. If you are well aware of your Support team and frequently talk with them, this is a red flag. If your team depends on an external one to survive, you are doing it wrong. A dedicated support team doesn’t exist so you can ignore your problems and have someone else take care of them. They exist to point support problems in the right direction. They will never understand the intricate details of any service to be able to resolve problems effectively. They can triage, but even that comes at a cost. No matter the size of your team, they cannot keep up with the changes that are constantly permeating our products and services.

One misguided attempt to resolve this knowledge burden may be to create unified tools that everyone uses to track service issues and reliability problems across services and teams. You create overhead for every team to depend on a such a framework to publicize internal details about how a service works. Rather than having a team focus on resolving their problems, we’ll be creating a burden on two teams, one to publish information unnecessarily and one to consume it. This also has an equally problematic impact on your software because now there are two ways to interact with every team and service, one through the standardized documented approach, and one through the backchannel alley (which probably also has to be documented). Creating this gutter causes attention to fixing the root cause to be diverted away to a secondary approach which doesn’t actually improve the product or service. It seeks to improve the support triage approach, and not what your customer sees. Additionally this path relies on manual actions and runbooks which are not the norm, rather than putting into place more reliable and consistent solutions.

The Alternative

Some developers may say "that's not my job or responsibility", and perhaps it was never framed that way, that means the first step is a mindset shift. Effective teams are made up of cross functional individuals, not just "developers" or "testers". Instead of investing in improved support, invest in improved reliability, so that support is not necessary. When a team invests in the reliability of their service, they remove the need for a separate support team, because there are no longer many issues (there is always something you didn’t predict or handle). You don’t need technology to help be reactive, and help triage problems if they don’t exist. Additionally knowledge doesn’t have to be unnecessarily shared between two teams, the “development” and the “support” team. There is only one team.

Common failure modes

Inversion of monitoring

What you shouldn’t do is invert the structure here. Don’t have an external service which watches your service to identify when there is a problem. That service doesn’t need to exist, and if it does it puts unnecessary load your production environment, not to mention it won’t even be correct. Let’s say you have an "Uptime pinger", it will hit a special endpoint in your service, which your service uses to identify if there is a problem, and report back success or failure. How often should this service ping your environment, once a second, a minute, an hour? Given that it is an external service it can do whatever it wants. Since you are exposing this route, now every one can hit it. You’ll be encouraged to put a cache in front of the endpoint which would subvert the value, since you aren’t getting the value from all those users hitting it all the time. Additionally, how well can it know what the problem is, can it identify how many null reference exceptions or 500s have existed in your service? Whether or not the database is under high load, not likely. Instead, logging the errors and having your monitoring tools investigate, you’ll be able to know exactly what the problems are. If you choose to expose that information via another mechanism you can, and it won’t affect your production environment.

What if X happens for another team, how will I know?

Alternatively, another team is asking, “How do I know your service is down”. The answer is you don’t need to know. If you have sufficient logging, monitoring, and alerting in place, you will know the moment a problem happens and will respond. Working to get everything in order as quickly as possible, is the best outcome. Having another team emailing you and trying to contact you, just in case you didn’t know, is disruptive noise. Additionally, when that team has a problem with one of their services, they already know as well. You don’t need a special mechanism to identify that. They are doing their job and you are doing yours.

Logging isn’t reliable but it doesn’t need to be.

Occasionally someone get’s an idea to use logging as a production tracking mechanism, as if it was a database, don’t do that. It creates a pit of failure and your service will still be unreliable. This frequently happens during transaction-like actions across services. In the world of microservices, you might want to make changes to two different services, so your write code like this:

try {
 await serviceOne.change();
 await serviceTwo.change();
} catch (error) {
 log.error([ACTION REQUIRED] There is a possible problem here, error);
 throw error;
}
Enter fullscreen mode Exit fullscreen mode

If you are slightly more aware of the multiple problems you might have made some adjustments to even know which change failed. But what happens if your logging of the error fails, would you write this?

try {
 await serviceOne.change();
 await serviceTwo.change();
} catch (error) {
  try {
    try {
      try {
        
        log.error([ACTION REQUIRED] There is a possible problem here, error);
      } catch () { }
    } catch () {}
  } catch (e2) {
    log.error([ACTION REQUIRED] There is a possible problem here, error);
    throw e2;
  }
  throw error;
}
Enter fullscreen mode Exit fullscreen mode

Logging should only be used in situations where knowing the exact number of failures is not relevant. Knowing that there was 1 or 10 doesn’t matter, and more importantly the data associated with the problem is also irrelevant. If you need a mechanism to work 100% of the time, it needs to be your datastore, and error handling that’s a first class citizen in your service. That means returning a 5XX when the datastore isn’t working.

Logging is expensive

Sometimes it becomes the case that we attempt to limit the information present in a log statement, stop yourself, and ask why you are doing that? While it is possible that you are actually logging too much per call. i.e. the all the user information as part of every INFO. When an ERROR fires, you want as much information as possible. A message “The Service is on fire” is not helpful unless you know which service, and how big of a fire. If the error never happens you never pay for the log message… win-win.

To prevent logging duplicate information, creating a requestId or correlationIdentifier across log statements is incredibly useful. On every request, log the initial data along with a uniqueId. Then for the remaining messages only log the requestId along with the specific information for the error. Seeing There was a problem 100 times without any additional data nor request metadata is incredibly annoying.

The Next Step

Is to put it all into action of course. Take a look at your services’ situations and determine what needs to be improved. If you don’t have time to improve the services, because there are too many fires, then you might have a different problem. Bottom line is reliability is core to improving support because being proactive for your users is what they want.

Discussion (0)

pic
Editor guide