As companies transition to operating with a DevOps culture and mindset, individuals may find that their roles expand, change and shift. With DevOps, the traditional approach of handing work over to a quality assurance (QA) or operations (Ops) person is thrown out the window. Silos between these different functions are broken down, and once-clear role definitions become blurry.
One of the key elements of DevOps is autonomy. Teams should be able to deliver on their objectives without depending on other teams to test, deliver, or operate their code. This means that people in traditional Development, QA, or Ops roles may find they need to dive deeper into other areas of the software development lifecycle.
Recently I was approached by an individual who found themself needing to know more about quality assurance and testing strategies. They wanted to learn how to test their microservice and were soliciting advice from various groups and individuals throughout the company.
I was asked:
How do we build confidence (because ultimately this is what tests do, build a developers confidence) for developers of Microservice A that its dependency, Microservice B, will do what it’s supposed to?
Microservice architectures contain many smaller, independently deployable applications that work together to create a larger application. In order to work together, microservices often have dependencies on other microservices in the ecosystem. With this in mind, their question made a lot of sense. However, their approach to the issue gave me the impression that they were thinking about testing from a monolithic perspective - where the entire system is part of a unit that you can test end to end.
...but microservice architectures aren't like that! Microservice B may be built by another team, or even a third party solution like an API from another company. You can't control Microservice B. Instead, the focus should be on what we can control - our microservice, Microservice A.
Thinking about microservice architectures, the question should instead be:
How do we build confidence in our microservice (Microservice A) such that our solution will react gracefully to the state of Microservice B?
If Microservice B can't do what it needs to do, for whatever reason, how does Microservice A respond? Different from the traditional testing strategy, quality assurance for microservices can be broken into 3 parts:
- Testing - ensuring your service does what it is supposed to do and only what it's supposed to do
- Monitoring - checking in on metrics for things you know might happen, and alarming when action needs to be taken
- Observability - having the tools in place to debug when something you didn't expect to happen, happens.
Testing an application builds confidence that it will do what it's supposed to do, and only what it's supposed to do when it is deployed to production. There is a lot of terminology when it comes to types of tests, and it's very easy to confuse them. Call them what you'd like but essentially you need a set of tests that validate your application's functionality including the backend code, and the connective infrastructure.
Small tests, often called unit tests, can be helpful for testing your runtime code. They isolate and test only one independent "unit" - generally a method or function. These small tests are typically cheap to build and quick to run, which means they can provide fast feedback to developers when run locally or as part of an automated pipeline.
However, many microservices are built in the Cloud and use different pieces of cloud infrastructure (hopefully taking advantage of infrastructure as code). These infrastructure resources need to be configured properly such that they are secure and they can work together to enable your application. This means that testing the runtime code isn't enough. We need to make sure that the microservice can be deployed properly to the cloud, that it runs, and can do what it is supposed to do (and only what it's supposed to do!). While there are ways to mock these resources for local testing, I have found this to be tedious and unhelpful as you still haven't tested that the real resources can work together.
Instead, development workflows should be set up such that every developer can deploy their version of the microservice on-demand, in an environment that's safe for testing, fast iteration, and learning. Every time a developer pushes code to their branch, it should update their deployed version of the application. Rather than messing around with mocked resources, a developer can experiment with real resources in the cloud and see how they really interact with each other to build confidence in the application. Automated tests can also be run against this version of the application as part of the deployment pipeline. This will help reduce manual work, and create consistency in the testing done in pull requests for each change being merged to production.
Tests should also be run post-deployment, to ensure the active application is still working properly in each environment that an update has been deployed to. Your application may be deployed to Alpha, Staging, or other pre-production environments in addition to your customer-facing Production environment. The same automated tests should be run in all environments to ensure consistency, and catch issues early. If automated post-deployment tests fail in your Staging environment the deployment pipeline can stop before rolling out the change to production, preventing your customers from being impacted by the issue.
Even with a plethora of testing, issues can still arise in production. Quality Assurance doesn't stop when the change is deployed. We need to maintain a high standard of quality throughout the software development lifecycle, which includes while it is actively being used. A team that truly owns the quality of their microservice needs to know when issues happen and be prepared to respond to them. With alarms to monitor their application in place, teams will be notified when something they predicted might happen actually happens.
Monitoring doesn't just magically get set up for an application though, teams have to put in the time to think about possible issues that might arise with the application, create indicators for those issues, and responses to those indicators. This generally includes things like creating metrics or service level indicators to monitor specific situations such as the duration of an API call, or the number of errors and invocations for a Lambda. With a metric in place, there needs to be an alarm to notify someone when that metric enters an undesirable state and action needs to be taken - perhaps a sustained spike in the API call duration, or a drop in availability of the application. The person responding to the alarm needs to know what's going on and how to respond to it. This means that alarms should contain information such as:
- Details about the metric (what's going on?)
- Details about the environment such as AWS account, region, etc. (where's the issue happening?)
- Details or links to observability tools like logs, metrics, dashboard, etc. (where can I find more information to help debug?)
- Details or link to a runbook that describes the procedure to follow for managing the incident (what action should I take?)
Remember that alarms can wake people up in the middle of the night, so they need to be actionable. If you're having trouble writing the runbook describing what actions to take when an alarm goes off it might indicate that an alarm is unnecessary and the metric is something you want to passively monitor or check in on from time to time instead. Alarms thresholds may also need tweaking - I've never got them 100% right the first time personally, so don't worry if they aren't perfect but do listen to feedback and adjust as you go.
Observability comes in when you need to investigate something you didn't anticipate happening. Observability uses tools like logs, metrics, dashboards, and traces - anything that can provide more insight into how the system is behaving. When an incident occurs, you need visibility in order to debug the issue and find out what's contributing to it. The more visibility you build into the system prior to an incident, the easier it will be to debug.
There is a balance to be struck here, so be sure to think about permissions and who should have access to specific information. Personally Identifiable Information (PII) and customer data should be handled with care and not be exposed in logs, traces, or other observability tools.
Remember how we talked about shifting roles and team autonomy? When transitioning to working with a DevOps mindset and full team ownership for microservices, automation will become your best friend. Taking advantage of Infrastructure as Code (IaC) will help your team provision and maintain the infrastructure with less guess work. (Even your dashboards and alarms can be created as code!) Utilizing automated deployment pipelines for deploying your development branches and production code will ensure consistency in the deployment process, the level of testing and quality. Automating your tests frees developers to add value to the product in other ways... I can keep going on, but I think you get the point.
It can feel like a lot to take on as your team transitions to a DevOps mindset and owning their application completely - and it is! Quality assurance is no longer handled by another team, it is up to your team to ensure that your application is tested and operating properly at all times. Try to take it one step at a time. Each iterative improvement you make will get your application closer to your quality and operational goals!