Nir Sharma for Squadcast

Posted on Dec 7, 2020 • Originally published at squadcast.com

How to SRE without an SRE on your team

#sre #bestpractices

Are terms like “Error budgets” and SLOs roadblocks on your way to adopting SRE practices for your organisation? Our latest blog talks of "How to SRE without an SRE on your team", where we look at some of the most elementary SRE concepts that you can start implementing right away! We help you pick SLOs, identify toil and touch base on Automation for SREs along with few best practices to get you started on your SRE journey.

An organisation with mature Site Reliability Engineering (SRE) principles may conjure images of engineers with years of experience in DevOps and System Administration, having a suite of specialised tools and experts dissecting each service outage. For an organisation that is thinking of implementing SRE principles this is an intimidating image and may seem unattainable. The truth is everyone can get started on their SRE journey by following a few elementary principles, which are outlined here. While we are not claiming that this is the only way to go forward when you don't have an SRE as a job title or role in your team, it's a good place to start.

In this blog, we go over some of the most basic steps you can take in your journey towards SRE adoption. To unlock the full value of SRE practices however requires a deeper commitment than just investing in the tools or training. There is a need to have an organisation wide cultural changes as well. We also look at some of the ways you can implement SRE principles including learning about error budgets, automation, and more. At the heart of great SRE practice lies a willingness to break down existing silos and to communicate and coordinate across cross-functional teams along with automating redundant processes.

Now that you have decided to adopt some SRE practices the question comes up: Where do I start? To start off we look at some of the most useful concepts and practices in SRE and how it makes deployments seamless and your engineers happy.

Error Budgets

An error budget can be defined as the maximum amount of time your system can be down without facing consequences. These consequences can be derived from external legal agreements (SLAs) or even internal organisational goals(SLOs). Error budgets are important because they allow your development and IT operations to ensure that the price of having new features in the product does not result in downtime and inconvenience for your users. For example, if your product is running faultlessly your developers can spend the error budget to create and deploy more features. If you have run out of error budget for the month then publishing new features takes a backseat till operational issues have been resolved.

Deciding the error budget will depend on the kind of service you are providing. Are you running a banking platform that needs to be available almost 24/7? Are you running an OTT platform specialising in live streaming sports? An effective error budget must also take into account the external problems that you won't have control over - which includes internet connectivity going down, your remote machines becoming the victim of a DDOS attack and similar unforeseen problems. Your error budget will be derived from the SLOs that you have picked. In the next section of this blog, we look at how you can pick SLOs that would be most helpful for your organisation.

An error budget is calculated with the following formula.

Error Budget = (1 - SLO of the service)

For instance a 99.9% SLO service has a 0.1% error budget.

While error budgets are usually used as an unambiguous criteria for SREs to accept changes from the development team, they are still useful for organisations without SREs. Development teams can use the error budget to decide whether or not to deploy changes or to decide whether to work on riskier features or not.

Measuring your service (SLOs,SLIs and SLAs)

The adoption of SRE practices should ideally evolve organically from your existing processes. The first thing to keep in mind is “Which metrics are most important to my customers?”. If you are a customer facing website, the most common would be uptime, latency and volume. Those important measurements can become your SLIs (service level indicators) and SLOs(service level objectives). SLOs should ideally shape up as a natural outcome of customer requirements. It is ofcourse tempting to let your IT staff determine your SLOs but for a more sustainable solution the SLOs must come from things that directly impact your customers. The easiest way to pick SLOs is to work backwards from your business objectives. This metric must be something that you can improve upon and something that does not depend on external factors. It is always better to start with fewer SLOs and pick the ones that you feel are most important for your product. The vital thing to remember here is that the SLO should be something that is quantifiable.

“SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction.” says Adam Hammond, DevOps Engineer at Megaport, in his extensive blog called “Choosing SLOs that users need, not the ones you want to provide"

Finally, when you define your SLOs, remember that a good SLO should be S.M.A.R.T.

Specific: the SLO should clearly and explicitly state what it measures (e.g. we want to measure availability by testing whether a request can be to the backend server, and not that we want it to be up).
Measurable: the SLO should be something that can be calculated easily (for example, the latency of the disk should be less than 5ms, not the disk should be quick to load data).
Achievable: you should be able to fulfill the SLOs (e.g. if a service has a SLO of 95 percent , you cannot promise 100 percent).
Relevant: your SLO should correspond to the user experience(e.g. an appropriate metric for a web server is response time, not CPU activity).
Timebound: an SLO should cover a timeframe that is suitable for when your users operate your system(e.g. if your users only use your system between 9 AM to 5 PM, a 24-hour SLO will be counterproductive).

Additional reading: A narrative case study -“How small changes to your SLOs can be SMART for your business"

SLOs are a great way to generate metrics about what is important to the business or it’s customers. Those insights are critical even if you don’t have a separate SRE team.

Toil

If you have been running a production system then toil will be familiar to you. Google’s SRE book defines Toil as “The kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” All organisations that are scaling up their product have had to deal with toil at some point of their expansion. The first stop to tackling toil in your organisation is to identify it properly. Sometimes this can be challenging as toil is often disguised as seemingly important work. A significant aspect of good SRE implementation is recognising toil early and automating it away. There is a cultural aspect to toil as well. Many teams may have included “busy work” in their schedules without realising its potential for long term damage. Naturally all production systems require human intervention to run optimally but the number of people managing them should not be growing linearly with the addition of every new user, virtual machine or service. Toil has a detrimental effect on your engineering team's morale and productivity as well.

Here are some examples of toil that is commonly faced by growing organisations-

Any task your engineers are performing manually. For example, you may have a script that automates some tasks but manually executing the script every time is still “toil”.
Lacking any enduring value - things that constitute toil do not add any lasting value to your project.
Repetitive - Things that are repeated frequently. It can include manually configuring production servers before every deployment.

An effective way of identifying toil includes doing a survey of your engineers. Here are some sample questions you can ask your engineering team to pinpoint problem areas.

Q: Approximately speaking over the past month, how much of your time was spent on toil?

Q: What according to you were the top five sources of toil?

Q: Is there toil that can be automated but you are not able to due to cultural reasons?

Without a separate SRE team, toil is doubly dangerous -- not only is your limited manpower being wasted on value added activity, they’re probably wasting more time than an experienced SRE would since it’s work that they don’t specialize in.

Automation for SREs

Automation is a substantial aspect of good SRE practices. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments(Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining.

Automation helps in three ways:

By being orders of magnitude faster than the equivalent manual effort thus saving precious manpower.
By being reliable, once stable, automation is not subject to human error and can be relied upon to correctly execute every single time.
By being repeatable. Automation can ensure operations are 100% consistent across your infrastructure. Consistent behaviour is cheaper and easier to manage.

With good automation in place, you can delay the need for specialist SREs and infrastructure people.

Conclusion: Next steps

Here are some other best practices that are good to follow:

Have proper rollbacks in place - In case of a faulty deployment which breaks things it is easier to rollback to when things were working fine. Having proper rollbacks in place will save you hours which will otherwise result in loss of engineering time and productivity.
Manage stress and burnout - Dealing with on-call stress, unhappy customers and teams is part and parcel of the SRE journey. There is always pressure on on-call engineers to resolve issues quickly however, sometimes resolution can take days. The onus is on you to create a stress-free environment for your team with the help of SRE practices and building a culture of reliability that takes pride in shipping great software.
Don’t update code if you won’t be available for the next two days - Again self-explanatory, if things go wrong it will be much harder to get your system running again. Once code has been deployed its best to see what is working smoothly in production.
Keep your customers in the loop - Customer facing teams like marketing, sales and support must always be aware of the limitations of your product. This ensures that your customers have realistic expectations from the product. Any website having over thousand users has experienced breakdowns. These breakdowns usually prompt angry emails from your customers especially if the site is down for unscheduled maintenance or if the breakdown in service is of a critical nature.

Adopting SRE best practices can be something teams and organisations of all sizes can begin with. If you are anticipating rapid expansion of your product or being plagued by frequent breakdowns, exploring the SRE process makes sense. It is important to remember that SRE is a journey and what works for others may not be the best solution for you. While it’s vital to learn from others, every organisation has its own unique journey towards adopting SRE. Post adopting a dedicated incident management platform it becomes easier to track the problem areas in your digital infrastructure. This includes learning by bringing in important metrics like MTTA (Mean time to acknowledge), MTTR (Mean time to resolve) and others. This transition to a dedicated platform is the next big step on your SRE journey. More than just a set of metrics and practices, SRE envisions a culture where problems are resolved in a “blameless” environment. A culture where issues can be raised and fixed in a transparent manner.

This article is inspired by a talk originally given at LISA'19 by Squadcast with the same title "How to SRE without an SRE on your team".

Some useful resources for SRE:

What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

DEV Community