There are job titles in the industry that requires prior knowledge in order to understand them. What are their responsibilities are.
I oftentimes find myself try to explain what do I do for living to foreign people to tech industry.
How do you explain SRE then? In this post, I’ll try to describe it in simple terms.
SRE are developers with operations responsibilities. They are in-charge of production environment, to keep it up and running.
A business, by definition, sells a product or a service.
Many of them these days has their business online. You can order almost anything through the internet. Very popular world scale services are Google and Amazon. They are available for you no matter where you are (almost).
You, the client, consumes a service. You use Google to search for interesting stuff or things you need (a nice restaurant). You read the news at your favourite news site or shop online at Amazon.
These companies serves you through the Internet. It seems to be they are online 24/7, 365 days a year. Pause for a second, and think about it. Isn’t that magical? it’s like a store that is always open, but easier to access - to enter the store you don’t need to leave your house.
Now that we have defined what a (online) service is, we can cover the 4 core concepts SRE’s are usually accounted for. I say usually, because responsibilities may be different between companies.
What does this mean anyway?
as SRE, we design the infrastructure for the product. We decide which hardware to use, do the capacity planning with room to grow as needed, etc.
One requirement is to make it reliable and resilient so service downtime is minimized as much as possible.
We try to eliminate any single-point of failures a long the road (from hardware to application). Always have redundancy for your infrastructure, so if something fails - be it hardware, network or software - the system can quickly recover from it.
SRE’s knows things break down. They are the ones who gets called when something critical is not working.
It is our job to recognize possible failures along the way and mitigate them ahead of time when that's possible.
Online services are composed from multiple applications or features. Today, many applications run on distributed systems. We need visibility to what’s going on.
In order to meet this, we use monitoring systems that expose the service’s health. These usually looks like dashboards from control room in the movies.
Using these systems we define alerts - for example, we know how to recognize unhealthy patterns in the application, or hardware failures. The alert system sends us notification when things break (by email, sms, phone, etc.), instead of having someone to watch the dashboards all day long and yell :)
Once in a while, these services release new features and security updates. Like a change to the UI or an addition of new buttons. These changes require a software update that happens behind the scene, most of the time without user interruptions.
Changes to the system introduce some risk, but they also introduce new features and bug fixes clients are waiting for. This leads us to -> deployment strategy.
We define a software deployment strategy so software updates (installation of a new software) are to be successful, and when they are not to recover quickly.
Of every change we make to the system, we always keep in mind “how do we recover from this if something goes wrong?”
Then combining the two (software deployment and recovery) procedures into a “playbook”, which can be taught of as a task list to execute. Last, we
automate this to ease the process.
This concept is, in my opinion the most important one. Once we formalize a procedure in our daily work, if we repeat it we want to automate it. This allows us to spend our time on more important domains (research, development) rather than doing the task repeatedly. Let’s abstract that.
When we have a “problem” or a “task” on our desk, we prefer to solve it one time only. This is made possible by coding it. So, as SRE’s we always prefer to code things rather than performing them manually, even tough this requires more of our time when solving the problem the first time.
For example, using our monitoring and alert systems, we get notified when things do not work as expected. We can use these to trigger some code that handle the issue. Simple example is, if the application crash (becomes unavailable) automatically start it. It will bring the service back up for customers, while we can debug what happened later.
The systems reliability - simply make sure the service is online and serving customers. Grow the infrastructure as needed, while keeping things on the budget. A lot of things happen behind the scene to make it like that.
Monitoring - we design, develop and integrate tools that gives us visibility of what’s going on in the system. Multiple graphs and counters that help us to know the status.
Automate everything - this goes without saying. If you have automated a task, you would only spend 'thinking' time on the 'problem' once.
If I had needed to describe SRE role in a short paragraph, it would probably be this:
SRE is responsible for keeping the service up and provide the ability to release software faster while reducing the risk involved with it using tools and deployment strategies. In order to achieve that, we write code 🧞
I hope now on the next occasion you meet someone with SRE title, you would know a little better what their role is all about.