Site Reliability Engineering (SRE) is a relatively new term in the software industry. It is a software engineering approach designed for improved system management and problem-solving. Think of it as a new form of system administration.
In SRE, a software engineer is in charge of tasks that are usually performed by the operations team. Site reliability engineering involves ensuring the availability, latency, performance, capacity, scalability, and deployment of software systems by the engineers themselves.
In this approach, the software meets operations. Companies using SRE hire people with software development experience in order to solve infrastructure and operational problems.
A site reliability engineer excels at the production side of the software. They are expected to ensure that software is delivered and deployed flawlessly. Additionally, SREs are responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning
The SRE model hinges on effective standardization and automation. Engineers are tasked with ideating and implementing methods to enhance and automate operational tasks, thus streamlining development and deployment processes.
Like system administrators, SREs must have some software development experience, but their primary strengths are network engineering, troubleshooting, deployment, configurations. They must also be effective multitaskers, as they must ensure multiple system components collaborate and deliver results consistently.
For greater clarity, let’s look at the average day of a site reliability engineer:
- Attending calls to fix/build deployment infrastructure.
- Ensuring that binaries and configurations are reproducible and applicable for integration deployment environments to ensure maximum system availability.
- Managing configurations of cloud resources for automated deployments.
- Monitoring software infrastructure, tracking tickets, and checking logs to mitigate risks and resolve existing problems.
- Plan releases and future deployments.
- Participate in sprint planning, code review, code development, and architecture design to foresee risks and offer input on best practices.
- Plan software deployments with immutable infrastructure using CI/CD.
Bear in mind that due to its relatively recent origin, the SRE role is highly subjective when it comes to specific responsibilities. At some companies, SREs play a key role in software development and programming, while at others they might be expected to focus specifically on the operations side.
To land a site reliability engineering job, study the questions listed below. Prepare for a wide range of topics as SRE interviews usually cover multiple areas and/or disciplines, testing the candidate for their skills in programming, incident response, support, architecture, networking, problem-solving, and general behavior.
- What is an incident command system?
- What are the different shells and which are the most commonly used?
- Describe what happens when you type www.google.com in your web browser and hit enter?
- What is SSH and how does it work?
- What is an error budget?
- What is toil reduction and how is it achieved?
- Describe the boot process of a Linux System.
- What is the Standard C library?
- What is the GNU Project?
- Write a Python script for basic analysis of some debug logs.
- What is the benefit of a protocol like QUIC?
- When would you use UDP for a long-distance VPN connection?
- What protocol is usually used within corporate networks?
- Describe or name some TCP congestion protocols?
- What is IP fragmentation?
- What are Service Level Indicators (SLI) and how are they relevant in SRE?
- What is defense in depth?
- Why is DNS monitoring important?
- What’s the difference between encoding, encrypting, and hashing?
- What’s the difference between HTTPS, SSL, and TLS?
- What is 2FA?
- Describe or name some security headers.
- You take over a service that has no monitoring whatsoever. Which monitoring strategy would you use, to start with?
- How do you monitor failures that are local to a region?
- What is white-box monitoring?
- What is black-box monitoring?
- How do you change the priority of a running process?
- How to make variables in a shell script available after the script exits?
- Explain TTD, TTR, and the importance of measuring them.
- Design the system architecture of a LinkedIn profile page.
- Design the system architecture of Twitter.
- Design a web-crawling system.
- What are some load-balancing strategies or techniques?
- Describe a situation where you strongly disagreed with a proposal. How did you tackle the situation?
- How can you use REST API to get a JSON?
- Write code to parse a log file and process the data.
Bear in mind that these questions provide a guide and structure around which interviewees can educate themselves. They are a starting point from which to approach your preparation for bagging a coveted SRE job. Since the SRE role is new and requires specialized capabilities, expect to spend a couple of months brushing up old lessons, study newer facets of domain knowledge, and develop the technical and people skills required to thrive in this position.
Put in the requisite effort, and the rewards will be worthwhile.