Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where the main purpose is to provision, maintain, monitor, and manage the infrastructure to provide maximum application uptime and reliability. SRE is an emerging role, but the tasks that the SRE does were always there ever since the first application that was developed. The scope of the software developers ends where they write code to develop the application and right from setting up the infrastructure, the various services that run on them, the network connectivity that is required, providing a platform for the application to run and making sure every part of the application is up and running reliably 24x7 is the duty of an SRE. We can consider Site Reliability Engineers are the strong bridge between the users and a reliable application.
Let us look at 8 ways in which you can become a better SRE at work. SRE not just involves various technologies to deal with and keep them running, but also several non-technical characteristics.
1. SRE is all about the right Mindset
a. No blame game
b. Thirst to solve
As an SRE we deal with multiple components and are a bridge between the users and the application. Even though the application is well written, a bigger responsibility falls upon SRE to keep the applications and the services it uses up and running. In this process, there might be a few situations where one of the SRE does a mistake that causes a disruption or even an outage. When this happens, the first thing to happen shouldn't be to blame anyone for the outage, but the following has to be performed.
i. Fix the issue
ii. Write an RCA ( Root Cause Analysis ) that mentions why the issue occurred in the first place, the names can be anonymous.
iii. Mention the first aid and the fix for the issue
iv. Discuss how the issue can be prevented the next time
v. Set an ETA for the fix
Another aspect is to have the right mindset to solve problems. As an SRE you have the responsibility to optimize the infrastructure, fix issues, build automation tools, monitoring tools, and more, which requires a lot of problem-solving skills. Unless you have the thirst to solve the problems, you will only feel more stressed out, or even worse, would cause issues.
2. Communication
a. Overcommunication is not a problem
b. Be kind and show empathy
Are you performing a production activity or even a stage change that could affect other teams? Have you made progress in the project that you are working on? Make sure to keep the necessary stakeholders in sync always. Write emails, send slack messages well in advance before the production activity, just before and after the activity. It might sound like over-communication, but trust me, as the company scales, you need to keep everyone relevant to the component that you are working on in sync. This way, if they have to take any actions from their side, they will do it, or if they face any issues post-activity they'll know who the right person to get in touch with is.
One other important characteristic to have as a human being is to be kind and show empathy. This will apply to all levels of engineering on either side of the conversation, period. Whether someone asks a silly question, or does a mistake, or behaves rudely with you, you should never mirror that behavior.
3. Stay synced with the team
a. Do not miss team meetings
b. Prevent duplication of work
c. Do not compete, but contribute
In this work from home ( WFH ) period, the only time where you have an opportunity to speak to your teammates is during a team meet. The reason why this is special is, you get an opportunity to stay synced with your team on what they all are working on, whether they are blocked on any tasks, how you can contribute to their tasks and also you will be using this opportunity to convey on what you are working on and get help if necessary. This also prevents duplication of work.
4. Shadow teammates on tasks and issues
The best way to learn is by doing it hands-on and the best way to begin would be by watching how it is done. I also believe that the best way to retain the learned information is by performing it repeatedly. This also includes watching your teammates perform the activities. It ensures that the activity is done without any mistakes when there are several eyes to watch it.
5. No Spoon-feeding, do homework
Do not expect all details to be taught by your teammates and seniors. Read the documentation, watch tutorials, read engineering blogs, practice on your own, and suggest improvisations. Even a well-built system will have much more efficient solutions, that you can propose.
6. Be attentive and cautious on production
I've heard people pretending to work while watching web series. They might be proud of their multitasking skills, but as far as I know, there is no such thing as multitasking at work while watching a web series and I highly recommend one to not do that. If you are interested in watching a series, I would suggest you use that motivation to focus on the work, finish the tasks quickly and reward yourself with a couple of episodes later in the evening.
Attention is the core necessity of life, and the same holds true to an SRE. Be attentive to the commands you run, the alerts you get, the trend the charts show, and the logs of the services and applications. Prepare for activities well in advance and let the actual activities be a no-brainer copy-paste so that you can pay attention to other indications during the activity.
7. Think before you hit enter
Do not underestimate sudo privilege. A lot of us have a habit to enter into the sudo mode as soon as we login into a machine, which is unnecessary. Even if the command you are running looks like a harmless command, make sure to get the process and commands reviewed by your teammates, seniors, or the subject experts, this will save you from outages.
8. Keep version control systems in sync
Whether it is NGINX config or any service config, make sure to keep the version control system that is isolated from the system in sync. No one hopes for the machine to become unresponsive, but when the machine becomes unusable all of a sudden, you have another opportunity to bring up alternate machines with the same configs as the previous ones. Keeping the version control system in sync also helps in automation.
Listen to the Podcast with more examples and explanation
Read about what Site Reliability Engineering is and what are the 4 main things that the Site Reliability Engineers take part in: Link to the Article
Check out my YouTube Channel here: Developer Tharun - YouTube
Thank you for reading the article.
Written by,
Top comments (0)