This article will discuss critical DevOps engineering skills that make you an expert. So, since we understand that people come to our company from different jobs and all have a different scope of competencies and levels of knowledge, we decided to create a universal roadmap for the growth and development of a DevOps engineer. But it didn't work as we wanted, so we decided to go the other way and create a list of skills and competencies needed to work in our company. Thus, having made a three-level system, each level consists of a questionnaire and criteria for the candidate. To put it another way, we prepared the first version of grading and certification. However, this system also did not help to solve our problem. Later we found a great tool - a self-assessment skill matrix. We decided to put the tool into practice for DevOps and later transformed it into a skill matrix. After that, we held a session where we set ourselves current and desired six-month grades. We used Miro as a tool, but you can also use Google sheets.
Because DevOps engineers in Mad Devs go deep into the operational and infrastructure parts, we have defined a knowledge stack for our current and future employees to start working as junior “MadOps”.
You need at least a middle sysadmin level to get started. Also, skills required for further growth and understanding of abstract skills and principles of:
- preparation and operation of the service in production;
- analysis logging;
- creating fault tolerance;
- disaster recovery;
- scripting and automation programming;
- configuration management.
The Linux kernel, subsystems, and the utilities around it are at heart. What you need to know:
- Processes, devices, disk partitions, lvm, file systems, namespaces, and cgroups;
- Boot loaders, startup process, systemd, and units;
- Netfilter network subsystem, user utilities: iptables, Shorewall, tc, etc., basic knowledge of network protocols;
- Virtualization - primarily KVM, also need to know the types of virtualization and other technologies;
- How to set up and work with basic services: dhcpd, NFS, sshd, DNS (bind), mail (postfix, Sendmail), web (Nginx, apache, caddy, traefik, etc.), database(MySQL, Postgres);
- Basic bash/python scripting;
- Basic troubleshooting.
Even though Docker is leaning back, we cannot exclude it from the list of necessary skills. It is difficult to imagine anything else for local use for several more years. If we talk about k8s, then the official support for Docker as a Container Runtime should completely stop with the release of 1.23.
It should also mention that Docker was the technology that brought containerization to the masses. Whereas the containerization technology itself has been around for a long time, its users were often mostly "geeks.”
There is to know:
- Differences between containerization and virtualization;
- Which Linux kernel components are necessary for containers to work;
- How to run docker containers using public docker images;
- Be able to write your own Dockerfiles based on best practices (layer order, caches, multi-stage builds, etc.);
- Prepare docker-compose files to speed up and simplify local development;
- How the network works in docker;
- Security practices for docker and dockerized applications;
- How to switch to dockerless tools if necessary. For example, buildkit, buildah, kaniko, etc.
Among the great variety of tools (pulumi, Cloudformation, AWS CDK, etc.) that help bring the IaC (Infrastructure as a Code) approach to the masses, we decided to use Terraform as a main tool to describe the infrastructure component.
It's essential to know about:
- Terraform is not a silver bullet and cannot replace absolutely all tools. To configure virtual machines, it is better to use the following tools: a) Packer; b) Ansible/Chef/Puppet/Salt; c) Whatever you want (bash?).
- Terraform is not a multi-cloud management tool. It can be called so with a huge stretch. By managing only AWS, you cannot deploy the infrastructure in the GCP using the same code. Each provider has its own set of resources, and these resources are called differently. However, the use of Terraform allows us not to learn the new syntax of various tools and new approaches to organizing code for working with different clouds/providers. Which at times speeds up the process of writing, maintaining, and transferring code between engineers.
- Ability to read someone else's terraform code. It means that you can read and understand the code used in the public modules (input/output parameters, logic, resources used); Fluent usage of public modules;
- Ability to describe the infrastructure of the project in the form of readable, maintainable and reusable code;
- Writing your own modules and understanding how to use them;
- Understanding how to organize the structure of the project;
- Manually work with a state file (importing existing resources into code, deleting objects, moving objects between resources (for example - from a resource to a module));
- MadDevs has an Open Source project GitHub - maddevsio/aws-eks-base: This boilerplate contains terraform configurations for the rapid deployment of a Kubernetes cluster, supporting services, and the underlying infrastructure in AWS. It is assumed that a person who masters the skill to work with Terraform at a sufficient level actively participates in the discussion of improvements and also periodically contributes changes.
It is now impossible to imagine any project that wants to reduce Time-To-Market without losing quality and doesn't use CI / CD (Continuous Integration / Continuous Delivery / Continuous Deployment) processes. Therefore, it is vital to understand the concepts and apply them correctly. Our task is often to write a pipeline regarding the development and source code flow. Let's consolidate the idea that we don't pull the flow on the pipeline but adjust the pipeline to the flow. Now it is practically not important which CI / CD system will be used, because they all have pretty much the same functionality. BUT it is important to remember that EDGE cases exist, and knowing the strengths/weaknesses of a particular system will allow you to make the right choice at the right time.
Necessary knowledge in this field:
- Understanding of the CI, CD, and CD concepts. Know what it is, and what the differences are.
- Writing simple and readable pipelines;
- Ability to transfer the development flow to the CI / CD pipeline, which may include complex logic:
- manual steps,
- trigger other jobs, services,
- Pipeline optimization. Ability to find bottlenecks, speed up, and optimize in terms of cost;
- Knowledge of various strategies for rolling out a new release and the ability to implement them:
- Rolling update,
- GitOps - what is it, when is it better to apply, and what tools are better to use;
- Knowledge of tooling. Integrate infrastructure and application code analysis, images and systems for vulnerabilities, and security checks of public endpoints into pipeline steps.
Each of these cloud providers offers over 100 services. There will not be enough time to know everything in detail. A considerable part of the services is quite unique and may never be encountered in work.
What is necessary to know:
- How to set up a network: this may include services such as VPC, - Security groups and ACLs, topology and subnets, peerings, VPN etc;
- Virtual machine;
- Storages: block and object storages;
- Container deployment services: ECS, AppRunner, Beanstalk, AppEngine, Web Apps, etc;
- Database services (both relational and not);
- Managed Kubernetes cluster services;
- Load Balancers, CDNs, WAFs.
When building a cloud infrastructure, it is also helpful:
- Understand and know the various PaaS, IaaS, and SaaS. This knowledge can significantly speed up the start of the project without unnecessary steps;
- Be able to migrate to clouds from on-premise and between clouds. It is necessary to calculate the capacity and cost correctly, choose the required services, develop and implement migration plan;
- Constantly keep the Cost optimization paradigm in mind and apply cost reduction practices (spots, reserved, preemptible nodes, better and more efficient services or self-hosted solutions);
- Understand a Well-architected framework and be able to build infrastructure around it;
- Know how to build an infrastructure that meets certain compliances (iso 27001, PCI, GDPR, HIPAA) and is ready for audits;
- Be able to effectively manage an extensive infrastructure (monthly check is over 10k and above).
Where it is possible (and this is 99.9999999% of projects), we are using Managed solutions from cloud providers, which marks the nature of working with k8s. Most of the time, we act as cluster users, not cluster administrators; that is why the list of necessary expertise is based on user experience:
- Can distinguish managed and vendors: GKE, EKS, AKS. Know what are the advantages and disadvantages.
- Understand, able to work and debug the main objects: Pod, Deployment, Replicaset, Jobs/Cron Jobs, DaemonSet, Statefulset.
- Need to know the types of services and what Ingress is.
- Be able to work with Configmaps, Secrets, sealed secrets, and external secrets.
- Understand the differences between the sidecar and init containers and their application
- Cluster autoscaling. Use different types of nodes and pools for cost-optimization.
- Apply advanced pod scheduling techniques: nodeSelector, affinity, antiAffinity, topologySpread.
- Pod/namespace resource management.
- Understand and configure RBAC and Network Policies.
- Know the differences between Admission and Mutating controllers. And be able to write solutions if necessary.
- Implement ServiceMesh where needed.
- Widespread application/implementation of Security practices. Use OPA (Open Policy Agent) if necessary.
- Basic understanding of the architecture: what are the components, what are they responsible for, and how are they interconnected.
Since helm is a tool for Kubernetes, all requirements are connected to k8s knowledge, for example:
- “Reading” public helm charts. What variables can be used, where they are substituted, and what k8s manifests the chart consists of.
- Create your own charts. Where it’s necessary use loops, conditions, and functions to reduce the amount of code. Templates must be readable.
- Write Umbrella charts if needed.
- How to customize/patch public charts (i.e., adding new objects).
- Experience with tools like helm-diff and helmfile.
One of the most critical components of modern systems is Observability. It is impossible to efficiently deliver changes to the user and efficiently manage resources without well-tuned observability tools.
We often hear only about “Monitoring” and “Logging.” Observability is a broader concept that includes monitoring, logging and tracing.
- Having the ability to work with popular monitoring systems: Prometheus, VictoriaMetrics, etc. and components around them (ie numerous exporters);
- Ability to work with widespread logging systems/stacks: ELK, EFK, Loki, Datadog, etc.;
- Experience with popular tracing systems: Jaeger, APM, etc.; Errors tracking and performance monitoring: Sentry, NewRelic, etc.;
- Knowledge of how to make custom dashboards for Grafana based on the requirements;
- Have skill to parse and filter logs in used logging systems.
It is tough to create a clear list of requirements because we are not security specialists but rather implementers. So here are the general points:
- Adhere to the Least Privileges principles when working with users, service accounts, and granting rights.
- Over the past few decades, the infrastructure building process has changed dramatically. Earlier, the main and wrong idea was "secure by default” inside your private network,” then all-new approaches are closer to “Zero Trust“ (we do not trust anyone or anything). Therefore, one should try to adhere to this concept wherever possible inside and outside your infrastructure.
- Know ISO 27001, HIPAA, PCI DSS, GDPR, CIS Benchmark, and OWASP standards.
An important element of our work is the development and implementation of solutions. The goal of such solutions is to simplify development, reduce costs, switch to a new, more efficient, safer technology, etc. From here, it follows several necessary performing skills:
- Ability to decompose tasks into atomic subtasks;
- Ability to estimate your effort;
- Ability to specify requirements;
- Ability to build a Roadmap and move along it;
- Ability to find and apply “effective solutions” to emerging problems and challenges;
- Documentation management;
- Independent research, development and presentation of PoC;
- Implementing maintainable and customizable production-ready solutions.
Everyone knows that DevOps and SRE are primarily cultural aspects and practices. Where DevOps comes from development and is aimed at delivering a feature to the client, and SRE comes from operations and is aimed at stability. Our requirements are pretty basic:
- Have a good understanding of SDLC, primarily interested the Agile model;
- Know what Delivery Pipeline and Feedback Loop are. Be able to build/optimize these processes together with the team, to select an adequate tool for each step;
- Understand and be able to build an incident management process:
- Logging and categorization,
- Notification and escalation,
- Finding and eliminating the root cause,
- Playbook writing;
- Be able to write post mortems for systematically improve stability and quality;
- Be able to develop and implement a Disaster Recovery plan acceptable to the business requirements.
In addition to the fact that a good DevOps engineer should have a broad technical outlook and a number of automation skills, it is extremely important to develop soft skills. That is, those personal qualities that help to effectively connect and synchronize the work of all participants and departments into a single whole. There is no doubt that the existence of well-developed soft-skills is an important element in both personal growth and career progression (and sometimes a fundamental one).
Most often, engineers are private people, but times change, and it is impossible to work alone. The DevOps engineer is the link between operations, development, and managers. He constantly has to communicate with the team, helping to achieve a common goal.
What stands out among the skills:
- Self-education. Nowadays, when technologies are changing every day, it is impossible to rely on knowledge gained 10 years ago (if it is basic things, such as TPC / IP). It is necessary to constantly improve and learn something new. Without self-study, it is impossible to quickly improve your hard skills.
- Сommunication skills. The devops workflow is mostly based on teamwork, communications, problem escalation, etc. Also, within the framework of such communications, it is really possible to test and pump your hard skills. Furthermore, do not forget how you formulate your thoughts when setting goals and tasks. Your team should receive clear and understandable explanations of tasks.
- Self-organization. Ability to work independently without constant mentoring. We're not talking about when you have just started your duties and do not even know the direction in which you need to work. But the faster you can start working without a permanent mentor, then faster the leveling process will go.
- Mentorship. You don't have to be a Senior Engineer to mentor someone. The ability to teach others is a good way to consolidate and systematize your own existing knowledge. It also helps to develop communication skills.
- Commitment. You need to be able to achieve your goals alone or working in a team. It’s not always possible to get into a project with a team of DevOps engineers, so you need to be able to set and achieve your goals.
- Fluency in English. Most of the knowledge sources are written in English. Technologies are also created in English. Work in development and operation is carried out in English. In our company, we made a special matrix specially for soft skills' assessment, where all the necessary skills are highlighted.
To summarize, the meaning of DevOps-engineer in different companies is different, making it difficult to compile a single list of specialist competencies. Even with a 10-year career, so many directions and pitfalls are not enough time to study them. It is also worth considering what services companies are using - some use cloud services, while others use their own or rented hardware. Therefore, the required knowledge will depend on which company you want to work for. Especially for this case, we have compiled our DevOps engineer skills matrix to simplify the work process for applicants and employees.
Previously published at maddevs.io/blog.