2 years ago, my company reached 500+ employees.
That's when we noticed the process for creating and managing systems was slowing down our productivity.
In particular, our SRE team was overwhelmed with developer requests since everything (from provisioning resources to granting access) required their assistance and sign-off.
We wanted to restore our development velocity by empowering teams so they didn't have to rely on the SRE team constantly.
That's when we decided to build a self-serve infrastructure portal to help teams be more autonomous.
This portal would empower teams to create and manage their own repo's, CI/CD pipelines, compute services, and other resources. Something like an internal Heroku.
We released this portal company-wide 18 months ago. Since then, we've launched more than 300 services.
Here's are 3 lessons I learned.
Journal Every Single Task
Every SRE task must have a ticket. This needs to begin before anything is automated.
In the beginning, our goal was to automate common manual tasks performed by the SRE team, and make these tasks self-serve.
For example, an SRE would be required for tasks like deploying a system, or creating a Github repo - these are examples of what we wanted to automate.
But finding these common tasks (beyond the painfully obvious ones) proved difficult because of the invisible nature of software. I rambled about this in another post.
How can you figure out what to automate if you don't know what tasks the SRE team performs, and how often they perform them?
And no, you cannot rely on anecdotal evidence or memory. There needs to be something tangible that can be used to make objective, data-driven decisions.
That's why it's important to have a process for handling incoming SRE requests. Every task, no matter how small, must be recorded (we used Jira tickets) so they can be counted and reviewed later on.
That means developers can no longer ask their favourite SRE for a small favour without a written request or "ticket". Same applies to product managers with an urgent business request.
These SRE tickets are crucial because they reveal what an effective infrastructure portal looks like for your organization.
Have A Vision Of Your Ideal Architecture
It's not enough to just to automate the most common manual tasks that currently exists.
You should have an opinion on what is the most effective architecture for your business.
Individual teams (usually) don't pay attention to the design choices of the rest of the company.
However, in order for a software company to scale efficiently, technological complexity and breadth must be controlled. This helps provide 2 main benefits:
- the opportunity to develop, and leverage existing, expertise and tooling within the company
- a common development environment that improves developer resourcing flexibility among teams
There should be an opinion of what an ideal architecture looks like for your company and business - not just for a subset of development teams.
For us, we decided that event-driven microservices (EDM) would be best, and that has influenced the tooling we have pursued.
For example, once when we noticed many teams needed a state-store for their services, we automated the provisioning of Kafka topics before considering databases.
This doesn't mean that EDM is the best design choice for your company (that's up to your business). But what ever that choice is, it should guide your vision for what processes you make self-serve. This will strongly influence how your teams build and construct new systems
So don't just automate how systems are currently being built, but balance satisfying existing needs with a vision of what the ideal should be.
Self-Serve And Self-Organizing
Once teams have the ability to freely build and create, they will.
It's likely that if your self-serve portal is successful, more people will begin using it to provision resources (and more frequently too).
This means resources (such as repo's, CI/CD pipelines, services, etc) will be created more frequently. And the more "things to keep track of", the more important it'll be to have a strategy to organize everything.
For your company to operate smoothly, everyone should be able to answer (or find the answer) to these questions:
- What resources (or services) are available?
- What resources (or services) does this service depend on?
- Which service does this resource belongs to?
Having a naming or tagging strategy is vital for the long term hygiene of your infrastructure.
In our experience, we found that having a "registry" to track resources was useful, but this might not be necessary for everyone.
But all this is a means to an end. The most important question having a organization strategy serves to answer is:
Who is responsible for this resource (or service)?
The answer to this should always be a single person (even for a shared resource) - usually a Team Lead.
Accountability is vital in a self-serve environment. It doesn't mean that someone will be "punished" if something goes wrong (a common misinterpretation), but it does mean we need to know:
- Who governs changes to this resource?
- Who should be notified if this resource is unresponsive?
- Who is monitoring this resource's cost?
Being able to answer these questions help enable a self-serve type of infrastructure to remain as effective on day-2 as day-1, and avoid turning your infrastructure into spaghetti.
Our self-serve portal proliferated our resources, and it would be impossible for any single person or team to keep track of everything. So just like how we distributed many SRE duties across teams, we decided to do the same with how we organized ourselves by including auditability and ownership in all our automation.
Top comments (0)