DEV Community

Cover image for How small changes to your SLOs can be SMART for your business - A narrative case study
Adam Hammond for Squadcast

Posted on • Originally published at squadcast.com

How small changes to your SLOs can be SMART for your business - A narrative case study

In the second part of his "Choosing SLOs that are appropriate for our customers" blog, Adam Hammond, narrates a fictional case study through Bill Palmer, one of the protagonists of The Phoenix Project and shows "How small changes to your SLOs can be SMART for your business"

In our previous blog, we discussed why you need to choose SLOs that are appropriate for your customers. We don’t always write out S M A R T and list our SLOs immediately. The process is organic, and it may take a while. Most business have a rigorous reporting and metric gathering regime, and in most situations, you will just need to tweak this to get the desired results.

To elaborate more, in this blog, we will focus on a fictional company named “Acme Interfaces, Inc” (Acme) who already have Measurable, Achievable, and Timebound SLOs. Bill Palmer, one of the protagonists of The Phoenix Project, is jumping into the newly formed role of CTO. He is going to help Acme reform their SLOs so that they’re Specific and Relevant for their customer’s needs and their business strategy. Despite consistently met internal and external service levels and fantastic feedback scores, customers are rushing away from Acme. It’s Bill’s job to figure out why and restore Acme to its previous glory.


“Bill, we’ve got a problem here but I’m not sure what it is. Steve said you were great at fixing difficult problems; I’m hoping you can do that here. Our sales are down, and our long term customers are getting ready to leave. We need help fast.” Dan looked across at me, a grim expression on his face, negative analyst briefings were strewn across his desk.

“I’ve had a bit of experience with this. It’s extraordinary - I’ve looked through the service level reports for the last two years: reported internal metrics have been stable.” I put a report from my hand onto the table. “No breaches of external SLAs.”

All requests with a Status Code of 2XX and total volume of requests

I put down another stack of paper. “...and customer satisfaction looks great.” I top the stack off with a print of the customer satisfaction dashboard.

NPS for Acme Interfaces, Inc of customers with spending higher than US$250,000

Dan looks at me and exhales heavily “you’ve found the same things as we have. Everything looks fine, but everything is not fine. The company was going great under our previous CEO, Nick, but as soon as we diversified our customer base, we started having problems. Our customers want our product until they use it and we can’t close long-term contracts. When we were a one customer company, we didn’t have these problems.”

He pulls a paper from the stack I placed on the table and pulls out the twelve-month rolling satisfaction report and a second customer churn report with it as well. “What I don’t understand is how we can have an NPS score of 8.2, but our churn is close to 80% after a year of using our product. I need your help.”

I sat and thought about the situation for a little bit. It was definitely a unique situation. “Look, I’ve got some ideas. I need a pre-sales business analyst to help me understand the current business profile and what our history a bit better.”

Looking back at me with a bemused expression, Dan picks up the phone “John, send in Jenny.” He put the phone down. “Well, you’re definitely on the right track. Jenny is our business engagement lead. She can help you out with at least a third of what you’re after. She knows everyone; Jenny can sort you out.”

Jenny opened the doors and walked in, nodded at Dan and then looked at me. “Bill, I take it?”

“That’s me, ready to jump into it?” I asked. She nodded back at me.


“Bill, I’ll be honest. Our biggest problem is Globex Corporation.”

“You’re saying that our largest customer is our biggest problem?” Bill looked slightly quizzically at Jenny as she sipped her coffee.

“Frankly, yes. As our first customer; Nick always prioritised what they wanted. But that’s the problem: they’re large and cumbersome. What they want is not what the market wants. They’ve had decreasing sales year-on-year, and, all of their feature requests have been perceived negatively by new customers in direct user surveys.”

Bill scratched his chin. “I’m not saying I don’t believe you, but how do you explain all of our metrics that look great.”

Jenny smiled grimly at Bill. “That’s an easy answer: all of our SLAs have been designed to look after Globex and no one else. Take a look here.” Jenny pulled out a stack of paper that looked very similar to the one that Bill gave to Dan. “Customers over $250,000, we only have one of those: Globex. All of our other customers pull out before larger contracts, or they never go further with our product. Nick tailored every single SLA to Globex; they don’t care about latency because they predominantly use our interfaces for driving their reporting engine. Their reports take a week to run each quarter. All of our metrics are around volume, but none focus on service quality.”

Bill flicked through the papers Jenny had put in front of him, eyeing each one closely. “Well... You’re right, how has this not been picked up before?”

“Nick had a strict ‘clean dashboard’ policy. He didn’t want to see raw data; he just wanted dashboard views. He was unequivocal that as far as he was concerned, data would confuse industry analysts and that they just needed to provide positive results. Coupled with his intense focus on Globex, it just ended up that everything focuses on them and the market responds because they are a significant source of revenue for us. Of course, that has had problems now that Dan has been trying to diversify our client base.”

Bill sat quietly for a few minutes. “So… I think I have a way out of this. Can you please get me some data?”

“Sure, what do you need?” Jenny pulled up a document on her iPad.

“Please send me the engagement reports we have for all of our leads and clients. I want to specifically focus on what problems our customers are trying to solve with our products. Please also have the BI team send me the raw data for customer surveys and a separate data set which shows NPS of customers that exited our platform. Also have the SRE team send me the stats on response times for requests, as well as the status code breakup for a rolling twelve-month period.”

Jenny finished typing out the list of statistics and looked back to Bill. “I’ll have these back to you by tomorrow morning.”


“Well Dan, we’ve discovered the problem.”

Dan’s face lit up as Bill sat down in front of him. “You what!? How did you do that? It’s only been two weeks.”

Bill sighed heavily. “Well, we’ve found the problems, but I’m afraid it’s going to require substantial work to fix them.”

Dan’s smile faltered slightly. “How bad is it?”

“It’s quite bad, Dan. All of the stats our commercial team use for our Service Level Objectives are not suited to the market and our template SLA only satisfies the needs of Globex Corporation. I sincerely doubt if anyone except Globex has had a good time using our platform.”

Dan leaned back in his chair. “Don’t pull any punches, Bill. Tell me how bad it is.”

“Our NPS across our entire customer base is four. On average, 11.5% of our requests fail per month, which can be up to 23%. Anyone using our platform APIs for real-time activity has found it to be non-functional under load.”

“...But, how can this be? All of our data has been so good. We’re still making our revenue targets. How have we missed this?”

“I said it before; our reporting focuses on Globex corporation. They use our platform for their extensive quarterly reporting, so all of our SLO reporting has targeted this use case. The problem is our market is not interested in using our product for reporting - they want to use our real-time APIs for generic use cases so they can focus on their core development tasks. Diversifying is not working because our platform is not built for the markets we’re trying to break into.

“Here’s the most prominent example we could find of reporting that looks really great but isn’t. Nick had these SLAs determined based on Globex using the system for reporting.” Bill picks up a sheet and places it in front of Dan.

Percentage Breakdown of Major Status Codes

“This looks good, but look what happens when we remove the ‘202’ status codes, which just means the system is processing a request.”

Percentage Breakdown of Major Status Codes excluding 202s

“As you can see, in reality, we’re barely meeting what would be considered a proper SLO for our systems. If we want to increase our market, we need to make some drastic technology changes now and update our SLOs to meet the expectations of potential customers.”

“Well, Bill. You’ve done this before, what do you suggest?”

“Dan, with the help of the SRE Team and Jenny, we’ve been able to build a plan.” Bills pulls out his phone and sends

We investigated the cause of most of the errors, and it seems like there are issues with IOPS on the database. The first step is to migrate our storage to at least 5000 provisioned IOPS so we can meet the real-time request demand. The SRE team has already upgraded that. Here’s the rest of our plan to normalise our performance and track our progress. Here is our planned SLA, with both internal and external SLOs to help us meet our customer expectations.”

“Our most obvious problem was our reporting metrics for request failures. We’ve added more specific wording so that reporting requests do not dilute our metrics. To support this, we’ve added an internal SLO for IOPS to be monitored by the SRE Team. We found that after waiting 5 seconds, our software would error when a result wasn’t returned. Increasing IOPS eliminated most of our request failures.”

Dan looked over at Bill, surprised. “What do you mean most?”

“I mean, we only had an average 1.2% failure rate. We’ve also changed up our NPS reporting to include all paying customers, and we’ve also lowered our targets because we cannot possibly meet an NPS target of 8, given the reality of the situation.”

“That’s fair enough; I’m fine with explaining the difference to industry analysts.” Dan motioned for Bill to continue.

“Finally, we have a new internal SLO for new customer NPS average of 8. That might sound crazy, but we found that almost all of our customers that failed to contract predominantly complained of slow request response times. We think that by ensuring our request times are lower than 250 milliseconds, we can retain most of our new customers and they will be promoters.”

Dan was silent for a few moments as his eyes rolled over the documents that Bill had placed in front of him.

“Okay Bill, I like the look of all these changes, what can you guarantee me in terms of customer retention? After all, that’s the biggest problem. I don’t want to move focus away from Globex if it’s not going to increase our operating performance.”

“Well, that’s the thing, Dan. After working with Jenny, we think we can get 80% customer retention on new sales if we implement these targets. Almost all of our customers said they loved the system when it worked. They just need it to be performant, and they will come. Jenny thinks she’ll be able to reach out to some former customers and get them to sign back up with us for a new trial, too. We went out and met the customers, understood what they wanted, and we made sure that these new SLOs were specific and attainable. These aren’t numbers we’ve picked from a hat; this is science. We already had a great system in place for metric measurement and monthly reporting; we just needed to tune it correctly.”

“Okay, you’ve got six months. I want to see all of these SLOs met and the customer retention numbers. If it all works, out we can announce our results the month after and make our new SLA public at the same time.”


Bill stood up in the boardroom, buttoned his coat and walked up to the lectern.

“Thank you all for the opportunity to demonstrate our progress before we announce our results tomorrow.”

“Six months ago, everyone here believed we had rock-solid SLOs, a great SLA with our target clients, and a great reporting system. That was true for our old market, but not for the new. Dan came to me looking to transform our business so that we could diversify our client base and grow as a business.”

“Our first step in solving this problem was to look at the data we’d been using and look for any discrepancies; there were a few. Our main problems were our focus on a single client, and some issues with what we considered to be a successful client request. After changing the parameters of our reporting to match what we wanted our company to be delivering, it was immediately clear that our system was failing a lot more than we thought, and our customers were not having their expectations met.”

“The first thing we did was resolve the root cause of our system failures, which was rather simple. This hadn’t been caught earlier because our SRE team was focused on another set of goals that aligned with assisting Globex Corporation, our primary customer. Simply, our storage was not keeping up with our systems, and we just needed to upgrade it which was relatively straight forward.”

“Our second issue, which was our primary focus, took a little longer to resolve fully. Over the past six months, we’ve moved resources away from supporting predominantly reporting interfaces to real-time interfaces, and we’ve adjusted our SLOs to focus on non-reporting response times. With the great work of our developers tuning their code, and the SRE team tuning our web servers, we’ve seen our response time drop below our target of 250ms, in the last two months.”

“These two issues were not our only problem, and we made sure to include all of our paying customers in our NPS surveys. We also put an internal focus on making sure our new customers were fully satisfied with the product, and any small issues they had were prioritised on the development pipeline. This increased our overall NPS on paper to 8.75, but six months ago our NPS was only 6 in our first all-customer survey.”

“Overall, a strong focus on our vision for where we wanted to be, making sure our goals were aligned with that vision, and then re-focusing our existing SLO and SLA reporting has seen us expand our revenue by 35% and see 85% of new customers stay with our product. We understood our market, listened to our customers, and responded accordingly.”


PRESS RELEASE FOR IMMEDIATE CIRCULATION

“Acme Interfaces, Inc sees explosive growth over the last half, attributes success to Know-Your-Customer approaches and SMART Service Level development.”

After implementing a significant change that empowered our SRE and Sales Teams, we’ve been able to drive 85% new customer retention and see a 35% increase in revenue. Our average Net Promoter Score (NPS) has increased by 0.5 to 8.75, even though we’ve expanded our survey base to all our paying customers. We’re also introducing a new customer-focused SLA today that should provide a performant base for all of our customers to depend on as we move into the future.

A big thanks to our CTO Bill Palmer and our new Head of Business Development Jenny Masters who spearheaded the internal development of our new SLO and SLA offering.

Did you enjoy this piece of content written in a narrative case study format? We would love to hear your thoughts! Leave us a comment or reach out over a DM via Twitter.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Top comments (0)