Patrick Londa for Blink Ops

Posted on Oct 17, 2022 • Originally published at blinkops.com

Reducing Your Cloud Costs: An Operational Optimization Guide

#devops #cloud

Cloud costs are top of mind as many business leaders and teams are focusing attention on honing their operational efficiency.

In April at CIO.com’s Future of Cloud Summit, Dave McCarthy, research vice president of cloud infrastructure services at IDC, shared that cloud spending represents roughly 30% of current IT budgets. In the 2022 State of Cloud Report by Flexera, 750 surveyed executives shared that they estimate they are wasting 30% of their cloud spend, while also saying that they expect costs to increase 47% over the next year. If you combine those stats, there is an efficiency opportunity roughly the size of 10% of IT budgets.

Achieving those cost savings isn’t as easy as flipping a switch. There is wasted spend embedded across multiple resource types, regions, and services. By function, the main categories of cloud spending are compute time, data storage, and data transfer.

In this post, we’ll outline a framework for reviewing your cloud spending today, identifying wasted resources, and reviewing your long-term infrastructure efficiency.

Reviewing Your Current Spending

“What are we currently spending money on?”

To start, you can review your current spend at the account-level with the major cloud providers. AWS, Azure, and GCP all have reporting options that enable you to view and filter your spending over a period of time.

In AWS, you can create Cost and Usage Reports. In GCP, you can review your Cloud Billing Report and view spend by “Project” or other filters. In the Azure portal, you can download usage and charges from the “Cost Management + Billing” section.

These views may be useful to get started and see transactional costs, such as from data transfers. In order to get more granular details on your cloud spending, you should leverage resource labels and tags to accurately categorize expenses.

With labels and tags, you can associate resources with specific cost centers, projects, business units, or teams. You can then easily organize your resource data, create custom reports, and run specific queries.

If you do not currently have a mechanism or standard practice around resource tags and labels, you can refer to these how-to guides for setting up mandatory tags:

If you use more than one cloud computing provider, you’ll need to aggregate invoices and usage reports across vendors. In this scenario, having consistent tagging methods across platforms is even more useful as it can offer a consistent way to view your resource usage and expenses.

Once you have a clear sense of your current spending, you can look for opportunities to reduce your expenses.

Eliminating Unnecessary Resources

“What resources are we spending money on and not using at all?”

As projects are spun up and shut down, there are often resources that become unattached and left behind. While they are no longer in use, they are still costing your organization money on a recurring basis.

Ideally, you have an automated way to regularly catch and delete these unattached resources. With a no-code platform like Blink, teams can scale up scheduled automations to continuously detect and remove unnecessary resources.

If you don’t have automations already in place, you can manually review resources in the console and remove unused ones in bulk. It can be time-consuming, but you may be able to reduce your operating costs significantly this way in the short-term.

To know what types of resources to review, here are some common examples:

Unattached Disks

Unattached IP Addresses

Old Snapshots

Finding and removing idle resources is a clear way to cut your operating costs, but it also is an important practice for maintaining a strong security posture. If you leave resources like unattached IP addresses, idle NAT Gateways, load balancers with no target, or orphaned Secrets lying around, bad actors could find them and take advantage of the information. In this way, resource management is key to reducing costs and reducing risk.

Optimizing and Updating Resources

“How can we optimize our existing resources?”

Now that you’ve reviewed and removed unused resources, you can now look at optimizing the resources you are using.

Using the Right Family for the Job

Whether you are creating new resources or evaluating existing ones, it’s important to consider which family of resources best fits your needs. If you’re using general-purpose machines, there might be another more cost-effective machine that is a better fit.

Depending on your usage, you may need more capacity in some specifications than others. For example, if you’re using AWS, there are Compute Optimized instances under the C family (e.g. EC2 C7g instances) which offer optimal price performance for especially computing-intense use cases, like batch processing workloads and scientific modeling. Other families include Memory Optimized (e.g. EC2 R6a instances) and Storage Optimized (Ec2 lm4gn instances). There are lots of other families (e.g. IOPs, network, accelerator-optimized) depending on the platform and the specification you want to optimize for.

When considering your performance requirements, you might have use cases like batch jobs or workloads that are fault-tolerant. Azure, GCP, and AWS all have unused capacity that they offer as less expensive, less reliable Spot VMs. Compared to on-demand instances, they are up to 90% less expensive to run.

Updating to New Machines

Within each of these families, there are often newer versions being offered. Often, the newer versions run more efficiently or have higher performance, so it’s a good best practice to upgrade to newer versions as much as you can.

One example of this is with EBS volumes. By switching from EBS GP2 volumes to EBS GP3 volumes, you can reduce your costs by 20%. There are some small performance tradeoffs, but it’s important to keep these types of upgrade opportunities in mind.

Another AWS example is switching from older machines to ones that use the new AWS Graviton2 processors. Instances running on Graviton2 processors vs. Intel processors offer up to 40% better price performance, with specific efficiencies varying by family.

Looking for Low CPU Usage

One way to optimize your spending is by rightsizing resources to match the usage level that you need. For example, you may be running an instance or virtual machine that has more computer capacity than you need.

By reviewing your usage data, you can determine if you are running at an average CPU usage of 30% or less for example. By reducing the size or type of instance, you can slightly reduce your spend, which adds up over time.

Here are some how-to guides that show examples for each platform:

Using Long-Term Resourcing for Predictable CPU Usage

Another way to optimize your costs is by leveraging reserved instances or committed use discounts. In exchange for predictable computing expectations, the major cloud providers offer resources at a discount with a committed term, such as 1 year or 3 years.

Here are some how-to guides that show examples for each platform:

Starting Nightly Non-Production Scale-Downs

Are there any resources that you can shut-down when they are not being used? For example, if your team is working with a test environment during certain work hours, you don’t need to run it 24 hours a day. You can scale it down at night and scale it back up the next morning.

With some automation, pausing and restarting a non-production cluster could be as simple as clicking an approval button in a slack message, and reducing your daily cloud costs.

Here are a couple examples of how to pause and restart clusters nightly:

AWS: How to Scale Down AWS EKS Clusters Nightly
GCP: How to Pause Your GKE Cluster Nightly
Azure: How to Pause Your AKS Cluster Nightly

Storing and Moving Data Efficiency

“Can we optimize how our data is stored and transferred?”

Storing Only Relevant Data

Your cloud bill is also impacted by how much data you are storing. While it’s useful to collect data to see how your services are running, it likely becomes less useful and relevant over time. Even if you want to maintain as much data as possible, you’ll want to employ a strategy of periodically switching data over to less-costly, long-term storage vehicles, such as Amazon’s S3 Glacier storage.

Here are some how-to guides for AWS on how to identify data that hasn’t changed in a while and how to reduce logging storage costs.

Optimizing Data Transfers

Data transfers may also account for a significant part of your cloud costs, and vary greatly depending on their source, destination, method of transport, and size.

You can also likely expect charges if you are transferring data across regions or across availability zones. Unless your business case requires it, you should look to avoid data transfers that go across regions and availability zones.

While inbound (or ingress) data transfers between the internet and your cloud provider are not charged, outbound transfers are charged per service. You should reduce outbound data transfers from your cloud to external destinations as much as possible.

If you are transferring data across AWS services for example, you should be utilizing private endpoints. This way, when you are accessing a S3 bucket from an EC2 instance, you can avoid data transfer charges.

The same principle applies for transferring data from your cloud to on-premises locations, and tools like AWS Direct Connect, GCP Direct Peering, and Azure ExpressRoute which may offer lower cost per GB compared to transfers over the internet. Actual savings depends on the amount of data you are moving, and if you are below a certain threshold, it might not make sense.

You can read more about the types of data transfer charges in the Cost Optimization pillar of the AWS Well-Architected Framework, or these AWS, GCP, and Azure resources.

Achieving Operational Excellence with Blink Automations

So far, we have covered several areas where you and your team can focus and optimize your costs, but significant savings over time takes new processes.

Beyond finding unused resources, you need an automated process for alerting you to cost reduction opportunities, and then making approval for removing resources as easy as clicking a button. If you only rely on scripts, you may accidentally take down environments or orphaned resources that should have been left up.

With Blink, you can use no-code automations to achieve operational excellence. In the cost optimization context, Blink lets you create and run dozens of common resource checks and send reports to email or Slack channels with simple, actionable options.

By running these Blink automations on a schedule, you’ll be able to confidently ensure that you are achieving operational excellence not just one time, but daily. You can take the same Blink automation approach for other operational excellence categories, like security operations, incident response, troubleshooting, and permissions management.

Get started with a free Blink account or reach out to us directly to hear more.

DEV Community