loading...

How we managed to Optimize kubernetes(EKS) infrastructure cost more than 50% ?

snyadav1994 profile image snyadav1994 ・6 min read

At a high level, it’s easy to see why Kubernetes (K8s) is popular: it’s flexible, scalable and an open source orchestrator that streamlines the task of managing containers.

Looking at K8s benefits like portability, flexibility, multi-cloud capability and huge community support, we also planned to migrate our micro-service stalk of services to K8s on AWS Platform.

Cost optimization being core to our cloud practices in my organization, we were working in parallel to build strategies to minimize our cloud cost post migrating 100+ micro-services on k8s.

After migrating entire workload on K8s we ended up having nearly 40+ servers (worker-nodes) of type c5.4xlarge (16 core CPU and 32 GB RAM) for which we ended up paying a total cost of nearly $19600 per month (@$0.68 per hour per instance).

To optimize cost we came up with following strategy:

Idea was to scale down environment (for e.g.: qa/staging environment) after office hours; in our case cluster was scaled down at 8pm and was scaled up at 8am. On top of this, k8s cluster would be in scaled down state over weekend.

Doing this the cost would almost reduce to half of what will be incurred while cluster is up 24*7.

Our initial approach was as follows:

We have our applications deployed in 4 AZs (Availability zone) and server scaling in cluster was managed by 4 ASGs (Autoscaling groups) deployed one ASG in each AZ respectively. We took benefit of ASG's automatic scaling by using scheduled action which would set min, max and desired count of these ASGs to 0 at 8 pm and hence all the pods that were running would go in pending state and next morning the scheduled action would scale up the cluster by setting ASGs min, max, desired as 1,10 and 1 respectively, this would immediately try to place all pending pods on available worker nodes which used to gradually scale up as per requirement.

Challenges faced in this approach:

  1. What if developers wants to work post office hours or over the weekend to test single application?

  2. Would it be worth to scale up entire cluster just to test single application?

  3. As mentioned before that while scaling down the ASGs all pods goes in pending state; so to save cost even if we try scaling up single ASG with min(1 worker node), max(10 worker nodes), desired(1 worker node) would developer be lucky enough to be able to get his intended pods placed on those worker nodes?

These aforementioned challenges made us come up with another solution

We came up with solution which would basically take care of four scenarios:

  1. Scale down entire k8s cluster.
  2. Scale up entire k8s cluster.
  3. Scale up user specified applications.
  4. Scale down user specified applications.

Flow Chart:

Note: We would need input parameters in code snippets like :-

  1. Activity – e.g.: scale_up, scale_down.
  2. Namespace - All, other namespaces present in cluster e.g.: default, kube-system, test-namespace etc.
  3. Application_Name - It’s value would be populated based on namespace chosen by user for e.g.: if user choses test-namespace this would result in populating all deployments present in test-namespace

'All' indicates all namespaces present in k8s cluster.

1. Scale down entire k8s cluster:

If Activity=scale_down and Namespace=All (which means scale down all namespaces present in k8s cluster)in this case Application_Name=All(as mentioned before it would populate itself based on Namespace chosen):
a. Since namespace is chosen as All; script would get all namespaces in a variable of type array by using command:

 $ kubectl get namespaces
Enter fullscreen mode Exit fullscreen mode

b. Iterate over all namespaces and get deployments in each namespace further scale down deployment to 0 replica:

 $ kubectl get deployment -n <namespace-name>
 $ kubectl scale deployment <deployment-name> -n <namespace-name> --replicas 0
Enter fullscreen mode Exit fullscreen mode

c. Scale down all ASG's:

 $ aws autoscaling update-auto-scaling-group --auto-scaling-group-name <ASG-name> --min-size 0 --max-size 0 --desired-capacity 0  
Enter fullscreen mode Exit fullscreen mode

Points to be noted:
i. This condition would be responsible for scaling down entire k8s cluster and deployments.
ii. Now since pods have been scaled down beforehand; none of the pods would be in pending state. (This helps us overcome 3rd challenge from aforementioned challenges)

2. Scale up entire k8s cluster:

If Activity=scale_up and Namespace=All in this case Application_Name=All(as mentioned before it would populate itself based on Namespace chosen):
a. Scale up all ASG's:

$ aws autoscaling update-auto-scaling-group --auto-scaling-group-name <ASG-name> --min-size 0 --max-size 10 --desired-capacity 1
Enter fullscreen mode Exit fullscreen mode

b. Since Namespace=All, get all namespaces in variable as follows:

 $ kubectl get namespaces
Enter fullscreen mode Exit fullscreen mode

c. Iterate over the namespaces and get deployments in each namespace and scale it back to 1 replica:

$ kubectl get deployment -n <namespace-name>
$ kubectl scale deployment <deployment-name> -n <namespace-name> --replicas 1
Enter fullscreen mode Exit fullscreen mode

Point to be noted:
i. This would take care of scaling up entire k8s cluster and deployments.

3. Scale up user specified applications:

If Activity=scale_up, Namespace=test-namespace Application_Name=test-deployment(populated by getting deployment names using Namespace chosen by user):
a. Scale up 2 ASGs (to ensures high availability):

    $ aws autoscaling  update-auto-scaling-group --auto-scaling-group-name <ASG-name> --min-size 0 --max-size 6 --desired-capacity 1
Enter fullscreen mode Exit fullscreen mode

b. Scale up namespace and application chosen by user:

    $ kubectl scale deployment test-deployment -n test-namespace --replicas 1
Enter fullscreen mode Exit fullscreen mode

Points to be noted:
i. AWS EC2 takes around 3 mins to spin up and get in running state, make sure to add a sleep of 180 seconds before scaling up deployments.
ii. This makes sure that we are not overprovisioning infrastructure while testing single application. (This helps use overcome challenge 1st and 2nd mentioned above in initial approach.)

4. Scale down user specified applications:

If Activity=scale_down, Namespace=test-namespace and Application_Name=test-deployment:
a. Scale down user chosen deployments:

    $ kubectl scale deployment test-deployment -n test-namespace --replicas 0
Enter fullscreen mode Exit fullscreen mode

b. Make sure that no pods are running in other namespaces (if pods are running in other namespaces this indicates that other developers are working on their respective application)

    $ kubectl get namespaces   # get this in array variable and 
      iterate over it.
Enter fullscreen mode Exit fullscreen mode

Iterate over namespaces to get running pods.

running_pods=`kubectl get pods --field-selector=status.phase=Running -n <other-namepaces> | grep -v NAME | awk '{print $1}'`
Enter fullscreen mode Exit fullscreen mode

If variable "running_pod" is emtpy, this would conclude that no pods are running in other namespaces and you are good to follow step c otherwise skip step c.

c. Scale down 2 ASGs

    $ aws autoscaling  update-auto-scaling-group --auto-scaling-group-name <ASG-name> --min-size 0 --max-size 0 --desired-capacity 0
Enter fullscreen mode Exit fullscreen mode

Points to be noted:
i. Step b mentioned above ensures that multiple developers can work on their respective applications after office hours without hindering activity of other developers by scaling down the cluster after use.
ii. Eventually numbers on nodes in ASG would reduce as developer would scale down respective application after use; which would further decrease cpu utilization ultimately leading to scale down of worker nodes and saving us from overprovisioning of infrastructure.

Cost Comparison:

We chose to run this code snippets to build a script specific to our cloud environment using jenkinsfile.

Same job is triggered to scale up and scale down entire cluster at 8am and 8pm respectively using cron and automatically notifies on team messenger (you can use any like Microsoft teams, Slack etc.) about scaling activity that took place.

And if developers want to test or work on application after office hours or over the weekend, they could use same job; in that case scenario 3rd and 4th mentioned above would take care of cluster provisioning using the parameters defined earlier.

What more we can do?

Along with implementation of above strategy, using EC2 type as spot instance in lower environments would be like cherry on top of the cake to have better cost optimization.

Conclusion:

To summarize; we basically tried scaling down infrasturcture after office hours in such a way that even if individual wants to work post office hours we can manage to provide infrastucture adequate for asked application/applications which is to be tested without affecting work of other developer who is testing his/her application on same k8s cluster.

Helpful Link:
https://kubernetes.io/docs/reference/kubectl/cheatsheet/

Discussion

pic
Editor guide