DEV Community

Cover image for Back2Basics: Monitoring Workloads on Amazon EKS

Back2Basics: Monitoring Workloads on Amazon EKS

Overview

We're down to the last part of this series✨ In this part, we will explore monitoring solutions. Remember the voting app we've deployed? We will set up a basic dashboard to monitor each component's CPU and memory utilization. Additionally, we’ll test how the application would behave under load.

Back2Basics: A Series

If you haven't read the second part, you can check it out here:

Grafana & Prometheus

To start with, let’s briefly discuss the solutions we will be using. Grafana and Prometheus are the usual tandem for monitoring metrics, creating dashboards and setting up alerts. Both are open-source and can be deployed on a Kubernetes cluster - just like what we will be doing in a while.

  • Grafana is open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics, logs, and traces no matter where they are stored. It provides you with tools to turn your time-series database data into insightful graphs and visualizations. Read more: https://grafana.com/docs/grafana/latest/fundamentals/
  • Prometheus is an open-source systems monitoring and alerting toolkit. It collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Read more: https://prometheus.io/docs/introduction/overview/

Architecture: Grafana & Prometheus

Alternatively, you can use an AWS native service like Amazon CloudWatch, or a managed service like Amazon Managed Service for Prometheus and Amazon Managed Grafana. However, in this part, we will only cover self-hosted Prometheus and Grafana, which we will host on Amazon EKS.

Let's get our hands dirty!

Like the previous activity, we will use the same repository. First, make sure to uncomment all commented lines in 03_eks.tf04_karpenter.tf and 05_addons.tf to enable Karpenter and other addons we used in the previous activity.

Second, enable Grafana and Prometheus by adding these lines in terraform.tfvars:

enable_grafana    = true
enable_prometheus = true
Enter fullscreen mode Exit fullscreen mode

Once updated, we have to run tofu init, tofu plan and tofu apply. When prompted to confirm, type yes to proceed with provisioning the additional resources.

Accessing Grafana

Grafana Login Page

We need credentials to access Grafana. The default username is admin and the auto-generated password is stored in a Kubernetes secret. To retrieve the password, you can use the command below:

kubectl -n grafana get secret grafana -o jsonpath="{.data.admin-password}" | base64 -d
Enter fullscreen mode Exit fullscreen mode

This is what the home or landing page would look like. You have the navigation bar on the left side where you can navigate through different features of Grafana, including but not limited to Dashboards and Alerting.

Grafana Home Page

It's worth noting the Prometheus that we have deployed. You might be asking - Does the Prometheus server have a UI? Yes, it does. You can even query using PromQL and check the health of the targets. But we will use Grafana for the visualization instead of this.

Prometheus Targets

Setting up our first data source

Before we can create dashboards and alerts, we first have to configure the data source.

First, expand the Connections menu and click Data Sources.

Grafana: Data Sources

Click Add data source. Then select Prometheus.

Grafana: Prometheus Data Sources

Set the Prometheus server URL to http://prometheus-server.prometheus.svc.cluster.local. Since Prometheus and Grafana reside on the same cluster, we can use the Kubernetes service as the endpoint.

Grafana: Set Prometheus server URL

Leave other configuration as default. Once updated, click Save & test.

Grafana: Default Data Source

Now we have our first data source! We will use this to create dashboard in the next few section.

Grafana Dashboards

Let’s start by importing an existing dashboard. Dashboards can be searched here: https://grafana.com/grafana/dashboards/

For example, consider this dashboard - 315: Kubernetes Cluster Monitoring via Prometheus

To import this dashboard, either copy the Dashboard ID or download the JSON model. For this instance, use the dashboard ID 315 and import it into our Grafana instance.

Grafana: Import Dashboard

Select the Prometheus data source we've configured earlier. Then click Import.
Grafana: Import Dashboard

You will then be redirected to the dashboard and it should look like this:
Grafana: Imported Dashboard

Yey🎉 We now have our first dashboard!

Let's Create a Custom Dashboard for our Voting App

Copy this JSON model and import it into our Grafana instance. This is similar to the steps above, but this time, instead of ID, we'll use the JSON field to paste the copied template.

Grafana: Import Voting App Dashboard

Once imported, the dashboard should look like this:

Grafana: Imported Voting App Dashboard

Here we have the visualization for basic metrics such as cpu and memory utilization for each components. Also, replica count and node count were part of the dashboard so we can check in later the behavior of vote-app component when it auto scale.

Let's Test!

If you haven't deployed the voting-app, please refer to the command below:

helm -n voting-app upgrade --install app -f workloads/helm/values.yaml thecloudspark/vote-app --create-namespace
Enter fullscreen mode Exit fullscreen mode

Customize the namespace voting-app and release name app as needed, but update the dashboard query accordingly. I recommend to use the command above and use the same naming: voting-app for namespace and app as the release name.

Back to our dashboard: When the vote-app has minimal load, it scales down to a single replica (1), as shown below.

Grafana: Voting App Dashboard

Horizontal Pod Autoscaling in Action

The vote-app deployment has Horizontal Pod Autoscaler (HPA) configured with a maximum of five replicas. This means the voting app will automatically scale up to five pods to handle increased load. We can observe this behavior when we apply the seeder deployment.

Now, let's test how the vote-app handles increased load using a seeder deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: seeder
  namespace: voting-app
spec:
  replicas: 5
...
Enter fullscreen mode Exit fullscreen mode

The seeder deployment simulates real user load by bombarding the vote-app with vote requests. It has five replicas and allows you to specify the target endpoint using an environment variable. In this example, we'll target the Kubernetes service directly instead of the load balancer.

...
        env:
        - name: VOTE_URL
          value: "http://app-vote.voting-app.svc.cluster.local/"
...
Enter fullscreen mode Exit fullscreen mode

To apply, use the command below:

kubectl apply -f workloads/seeder/seeder-app.yaml
Enter fullscreen mode Exit fullscreen mode

After a few seconds, monitor your dashboard. You'll see the vote-app replicas increase to handle the load generated by the seeder.

D:\> kubectl -n voting-app get hpa
NAME                 REFERENCE                        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
app-vote-hpa         Deployment/app-vote              cpu: 72%/80%   1         5         5          12m
Enter fullscreen mode Exit fullscreen mode

Grafana: Voting App Dashboard

Since the vote-app chart's default max value for the horizontal pod autoscaler (HPA) is five, we can see that the replica for this deployment stops at five.

Stopping the Load and Scaling Down

Once you've observed the scaling behavior, delete the seeder deployment to stop the simulated load:

kubectl delete -f workloads/seeder/seeder-app.yaml
Enter fullscreen mode Exit fullscreen mode

Give the dashboard a few minutes and observe the vote-app scaling down. With no more load, the HPA will reduce replicas, down to a minimum of one. This may also lead to a node being decommissioned by Karpenter if pod scheduling becomes less demanding.

Grafana: Voting App Dashboard

You'll see that the vote-app eventually scales in as there is lesser load now. As you might see above, the node count also change from two to one - showing the power of Karpenter.

PS D:\> kubectl -n voting-app get hpa
NAME                 REFERENCE                        TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
app-vote-hpa         Deployment/app-vote              cpu: 5%/80%    1         5         2          18m
Enter fullscreen mode Exit fullscreen mode

Challenge: Scaling Workloads

We've successfully enabled autoscaling for the vote-app component using Horizontal Pod Autoscaler (HPA). This is a powerful technique to manage resource utilization in Kubernetes. But HPA isn't limited to just one component.

Tip: Explore the ArtifactHub: Vote App configuration in more detail. You'll find additional configurations related to HPA that you can leverage for other deployments.

Conclusion

Yey! You've reached the end of the Back2Basics: Amazon EKS Series🌟🚀. This series provided a foundational understanding of deploying and managing containerized applications on Amazon EKS. We covered:

  • Provisioning an EKS cluster using OpenTofu
  • Deploying workloads leveraging Karpenter
  • Monitoring applications using Prometheus and Grafana

While Kubernetes can have a learning curve, hopefully, this series empowered you to take your first steps. Ready to level up? Let me know in the comments what Kubernetes topics you'd like to explore next!

Top comments (5)

Collapse
 
whimsicalbison profile image
Jack

Thanks for writing this article! I especially liked seeing the use of OpenTofu. I've used Grafana and Prometheus previously, but I'm about to add both of these tools to my local Kubernetes cluster and found your insights very helpful.

I didn't realize that AWS had released a managed version of Grafana and Prometheus. What are the benefits of hosting them yourself versus using the managed version? I assume, like most AWS services, the managed version costs more but removes the burden of management and maintenance, at the expense of having fewer configuration options?

It's simple and straightforward, and you probably covered this in your previous articles, but it would be helpful to mention that Terraform is deploying Grafana and Prometheus using Helm. I was able to look at the source code and figure it out quickly, so it wasn't a big deal.

I haven't looked into it deeply, but I was wondering if you could automate some of the setup steps you're walking us through in the UI via the Helm config. You might have purposely left this out to show the manual process, but the two things that stood out to me were connecting Grafana and Prometheus, as well as loading dashboards.

Again, thanks for this article. Great stuff!

Collapse
 
romarcablao profile image
Romar Cablao

Hi @whimsicalbison - thanks for your feedback, really appreciate it 🙂

Yes you are correct, the managed Grafana and Prometheus remove the burden of management and maintenance since these are managed by AWS on our behalf. These also integrates well with other AWS services like IAM. Pricing for Amazon Managed Grafana is per user while Amazon Managed Service for Prometheus is per metric ingested, query processed and metric storage. Of course the more users and data we’ve processed/stored, the higher the price. I personally prefer the OSS and use low-cost storage like S3 (e.g. Grafana Mimir can use Amazon S3 for storing Prometheus metrics).

Data source and dashboard setup - while this can certainly be automated using Helm chart configuration, I intentionally created a step-by-step guide for this. This way, others can understand the process by navigating through the UI.

Collapse
 
whimsicalbison profile image
Jack

Thanks. Wasn't aware of Grafana Mimir either... will check that out!

Collapse
 
jasondunn profile image
Jason Dunn [AWS]

Nicely written article! Very detailed.

Collapse
 
romarcablao profile image
Romar Cablao

Thank you @jasondunn