Overview
We're down to the last part of this series✨ In this part, we will explore monitoring solutions. Remember the voting app we've deployed? We will set up a basic dashboard to monitor each component's CPU and memory utilization. Additionally, we’ll test how the application would behave under load.
If you haven't read the second part, you can check it out here:
Back2Basics: Running Workloads on Amazon EKS
Romar Cablao for AWS Community Builders ・ Jun 19
Grafana & Prometheus
To start with, let’s briefly discuss the solutions we will be using. Grafana and Prometheus are the usual tandem for monitoring metrics, creating dashboards and setting up alerts. Both are open-source and can be deployed on a Kubernetes cluster - just like what we will be doing in a while.
-
Grafana
is open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics, logs, and traces no matter where they are stored. It provides you with tools to turn your time-series database data into insightful graphs and visualizations. Read more: https://grafana.com/docs/grafana/latest/fundamentals/ -
Prometheus
is an open-source systems monitoring and alerting toolkit. It collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Read more: https://prometheus.io/docs/introduction/overview/
Alternatively, you can use an AWS native service like Amazon CloudWatch
, or a managed service like Amazon Managed Service for Prometheus
and Amazon Managed Grafana
. However, in this part, we will only cover self-hosted Prometheus
and Grafana
, which we will host on Amazon EKS.
Let's get our hands dirty!
Like the previous activity, we will use the same repository. First, make sure to uncomment all commented lines in 03_eks.tf
, 04_karpenter.tf
and 05_addons.tf
to enable Karpenter
and other addons we used in the previous activity.
Second, enable Grafana
and Prometheus
by adding these lines in terraform.tfvars
:
enable_grafana = true
enable_prometheus = true
Once updated, we have to run tofu init
, tofu plan
and tofu apply
. When prompted to confirm, type yes
to proceed with provisioning the additional resources.
Accessing Grafana
We need credentials to access Grafana. The default username is admin
and the auto-generated password is stored in a Kubernetes secret
. To retrieve the password, you can use the command below:
kubectl -n grafana get secret grafana -o jsonpath="{.data.admin-password}" | base64 -d
This is what the home or landing page would look like. You have the navigation bar on the left side where you can navigate through different features of Grafana, including but not limited to Dashboards
and Alerting
.
It's worth noting the Prometheus
that we have deployed. You might be asking - Does the Prometheus
server have a UI? Yes, it does. You can even query using PromQL
and check the health of the targets. But we will use Grafana for the visualization instead of this.
Setting up our first data source
Before we can create dashboards and alerts, we first have to configure the data source.
First, expand the Connections
menu and click Data Sources
.
Click Add data source
. Then select Prometheus
.
Set the Prometheus server URL to http://prometheus-server.prometheus.svc.cluster.local
. Since Prometheus
and Grafana
reside on the same cluster, we can use the Kubernetes service
as the endpoint.
Leave other configuration as default. Once updated, click Save & test
.
Now we have our first data source! We will use this to create dashboard in the next few section.
Grafana Dashboards
Let’s start by importing an existing dashboard. Dashboards can be searched here: https://grafana.com/grafana/dashboards/
For example, consider this dashboard - 315: Kubernetes Cluster Monitoring via Prometheus
To import this dashboard, either copy the Dashboard ID
or download the JSON
model. For this instance, use the dashboard ID 315
and import it into our Grafana
instance.
Select the Prometheus
data source we've configured earlier. Then click Import
.
You will then be redirected to the dashboard and it should look like this:
Yey🎉 We now have our first dashboard!
Let's Create a Custom Dashboard for our Voting App
Copy this JSON
model and import it into our Grafana instance. This is similar to the steps above, but this time, instead of ID, we'll use the JSON
field to paste the copied template.
Once imported, the dashboard should look like this:
Here we have the visualization for basic metrics such as cpu
and memory
utilization for each components. Also, replica count
and node count
were part of the dashboard so we can check in later the behavior of vote-app component when it auto scale.
Let's Test!
If you haven't deployed the voting-app
, please refer to the command below:
helm -n voting-app upgrade --install app -f workloads/helm/values.yaml thecloudspark/vote-app --create-namespace
Customize the namespace voting-app
and release name app
as needed, but update the dashboard query accordingly. I recommend to use the command above and use the same naming: voting-app
for namespace and app
as the release name.
Back to our dashboard: When the vote-app
has minimal load, it scales down to a single replica (1), as shown below.
Horizontal Pod Autoscaling in Action
The vote-app
deployment has Horizontal Pod Autoscaler (HPA) configured with a maximum of five replicas. This means the voting app will automatically scale up to five pods to handle increased load. We can observe this behavior when we apply the seeder
deployment.
Now, let's test how the vote-app
handles increased load using a seeder
deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: seeder
namespace: voting-app
spec:
replicas: 5
...
The seeder
deployment simulates real user load by bombarding the vote-app
with vote requests. It has five replicas and allows you to specify the target endpoint using an environment variable. In this example, we'll target the Kubernetes service
directly instead of the load balancer.
...
env:
- name: VOTE_URL
value: "http://app-vote.voting-app.svc.cluster.local/"
...
To apply, use the command below:
kubectl apply -f workloads/seeder/seeder-app.yaml
After a few seconds, monitor your dashboard. You'll see the vote-app
replicas increase to handle the load generated by the seeder
.
D:\> kubectl -n voting-app get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app-vote-hpa Deployment/app-vote cpu: 72%/80% 1 5 5 12m
Since the vote-app
chart's default max value for the horizontal pod autoscaler (HPA) is five, we can see that the replica for this deployment stops at five.
Stopping the Load and Scaling Down
Once you've observed the scaling behavior, delete the seeder
deployment to stop the simulated load:
kubectl delete -f workloads/seeder/seeder-app.yaml
Give the dashboard a few minutes and observe the vote-app
scaling down. With no more load, the HPA will reduce replicas, down to a minimum of one. This may also lead to a node being decommissioned by Karpenter
if pod scheduling becomes less demanding.
You'll see that the vote-app eventually scales in as there is lesser load now. As you might see above, the node count also change from two to one - showing the power of Karpenter.
PS D:\> kubectl -n voting-app get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app-vote-hpa Deployment/app-vote cpu: 5%/80% 1 5 2 18m
Challenge: Scaling Workloads
We've successfully enabled autoscaling for the vote-app
component using Horizontal Pod Autoscaler (HPA). This is a powerful technique to manage resource utilization in Kubernetes. But HPA isn't limited to just one component.
Tip: Explore the ArtifactHub: Vote App configuration in more detail. You'll find additional configurations related to HPA that you can leverage for other deployments.
Conclusion
Yey! You've reached the end of the Back2Basics: Amazon EKS Series
🌟🚀. This series provided a foundational understanding of deploying and managing containerized applications on Amazon EKS. We covered:
- Provisioning an EKS cluster using OpenTofu
- Deploying workloads leveraging Karpenter
- Monitoring applications using Prometheus and Grafana
While Kubernetes can have a learning curve, hopefully, this series empowered you to take your first steps. Ready to level up? Let me know in the comments what Kubernetes topics you'd like to explore next!
Top comments (5)
Thanks for writing this article! I especially liked seeing the use of OpenTofu. I've used Grafana and Prometheus previously, but I'm about to add both of these tools to my local Kubernetes cluster and found your insights very helpful.
I didn't realize that AWS had released a managed version of Grafana and Prometheus. What are the benefits of hosting them yourself versus using the managed version? I assume, like most AWS services, the managed version costs more but removes the burden of management and maintenance, at the expense of having fewer configuration options?
It's simple and straightforward, and you probably covered this in your previous articles, but it would be helpful to mention that Terraform is deploying Grafana and Prometheus using Helm. I was able to look at the source code and figure it out quickly, so it wasn't a big deal.
I haven't looked into it deeply, but I was wondering if you could automate some of the setup steps you're walking us through in the UI via the Helm config. You might have purposely left this out to show the manual process, but the two things that stood out to me were connecting Grafana and Prometheus, as well as loading dashboards.
Again, thanks for this article. Great stuff!
Hi @whimsicalbison - thanks for your feedback, really appreciate it 🙂
Yes you are correct, the managed Grafana and Prometheus remove the burden of management and maintenance since these are managed by AWS on our behalf. These also integrates well with other AWS services like IAM. Pricing for Amazon Managed Grafana is per user while Amazon Managed Service for Prometheus is per metric ingested, query processed and metric storage. Of course the more users and data we’ve processed/stored, the higher the price. I personally prefer the OSS and use low-cost storage like S3 (e.g. Grafana Mimir can use Amazon S3 for storing Prometheus metrics).
Data source and dashboard setup - while this can certainly be automated using Helm chart configuration, I intentionally created a step-by-step guide for this. This way, others can understand the process by navigating through the UI.
Thanks. Wasn't aware of Grafana Mimir either... will check that out!
Nicely written article! Very detailed.
Thank you @jasondunn