DEV Community

Cover image for Troubleshooting: EKS + Helm + Prometheus + Grafana
Nowsath for AWS Community Builders

Posted on • Updated on

Troubleshooting: EKS + Helm + Prometheus + Grafana

In this post, I've compiled a list of issues along with their corresponding solutions that I encountered while configuring Prometheus and Grafana with Helm in the existing EKS Fargate cluster setup.

Error 1: Could not determine region from any metadata service. The region can be manually supplied via the AWS_REGION environment variable.

panic: did not find aws instance ID in node providerID string

$ k logs ebs-csi-controller-7f5c959c75-j92jf -n kube-system -c ebs-plugin
I1228 04:31:45.536047       1 driver.go:78] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.25.0"
I1228 04:31:45.536144       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I1228 04:31:58.152468       1 metadata.go:88] "ec2 metadata is not available"
I1228 04:31:58.152491       1 metadata.go:96] "retrieving instance data from kubernetes api"
I1228 04:31:58.153081       1 metadata.go:101] "kubernetes api is available"
E1228 04:31:58.175387       1 controller.go:86] "Could not determine region from any metadata service. The region can be manually supplied via the AWS_REGION environment variable." err="did not find aws instance ID in node providerID string"
panic: did not find aws instance ID in node providerID string
Enter fullscreen mode Exit fullscreen mode
$ kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-ebs-csi-driver,app.kubernetes.io/instance=aws-ebs-csi-driver"
NAME                                  READY   STATUS             RESTARTS       AGE
ebs-csi-controller-7f5c959c75-j92jf   0/6     CrashLoopBackOff   36 (9s ago)    10m
ebs-csi-controller-7f5c959c75-xpv9x   0/6     CrashLoopBackOff   36 (23s ago)   10m
ebs-csi-node-969qs                    3/3     Running            0              10m
Enter fullscreen mode Exit fullscreen mode

Solution:
If you don't specify the region of your cluster when installing aws-ebs-csi-driver will result in the ebs-csi-controller pods crashing, as the default region will be set to 'us-east-1'.

helm upgrade --install aws-ebs-csi-driver \
  --namespace kube-system \
  --set controller.region=eu-north-1 \
  --set controller.serviceAccount.create=false \
  --set controller.serviceAccount.name=ebs-csi-controller-sa \
  aws-ebs-csi-driver/aws-ebs-csi-driver
Enter fullscreen mode Exit fullscreen mode

This is because of the ebs-plugin container "Could not determine region from any metadata service. The region can be manually supplied via the AWS_REGION environment variable."


Error 2: Values don't meet the specifications of the schema(s) in the following chart(s)

Error: values don't meet the specifications of the schema(s) in the following chart(s):
prometheus:
- server.remoteRead: Invalid type. Expected: array, given: object
alertmanager:
- extraEnv: Invalid type. Expected: array, given: object
Enter fullscreen mode Exit fullscreen mode

Solution:
The errors are a result of a version mismatch between the Prometheus version and the file used in the Helm installation. If you are using a customized prometheus_values.yml file, ensure you specify the precise version of Prometheus. Alternatively, if you do not use a customized file, make sure to use the latest version of the Prometheus file.

helm upgrade -i prometheus prometheus-community/prometheus \
  --namespace prometheus \
  --set alertmanager.persistentVolume.storageClass="gp2",server.persistentVolume.storageClass="gp2" \
  --version 15
Enter fullscreen mode Exit fullscreen mode

I have used Prometheus version 15 here.


Error 3: 0/17 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/17 nodes are available: 17 Preemption is not helpful for scheduling..

$ k get events -n prometheus
LAST SEEN   TYPE      REASON               OBJECT                                                MESSAGE
2m13s       Warning   FailedScheduling     pod/prometheus-alertmanager-c7644896-td8xv            0/17 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/17 nodes are available: 17 Preemption is not helpful for scheduling..
47m         Normal    SuccessfulCreate     replicaset/prometheus-alertmanager-c7644896           Created pod: prometheus-alertmanager-c7644896-td8xv
2m30s       Warning   ProvisioningFailed   persistentvolumeclaim/prometheus-alertmanager         storageclass.storage.k8s.io "prometheus" not found

Enter fullscreen mode Exit fullscreen mode

The pods, prometheus-alertmanager and prometheus-server, will remain in a pending status.

$ k get po -n prometheus
NAME                                             READY   STATUS    RESTARTS   AGE
prometheus-alertmanager-c7644896-q2nzm           0/2     Pending   0          74s
prometheus-kube-state-metrics-8476bdcc64-f984p   1/1     Running   0          75s
prometheus-node-exporter-r82k7                   1/1     Running   0          74s
prometheus-pushgateway-665779d98f-zh2pf          1/1     Running   0          75s
prometheus-server-6fd8bc8576-csqt8               0/2     Pending   0          75s
Enter fullscreen mode Exit fullscreen mode

Solution:
This is due to missing storage class of 'prometheus' as clearly shown in the events logs. So go ahead and create the storage class shown below.

EBS_AZ=$(kubectl get nodes \
  -o=jsonpath="{.items[0].metadata.labels['topology\.kubernetes\.io\/zone']}")
Enter fullscreen mode Exit fullscreen mode
echo "
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: prometheus
  namespace: prometheus
provisioner: ebs.csi.aws.com
parameters:
  type: gp2
reclaimPolicy: Retain
allowedTopologies:
- matchLabelExpressions:
  - key: topology.ebs.csi.aws.com/zone
    values:
    - $EBS_AZ
" | kubectl apply -f -
Enter fullscreen mode Exit fullscreen mode

Error 4: Failed to provision volume with StorageClass "prometheus": rpc error: code = Internal desc = Could not create volume "pvc-48b7c3d8-d46a-47be-90e7-3d59eb3f5844": could not create volume in EC2: NoCredentialProviders: no valid providers in chain...

$ kubectl get events --sort-by=.metadata.creationTimestamp -n prometheus
LAST SEEN   TYPE      REASON                 OBJECT                                                MESSAGE
30s         Normal    Provisioning           persistentvolumeclaim/prometheus-alertmanager         External provisioner is provisioning volume for claim "prometheus/prometheus-alertmanager"
30s         Normal    Provisioning           persistentvolumeclaim/prometheus-server               External provisioner is provisioning volume for claim "prometheus/prometheus-server"
5s          Warning   ProvisioningFailed     persistentvolumeclaim/prometheus-alertmanager         failed to provision volume with StorageClass "prometheus": rpc error: code = Internal desc = Could not create volume "pvc-b7373f3b-3da9-47ac-8bfb-ad396816ce88": could not create volume in EC2: NoCredentialProviders: no valid providers in chain...
17s         Warning   ProvisioningFailed     persistentvolumeclaim/prometheus-server               failed to provision volume with StorageClass "prometheus": rpc error: code = Internal desc = Could not create volume "pvc-48b7c3d8-d46a-47be-90e7-3d59eb3f5844": could not create volume in EC2: NoCredentialProviders: no valid providers in chain...
Enter fullscreen mode Exit fullscreen mode

Solution:
This issue arises from insufficient permissions assigned to the service account in the cluster, preventing it from provisioning the required persistent volumes.

You need to set service account details (with required IAM policies for the role) while install aws-ebs-csi-driver with Helm as shown here.

helm upgrade --install aws-ebs-csi-driver \
  --namespace kube-system \
  --set controller.region=eu-north-1 \
  --set controller.serviceAccount.create=false \
  --set controller.serviceAccount.name=ebs-csi-controller-sa \
  aws-ebs-csi-driver/aws-ebs-csi-driver
Enter fullscreen mode Exit fullscreen mode

Error 5: The service account is absent in the EKS cluster setup, yet it is visible through the eksctl get command.

Solution:
Check whether you have added '--role-only' option while creating service account using eksctl.
If yes, delete the service account and recreate it without '--role-only' option as shown below.

eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster api-dev \
  --role-name AmazonEKS_EBS_CSI_DriverRole \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --override-existing-serviceaccounts --approve
Enter fullscreen mode Exit fullscreen mode

Here, 'api-dev' is the cluster name. Replace it with your cluster name before running the command.


Thank you for taking the time to read 👏😊! I will continue to update this post as I encounter new issues. Feel free to mention any unlisted issues in the comment section. 🤝❤️

Check my post on setting up Prometheus and Grafana with existing EKS Fargate cluster - Monitoring

Top comments (0)