Enabling logs and alerting in AWS EKS cluster - CloudWatch and fluent-bit

#kubernetes #aws #terraform #cloudwatch

Logs are a fundamental component in our environments, these provide useful information that helps to debug in the case of any issue or to identify events that can affect the security of the application. Logs should be enabled in each component, from the infrastructure level to the application level. This brings some challenges like where to store the logs, what kind of events log, how to search events, and what to do with the logs.

In this post I will share my experience enabling and configuring logging in an EKS cluster, creating alerts to send a notification when a specific event appears in the logs.

This is the part #1 in which I show how to enable control plane logging and container logging in an EKS cluster, in the part #2 I will show you how to enable some alerts using the logs groups created.

Before to start is important to mention that logs can contain private data like user information, keys, passwords, etc. For this reason, logs should be encrypted at rest and enable restrictions to access them.

Kubernetes Control Plane logging

Kubernetes architecture can be divided into a control plane and worker nodes, control plane contains the components that manage the cluster, components like etcd, API Server, Scheduler, and Controller Manager. Almost every action done in the cluster pass through the API Server that logs each event.

AWS EKS manages the control plane for us, deploying and operating the necessary components. By default, EKS doesn't have logging enabled and actions from our side are required. Enabling EKS control plane logging is an easy task, you need to know what component log and enable it. You can enable logs for the API server, audit, authenticator, control manager, and scheduler.

In my opinion, audit logs and authenticator are useful because records actions done in our cluster and help us to understand the origin of the actions and requests generated by the IAM authenticator.

By terraform you can use the following code to create a simple cluster and enabling audit,api,authenticator, and scheduler logs.


 Terraform
resource "aws_eks_cluster" "kube_cluster" {
  name                      = "test-cluster"
  role_arn                  = aws_iam_role.role-eks.arn
  version                   = "1.22"
  enabled_cluster_log_types = ["audit", "api", "authenticator","scheduler"]
  vpc_config {
    subnet_ids              = ["sub-1234","sub-5678"]
    endpoint_private_access = true
    endpoint_public_access  = true
  }
}

Logs are stored in AWS CloudWatch logs and the log group is created automatically following this name structure /aws/eks/<cluster-name>/cluster, inside the group you can find the log stream for each component that you enabled



authenticator-123abcd
kube-apiserver-123abcd
kube-apiserver-123abcd

By default, the Log group created by AWS doesn't have encryption and retention days enabled, I recommend creating the logs group by yourself and specifying and KMS Key, and setting some time to expiry the logs, Kubernetes generates a considerable number of logs that will increase the size of the group that can impact the billing.

Kubernetes Containers logging

The steps mentioned above were to enable just logging in the control plane, to send logs generated by the applications running in the containers a log aggregator is necessary, in this case, I will use Fluent-Bit](https://fluentbit.io/how-it-works/)

Fluent-Bit runs as a daemonSet in the cluster and sends logs to CloudWatch Logs. Fluent-Bit creates the log groups using the configuration specified in the kubernetes manifests.

Here is important to mention that AWS has created a docker image for the daemonSet, this can be found in this link.

AWS describes the steps to run the daemonSet, this is done by some commands, but I will use a Kubernetes manifest that can be stored in our repository and then use Argo or Fluxcd to automate deployments.

The following steps show the manifests to create the objects that Kubernetes needs to send containers logs to CloudWatch, you must have access to the cluster and by kubeclt command create the resources (kubeckt apply -f manifest-name.yml).

1. Namespace creation

A K8 namespace is necessary, amazon-cloudwatch name will use for this, you can change the name but make sure to use the same in the following steps.



apiVersion: v1
kind: Namespace
metadata:
  name: amazon-cloudwatch
  labels:
    name: amazon-cloudwatch

2. ConfigMap for aws-fluent-bit general configs

This configMap is necessary to specify some configurations for fluent-bit and for AWS, for instance, the cluster-name AWS use to create the logs group. In this case, I don't want to create an HTTP server for fluent-bit and I will read the logs from the tail, more information about this can be found }(https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/configuration-file).



apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-general-configs ## you can use a different name, make sure to use the same in the following steps
  namespace: amazon-cloudwatch
data:
  cluster.name: ${CLUSTERNAME}
  http.port: ""
  http.server: "Off"
  logs.region: ${AWS_REGION}
  read.head: "Off"
  read.tail: "On"

3.Service Account, Cluster Role and Role Binding

Some permissions are required to send logs from daemonSet to Cloudwatch, you can attach a role to the worker-nodes or use a service account with an IAM role, in this case, I will create an IAM role and associate it with a service account.
The following Terraform code creates a policy and the role.


 Terraform
resource "aws_iam_role" "iam-role-fluent-bit" {
  name                  = "role-fluent-bit-test"
  force_detach_policies = true
  max_session_duration  = 3600
  path                  = "/"
  assume_role_policy    = jsonencode({

{
    Version= "2012-10-17"
    Statement= [
      {
        Effect= "Allow"
        Principal= {
            Federated= "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}"
        }
        Action= "sts:AssumeRoleWithWebIdentity"
        Condition= {
          StringEquals= {
            "oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}:aud": "sts.amazonaws.com",
"oidc.eks.${REGION}.amazonaws.com/id/${EKS_OIDCID}:sub": "system:serviceaccount:${AWS_CLOUDWATCH_NAMESPACE}:${EKS-SERVICE_ACCOUNT-NAME}"
          }
        }
      }
    ]
  }

})

}

EKS_OIDCID: is the OpenID Connect for your cluster, you can get it in the cluster information or by terraform outputs.
AWS_CLOUDWATCH_NAMESPACE: is the namespace create in the step 1, in this case amazon-cloudwatch.
ACCOUNT_ID: is the AWS account number where the cluster was created.

The role needs a policy with permissions to create and put logs in cloudwatch, you can use the following code to create the policy and attach it to IAM Role created.


 Terraform

resource "aws_iam_policy" "policy_sa_logs" {
  name        = "policy-sa-fluent-bit-logs"
  path        = "/"
  description = "policy for EKS Service Account fluent-bit "
  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ec2:DescribeVolumes",
                "ec2:DescribeTags",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams",
                "logs:DescribeLogGroups",
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutRetentionPolicy"
            ],
            "Resource": "arn:aws:logs:${REGION}:${ACCOUNT_ID}:*:*"
        }
    ]
}
EOF
}

######## Policy attachment to IAM role ########

resource "aws_iam_role_policy_attachment" "policy-attach" {
  role       = aws_iam_role.iam-role-fluent-bit.name
  policy_arn = aws_iam_policy.policy_sa_logs.arn
}

Once the role has been created the Service account can be created, you can use the following k8 manifest for that, you should replace the IAM_ROLE variable for the ARN of the role created previously.



apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
  annotations:
    eks.amazonaws.com/role-arn: "${IAM_ROLE}"

With the SA ready, you need to create a cluster role and associate that to the SA created, the following manifests can be used for that.



apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-role
rules:
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - namespaces
      - pods
      - pods/logs
      - nodes
      - nodes/proxy
    verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-role
subjects:
  - kind: ServiceAccount
    name: fluent-bit
    namespace: amazon-cloudwatch

4.ConfigMap for fluent-bit configurations

A ConfigMap is used to specify a detailed configuration for Fluent-bit, AWS already defines a configuration, but you can add custom configs, The following link, shows the configurations defined by AWS, if you see, the first objects created in the YAML are the manifest defined in previous steps, in this step you just need to define the ConfigMap with name fluent-bit-config, I don't want to put here all the manifest because is a little long and can complicate the lecture of this post.

With this ConfigMap, Fluent Bit will create the log groups in the below table, you also have the option to create by terraform and specify encryption and retention period (i recommend this way).

CloudWatch Log Group Name	Source of the logs(Path inside the Container)
aws/containerinsights/Cluster_Name/application	All log files in /var/log/containers
/aws/containerinsights/Cluster_Name/host	Logs from /var/log/dmesg, /var/log/secure, and /var/log/messages
/aws/containerinsights/Cluster_Name/dataplane	The logs in /var/log/journal for kubelet.service, kubeproxy.service, and docker.service.

If you analyze the ConfigMap you can see the INPUTS for each source mentioned in the table.



apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
  labels:
    k8s-app: fluent-bit

... OTHER CONFIGS 

### here is the INPUT configurations for application logs  
application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log

... OTHER CONFIGS 

    [OUTPUT]
        Name                cloudwatch_logs
        Match               application.*
        region              $${AWS_REGION}
        log_group_name      /aws/containerinsights/$${CLUSTER_NAME}/application
        log_stream_prefix   $${HOST_NAME}-
        auto_create_group   false
        extra_user_agent    container-insights
        log_retention_days  ${logs_retention_period}

The OUTPUT in the previous manifest defines the CloudWatch log Group configuration that fluent bit will create, as you can see you can specify if the log groups should be created, the prefix for the stream, the name, and the retention period for the logs. If you are using Terraform you should set to false the option auto_create_group

5.DaemonSet Creation

This is the last step :) , AWS also provides the manifest to create the DaemonSet, in this link you can find it in the bottom of the file. As I mentioned, I don't want to put the whole file here, you can copy and paste the content or edit the file if you have custom configurations.



apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
  labels:
    k8s-app: fluent-bit
... OTHER CONFIGS

Once you have run the above steps, you can validate that the daemonSet is running well and if everything is ok you should be the Logs groups in the AWS console with some events passed by fluent-bit DaemonSet

References