mikeyGlitz

Posted on Jan 3, 2021 • Edited on Mar 31, 2021

Kubernetes Service Monitoring and Observability

#kubernetes #grafana #linkerd #prometheus

A common problem that I've run across throughout my career is supporting applications which have gone to production. When supporting a service in production, it is important to be able to identify where things went wrong and how to fix them before customers/end users are impacted.
Application logs are usually first place that I usually check when I'm notified of a production issue. Seems simple enough right? Open a file look for the words "exception" or "error" and backtrack from there. The only problem is that this approach to troubleshooting becomes unsustainable in today's environments where services can be comprised of decentralized, distributed, constituent applications.
How would you go about monitoring multiple services (i.e. micro-service architecture) without the increased cognitive load that comes with complicating the troubleshooting process?

Monitoring and Kubernetes

Although there are multiple ways of performing monitoring which can be dependent on which platform you're using and what tools are available for your platform (i.e. GCP, Azure, AWS), I'm going to be focusing on how to perform monitoring on a Kubernetes cluster.
Fortunately when working in the context of a Kubernetes cluster, there's a fair bit of open source projects which can aid in providing observability and monitoring for services.
For the ease of use, I'll be using Linkerd. Linkerd is a service mesh and is fairly easy to set up. Additionally, with Linkerd, you get Prometheus and Grafana included for FREE!

What is a Service Mesh?

In the context of a Kubernetes cluster, a service mesh is a collection of applications, services, and custom resources which provide observability, scalability, and resiliency for applications in your cluster. Linkerd specifically uses a pod-sidecar called Linkerd Proxy which proxies API calls to your services and provides metrics back to Linkerd. These metrics are reported to Prometheus which is a tool for viewing metrics and managing alerts. Prometheus reports metrics to Grafana which provides visualizations of these metrics in the form of charts and graphs which makes the metrics easier to digest.
Additionally, Grafana has a plugin, Loki which is designed to feed logs to Grafana allowing you to search through logs with relative ease.

The flow for application metrics will resemble the following:

Additionally, an application log flow will look like the following:

Setting up Linkerd

Assuming you have your Kubernetes cluster all set up, setting up Linkerd will be relatively straightforward. Linkerd provides a command-line interface (CLI) tool which makes managing Linkerd pretty easy.

Install the CLI following these instructions:

curl -sL https://run.linkerd.io/install | sh
# Add Linkerd to path
export PATH=$PATH:$HOME/.linkerd2/bin
# Deploy Linkerd to your cluster
linkerd install | kubectl apply -f -

From here, you'll be able to access Grafana from the Linkerd Dashboard

# Open a browser window to the linkerd dashboard
linkerd dashboard

Kubernetes resources can be assigned to Linkerd using annotations to either a Deployment or Namespace:

apiVersion: v1
kind: Namespace
metadata:
  linkerd.io/inject: enabled

apiVersion: apps/v1
kind: Deployment
metadata:
  linkerd.io/inject: enabled

Logging Operator

Logging Operator is a project created by Banzai Cloud which is powered by FluentD and FluentBit in order to perform log discovery.

We begin by deploying the logging operator onto our Kubernetes cluster. Fortunately, there's a helm chart which makes the deployment easier.

# Create a logging namespace
apiVersion: v1
kind: Namespace
metadata:
  name: logging

# Add the helm repo
helm repo add banzaicloud-stable https://kubernetes-charts.banzaicloud.com
# Install the helm chart
helm upgrade --install --wait --create-namespace --namespace logging logging-operator banzaicloud-stable/logging-operator \
  --set createCustomResource=false"

Deploying Loki

Now that Logging Operator is present to set up log discovery across our various sources, it's time to set up Loki and Grafana.
Grafana community has set up a helm chart which will help in building a Loki stack using terraform:

resource "helm_release" "rel_logging_loki" {
  repository = "https://grafana.github.io/helm-charts"
  chart = "loki-stack"
  name = "loki"
  namespace = "logging"

  set {
    name = "pomtail.enabled"
    value = "true"
  }
  set {
    name = "loki.enabled"
    value = "true"
  }
}

Linkerd will need to be updated so that Grafana talks to Loki.
To preform the set up, we'll use a built-in Kubernetes utility kustomize in order to patch the new Grafana configuration into Linkerd's Grafana instance.

grafana.yml

kind: ConfigMap
apiVersion: v1
metadata:
  name: linkerd-grafana-config
data:
  datasources.yaml: |-
    apiVersion: 1
    datasources:
    - name: prometheus
      type: prometheus
      access: proxy
      orgId: 1
      url: http://linkerd-prometheus.linkerd.svc.cluster.local:9090
      isDefault: false
      jsonData:
        timeInterval: "5s"
      version: 1
      editable: true
    - name: Loki
      type: loki
      access: proxy
      editable: false
      default: true
      url: http://loki.logging:3100
      maximumLines: "300"

Set up a kustomization.yml:

resources:
- linkerd.yml
patchesStrategicMerge:
- grafana.yml

Now we can dump our current Linkerd config and run kubectl kustomize to patch the Grafana configuration from the previous step into Linkerd

linkerd upgrade > linkerd.yml
kubectl kustomize | kubectl apply -f -

Setting up Logging Operator to Stream to Loki

The last step we'll have to complete is setting up the logging operator to stream to Loki.
The Logging operator specifies the following custom resources which are used to watch containers and transport logs to a target destination:

Logging - Specifies a logging source
Output - Specifies a destination for log outputs. These resources can also be established cluster-wide as a ClusterOutput
Flow - Connects Logging resources to Output resources and specifies patterns which are used to parse log entries. These resources can also be established cluster-wide as a ClusterFlow.

Specify a ClusterFlow to send logs to Loki:

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: loki-output
  namespace: logging
spec:
  loki:
    url: http://loki:3100
    configure_kubernetes_labels: true
    buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true

For each Pod/Deployment you want to monitor, you're going to need to set up a Logging and a Flow:

# Set up the Logging object
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: files-logger
  namespace: files
spec:
  fluentd: {}
  fluentbit: {}
  controlNamespace: logging
---
# Set up the Flow object
apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
metadata:
  name: files-flow
  namespace: files
spec:
  globalOutputRefs:
  - logging-index-output
  filters:
    - tag_normaliser: {}
    - parser:
        remove_key_name_field: true
        reserve_data: true
        parse:
          type: multi_format
          patterns:
            - format: regexp
              expression: '/^(?<time>[^\]]*) \[(?<level>[^ ]*)\] (?<source>[^\":]*): (?<message>.*)$/'
              time_key: logtime
              time_format: '%Y-%m-%dT%H:%M:%S.%LZ'
            - format: regexp
              expression: '/^time="(?<time>[^\]]*)" level=(?<level>[^ ]*) msg="(?<message>[^\"]*)"/'
              time_key: time
              time_format: '%Y-%m-%dT%H:%M:%SZ'
            - format: regexp
              expression: '/^level=(?<level>[^ ]*) ts=(?<time>[^\]]*) caller=(?<source>.*) msg="(?<message>[^\"]*)"/'
              time_key: time
              time_format: '%Y-%m-%dT%H:%M:%S.%LZ'
            - format: regexp
              expression: '^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$'
              time_key: time
              time_format: '%d/%b/%Y:%H:%M:%S %z'
  match:
  - select:
      labels:
        app: owncloud

ℹ The Flow object in this example uses FluentD expressions to parse log streams. The FluentD expressions can be evaluated/debugged using Fluentar

The globalOutputRefs section matches the Flow resource with an Output. The Output will route the logs to the indicated destination.

The match section indicates what Kubernetes resources the Flow resource will look for in order to obtain logs. In this example, the Flow resource will look for any pods with the label app: owncloud.

If everything is set up right, the logs will be able to be searched in Grafana:

What's Left?

So far I've covered how to set up Linkerd and stream logs to Grafana. An improvement to this set up would be deploying Prometheus AlertManager in order to create and manage alerts so that you can receive notifications through 3rd party services like Slack and PagerDuty.

References

Customizing Linkerd's Configuration - https://linkerd.io/2/tasks/customize-install/
Logging Operator Quickstart Guide - Loki- https://banzaicloud.com/docs/one-eye/logging-operator/quickstarts/loki-nginx/
https://itnext.io/part-4-operations-and-the-cloud-native-stack-in-action-bb17d9f0ff5

Outtakes

This article was roughly 3 months of experimenting in my home-lab. Instead of Graphana and Loki, I had originally attempted to preform log streaming using the Elastic Stack, EFK (Elasticsearch, FluentD, and Kibana).
I'm using Keycloak as an identity provider and had attempted an initial configuration using Kibana and Open-ID Connect (OIDC) based authentication; however, the OIDC plugin is only available on the Platinum Tier of Elastic. Disabling the xpack.security.enabled setting broke the Elastic applications.
Attempting to utilize a keycloak-kibana plugin also caused Kibana to fail to start.

With Loki and Grafana, I'm able to set up a Kubernetes Ingress with OAUTH2 authentication as a way to secure access to my services.

Top comments (2)

Kohei Ota • Jan 4 '21 • Edited

Pomtail -> Promtail in

:)

mikeyGlitz • Jan 4 '21

Still learning how this works. I'll issue a correction soon:

source: computingforgeeks.com/forward-logs...
Promtail, just like Prometheus, is a log collector for Loki that sends the log labels to Grafana Loki for indexing.

DEV Community

Kubernetes Service Monitoring and Observability

Monitoring and Kubernetes

What is a Service Mesh?

Setting up Linkerd

Logging Operator

Deploying Loki

Setting up Logging Operator to Stream to Loki

What's Left?

References

Outtakes

Top comments (2)

Read next

EKS Auto Mode Unlocked for Existing Clusters with Terraform

Day 37: Using Kustomize to Manage Kubernetes Configurations

Day 36: Monitoring Kubernetes with Prometheus and Grafana

How to deploy Kyverno Across Multiple Kubernetes Clusters