Jennifer Luther Thomas

Posted on Dec 7, 2023 • Originally published at Medium on Nov 28, 2023

How purpose-built observability will speed up your Kubernetes troubleshooting

#calico #kubernetesobservabil #kubernetes #troubleshooting

If you remember from my last post, in my previous job, I was battling a support case where a user had implemented network security policies for their FME Flow (Server) application and a part of the application wasn’t working. They were using Open Service Mesh (a lightweight and extensible cloud-native service mesh), which had some mechanism for recommending Kubernetes security policies. However, in doing so it had missed one of the port ranges that FME Flow required. This took weeks of troubleshooting with involvement from the customer, resellers/consultants with escalation to the software vendor (me).

Knowing that FME Flow was working fine before the network security policies were applied, OSM seemed like the probable cause. But with no insights into network topology or flow logs to see how those policies had potentially impacted FME it was difficult to isolate the problem. I had also never used OSM before, which made it hard for me to know where to start. Eventually the customer figured out that a port range had been missed in the network security policies so I can’t take any credit for the resolution.

If the customer or I had access to Kubernetes observability back then, it would have made it so much easier and faster to troubleshoot the issue.

So what is observability?

Observability refers to the ability to understand the internal state of a system by looking at the external outputs of the system. For me, the main benefit was being able to quickly view my intra-cluster traffic and identify where packets were being denied and why.

In Calico Cloud (which I’m using for this example) this can be found inside the Dynamic Service and Threat Graph.

“Dynamic Service and Threat Graph provides a point-to-point, topographical representation of traffic within your cluster to observe Kubernetes environment behavior, troubleshoot connectivity issues, and identify performance hotspots.”

According to Splunk, in their “The State of Observability 2023” report:

“Observability has become foundational to modern enterprises, providing a way to see into the stunningly complex web of systems that characterizes today’s IT environments”

I would agree with that.

Observability helps:

understand the communication patterns within Kubernetes
visualize microservice communication
quickly see dependencies and interactions
identify external services
analyze performance
speed up troubleshooting
increase resilience

The situation

In the FME Flow (an enterprise spatial ETL application) helm chart you can add a value for the fmeserver.portpool. This defines a range of ports that the FME Engines (the ‘worker’ that does all of the ETL processing) use when connecting to the FME Core, which is essentially the brains of the application.

If you apply network security policies (without the port range exposed) after you’ve already launched FME you may not notice immediately. In fact, I think the issue mainly manifested itself when an FME Engine needed to communicate with the Core to retrieve database connection information (if an ETL job was reading/writing to a database) which was not the easiest to troubleshoot. This was the customer’s situation that was brought to me.

If you don’t allow the port range within the cluster before you install the application you may find that the FME Engines can’t register with the FME Core and the problem would be more obvious and critical. If the FME Engines aren’t registering, you’re not going to be processing any data.

This is why I wanted to see how fast I could identify and fix this issue using observability.

Reproducing the scenario

As I’ve been learning more about Calico Cloud and it’s capabilities, I realized the policies that I implemented previously could not only be used in Calico Cloud, but I could easily and quickly view how all of the components of my application communicate with each other and if the policies I’m putting in place are working.

Connecting Calico Cloud to my cluster was a piece of cake. The UI generates a command that you can apply to your cluster. Then while you take an extended coffee or a snack break, it installs everything in your cluster and you’re ready to go!

To reproduce the customer scenario, I had the Policy Board open on one screen.

The Policy Board lets me see which policies I have enforced, either by applying them via CLI or creating in the Calico Cloud interface. It very quickly gives insights into which policies are evaluating traffic whether packets are being allowed or denied.

The Policies Board in Calico Cloud

On my other screen I had the Dynamic Service and Threat Graph open, which allows you to click on any line and see what traffic is flowing between components.

Based on the arrow direction we can easily see that traffic is flowing from NGINX (the ingress) to FME (the application). The arrow is green, which means that traffic is allowed. Inspecting the traffic shows the protocol, ports and any policies in place in the right hand sidebar:

An up-close view of the communication between two Kubernetes namespaces

If we look inside the fme namespace we can see traffic communicating between the different components. There is a red line between the engine deployment groups and the engine registration service (core).

Inspecting that traffic flow shows the engine group is trying to communicate with the core on ports 7070 and 42001–42002 (the port range I specified in the helm chart is 42000–43000).

A topological view of microservices communication within the fme namespace

Looking at the right-hand panel we can see that traffic is being denied as it leaves the engine group.

In Calico Cloud at the bottom of the Dynamic Service and Threat Graph is a Flows table which lists every flow happening within the cluster. This includes source and destination, ports, policies, action, process ID, etc. To find out which policy is denying the traffic we can look at the policies that are evaluating the flow from engine to the engine registration service.

The default deny policy is in place to block any traffic that I haven’t explicitly allowed as part of my zero-trust security posture. I deliberately excluded my port range from any policy so that it would be denied by default to reproduce the customer’s scenario.

To allow traffic to flow as intended the engine policy will need to be set up so that it can communicate with the other FME Flow components.

To apply this policy to the engines, I used this policy label selector:

selector: safe.k8s.fmeserver.component == "engine"

This is the Create Policy UI where I’ve defined ingress and egress rules for the engines:

The ingress and egress rules for the fme engine deployment

And if you’d rather look at the yaml:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: application.engine-fmeserver
  namespace: fme
spec:
  tier: application
  order: 37.5
  selector: safe.k8s.fmeserver.component == "engine"
  serviceAccountSelector: ''
  ingress:
    - action: Allow
      protocol: TCP
      source:
        selector: safe.k8s.fmeserver.component == "core"
      destination:
        ports:
          - '7500'
  egress:
    - action: Allow
      protocol: TCP
      source: {}
      destination:
        selector: safe.k8s.fmeserver.component == "core"
        ports:
          - '7070'
          - '42000:43000'
    - action: Allow
      protocol: TCP
      source: {}
      destination:
        selector: app.kubernetes.io/name == "postgresql"
        ports:
          - '5432'
  types:
    - Ingress
    - Egress

And almost like magic, the Dynamic Service and Threat Graph turns green:

Another snazzy feature that makes it easy to see which components are talking to which is to click on it in the Dynamic Service and Threat Graph. Clicking on the engine-standard-group deployment shows every inbound and outbound communication on the right-hand side.

Flow Visualization

Another observability feature of Calico Cloud that I’ve recently come to appreciate is Flow Visualization, or “FlowViz”.

At first look, it’s a bit of a WTF moment. This doesn’t tell me anything?!

Instead of the topological view that is easy to comprehend at first look, Flow Visualization gives a 360 degree view of your cluster, with network traffic represented volumetrically. Moving in from the outside it represents namespaces, endpoint names and flows. Colour-coded flows quickly lets you see if there’s any denied traffic, and next to this visualization is a table that shows allowed and denied traffic by namespace, with connections per second (CPS), packets per second (PPS) and bits per second (BPS), if network performance is your thing.

You can also ‘zoom in’ to your namespace and easily find denied traffic and which policies are denying (or allowing) traffic. Clicking on the denied traffic flow (as shown in the gif above) instantly shows which policy is responsible so that it can be fixed.

Conclusion

If the customer had the same visibility into their cluster it would have been very easy to identify the denied traffic and correct the policies. But then I wouldn’t be here telling this story.

Observability using Dynamic Service and Threat Graph, FlowViz and the Policy Board make it incredibly easy and fast to apply the correct network security policies to protect your workloads, as well as see what’s actually going on inside your cluster! The massive overhead of writing policies in yaml and taking an iterative approach to testing connection by connection is so last blog. Not only does observability make your cluster look cool, I’ve found it incredibly valuable.

If you want to give it a go for yourself, sign up for a Calico Cloud trial. There is also a hands-on tutorial to learn how to gain observability and optimize troubleshooting in under an hour.

Stay connected with me on here, X or LinkedIn to follow my journey and more introductory security content!

If you want to see all of my policies for FME Flow reach out and I can share the yaml.

DEV Community

How purpose-built observability will speed up your Kubernetes troubleshooting

So what is observability?

The situation

Reproducing the scenario

Flow Visualization

Conclusion

Top comments (0)

Read next

Inside the Kubernetes Control Plane

Automating Policy Enforcement in Kubernetes Using OPA: A Step-by-Step Tutorial

Azure Container Services: The Right Tool for Every Containerization Need

Stormforge and Karpenter With EKS