How to quickly realize proactive patrolling for dead-end network connectivity in large-scale clusters

#kubernetes #sre #cloudnative #patrol

01 What is inspection

Cluster patrolling is the process of performing periodic inspections and evaluations of a cluster system, whose main purpose is to ensure the stability, performance, and security of the cluster. Below are a few of the main uses of cluster patrolling:

Troubleshooting and Problem Diagnosis: Inspection can help to identify faults and problems in the cluster and provide diagnosis and solutions. By checking the individual components, configurations and operating conditions of the cluster, potential sources of failure and performance bottlenecks can be identified in a timely manner and appropriate measures can be taken to fix them.
Performance Optimization: Patrols can assess the performance and resource utilization of a cluster. By analyzing the load, resource allocation and configuration of the cluster, problems such as performance bottlenecks, resource wastage and bottlenecks can be identified and optimization recommendations can be provided to improve the performance and efficiency of the cluster.
Security Audit and Compliance: Patrols can check the security and compliance of clusters, including access control, authentication, data protection and other aspects. By auditing the cluster's security configuration, vulnerability management and compliance provisions, potential security risks and compliance issues can be identified and appropriate measures taken for remediation and compliance adjustments.
Capacity Planning and Scalability: The walk-throughs allow assessment of the cluster's capacity utilization and scalability requirements. This helps to predict future resource requirements, plan scaling strategies, and provide recommendations to ensure that the cluster has sufficient capacity and scalability to meet business growth and change.
High Availability and Redundancy Strategies: Patrols can assess the cluster's high availability and redundancy strategies. By examining the cluster's failover, backup, and recovery mechanisms, potential single points of failure and availability issues can be identified and recommendations can be provided to enhance the reliability and redundancy of the cluster.

02 Traditional Network Active Patrol Pain Points

Proactive patrols are mostly manual, through CLI tools or scripts, actively injecting pressure into the cluster to obtain the cluster response, so there are a lot of shortcomings.

When manual input commands are used to realize inspection, it can be difficult to implement due to the large cluster size, high frequency of inspection, or the complexity of the inspection process.
When a shell programming approach is used to implement inspections, raising the threshold of inspections for O&M personnel and programming bugs affect the accuracy of inspection conclusions.
When multiple hair presses are required to increase the number of requests and connections, configuration tuning of the hair presses is required at a higher cost, raising the cost of preparing the pressure test environment.
Issues such as needed tuning' of test tools andinexperience in configuration' made the ability to issue pressures limited, and the tests did not serve the intended purpose, producing erroneous conclusions.
For K8s applications rely more on the product's own inspection capability to confirm the cluster status by collecting information such as application metrics, logs, status, etc., The limited metrics information generated by the application does not allow for a complete conclusion of the inspection.
For a large-scale K8s cluster, it is desirable to confirm the network connectivity of PODs among all nodes to avoid network failure in one node and to find out whether there are occasional packet loss problems in the network, and there are many communication channels, including Pod IP, ClusterIP, NodePort, Loadbalancer IP, Ingress IP, and even PODs with multiple NICs and dual-stack IPs, and the manual method of inspection is inefficient and the maintenance cost is high.
For different applications need to use different tools to check, such as dns service, business application service, disks, etc., which requires O&M personnel to have in-depth knowledge of different inspection tools, greatly increasing the threshold of O&M personnel.
Different inspection tools have different inspection report styles, and it is not possible to show a detailed report of inspection results in a cloud-native style.

03 solution：kdoctor

kdoctor is a Kubernetes data-plane testing component based on active pressure injection for functional, performance testing of clusters. By researching and abstracting the regular O&M needs of O&M personnel, it allows cloud-native implementation of O&M tasks such as network, storage, and application, based on a CRD design that is capable of interfacing with observable components.

The kdoctor contains the following 3 main types of inspections：

kdoctor NetReach: Performs connectivity patrols on Pod IP, ClusterIP, NodePort within the cluster based on task configuration, ClusterIP, NodePort, Loadbalancer IP, Ingress IP, and even POD multi-network card and dual-stack IPs in the cluster according to the task configuration.
kdoctor AppHttpHealthy: specify the access address inside and outside the cluster according to the task configuration, and check the connectivity using HTTP HTTP and HTTPS protocols for connectivity checking, supporting PUT, GET, POST and other request methods.
kdoctor NetDns: according to the task configuration, performs connectivity checking on specified DNS Servers inside and outside the cluster, supports udp, tcp, tcp-tls protocols.

The kdoctor solves the traditional active inspection problem with the following design：

By issuing a CRD to configure the inspection task requirements, the user only needs to focus on the inspection target, inspection frequency, pressure generation parameters, and desired inspection results.
kdoctor reads the task configuration and runs the pressurizing agent as Deployment or DaemonSet to achieve the effect of multiple pressurizing machines.
kdoctor will use the default agent or create a new agent to execute the task according to the specification of the task, in order to achieve resource reuse and task resource isolation.
kdoctor will bind the corresponding resource target, such as ingress, service, each agent pod according to the task configuration mutual access to the bound resources, according to the request results to draw conclusions.
kdocotr's pressure-sending client is performance tuned to greatly reduce resource consumption during pressure-sending requests.
kdoctor's inspection reports are output through logs, aggregated api, file drop and so on.

04 Installation and Usage

Install kdoctor according to the official documentation for kdoctor.

In this paper, we use NetReach as an example for cluster connectivity patrol.

The cluster connectivity patrol task NetReach is issued, and the task will execute a round of tasks lasting 10s, where the default agent of each node accesses the IPv4 addresses of ClusterIP, Endpoint, NodePort, and LoadBalancer to each other using the http protocol, and executes them immediately.

cat <<EOF | kubectl apply -f -
apiVersion: kdoctor.io/v1beta1
kind: NetReach
metadata:
  name: reach-task
spec:
  expect:
    meanAccessDelayInMs: 1500
    successRate: 1
  request:
    durationInSecond: 10
    perRequestTimeoutInMS: 1500
    qps: 10
  schedule:
    roundNumber: 1
    roundTimeoutMinute: 1
    schedule: 0 1
  target:
    clusterIP: true
    endpoint: true
    ingress: false
    ipv4: true
    loadBalancer: false
    multusInterface: false
    nodePort: true
EOF

View Inspection Tasks

~# kubectl get netreach
NAME         FINISH   EXPECTEDROUND   DONEROUND   LASTROUNDSTATUS   SCHEDULE
reach-task   true     1               1           succeed           0 1

View the inspection task report

The kdoctor controller aggregates the inspection task reports and displays them via an aggregation API.

~# kubectl get kdoctorreport  reach-task -oyaml
apiVersion: system.kdoctor.io/v1beta1
kind: KdoctorReport
metadata:
  creationTimestamp: null
  name: reach-task
spec:
  FailedRoundNumber: null
  FinishedRoundNumber: 1
  Report:
  - EndTimeStamp: "2023-09-21T11:30:33Z"
    NetReachTask:
      Detail:
      - MeanDelay: 50.294117
        Metrics:
          Duration: 15.004307799s
          EndTime: "2023-09-21T11:30:33Z"
          Errors: {}
          Latencies:
            Max_inMx: 0
            Mean_inMs: 50.294117
            Min_inMs: 0
            P50_inMs: 0
            P90_inMs: 0
            P95_inMs: 0
            P99_inMs: 0
          RequestCounts: 102
          StartTime: "2023-09-21T11:30:18Z"
          StatusCodes:
            "200": 102
          SuccessCounts: 102
          TPS: 6.798047691796755
          TotalDataSize: 39295 byte
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentClusterV4IP_10.233.32.45:80
        TargetUrl: http://10.233.32.45:80
        ....
        Succeed: true
        SucceedRate: 1
        TargetMethod: GET
        TargetName: AgentPodV4IP_kdoctor-netreach-reach-task-pmndx_10.233.74.96
        TargetUrl: http://10.233.74.96:80
    NodeName: worker-node-1
    PodName: kdoctor-netreach-reach-task-lwbtk
    ReportType: agent test report
    RoundDuration: 15.049239468s
    RoundNumber: 1
    RoundResult: succeed
    StartTimeStamp: "2023-09-21T11:30:18Z"
    TaskName: netreach.reach-task
    TaskType: NetReach
  ReportRoundNumber: 1
  RoundNumber: 1
  Status: Finished
  TaskName: reach-task
  TaskType: NetReach

05 summarize

kdoctor is positioned not to replace traditional, professional testing tools, nor to implement a complete inspection solution, but to provide a simple, fast, efficient, cloud-native O&M testing tool`, to fill the functional gaps in the current O&M testing, to reduce the burden on O&M and to dock the results of the inspections into the product's ecosystem.

DEV Community

How to quickly realize proactive patrolling for dead-end network connectivity in large-scale clusters

01 What is inspection

02 Traditional Network Active Patrol Pain Points

03 solution：kdoctor

04 Installation and Usage

05 summarize

Top comments (0)

Read next

How to Install k3s with High Availability (HA)

Horizontal Pod Scaling vs Vertical Pod Scaling in Kubernetes: A Comprehensive Guide

Debugging Kubernetes cluster part 2

11th Dec 2024 — OpenAI Outage (ChatGPT) Explained: Kubernetes Clusters on Fire!