DEV Community

Jim Hatcher
Jim Hatcher

Posted on • Updated on

Running Multi-region CockroachDB on k8s -- the internals

If you are familiar with CockroachDB (CRDB), you may have heard that CRDB runs well on Kubernetes (k8s). You may have also heard that CRDB can be deployed in either a single-region or multi-region configuration.

The Cockroach Labs' docs site has a great guide for deploying CRDB on k8s in a multi-region environment.

If you follow the guide, you'll have a working multi-region cluster, but you may not understand what exactly you've set up!

There is a section that explains some of the differences that you'll see between a single-region k8s deployment and the multi-region k8s deployment described in the guide.

In this blog, I want to explain things in a little more detail. Understanding these fundamentals can be really helpful when you're troubleshooting a cluster that isn't acting quite right.

You gotta use the manifests!

When you're deploying CRDB in a single-region, there are three deployment options:

  1. k8s operator
  2. Helm Chart
  3. Manual manifests (i.e., yamls)

However, when you go the multi-region route, the only available option is the manual manifests.

This is not really a CRDB limitation. There is not a strong consensus in the k8s community about how federated k8s clusters should work, so we have our own way of doing it that doesn't really use any k8s primitives to provide coordination across the various clusters.

One CockroachDB cluster, 3 Kubernetes Clusters

So...when you create a multi-region CRDB cluster (let's say 3 regions for the sake of this illustration), you run the CRDB resources on 3 separate k8s clusters and these 3 k8s clusters know nothing about each other.

Multi-Region CRDB Deployment on k8s

From the CRDB perspective, it is one single, logical cluster; the CRDB nodes don't really know that they're running on k8s -- they just know that there are various CRDB nodes, and as long as there is network connectivity between these nodes, everything will interact as designed.

Cluster Maintenance

There are some implication to this. One implication is that if you want to make changes to your cluster across all three regions, you'll need to interact with the k8s control planes in all three k8s clusters.

If you look at the guide and find the section called "Maintain the Cluster", you'll see steps for scaling, upgrading, and stopping the cluster. Any of these steps necessitates doing actions on all 3 k8s control planes. There is no one single control plane that can handle these actions for you.

Network Planning

Another implication is that you should not create overlapping service or pod networks in your k8s clusters.

A k8s cluster has a service network with a pool of available IP addresses which it assigns to newly-created services (things like load balancers). It also has a pod network with a pool of available IP addresses which it assigns to newly-created pods.

When you create a k8s-based deployment of CRDB, a stateful set is used. The stateful set has a number of replicas configured (the default replicas in the guide is 3). Each of these replicas is created as a pod and gets assigned an IP address from this pod network IP range.

We also create one headless service (which is used internally and doesn't use an IP) and one non-headless service (which gets an IP assigned from the service network IP range). You can see these IPs by running kubectl get pods and kubectl get services, respectively.

Below is a diagram which shows the network connectivity need of every CRDB node (running as a k8s pod).
CRDB Node Network Connectivity

You can use a k8s load balancer (running on the service network) to handle the incoming SQL and DB Console access, but the CRDB node-to-node connectivity requires that every node (in any region) needs to be able to connect to every other node directly -- and in a k8s deployment, that means that every k8s pod running CRDB has to be able to talk directly to every other k8s pod running CRDB.

Our docs give basic examples of creating the various k8s clusters. For instance, in the GKE path, the following commands are given:

gcloud container clusters create cockroachdb1 --zone=$GCE-ZONE1
gcloud container clusters create cockroachdb2 --zone=$GCE-ZONE2
gcloud container clusters create cockroachdb3 --zone=$GCE-ZONE3
Enter fullscreen mode Exit fullscreen mode

In a simple, demo deployment, this command is fine. But for more complex deployments (certainly production deployments), it's worth thinking through whether these clusters will end up having overlapping service and pod networks.

I personally like to use a more explicit command for creating the cluster. For instance, for the GKE path, I would issue the following:

gcloud container clusters create cockroachdb1 
  --zone=$GCE-ZONE1 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.1.0.0/16
  --services-ipv4-cidr=10.101.0.0/16
gcloud container clusters create cockroachdb2 
  --zone=$GCE-ZONE2 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.2.0.0/16
  --services-ipv4-cidr=10.102.0.0/16
gcloud container clusters create cockroachdb3 
  --zone=$GCE-ZONE3 --machine-type=$MACHINETYPE --num-nodes=3 
  --cluster-ipv4-cidr=10.3.0.0/16
  --services-ipv4-cidr=10.103.0.0/16
Enter fullscreen mode Exit fullscreen mode

Notice that I'm explicitly giving each cluster a unique cluster (i.e., pod) CIDR block and services CIDR block. I'm also being explicit about the machine type and number of nodes just because it avoids ambiguity.

Note: Because CRDB wants to have this direct pod-to-pod connectivity, we don't like to run CRDB on service meshes (like Istio). Service meshes tend to rely on hiding the pod details and having everything connect through the services layer, which works great for stateless services, but not so great for highly-stateful databases with node-level replication requirements.

DNS Resolution

k8s pods within a given k8s cluster are able to talk to each other by host name because DNS Resolution is handled within k8s. For example if pod0 needs to talk to pod1, pod0 references pod1's host name which gets resolved by the DNS service (either coredns or kube-dns, depending on your flavor of k8s). The DNS service returns the IP address mapped to the hostname and then pod0 talks to pod1 directly via this IP address.

Since our CRDB cluster involves three k8s clusters, this DNS system breaks down a little bit. It works fine for pods within the same cluster, but pods in different clusters can't find each other out of the box.

To remedy this, we set up a few things in our multi-region implementation:

  1. we create a load balancer that exposes the DNS service of each cluster outside of that cluster.
  2. in each DNS service we create a config map that tells each DNS service how to forward DNS requests to every other DNS service (using the aforementioned load balancer) based on the format of the hostname.

If you want to see these, look in the "kube-system" namespace and run kubectl get configmap and kubectl get svc.

Using this system of "DNS forwarding", each pod can now resolve the name of every other pod across all the clusters. Assuming that the pod networks are non-overlapping and routable, then we should be able to operate.

Let me walk through an example of this, just to make sure it's clear:

  1. pod1 in region1 wants to connect to pod2 in region2
  2. pod1 reaches out to its DNS service and asks to resolve the name "pod2.region2". The DNS service is not able to resolve that name directly but there is a forwarding rule that says for any hostname of the format "*.region2", talk to another DNS server at address 10.10.10.10.
  3. The 10.10.10.10 address is a load balancer that is exposing the DNS service of region2's k8s cluster. The region1 DNS server talks to region2's DNS server via this link and resolves the pod2's host name as 10.2.0.2.
  4. pod1 now knows the IP address of region2.pod2 and creates a connection to 10.2.0.2 directly.

If you're having trouble getting the pods to communicate across regions, there are few things you can look at for troubleshooting:

  1. You can look at the logs of the CRDB pods and find error messages about connecting to other nodes. Find the logs by running kubectl logs cockroachdb-0 -c cockroach -n <namespace name here>. Sometimes the error message will help you understand whether name resolution has occurred. For instance, if you see an error like: addrConn.createTransport failed to connect to {cockroachdb-2.region2:}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.2.2.2: connect: connection refused, then you can tell that the DNS resolution has happened correctly (since both the hostname and IP address are listed in the error). In this case, you don't have a DNS resolution problem but some other network connectivity issue.
  2. You can also look at the logs of the DNS pods by running kubectl get pods -n kube-system. Then after you find the name of one of the dns pods, you can run kubectl logs <name of dns pod> -n kube-system and look at the log messages.

Network Firewalls

You'll want to make sure that you have firewalls opened between the clusters on TCP/53 (the port used for DNS) and also TCP/26257 (the port used for CRDB node-to-node communication). Depending on your k8s environment (GKE, EKS, etc.), these settings can be controlled in one or more places. For instance, in EKS (i.e., Amazon), there is an NSG (Network Security Group) for the cluster role that has firewall settings. There is also a NACL (Network Access Control List) on each private subnet that has firewall settings.

Summary

CockroachDB is complementary to k8s. When running in multiple k8s regions, there is a little extra work (mostly done by the python script described in our docs) that creates DNS forwarding and allows k8s pods in different k8s clusters to be able to communicate. This DNS setup, in conjunction with smart network planning is a good solution for running CRDB in multi-region k8s.

Top comments (0)