Repaving CockroachDB Nodes
"Repaving" nodes refers to the replacement of the underlying Linux/Unix nodes that are running CockroachDB -- usually in a cloud deployment. It is a procedure followed by some companies as a way to make sure that they're using the newest (and presumably most secure) versions of O/S images. It is thought to be easier than applying O/S patches. Because CockroachDB is a highly available and resilient distributed database, repaving nodes where CockroachDB is running is a relatively easy process.
Approaches to repaving
There are generally two approaches to repaving CockroachDB nodes:
- decommission nodes and add them back one-at-a-time
- stop each node, detach the block storage, re-attach the block storage to a newly created node and start up CockroachDB again
There are pros/cons to each. Let's examine in more detail.
Decommissioning
Let's assume we have a five-node cluster and that we're using the default replication factor of three. This means we will have a replica for every data range on each node. We can decommission a node by running cockroach node decommission <nodeid>
. When we do this, the node being decommissioned will start to move its replicas to other nodes within the cluster. This process could take several hours, depending on how much data is being stored on each node. During this time, data streaming will be occurring which means that we will increase CPU, memory, disk and network usage on the nodes involved (both the node being decommissioned and any nodes -- likely all the others -- where data replicas are being moved). After the node has finished off-loading its replicas, we can add a new node. This new node will be recognized by the cluster and it will start receiving replicas as the cluster re-balances again. The nice thing about this approach is that the cluster never runs in an "under-replicated state." We will always have three replicas of all the nodes. This overall process could take a while as you move through the nodes one-by-one.
It's worth noting that using this decommissioning method on a three-node cluster with RF=3 is not a good idea because there are no additional nodes to be used to accept new replicas. If you decomm a node on this topology, the replicas will not be copied to other nodes.
Stop and replace
The other common approach is to stop nodes and switch out the underlying hardware. One node at a time, you would stop the cockroach process (i.e., kill -9 <cockroach pid>
). When this node is stopped, the CockroachDB cluster will consider it to be in a "suspect state" and any of the ranges for which replicas are located on that node will be considered "under-replicated." An under-replicated state implies that a quorum is still maintained, so we can still read and write to the range. Cockroach will do nothing to remedy the under-replicated state for a certain period of time called the "time till store dead" window (the setting is called server.time_until_store_dead
and the default value is 5 minutes). If the node comes back up during this "time till dead" window, then CockroachDB will put the node back into a healthy state by catching up the range to a fully consistent state by sending RAFT logs from the leaseholder(s).
So, the idea is that we have the "time till dead" timeframe in order to detach the data device from the host, replace the host, re-attach the disk and restart the CockroachDB process. It is a common practice to temporarily extend the "time till dead" window to a more manageable timeframe like 15 or 30 mins. The advantage of this approach is that no streaming of data replicas has to occur, so it's much quicker to cycle through the cluster nodes. The potential risk of this approach is that during the time where some ranges are in an under-replicated state, if you were to lose another node in the cluster (due to an unexpected outage maybe), then you could potentially lose quorum on some ranges and therefore lost the ability to read and write to parts of the database.
This risk, however, can be mitigated. One mitigating factor is that this process is usually automated and therefore runs quickly. Another step that can be taken to mitigate this risk is to adjust the replication factor of your critical data to use a replication factor (RF) of 5, rather than the default of 3. When using RF=5, a quorum is 3 out of 5 (rather than 2 out of 3). In our repaving scenario, some ranges will be in an under-replicated state of 4 out of 5; and to move into an unavailable state, we'd have to lose 2 more replicas which is unlikely.
My personal preference is to run in Production with RF=5 and to use the stop/detach/reattach method -- hopefully in a fully-tested and automated manner.
Example of the "stop/detach/reattach" method in EC2
To exemplify this method, let's spin up four EC2 instances to make a three-node CockroachDB cluster in AWS EC2. You can follow the general documentation for deploying CockroachDB in AWS EC2 -- docs here (https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-on-aws.html) -- with a few slight deviations. One deviation we'll make is to attach an additional EBS volume to each node to be used to store all data/logs. We can still use the normally attached EBS volume to house the CockroachDB executable and TLS certificates.
I suggest you spin up four EC2 instances and for each one, record the private IP, the public IP, the public host name, the private host name, and the identifiers of the volumes of each of the drives. All of this information is available from the AWS control plane. For instance:
Node name (for all): hatcher-repave-test
Region: us-east-2a
N1
3.145.68.78
ec2-3-145-68-78.us-east-2.compute.amazonaws.com
172.31.11.189
ip-172-31-11-189.us-east-2.compute.internal
vol-0821f7fb596be5f25 (100GB)
vol-07c96d7b41612d9b7 (8GB)
N2
3.139.85.111
ec2-3-139-85-111.us-east-2.compute.internal
172.31.8.166
ip-172-31-8-166.us-east-2.compute.internal
vol-03da8d9790bc8e3ca (100GB)
vol-00a17f674809e3f40 (8GB)
N3
3.145.180.186
ec2-3-145-180-186.us-east-2.compute.internal
172.31.15.100
ip-172-31-15-100.us-east-2.compute.internal
vol-0fda8d9790bc8e2ef (100GB)
vol-0e3b89cebb10f84df (8GB)
N4 (Note: the fourth node won't need a separate data node)
3.138.181.122
ec2-3-138-181-122.us-east-2.compute.internal
172.31.13.143
ip-172-31-13-143.us-east-2.compute.internal
vol-0f2c700f0d9f9749b (100GB)
Make sure that the EC2 instances have a network security group that allows inbound calls on TCP/26257, TCP/8080, and TCP/22 to allow SSH-ing into each node.
Then, for each node, we need to run some Linux setup in order to make sure that our data drive is available for use:
#Check to see the name of the device and whether it has a file system installed
sudo lsblk -f
#it shouldn't have a file system at first, so let's make a file system (assuming the device is called /dev/xvdb)
sudo mkfs -t xfs /dev/xvdb
#verify there is a file system now
sudo lsblk -f
#make the /data directory so we can mount to it
sudo mkdir /data
#manually mount the volume
sudo mount /dev/xvdb /data
#backup the existing fstab file
sudo cp /etc/fstab /etc/fstab.orig
#get the UUID of the volume
sudo blkid
#using the UUID of the volume captured in the previous step, edit the fstab so the volume will auto-mount again after subsequent reboots
# add: UUID=b26331da-9d38-4b0a-af09-3c3808b8313e /data xfs defaults,nofail 0 2
sudo vim /etc/fstab
#verify that the fstab edit was done correctly by unmounting and remounting the volume
sudo umount /data
sudo mount -a
#verify that you see the drive
df -h
#change the owner to ec2-user
sudo chown ec2-user /data
#if you need to change the permission of the drive, do that here
#sudo chmod 400 /data
Now that we have our EBS-based data volume available to be used, we need to create a node cert for each node. We can follow the instructions from the docs to create the CA (certificate authority) cert/key pair and the root user's client cert/key pair; we only need to create once of these for the whole cluster. But, we want to create a separate node cert for each of the nodes. The easiest way to do this is to create the CA root/key + client root cert/key and copy those files to each of the nodes. In practice, you only need to have the ca.crt on the nodes and the ca.key should be kept in a safe place off the cluster; also in practice, you only need to copy the client root user cert/key files to locations where you will be authentication to the cluster. But, for this exercise, I recommend copying all four files to each of the nodes.
You'll need to copy the cockroach
executable to each of the nodes before running the next step -- this is described in the main doc.
Then, run commands like the following on each node to create the node cert/key pair files. If you're planning on exposing the nodes via an AWS-based load balancer, you can include the LB's IP and its hostname (if you made one) to each of these commands, too.
# node 1 command
cockroach cert create-node \
172.31.11.189 \
3.145.68.78 \
ip-172-31-11-189.us-east-2.compute.internal \
ec2-3-145-68-78.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key
# node 2 command
cockroach cert create-node \
172.31.8.166 \
3.139.85.111 \
ip-172-31-8-166.us-east-2.compute.internal \
*.us-east-2.compute.internal \
*.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key
# node 3 command
cockroach cert create-node \
172.31.15.100 \
3.145.180.186 \
ip-172-31-15-100.us-east-2.compute.internal \
ec2-3-145-180-186.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key
# node 4 command
cockroach cert create-node \
172.31.13.143 \
3.138.181.122 \
ip-172-31-13-143.us-east-2.compute.internal \
ec2-3-138-181-122.us-east-2.compute.amazonaws.com \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key
We're going to start a three-node CockroachDB cluster (even though we have four nodes -- we'll use that fourth node later).
Start CockroachDB on nodes 1-3 by running the start
command. Notice that we're explicitly specifying the --store=/data
directory so that CockroachDB will store all data and logs on our separately-mounted data volume (this is slightly different than the doc instructions).
#node 1 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.11.189 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background
#node 2 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.8.166 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background
#node 3 command
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.15.100 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background
#node 4 command (don't run yet)
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.13.143 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background
The docs will have you run an init
command (from any one of the nodes) and then your CockroachDB cluster will be up and running. We could at this point initialize one of the canned CockroachDB workloads in order to put some data into the cluster. Or we could create a simple database and table and manually insert a few records.
Now that we are in a steady state, let's start the repaving part.
(1) Extend the "time till dead" window from the default of 5 minutes to 30 minutes. Execute this command from a SQL window (i.e., cockroach sql --certs-dir=certs --host=172.31.13.143
)
set cluster setting server.time_until_store_dead to '30m0s';
show cluster setting server.time_until_store_dead;
(2) Verify that all the nodes are healthy and running. You can also verify that there are currently no under-replicated ranges via a SQL query (below) or in the DB Console's main page.
cockroach node status --certs-dir=certs --host=172.31.13.143
-- Under replicated (should be zero)
SELECT SUM((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated
FROM crdb_internal.kv_store_status S
INNER JOIN crdb_internal.gossip_liveness L ON S.node_id = L.node_id
WHERE L.decommissioning <> true;
(3) Stop N3
#kill all the processes running CRDB
killall -9 cockroach
#verify it's not running anymore
ps aux | grep cockroach
(4) Unmount the data drive on N3
sudo umount /data
(5) Detach the volume from N3 by using the AWS Control Plane of the AWS CLI. Having the volume identifiers is very helpful for this step and the next step.
(6) Attach the volume to N4 by using the AWS Control Plane of the AWS CLI.
(7) Mount the volume in N4.
#Check to see the name of the device and verify that it has a file system installed -- we expect it to already have a file system
sudo lsblk -f
#make the /data directory so we can mount to it
sudo mkdir /data
#manually mount the volume
sudo mount /dev/xvdb /data
#backup the existing fstab file
sudo cp /etc/fstab /etc/fstab.orig
#get the UUID of the volume
sudo blkid
#edit the fstab so we will get the mount again after rebooting
# add: UUID=b26331da-9d38-4b0a-af09-3c3808b8313e /data xfs defaults,nofail 0 2
sudo vim /etc/fstab
#verify that the fstab edit was done correctly by unmounting and remounting the volume
sudo umount /data
sudo mount -a
#verify that you see the drive
df -h
#verify that there is data in the volume
ls -alh /data
(8) Start N4
cockroach start \
--certs-dir=certs \
--advertise-addr=172.31.13.143 \
--join=172.31.11.189,172.31.8.166 \
--cache=.25 \
--max-sql-memory=.25 \
--store=/data \
--background
NOTE: CockroachDB does not require that the node being replaced and the new node have the same IP. In this example, they have different IPs. The data associated with the node in the data drive is not associated with the node IP.
(9) Verify that all nodes are up, data is available, and we have no under-replicated ranges.
cockroach node status --certs-dir=certs --host=172.31.13.143
-- Under replicated (should be zero)
SELECT SUM((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated
FROM crdb_internal.kv_store_status S
INNER JOIN crdb_internal.gossip_liveness L ON S.node_id = L.node_id
WHERE L.decommissioning <> true;
(10) Reset the "time till dead" setting.
set cluster setting server.time_until_store_dead to default;
(11) You can delete N3 and the main EBS volume (not the data volume) associated with N3 via the AWS control plane.
Summary
Repaving nodes is a very automat-able activity that can be used to satisfy security requirements -- especially in security-sensitive, cloud-based organizations. CockroachDB tolerates this type of operation well because of its distributed, resilient, and available design.
Top comments (0)