DEV Community

leo
leo

Posted on

Introduction to active/standby switching of openGauss database nodes

Instance active/standby switchover
Operation scene
During the operation of openGauss, the database administrator may need to manually switch the database node between the active and standby nodes. For example, after the active/standby failover of the database nodes is found, the original active/standby role needs to be restored, or if a hardware failure is suspected, the active/standby switchover needs to be performed manually. The cascaded standby machine cannot be converted directly to the main machine. It can only become the standby machine through switchover or failover first, and then switch to the main machine.

illustrate:

The main-standby switch is a maintenance operation to ensure that the openGauss state is normal, and the switch operation is performed after all services are completed.
When Extreme RTO is enabled, cascading backups are not supported. Because the standby machine does not support connection when the ultimate RTO is enabled, it cannot synchronize data with the cascaded standby machine.
After cascading standby switchover, the synchronous_standby_names parameter of the host will not be adjusted automatically, so you may need to manually adjust the synchronous_standby_names parameter of the host, otherwise the write service of the host may be blocked.
Steps
Log in to any node of the database as the operating system user omm, and run the following command to check the active and standby status.

gs_om -t status --detail
Log in to the standby node to be switched to the active node as the operating system user omm, and run the following commands.

gs_ctl switchover -D /home/omm/cluster/dn1/
/home/omm/cluster/dn1/ is the data directory of the standby database node.

Note: For the same database, the previous master/standby switchover has not been completed, and the next switchover cannot be performed. When a switchover is initiated while the business is in operation, the host’s thread may not be stopped and the switchover display times out, but the actual background is still running. After the host thread stops, the switchover can be completed. For example, when the host deletes a large partition table, it may not be able to respond to the signal initiated by switchover.

When the main machine fails, you can execute the following commands on the standby machine.

gs_ctl failover -D /home/omm/cluster/dn1/
After the switchover or failover succeeds, execute the following command to record the current active and standby machine information.

gs_om -t refreshconf
example
Switch the standby instance of the database node to the primary instance.

Query the database status.

gs_om -t status --detail

[ Cluster State ]

cluster_state : Normal
redistributing : No
current_az : AZ_ALL

[ Datanode State ]

node             node_ip         port      instance                            state
Enter fullscreen mode Exit fullscreen mode

1 pekpopgsci00235 10.244.62.204 5432 6001 /home/wuqw/cluster/dn1/ P Primary Normal
2 pekpopgsci00238 10.244.61.81 5432 6002 /home/wuqw/cluster/dn1/ S Standby Normal
Log in to the standby node to perform active/standby switchover. In addition, after switchover cascades the standby machine, the cascaded standby machine is switched to the standby machine, and the original standby machine is reduced to the cascaded standby machine.

gs_ctl switchover -D /home/wuqw/cluster/dn1/
[2020-06-17 14:28:01.730][24438][][gs_ctl]: gs_ctl switchover ,datadir is -D "/home/wuqw/cluster/dn1"
[2020-06-17 14:28:01.730][24438][][gs_ctl]: switchover term (1)
[2020-06-17 14:28:01.768][24438][][gs_ctl]: waiting for server to switchover............
[2020-06-17 14:28:11.175][24438][][gs_ctl]: done
[2020-06-17 14:28:11.175][24438][][gs_ctl]: switchover completed (/home/wuqw/cluster/dn1)
Save the database master and standby machine information.

gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.
Troubleshooting
If a failure occurs during the switchover process, troubleshoot the error based on the log information in the log file. For details, see Log Reference.

exception handling
The exception judgment criteria are as follows:

Under business pressure, it takes a long time to switch between the active and standby instances. This situation does not need to be dealt with.

When the other standby machines are being built, the master needs to send logs to the standby machine before downgrading to the standby machine, resulting in a long time for active-standby switchover. This situation does not need to be dealt with, but try to avoid active/standby switchover during the build process.

During the switchover process, the connection between the active and standby instances is disconnected due to network failures, disk fullness, etc., and dual-active instances occur, please refer to the following steps to fix them.

Warning: After the dual-active state occurs, please follow the steps below to return to the normal active-standby state. Failure to do so may result in data loss.

Execute the following command to query the current instance status of the database.

gs_om -t status --detail
If the query result shows that the status of both instances is Primary, this status is abnormal.

Determine the node that is downgraded to standby, and execute the following command on the node to shut down the service.

gs_ctl stop -D /home/omm/cluster/dn1/
Run the following command to start the standby node in standby mode.

gs_ctl start -D /home/omm/cluster/dn1/ -M standby
Save the database master and standby machine information.

gs_om -t refreshconf
Check the database status to confirm the recovery of the instance status.

Top comments (0)