DEV Community

leo
leo

Posted on

openGauss provides a remote disaster recovery solution based on streaming replication

Disaster Recovery across Regions with Three Centers in Two Regions
To achieve cross-region disaster recovery, two sets of database instances need to be deployed, one for the primary database instance and one for the disaster recovery database instance. The primary database instance and the disaster recovery database instance are generally deployed in two different cities that are far apart. Full and incremental data synchronization can be directly realized between database instances with or without storage media. When the primary database instance (that is, the production database instance) has a regional failure and the data cannot be recovered at all. Consider enabling the upgrade of the disaster recovery database instance to the master to take over the business.

openGauss currently provides a remote disaster recovery solution based on streaming replication.

Remote Disaster Recovery Solution Based on Streaming Replication
overview
Starting from version 3.1.0 of openGauss, this solution has been provided for cross-Region disaster recovery in two locations and three centers.

Specifications and Constraints
This section describes the feature specifications and constraints of the solution in detail, and managers need to focus on them.

Specifications
The network latency requirement within the primary database instance or disaster recovery database instance is <=10 milliseconds, and the remote network latency requirement between the primary and standby database instances is <=100 milliseconds. The normal operation of the disaster recovery can be guaranteed within this delay range, otherwise the link of the active and standby database instances will be disconnected.

On the premise that the network bandwidth is not a bottleneck and the disaster recovery database instance enables parallel playback, different hardware specifications can support the log generation speed of the primary database instance as shown in the following table. RPO and RTO can be guaranteed at the log generation speed, otherwise they cannot be guaranteed.

Table 1 Log generation rate under typical configuration

typical configuration Support the log generation rate of the main database instance

96U/768G/SATA SSD <=10MB/s

128U/2T/NVMe SSD <=40MB/s

If disks are deployed in a mixed manner, the specifications of the low configuration should be used (for example, if there are NVMe and SATA disks in the database instance, please refer to the specifications of the SATA disk configuration).

A certain amount of data is allowed to be lost when the disaster recovery database instance is upgraded to the master, and the RPO<=10 seconds; the disaster recovery database instance is in the normal state, the RTO of the disaster recovery master promotion is less than 10 minutes, and the database instance is in the degraded state. The RTO for upgrading a database instance to the master is generally within 20 minutes.

Drill features: Planned master-standby database instance switchover, no data loss RPO=0, RTO<=20 minutes (including the two processes of downgrading the master database instance to a disaster-standby instance and upgrading the disaster-standby database instance to master).

Note: After testing, the limit writing speed of SATA SSD is about 240MB/s, the writing speed of SAS SSD can reach more than 500MB/s, and the performance of NVMe SSD is even better. If the hardware conditions do not meet the above standards, the supported main database instance single-shard log generation speed should be reduced to ensure RPO and RTO.

When resources such as file handles and memory are exhausted in the active and standby database instances, RPO and RTO cannot be guaranteed.

feature constraints
Before setting up a disaster recovery relationship, the primary cluster needs to create a disaster recovery user with stream replication permission for disaster recovery authentication. The primary and backup clusters must use the same disaster recovery user name and password. After a disaster recovery is established, the user password cannot be changed . If you need to modify the disaster recovery user name and password, you need to cancel the disaster recovery and use a new disaster recovery user to build again. The disaster recovery user password cannot contain the following characters "| ;&$<>`'"{}()[]~*?!\nBlank".
The version numbers of the active and standby clusters for disaster recovery must be the same.
The existing primary backup and cascaded backup are not supported before streaming disaster recovery is set up.
When setting up a disaster recovery relationship, if the number of cluster replicas is <= 2, set most_available_sync to on. This parameter will not return to the initial value after the disaster recovery is terminated or failover, and the cluster is continuously guaranteed to be in the maximum available mode.
When setting up a disaster recovery relationship, synchronous_commit will be set to on, and the initial value will be restored when the disaster recovery is canceled or failover is promoted to the master.
The disaster recovery cluster can be read but not written.
After the disaster recovery cluster is upgraded to the master through the failover command, the disaster recovery relationship with the original master cluster will become invalid, and the disaster recovery relationship needs to be re-established.
Disaster recovery can be set up when the primary database instance and the disaster recovery database instance are in the normal state; when the primary database instance is in the normal state and the disaster recovery database instance has been upgraded to the master, the primary database instance can perform disaster recovery release, and other database instances Status not supported. When the primary database instance and the disaster recovery database instance are in the normal state, the primary database instance can be switched to the disaster recovery database instance through the planned switchover command, and the disaster recovery database instance can be switched to the primary database instance. When the disaster recovery database instance is in a non-Normal and non-Degraded state, it cannot be promoted to the master, and cannot continue to provide disaster recovery services as a disaster recovery database instance. The disaster recovery database instance needs to be manually repaired or rebuilt.
If the majority of DNs in the disaster recovery cluster fail or all CMS and DN fail, disaster recovery cannot be started, the disaster recovery cluster cannot be promoted to master, and cannot be used as a disaster recovery cluster. The disaster recovery cluster needs to be rebuilt.
If the main cluster has undergone a forced cut operation, the disaster recovery cluster needs to be rebuilt.
Both the main cluster and the disaster recovery cluster support full backup and incremental backup in the gs_probackup tool. In the disaster recovery state, neither the primary cluster nor the disaster recovery cluster can be restored. If the primary database instance needs to be restored, the disaster recovery relationship needs to be canceled first, and the disaster recovery relationship needs to be re-established after the backup and recovery are completed.
After the disaster recovery relationship is established, the DN instance port modification is not supported.
The synchronization of GUC parameters is not supported between the primary database instance and the disaster recovery database instance that have established a disaster recovery relationship.
Active/standby clusters do not support node replacement, repair, up/down copy, or DCF mode.
When the disaster recovery database instance has two copies, the disaster recovery database instance can still be upgraded to the master to provide external services when one copy is damaged. If the remaining copy is also damaged, data loss will inevitably result.
In the disaster recovery state, only grayscale upgrade is supported, and the original upgrade constraints are inherited. The upgrade in the disaster recovery state needs to follow the order of first upgrading the primary cluster, then upgrading the standby cluster, then submitting the standby cluster, and then submitting the primary cluster.
It is recommended that for the selection of streaming disaster recovery and streaming replication IP, consideration should be given to separating the intra-cluster network plane from the cross-cluster network plane as much as possible, so as to facilitate pressure distribution and improve security.

Top comments (0)