In my previous blog, I wrote about various cluster mechanism for high availability. While the High Availability mechanisms in the landscape are providing a very aggressive RTO and RPO and protects against single host and single zone failures, the disaster recover strategy provide protection against regional failures. This is important for the enterprises that run their business globally and will potentially need to operate business processes and run SAP workloads irrespective of single region being unavailable.
As more components get added to a customer's setup for HA and DR purpose, it becomes important to first evaluate the cost aspect and purpose of disaster recovery. Usually, the landscapes are a mix of critical business systems and non-critical supporting environments. A typical example is SAP Business Suite systems, ECC, CRM, SCM usually are critical ones, while Solution Manager, BW, Business Objects, Portal and other applications being non-critical in nature. As the probability of a whole region getting failed is very minimal, though still possible, critical applications shall be considered for a robust DR strategy, while non-critical applications can sustain with a backup/restore mechanism. Below are 3 key mechanisms by which Disaster Recovery can be achieved in AWS
- Warm Disaster Recovery
- Pilot Light Disaster Recovery
- AWS DRS - Disaster Recovery as a Service - Storage level replication
- Backup/Restore - Storing the native database backups as well as the EC2 instance backups across regions
Let us deep dive on each of them and understand the type of SAP systems for which these are good fits.
Warm Disaster Recovery
The term "warm" represents a database or a server that is pre-populated with data. Basically, in this design, the Disaster Recovery region (or zone in case of zonal DR) have a fully configured EC2 instance that is always receiving changes from primary EC2 instance in an asynchronous manner using the database replication mechanism. In case of SAP HANA, it will be the HANA System Replication (HSR) that will replicate the changes. Since the design includes a secondary EC2 instance that is of same size as primary EC2, the cost of this design is highest among the options available. The benefit is very short RTO and short RPO duration. For the SAP application servers and ASCS/ERS instances, as there are no data related changes, the design involves setting up EC2 instances that can run application servers and ASCS/ERS instances on-demand. These can be shut down to save some of the costs and they are started when a DR event or a DR testing happens.
Pilot Light Disaster Recovery
The term pilot light is derived from gas burner which has a smaller gas burner permanently alight to lighten up the larger burner, the smaller one is called pilot light. Using the same analogy, for a HANA database or any other support SAP database, in order to ensure replication happens a minimal sized EC2 instance is deployed in DR region. This instance is up to date with database changes, however cannot function without a reboot to resize the instance and bring it up to the same configuration as actual production instance. For SAP application servers, the design follows same strategy as a warm standby. In case further cost savings are required, for application layer, the storage costs can be further saved by only deploying application servers when a DR event or DR testing happens. In order to achieve this, either AWS DRS ( discussed further below) or AWS CloudFormation/Terraform/AWS Launch Wizard based scripts can be utilized.
AWS DRS
AWS Elastic Disaster Recovery as a Service (abbreviated as DRS) uses storage level replication for an EC2 instance and unlike previous 2 methods, does not rely on running applications or databases to perform the replication. The replication itself need not be managed and require no storage expertise, as this just need to be enabled from AWS console and rest is taken care behind the scenes automatically using AWS native technologies. There is no EC2 instance that is actively running in DR site until a DR event is initiated manually using AWS console either for testing or real DR purposes. This further save cost but also increases the time it takes to bring up the service as an instance need to be deployed using the replicated storage.
Backup/Restore
The backups can be performed for most of the databases with AWS backup, alternatively for unsupported databases, they can be backed up to a EBS based disk and then copied over to S3 bucket that is configured for cross-region replication (S3 CRR). These backups are available in secondary region with a latency of 15 minutes or more, hence this approach is suitable when DR RTO and RPO requirements are not stringent in nature. In SAP landscape, this applies to non-critical systems, usually this includes SAP Solution Manager, SAP LVM, SAP ATC. Depending on scenario, some of ERP systems also can be considered for this approach.
Below is a table that compares these options against several parameters
The right Disaster Recovery approach is usually a mix of all the mechanisms described above. Start with answering below questions for any customer scenario.
- What is the SAP application type?
- Are there specific RTO/RPO requirements available?
- Evaluate RTO/RPO in terms of business impact
- Evaluate cost implications
Another important aspect that shall be kept in mind is testing Disaster Recovery. All the approaches described above provides mechanism to perform Disaster Recovery tests. However, in SAP landscapes, it is important to note that most of such landscapes do not work in isolation and have dependency with plenty of other systems, for example, fileshare areas, integration engines like BizTalk, job schedulers like Control-M ,and many others. There are also key landing zone components including DNS, Active Directory among others that need to be functional in a DR region before DR can be validated and/or executed. Ensuring that these pre-requisites are addressed before DR for SAP landscapes are planned for implementation and DR tests will ensure that in a real DR scenario, the test results can be reliably used.
In the next article, we will understand how the above scenarios can be automatically deployed and tested in AWS with minimal manual intervention.
Top comments (0)