Planning for disaster recovery is an essential norm in cloud environment. It cannot be taken lightly because of its importance in guaranteeing high availability and integrity of services and applications deployed in the cloud. This is fundamental to continuity of any business entity. Disaster Recovery (DR) is organization's resilience in promptly recovering access and functionality of its infrastructure after a disruptive event, which may be as a result of natural disaster or resulting from human actions or errors.
Come on and let's explore some best practices for disaster recovery on Key Vault and Storage accounts. For a clearer and concise discussion we will look into these as it applies in Microsoft Azure Architecture. It is believed that the basic knowledge of a Cloud Service Provider (CSP) Infrastructure can assist in understanding the same scenario in other CSPs’ infrastructure.
First, we will highlight what the objectives of Disaster recovery are in cloud environment, an overview of what key vault and storages are as cloud solutions and what are the potential risk that can impact on the high availability of key vault and storages, and will be presented. And more importantly, we will look into steps for designing a DR plan for the two solutions and how this applies to them individually.
Objectives of Disaster Recovery
A Disaster recovery plan aims at accomplishing the followings:
- To ensure business continuity: With DR, organisations will be able to minimize the impact of a disaster on business operations and ensure that critical functions can continue or resume as quickly as possible after a disaster.
- To minimize downtime: The downtime that immediately follows a disruption will be minimize or non-existent depending on the effectiveness of the Disaster recovery plan. This limits financial losses and at the same time customers’ confidence level is maintained.
- To restore data: Essential data and applications will be recovered and restored to their functional state after a disaster to resume normal operations.
- To preserve data integrity: The integrity and accuracy of recovered data will be maintained to avoid potential errors or inconsistencies that could impact business processes.
- To identify and mitigate risks: Potential risks and vulnerabilities will be assessed to enhance preparedness and reduce the likelihood of future disasters or minimize their impact.
- To protect assets and resources: Hardware, software, equipment, and facilities, will be protected from excessive damage or loss during a disaster.
- To ensure compliance to Regulatory requirements: Disaster recovery plans seek to adhere to laws relating to data privacy, security, and business continuity.
- To enhance organizational resilience: Planning for an effective disaster recovery promotes a culture of resilience within the organization, whereby members are encouraged to be adaptable and flexible to face unforeseen challenges.
Key Vault is a cloud service for securely storing and accessing secrets. These secrets include API keys, passwords, certificates, or cryptographic keys. Access to these are securely controlled by the Key vault. A detailed documentation on how to implement key vault in azure can be accessed on my earlier blogpost. Considering that these secrets kept in the Vault are very sensitive, a good disaster recovery plan for this service is very crucial.
Storages plays a vital role in any Cloud or on premise Infrastructure. Deployed resources needed to be kept in a secured manner that will make accessibility possible when needed. In cloud computing, Storage is the capability of storing and managing data, files, and other resources in storage services provided by a cloud service provider. Cloud storage allows users and organizations to access, upload, download, and modify their data without the need for managing a physical infrastructure.
In Azure, Storage services offers high availability, scalability, durability, and security for a variety of data objects in the cloud and this can be accessed globally. Types of Storage services in Azure include Azure Blob Storage, Azure Files, and Azure Queue Storage, Azure Table and Azure Managed Disk. The user-interface tools for interacting with Azure Storage is the Azure portal and Azure Storage Explorer. A descriptive article on how to create a Storage account with Blob Storage Container can be found in my earlier blog Creating a Storage Account with Azure Blob Storage Container
Potential Risks on Key Vault and Storages
Risks and threats that Key Vault and storage systems may encounter are enumerated below:
- Natural disaster – On this list are fire outbreak, hurricanes, flood, earthquakes, tsunami, tornado, wildfire etc.
- Cyber attacks
- Human errors: Here, let’s talk about each of the resources one at a time
(a) Risk of human error on key Vault
(i) Unauthorized Access to sensitive keys, secrets, and
certificates may occur if proper access controls and
permissions are not set up
(ii) Compromising the encryption keys when they are not
adequately protected. This may lead to the exposure of
(iii) Malicious or negligent actions by authorized users with
access to the Key Vault could lead to data breaches or
misuse of sensitive information.
(iv) Not rotating the encryption keys regularly can increase
the risk of prolonged exposure of sensitive data
(v) Insecure management of secrets, such as passwords or API
(vi) When key vault is not designed with redundancy or backup
mechanisms it can become a single point of failure for
(b) Risk of human error on Storages
(i) Misconfiguring Publicly Accessible Containers/Blobs can give way to unauthorized access to sensitive data
(ii) Weak access controls and authentication mechanisms for Storages
(iii) Accidental deletion or corruption of data without proper backups can lead to irreversible data loss.
(iv) Users with access to storage accounts might intentionally or unintentionally compromise data security.
(v) Failure to use encryption for data at rest or in transit increases the risk of data exposure if the storage resources are compromised.
(vi) If SAS (Share Access Signatures) tokens are not securely managed and expire too late, attackers might abuse them to gain unauthorized access to storage resources.
(vii) Misconfiguration of Cross-Origin Resource Sharing (CORS) might lead to unauthorized cross-origin requests, potentially exposing sensitive data.
Designing a Disaster Recovery (DR) Plan for Key Vault and Storage
In designing a DR plan, the following has to be considered:
a) Assessing the risks by identifying and evaluating this potential impact
b) Essential metrics such as the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) has to be defined
c) Choose the disaster recovery strategy
Implementing Disaster Recovery for Azure Key Vault
The two main ways of implementing DR plans for Key Vault are
(i) Backup and Restore
(ii) Multi region Replication
(i) Backup and Restore
Presently, Azure Key Vault (AKV) doesn't provide a way to back up an entire key vault in a single operation but the contents of the key vault can be backed up individually. Azure Key Vault has multiple layers of redundancy which ensures that keys and secrets remain available to your application.
When there is disaster, Key Vault maintains availability by automatically failing over requests to a paired region without any intervention from a user. This is because the keys and the secrets together with other contents in the key vault are replicated within the region and to a secondary region at least 150 miles away. But the replication must be within the same geography to maintain high durability of Key Vault contents. If individual keys, secrets or certificates in the Key vault fails, alternate components within the region automatically take over the request to make sure that there's no downtime.
However, note that exception to paired regions is single region geo like Brazil South, Qatar Central. In single region geo, Zone Redundant Storage (ZRS) is used to replicate data three times within the single location/region.
If an entire region becomes unavailable (which is very rare), what happens? Requests are automatically directed (failed over) to a secondary region. When the primary region is available again, requests are routed back to the primary region. Again, this does not apply in single region geo.
To restore backed up contents of a Key Vault, it can only be restored in the same geography due to cryptographic boundary. For example, East US and West US belong to the same geography - United States. The regions can be different, but they need to belong to the same geography and the same Subscription.
You can configure soft-delete and purge protection features on your key vault to protect your secrets against accidental or malicious deletion of your secrets. This allows recovery of the deleted vaults and deleted contents of the key vault known as soft-delete. It is important to note that Key Vault does not support the ability to backup more than 500 past versions of a key, secret, or certificate object.
(ii) Multi region replication
Azure Key Vault is a multi-tenant service that uses a pool of Hardware Security Modules (HSMs).Multi region replication is the extension of a managed HSM pool from primary region to secondary region that enhances the availability of critical cryptographic keys if one region become unavailable. After configurations, both regions are active, and replication takes place automatically. The closest available region to the application receives and fulfills the request, thereby maximizing read throughput and latency. The picture below explains how this works.
When multi-region replication is enabled on a managed HSM in the primary region, a second managed HSM pool, with three load-balanced HSM partitions, is created in the secondary region. When requests are issued to the Traffic Manager global DNS endpoint the closest available region receives and fulfills the request. For the list of Azure regions where you can replicate HSM pool from as the primary region, click on this link.
Implementing Disaster Recovery for Azure Storages
Storage account is used to manage the services that comprise Azure Storage. Storage accounts is used to deploy storage resources such as blob containers, file shares, tables, and queues. To plan disaster recovery for Azure storage, the Cloud concept of Redundancy comes into play.
Redundancy is the process of duplicating copies of data, systems and other resources so that if there is any disruptive events or actions, you have immediate and secure access to backed-up copies. In Azure Storage Redundancy, redundancy setting for Storage accounts are broadly divided into two categories:
(i) Redundancy in the Primary region
(ii) Redundancy in the Secondary Region
Redundancy in the Primary Region
Storage resources in a Storage account is usually replicated three times in the primary region. For the purpose of Disaster Recovery plan within a zone, Zone Redundancy Storage (ZRS) is recommended.
The diagram above depicts how Zone Redundancy Storage is structure in three availability zones.
Zone-redundant storage (ZRS) have the features listed below:
(a) It copies data synchronously across three Azure availability zones in the primary region.
(b) It offers durability for storage resources of at least 99.9999999999% (12 9's) over a given year.
(c) If a zone becomes unavailable you can still access the data for both read and write operations
(d) ZRS in the primary region is recommended for
(i) Scenarios that require high availability.
(ii) Restricting replication of data to a particular country or
region to meet data governance requirements.
(iii) Azure Files workloads.
(e) It provides excellent performance, low latency, and resiliency for your data if it becomes temporarily unavailable within a zone.
f) ZRS is supported for Standard storage accounts, Premium block blob accounts and Premium file share accounts
Point (e) above shows that Zone Redundant Storage is only effective for any disruptive event that occur within a zone. For regional disaster recovery, Geo Zone Redundant Storage (GZRS) is recommended. GZRS make use of ZRS in the primary region and geo-replicates data to a secondary region.
Redundancy in a Secondary Region
Replicating data to a secondary region is ideal to achieve high availability and durability especially in a scenario of a complete regional outage or a disaster in which the primary region irrecoverable. There are two options for copying your data to a secondary region:
(a)Geo Redundant Storage (GRS)
The picture above shows how data is replicated from a Primary region to a Secondary region. The features of GRS are stated below
i. It copies data synchronously three times within a single physical location in the primary region using LRS, and then copy it asynchronously to a single physical location in a secondary region
ii. Data written to the secondary location is also replicated within that location using LRS.
iii. Primary region and secondary region are usually hundreds of miles away apart
iv. It offers durability for storage resources of at least 99.99999999999999% over a given year
(b) Geo-Zone-Redundant Storage(GZRS)
The picture depicts geo replication in Geo Zone Redundant Storage.
How does it work?
i. It combines the high availability provided by redundancy across availability zones with protection from regional outages provided by geo-replication.
ii. In a GZRS storage account, data is copied across three availability zones in the primary region
iii. Data is also replicated to a secondary geographic region for protection from regional disasters.
iv. For Storage solutions, it is used to achieve maximum consistency, durability, availability, excellent performance, and resilience for disaster recovery.
v. It provide at least 99.99999999999999% durability of objects over a given year.
vi. It is only Standard General-purpose v2 storage accounts that support GZRS
vii. Enabling read access to the secondary region will make data to be always available to be read from the endpoint as well as from the primary endpoint
Testing and Monitoring Disaster Recovery procedures
- Backup Testing: Conducting regular tests for the backup and restore process is important to ensure that data can be successfully restored from the GZRS storage. To do this, a test environment can be created where data can be restored to verifying its integrity after a disruptive event
- Failover Testing: Test the failover process to a secondary region to simulate a disaster scenario. Ensure that the application can continue to function seamlessly using the data from the secondary region in the event of a disaster
- Testing Data Consistency: In a multi-region setup, you need to ensure that the data in both regions is consistent. Regularly test data consistency checks to ensure there are no discrepancies between the data in the primary and secondary regions.
- Load Testing: Conduct load testing after failover to ensure that the secondary region can handle the increased traffic and workload.
- Monitor data replication continuously between the primary and secondary regions. Azure Monitor can be used to track the status of GZRS replication and get alerts if any issues arise.
- Set up health probes and heartbeats to check the availability and responsiveness of your application and storage services in both regions.
- Implement Application Performance Monitoring (APM) tools to monitor the application's performance during normal operation and after failover.
- Configure alerts to notify relevant members of Disaster recovery team in case of a disaster recovery event. These alerts can be set up to trigger when specific conditions are met, such as data unavailability or replication delays.
- Use automation scripts to regularly test the disaster recovery procedures and monitor the infrastructure's health. This can help identify issues before they degenerate.
- Conduct regular incident response drills to test how your team responds to disaster recovery scenarios
Steps for Creating a Disaster Recovery Team
- Select members who are experts in cloud security, storage technologies, key management, and disaster recovery. This should staff from IT, security, operations, and any other relevant departments.
- All team members must have a comprehensive knowledge of the key vault and storage services provided by your cloud provider to ensure effective disaster recovery planning.
- Conduct a thorough risk assessment specific to key vaults and storage services. Identify potential threats, vulnerabilities, and risks that could impact the availability or security of your data.
- Define your recovery objectives using the metrics: Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for key vaults and storage.
- Develop Disaster Recovery Plan outlining step-by-step procedures for data backup, replications, recovery, and restoration as described above.
- Establish clear communication and coordination channels among the DR team members and with other relevant stakeholders. Each member must have full documentation of their roles and responsibilities during and after a disaster
- Training on key vault and storage services' disaster recovery features and best practices together with awareness of any updates thereof should not be neglected
It is very important to pay attention to disaster recovery procedures on key vault and storages because these two services are very crucial and fundamental in cloud environment. This article has discussed extensively overview of key vault and storages and potential risks that may interfere with their availability.
How to implement disaster recovery procedures such as backup and restore, together with multiregion replication were explained. Testing and monitoring Disaster recovery plan basically by the disaster recovery team was also highlighted.
In all, regularly reviewing and updating the disaster recovery plan as your cloud environment and business needs evolve will ensure high availability and resilience of key vault and storages as well as other cloud resources.
Trust this is useful. Please let us have your feedbacks