Disaster Recovery (DR) planning is an essential part of a successful cloud strategy. Unplanned outages and data loss can have catastrophic effects on business operations. By implementing a solid disaster recovery plan, customers can ensure that their systems remain operational and that data is consistently backed up and retrievable, minimizing potential downtime and loss.
Whether we are setting up OCI as the DR site from a customer’s data center or setting up DR from one OCI region to another; when planning DR on Oracle Cloud Infrastructure we need to consider geographical distance requirements, data sovereignty, the amount of data to be replicated between the locations, recovery point objectives (RPO) and recovery time objectives (RTO).
Distance and Data Sovereignty – The distance requirement and the data sovereignty requirements help to define the OCI region/s that will be required for the solution. If the data has to remain in country then we know we cannot use a foreign country region for the DR site. As an example, I have seen DR solutions for global companies where the primary site is in one country and the secondary failover site is in another country because their customer facing application is global. In the event of a failover for this type of configuration, the content delivery network is pointed to the secondary region. When considering the distance, we also have to consider how quickly we need the data replicated. The farther the distance, the more latency we introduce on the network. That could have a negative impact on recovery point objectives. Customer priorities could dictate distance over recovery point i.e. Ashburn to San Jose is a longer network connection than Ashburn to Chicago.
Data Replication – OCI includes two network egress counters, one for egress to the internet and the other is egress between OCI regions. Customers get 10TB a month included with their OCI tenancy and they are considered two separate counts. 10TB for egress to the internet and another 10TB for egress between regions. When we consider the data replication, we need to understand how much data will be replicated between the regions to calculate any costs if we are to exceed the included 10TB per month.
Recovery Point Objectives (RPO) – this is where the DR discussions start to get fun and the designs start becoming more unique! If a customer has a RPO of 15 minutes, we have to consider the network for the data replication and the distance between the primary and the failover site and the OCI region that the customer is deploying. The remote peering connection (RPC) between OCI regions does not have a service level agreement (SLA). It is important to note that I have not once encountered a network issue between regions and I have been working with OCI before they went to generally availablity. Having an SLA on the uptime and bandwidth would provide assurances if solutioning for tight RPO’s. So, to solve for that we could either stand up a network connection between the regions with redundant circuits and dedicated bandwidth or we could design for a “near” DR solution. A near DR solution would leverage OCI availability domains or fault domains (in single AD regions) that would align with oracle maximum availability architecture principles and provide a local failover architecture to more stringent RPO and RTO requirements. In the event of a regional failure, the customer could then failover to a secondary region that would have more relaxed RPO and RTO requirements.

Orchestration of the data replication and DR failover can be done in a couple of different ways. OCI introduced full stack disaster recovery (FSDR) sometime around October of 2022. OCI Full Stack Disaster Recovery service manages the transition of infrastructure, platforms, and applications between OCI regions. I am working with FSDR at one of my customers now and the product has matured since its initial release. There are still some things we need to design around while using the service like customer managed keys for storage and services it has not yet been integrated. FSDR works great for compute, storage (as long as you are not using customer managed keys), Databases like ATP, ADW, ExaCS. While FSDR is not integrated with services like Oracle Analytics and Essbase, it can be used to orchestrate the DR failover using custom scripts. Customers that do not want to use FSDR can leverage other methods for data replication and orchestration such as rsync or third party tools like Commvault. I just wanted to highlight an OCI cloud native feature that could be leveraged for the orchestration.
The intention of this blog entry was to discuss the key elements for configuring disaster recovery of applications deployed to OCI. I would have to do another blog entry for architecture considerations related to infrastructure such as connectivity to regions, resource sizing in the DR region, OCI identity domains, etc.
In summary, planning for disaster recovery on a cloud service provider not only mitigates risks but also enhances operational resilience, scalability, and cost efficiency. It’s a crucial step in safeguarding a business’s future.
