Having described the different components required to interconnect multiple DC to offer business continuity and disaster recovery in the previous posts, I think it may be useful to provide a series of questions that you may ask yourself to better understand which solution fits better for your business.
Definition:
Consider 2 main models for Business Continuance:
Disaster Recovery (DR) Plan :
- Disaster Recovery usually means that the service is ”restarted” in a new location.
- Time to recover the application is mainly dictated by the mission and business criticality levels.
- Recovery happens manually or automatically after an uncontrolled situation (power outage, natural disaster, etc..)
- Automatically recovery may require additional components to be efficient such as Intelligent DNS, Route Host Injection, LISP.
- Interconnection and load distribution are achieved using intelligent L3 services
- Traditionally DR scenario implies the hosts/applications/services to be renumbered (understand using a new IP address space) according to the new location. However notice since recently LISP allows to migrate and to locate IP components without any IP parameter changes (and without LAN extension either).
Disaster Prevention/Avoidance (DP/DA) Plan:
- Interconnection is usually achieved using L2 extension for business continuity (without any interruption for the current active sessions – statefull sessions are maintained in a transparent way) but can also be achieved using L3 extension techniques (sessions are stopped and immediately reinitiated in the new location)
- The recovery happens automatically after a controlled situation (maintenance, machine migration)
For both scenarios listed above, network services can be leveraged to provide automatic redirection to the active DC in case of DC failover. These services can improve the path for accessing the active application in a clever fashion and/or to optimize the server to server communication. These services are not mandatory to address the DR or DP but are complementary and strongly recommended to accelerate the recovery process.
Questions:
1) What is the impact to your Business when the service is available or degraded mode ?
For most of Enterprises, mission and business services fit onto one of these criticality levels.
C1 | Mission Imperative | Any outage results in immediate impact to revenue generation – zero downtime is acceptable under any circumstances |
C2 | Mission Critical | Any outage results in critical impact to revenue generation – downtime is scheduled under short specific windows time. |
C3 | Business Critical | Any outage results in moderate impact to revenue generation. It is accepted to interrupt the service for a short time. |
C4 | Business Operational | A sustained outage results in minor impact to revenue generation. It is accepted to interrupt the service for a longer period of time. |
C5 | Business Administrative | A sustained outage has little to no impact on a the service |
When the business framework finds its criticality level, it is therefore possible to classify it into a more generic matrix. The matrix can be split into 2 main models as discussed previously: Disaster Recovery and Disaster Prevention/Avoidance
Disaster Prevention/Avoidance |
Disaster Recovery |
|||||||
Availability | RTO (hours) | RPO (Hours) |
RTO, (Hours) | RPO (Hours) | Criticality Level | |||
Up to 99.999% | ~0 | ~0 | n/a** | n/a | C1 | |||
Up to 99.995% | 1 | 0 | 4 | 1 | C2 | |||
Up to 99.99% |
4 | 0 | 24 | 1 | C3 | |||
Up to 99.9% |
24 | 1 | 48 | 24 | C4 | |||
Up to 99.9% |
Best Effort | 24 | Best Effort | 1 week | C5 |
1) What is the maximum time to recover the critical applications (aka RTO) ?
Business continuance may have different requirements depending on the business of the enterprise driven from their respective DC, or the time of the day the recovery happens. It can be “zero” down time for strict business continuance or it can accept couple of hours down time.
2) What is the maximum data loss accepted during the recovery process (aka RPO) ?
The shortest usually is the best, but shortest implies higher complexity, higher cost and shorter distances between sites that will weight the balance for the final decision. It is sometime accepted to loose data for couple of hours between an active DC and its recovery sites while some business will impose zero data lost. Synchronous data replication leads to RPO=0 but has distance limitations and bandwidth implications
3) Recovery mode ?
Cold: Need to power up all machines for the recovery process (network/Compute/Storage and start application from backup)
- RTO can take several days or weeks
- RPO may be several hours
This mode is not often deployed as a main backup site although it’s the less expensive to maintain. However it is inefficient and useless during normal working phase and RTO is usually very large (several days/weeks). That said some large enterprises may deploy this DR solution for a 2nd small backup site limited to some applications.
Warm: All applications are already installed and ready to start from backup data. The Network access needs to be a routed (manually) to the backup site after applications have started.
- RTO can take several hours
- RPO may be several minutes to few hours
Same as previous mode, it is expensive due to its inefficiency during normal working phase. As of today, that mode is not the best choice for a main backup site.
Hot: backup site is always accessible and some applications are up and running.
- RTO can take several minutes to hours
- RPO may be zero to few seconds
This mode is very interesting for DR functions as the infrastructure already deployed for the backup function can be leveraged to support active applications. In addition LISP can also offer stateless services for business continuity using the existing routed network.
4) How many Data centers to interconnect?
Different scenarios can be considered
1 Primary DC (active) + 1 Backup DC (standby) – usually Region/Geo Distributed DC
- All applications runs on the Active DC (primary DC)
- Cold – The backup DC is inactive at all layers (network/compute/storage)
- Warm/Hot – The backup DC is inactive for application and network but storage may be active for data replication in synchronous or asynchronous mode to be ready for the recovery process. In some cases, data warehousing activity takes place on data in the secondary datacenter
1 Primary DC (Active) + 1 Backup DC (Active) – usually Region/Geo Distributed DC
- Active Applications on both DC but not necessarily related together
- Each DC is backup of the other (meaning double-way data replication)
2 Active DCs (Twin DC or Metro Distributed Virtual DC) + 1 Backup DC (Region/Geo Distributed DC)
- Usually 2 dispersed DC (Metro distance using L2) looks like a single logical DC from the Application & mgmt point of view. The 3rd one is only used for recovery purposes (Cold/Warm restart using L3).
- This one is the preferred solution as it offers service continuity and disaster prevention without any interruption inside the Metro DVDC plus the disaster recovery for uncontrolled situation.
- The DVDC requires VLAN extension for a full transparent live migration (maintaining the session active) between the 2 sites.
- LISP can improve the original function of disaster recovery by offering disaster prevention on the backup site interconnected using a routed network.
5) If Active/Active DC what are the drivers ?
Some applications are up and running in the secondary DC in an autonomous fashion (meaning no directly linked to applications running in the primary DC) Applications are often distributed to balance the load (network or compute) between the 2 DC.
- i.e. Application A supporting service ABC running on DC 1 & Application B supporting service XYZ running on site 2
- Each DC can be backup of the remote site (double-ways).
Duplicate Applications can also be deployed on both DC (primary & backup sites) to offer a closest access to the service XYZ for the remote users (low response time using geo Dispersed Virtual DC may have a huge impact on the business)
- i.e. Application A supporting service XYZ is deployed in DC 1 and duplicated application A-bis supporting the same service XYZ is deployed in DC2.
- Notice for such scenario, usually the DB tier is shared between the 2 duplicated applications which may impact the distances between both DC
Applications (virtual) can migrate to the backup site in case of burst:
- Manually
- Automatically
- Migration can be statefull (zero interruption) or Stateless (imposed reestablishing the session)
Applications (virtual) can migrate to the backup site for maintenance purposes (manual)
Migration purposes for physical devices (for undetermined temporary period)
- Some mainframes IP parameters are hardcoded making difficult and costly any changes hence usually require the same subnet to be extended on both sites
- This can be achieved using LAN Extension or L3/LISP
HA clusters are spread over the 2 sites with 1 member of the cluster active on primary site and standby member(s) on secondary site(s) (and vise versa for other HA framework)
6) What is the distance between DC?
The distance can be dictated by:
Software Framework (HA clusters, GRID, Live migration). Distances usually imposed by the max latency supported between members (consider 500ms as unlimited distances)
- Some HA clusters support L3 interconnection with unlimited distances
- Some GRID systems are latency sensitive hence only short distances are supported.
Storage mode and replication methods (Shared Storage, Act/Act Storage..)
- Some Act/Act storage methods usually support hundred km using synchronous replication but some solutions support several thousand kms using asynchronous replication.
- Synchronous (+/-100kms max) to Asynchronous (unlimited distances) replication modes
- In case of HA cluster Quorum Disk may impose Synchronous replication
- it is also possible to change this Quorum disk using other method adapted to the Geo-Clusters such as Majority Node set which supports asynchronous mode.
Keep in mind that the latency (hence the distance between DC) is one of the most important criteria for DCI deployment. First for the Storage side (TPS between the server and the disk array) for operation continuity (recommended A/A storage mode). Secondly for Multi-tier applications spread over multiple DC (Server to Server communication). Indeed, the latency may have a strong impact on response time due to the ping pong effect (often +10 handshakes for a single ACK) hence when migrating or restarting a framework from site to site, it is important to understand how the workflows are related between the different components (i.e; front-end, middle-tier, back-end, storage). Avoid a partial movement of framework and if possible optimize the ingress and egress traffic in a symmetrical way through the statefull security and network services.