4 – Cloud Computing and Disaster Avoidance

The Need for Distributed Cloud Networking

If Hot Standby disaster recovery solutions running in a traditional routed DCI network are still valid and are often part of the specifications of the enterprise, there are some applications and emerging services enabled with cloud computing that require an RTO and RPO of zero with a full transparent connectivity between the data centers.

These requirements have already been addressed, especially with the deployment of high-availability clusters stretched between multiple sites. However, the rapid evolution of business continuity for cloud computing requires us to offer more reliable and efficient DCI solutions to meet the flexibility and elasticity needs of the service cloud: dynamic automation and almost unlimited resources for the virtual server, network, and storage environment in a fully transparent fashion.

Disaster recovery is critical to cloud computing for two important reasons. First, physical data centers have “layer zero” physical limits and constraints such as rack space, power, and cooling capacity. Second, currently not many applications are written for the distributed cloud that accounts for the type of network transports and distances needed to exchange data between different locations.

DCI Solution with Resource Elasticity

To address these concerns, the type of network interconnection between the remote data centers that handles the cloud infrastructure needs to be as resilient as possible and must be able to support any new connections where resources may be used by different services. It should also be optimized to provide direct, immediate, and seamless access to the active resources dispersed at different remote sites.

The need for applications and services to communicate effectively and transparently across the WAN or metro network is critical for businesses and service providers using private, hybrid, or public clouds.

To support cloud services such as Infrastructure as a Service (IaaS), VDI/VXI, and UCaaS in a sturdy, resilient, and scalable network, it is also crucial to provide highly-available bandwidth with dynamic connectivity regardless of VM movement. This is true for the network layer as well as for the storage layer.

The latency related to the distances between data centers is another essential element that must be taken into account for the deployment of the DCI. Each service and function has its own criteria and constraints, so it relies on the application requiring the lowest latency to be used as a reference to determine the maximum distance between physical resources.

To better understand the evolution of DCI solutions, it is important to highlight one major difference when comparing the active/standby behavior between members of a high availability (HA) cluster and the live migration of VMs spread over multiple locations. Both software frameworks require LAN and SAN extension between locations.

Posted in DCI | Leave a comment

3 – Network Services for Active/Active access

 

Global Site Load Balancing Services: To accelerate the disaster recovery service and the dynamic distribution of the workload between the primary and secondary data centers, Cisco provides different network services to optimize the access and the distribution of the user traffic to the remote sites using a Global Site Load Balancing (GSLB) solution. This global GSLB solution for traditional Layer 3 interconnection between sites relies on three major technologies:

Intelligent Domain Name System: An intelligent Domain Name System (DNS) known as the Global Site Selector (GSS) redirects the requests from end-users to the physical location where the application is active and fewer network resources are consumed. In addition, the GSS can be used to distribute traffic across multiple active sites, either in collaboration with the local services of a server load balancing (SLB) application. For example, a Cisco Application Control Engine (ACE) is deployed on each data center to inform the GSS of the health of the service it offers, or based on the load of the network, or in collaboration with the existing WAN edge routers in the data center (e.g. redirection based on physical distances between the user and the application3), just to name the most common functions. Hence the user traffic is distributed accordingly across the routed WAN.

Data Replication Based on Network Load

 

Data Replication in Collaboration with Routers

 

 HTTP Traffic Redirection Between Sites: In case of insufficient resources, the local SLB device will return an HTTP redirection message type (HTTP status code 3xx) to the end-user so that the web browser of the client can be automatically and transparently redirected to the elected backup data center where resources and information are available.

Route Health Injection: Route Health Injection (RHI) provides a real-time, very granular distribution of user traffic across multiple sites based on application availability. This method is initiated by an SLB device that will inform the upward router about the presence or absence of selected applications based on extremely accurate information. This information is usually related to the status of the services that it supports. Therefore, the redirection of the user request to a remote site occurs in real time.

Posted in DCI | Leave a comment

2 – Active/Active DC

Active-Standby versus Active-Active A Hot Standby data center can be used for application recovery or to relieve the primary data center from a heavy workload. Relieving data center resources from a heavy workload is usually referred to as Active-Active DR mode.

Active-Active DR Mode

 

One example of Active/Active DR mode involves an application that is active on a single physical server or VM while the network and compute stack are active in two locations. Some exceptions to this definition are specific software frameworks such as GRID computing, distributed database (i.e. Oracle RAC®) or some cases of server load balancing (SLB). When the resources are spread over multiple locations running in Active-Active mode, some software functions are active in one location and on standby in the other location. Active applications can be located in either site. This approach distributes the workload into several data centers.

It is important to clarify these Active/Active modes by considering the different levels of components, which are all related for the final recovery process:

• The network is running on each site interconnecting all compute, network, and security services; and advertising the local subnets outside each data center. Applications active in the remote data centers are therefore accessible from the traditional routed network without changing or restarting any IP processes.

• All physical compute components are up and running with their respective bare metal operating system or hypervisor software stack.

• Storage is replicated on different locations and can be seen as Active/Active for different software frameworks. However, usually a write command for a specific storage volume is sent to one location at a time while the same data is mirrored to the remote location.

Let’s have a deeper look at the service itself offered by multiple data centers in Active/Active mode. Assuming that we have application A on data center 1 (i.e. Paris) that offers an e-commerce web portal for a specific set of items. The same e-commerce portal offering the same items can also be available and active on a different location (i.e. London), but with a different IP identifier. For the end-user, the service will be unique and the location transparent, but the request can be distributed by the network services based on the proximity criteria established between the end-user and the data center that hosts the same application. So, the same application in this case looks Active/Active, but the software that runs on each compute system is performed autonomously in the front-end tier. They are not related except from a database point of view. Finally the whole session is maintained at the same servers and in the same location until the session is closed.

Posted in DCI | 8 Comments

1 – Disaster Recovery

Traditional data center interconnection architectures have supported Disaster Recovery (DR) backup solutions for many years.

Disaster Recovery can be implemented in Cold Standby, Warm Standby, and Hot Standby modes. Each option offers different benefits.

  • Cold Standby: The initial DR solutions worked in Cold Standby mode, in which appropriately configured backup resources were located in a safe, remote location (Figure 1). Hardware and software components, network access, and data restoration were implemented manually as needed. This DR mode required restarting applications on the backup site, as well as enabling network redirection to the new data center. The Cold Standby model is easy to maintain and remains valid. However, it requires a substantial delay to evolve from a standby mode to full operational capability. The time to recover, also known as Recovery Time Objective (RTO), for this scenario can require up to several weeks. In addition, the Recovery Point Objective (RPO), which is the maximum data lost during the recovery process, is quite high. It is accepted that several hours of data might be lost in a Cold Standby scenario.

Cold Standby Mode

  • Warm Standby: In Warm Standby mode, the applications at the secondary data center are usually ready to start. Resources and services can then be manually activated when the primary data center goes out of service and after traffic is being fully processed to the new location. This solution provides a better RTO and RPO than Cold-Standby mode, but does not offer the transparent operation and zero disruption required for business continuity.
  • Hot Standby: In hot standby mode, the backup data center has some applications running actively and some traffic processing the service tasks. Data replication from the primary data center and the remote data center are done in a real time. Usually the RTO is measured in minutes and the RPO approaches zero, which means that the data mirrored in the backup site is exactly the same as that in the original site. Zero RPO allows applications and services to restart safely. Immediate and automatic resource availability in the secondary data center improves overall application scalability and equipment use.
  • Data Replication: The different disaster recovery modes are deployed using Layer 3 interconnections between data centers through a highly-available routed WAN. The WAN offers direct access to the applications running in the remote site with synchronous or asynchronous data mirroring, depending on the service level agreement and enterprise business requirements.

The pressure to support business continuity has motivated many storage vendors to accelerate the RTO by offering more efficient data replication solutions that achieve the smallest possible RPO. Examples of highly efficient data replication solutions are host-based and disk-based mirroring.

For example, with Veritas Volume Replicator® host-based mirroring solution, the host is responsible for duplicating data to the remote site. In the EMC Symmetrix Remote Data Facility® (SRDF) and Hitachi (HDS) TrueCopy® disk-based mirroring solutions, the storage controller is responsible for duplicating the data1.

A disaster recovery solution should be selected based on how long the organization can wait for services to be restarted, and above all, how much data it can afford to lose after the failover happens. Should the business restart in a degraded mode, or must all services be fully-available immediately after the switch-over? Financial institutions usually require an RTO of less than one hour with an RPO equal to zero without a degraded mode. This is a fairly widespread practice so that no transactions are lost.

Posted in DCI, DR&DA | Leave a comment

Business Continuity

Business resilience and disaster recovery are core capabilities required of the data center IT infrastructure.  The emergence of cloud computing has put a brighter spotlight on the need to ensure that a robust resilience strategy is in place down to the virtual machine.  This places new and greater demands on the network in order to ensure virtual machine mobility, security and availability while still maintaining the flexibility and agility of a cloud model.

One of the main concepts of Cloud Computing is to provide automatically and dynamically almost unlimited resources for a given service in a fully virtual environment. However in order to provide those resources, the complete architecture for the cloud computing must be built with efficient tools and sturdy solution design, not only at the compute stack level but network and storage sides, in order to offer Business Continuity without any interruption.

Posted in DCI | Leave a comment