Since I wrote this article last spring discussing about future integrated DCI features, VXLAN EVPN Multi-site has become real. It is available since NX-OS 7.0(3)I7(1).
There is also a new White paper that discusses deeper about VXLAN EVPN Multi-site Design and deployment
Good read !
++++++++++++++++++++++ Updated Jan 2018 +++++++++++++++++++++++++
DCI is dead, long live to DCI
Some may find the title a bit strange, but, actually, it’s not 100% wrong. It just depends on what the acronym “DCI” stands for. And, actually, a new definition for DCI came out recently with VXLAN EVPN Multi-site, disrupting the way we used to interconnect multiple Data Centres together.
For many years, the DCI acronym has conventionally stood for Data Centre Interconnect.
The “Data Centre Interconnect” naming convention may essentially be used to describe solutions for interconnecting traditional-based DC network solutions, which have been used for many years. I am not denigrating any traditional DCI solutions per se, but the Data Centre networking is evolving very quickly from the traditional hierarchical network architecture to the emerging VXLAN-based fabric model gaining momentum in enterprise adopting it to optimize modern applications, computing resources, save costs and gain operational benefits. Consequently, if these independent DCI technologies may continue to be deployed for extending Layer 2 networks between traditional DC networks, they will tend to be replaced by modern solution allowing a smooth migration to VXLAN EVPN-based fabrics. Nonetheless, for the interconnection of modern VXLAN EVPN standalone (1) Fabrics, a new innovative solution called “VXLAN EVPN Multi-site” – which integrates in a single device the extension of the Layer 2 and Layer 3 services across multiple sites – has been created for a long life. As a consequence, DCI also stands for Data Centre Integration when interconnecting multiple fabrics using VXLAN EVPN Multi-site, thus this strange title.
(1) The term “standalone Fabric” describes a Network-based Fabric. The concept of Multi-Pod, Multi-Fabric and Multi-site discussed in this article doesn’t concerned the enhanced SDN-based Fabric, also known as ACI Fabric. If both rely on VXLAN as the network overlay transport within the Fabric, they both use different solutions for interconnecting multiple Fabrics together.
ACI already offers a more sophisticated and solid integration of Layer 2 and Layer 3 services with its Multi-Pod solution since ACI 2.0 as well as ACI Multi-site since ACI 3.0 (Q3CY2017).
A bit of history:
Cisco brought the MP-BGP EVPN based control-plane for VXLAN in beginning of 2015. After a long process of qualification and testing in 2015, this article has been elaborated to describe the approach of the VXLAN EVPN Multi-Pod. At the time, it offered the best design considerations and recommendations for interconnecting multiple VXLAN EVPN Pods. Afterwards, we ran a similar process for validating the architecture for interconnecting multiple independent VXLAN EVPN Fabrics, of which post 36 with its sub-sections has been published. This second set of technical information and design considerations has mainly been elaborated to clarify the network services and network design required for interconnecting thru an independent DCI technology, multiple VXLAN EVPN Fabrics in conjunction with the Layer 3 Anycast Gateway (1) functions.
(1) The Layer 3 Anycast gateway is distributed among all compute and service leaf nodes and across all VXLAN EVPN Fabrics offering the default gateway for any endpoints at the closest Top of Rack.
Although the sturdiness of this solution of VXLAN EVPN Multi-Fabric is fully reached, this methodical validation design demonstrates the critical software and hardware components required when interconnecting VXLAN EVPN Fabrics with the Layer 3 Anycast Gateway, with manual connectivity’s for independent and dedicated Layer 2 and Layer 3 DCI services, making the whole traditional connectivity a bit more complex than usual.
To summarise, as of August 2017, we had 2 choices when interconnecting multiple VXLAN EVPN Pods or Fabrics:
VXLAN EVPN Multi-PoD
The first option aims to create a single logical VXLAN EVPN Fabric in which multiple Pods are dispersed to different locations using a Layer 3 underlay to interconnect the Pods. Also known as stretched Fabric, this option is more commonly called VXLAN EVPN Multi-Pod Fabric. This model goes without a DCI solution, from a VXLAN overlay point of view, there is no demarcation between a Pod and another. The key benefit of this design is that the same architecture model can be used to interconnect multiple Pods within a building or campus area as well as across metropolitan areas.
If this solution is simple and easy to deploy, and thanks to the flexibility of VXLAN EVPN, it’s lack of sturdiness with the same Data-Plane and Control-Plane shared across all geographically dispersed Pod.
Although with the VXLAN Multi-Pod methodology, the Underlay and Overlay Control-Planes have been optimised – separating the Control-Planes for the Overlay network (iBGP ASN) using eBGP for the external site network – to offer a more robust End-to-End architecture than a basic stretched Fabric, the same VXLAN EVPN Fabric is overextended across different locations. Indeed, the reachability information for all VXLAN Tunnel endpoints (VTEPs) as well as that of endpoints is extended across all VXLAN EVPN Pod (throughout the whole VXLAN Domain).
If this VXLAN EVPN Multi-pod solution simplifies its deployment, the failure domain is kept extended accros all the PoD, and the scalability remains the same as it is in a single VXLAN fabric.
VXLAN EVPN Multi-Fabric
The second option when interconnecting multiple VXLAN EVPN Fabrics in a more solid fashion, is by inserting an autonomous DCI solution, extending Layer 2 and Layer 3 connectivity services between independent VXLAN Fabrics, while maintaining segmentation for the multi-tenancy purposes. Compared to the VXLAN EVPN Multi-Pod Fabric design, the VXLAN EVPN Multi-Fabric architecture offers a higher level of separation for each of the Data Centre Fabrics, from a Data-Plane and Control-Plane point of view. The reachability information for all VXLAN Tunnel endpoints (VTEPs) is contained within each single VXLAN EVPN Fabric. As a result, VXLAN tunnels are initiated and terminated inside each Data Centre Fabric (same location).
The standalone VXLAN EVPN Multi-Fabric model also offers higher scalability than the VXLAN Multi-Pod approach, supporting a larger total number of leaf nodes and endpoints across all Fabrics.
However, the sensitive aspect of this solution is that it requires complex and manual configurations for a dedicated Layer 2 and Layer 3 extension between each VXLAN EVPN Fabric. In addition, the original protocol responsible for Layer 2 and Layer 3 segmentations must handoff at the egress side of the border leaf nodes (respectively, VRF-Lite handoff and Dot1Q handoff). Last but not least, in addition to the dual-box design for dedicated Layer 2 and Layer 3 DCI services, endpoints cannot be locally attached to them.
Although the additional DCI layer has been a key added-value, maintaining the failure domain inside each Data Centre with a physical boundary between DC and DCI components, it has been a choice balanced between sturdiness and CAPEX, sometimes impacting the final choice to deploy a solid DCI solution.
On the same article part 5, Inbound and Outbound Path Optimisation solutions, that can be leveraged for different Data Centre deployments, have also been elaborated. And although VXLAN Multi-Pod and Multi-Fabric solutions may become soon “outdated” for interconnecting modern VXLAN-based Fabrics (see next section), the technical content of path optimisation is still valid for VXLAN Multi-Site or just when trying to better apprehend the full learning and distribution processes for the endpoint reachability information within a VXLAN EVPN Fabric. Hence, don’t ignore them yet, they may still be very useful for education purposes.
If this VXLAN EVPN Multi-fabric solution improves the strength of the end to end solution, there is no integration per se between each VXLAN fabric and the DCI, making its deployment complex to operate and manage.
VXLAN EVPN Multi-site
The above short refresh on the traditional solutions when interconnecting geographically dispersed modern DC Fabrics being said, and while some network vendors (hardware and software/virtual platforms) are trying to offer a suboptimal method of the options discussed previously, actually, these two traditional solutions are obsolete when interconnecting standalone VXLAN EVPN-based Fabrics. Indeed, a new architecture called VXLAN EVPN Multi-Site has become available since August 2017 with the NX-OS software release 7.0(3)I7(1). A related IETF draft is available for further details (https://tools.ietf.org/html/draft-sharma-Multi-Site-evpn) if needed.
For guidelines, please visit the Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide, Release 7.x
The concept of VXLAN EVPN Multi-Site integrates the Layer 2 and Layer 3 services so that they can be extended within the same box. This model relies on Layer 2 VNI stitching. A new function known as Border Gateway (BGW) is responsible for that integrated extension. As a result, an Intra-Site VXLAN tunnel terminates at a Border Gateway which, from the same VTEP, re-initiates a new VXLAN tunnel toward the remote sites (Inter-Sites). The extension of the L2 and L3 services are transported using the same protocol. This new innovation combines the simplicity and flexibility of the VXLAN Multi-Pod with the sturdiness of the VXLAN Multi-Fabric.
The role of Border Gateway is a unique function offered by the hardware at line rate the because of Cisco ASIC technology (Cloud Scale ASIC). Yet another feature that will never come with a whitebox with Merchant Silicon.
It is not mandatory that the non-Border Gateway network devices within the VXLAN EVPN Fabric support the function of Multi-Site, neither in hardware nor in software. As a result, any VXLAN EVPN Fabric based on the first-generation of N9k in production today will be able to deploy Multi-Site by simply adding the pair of required devices(1) that support the Border Gateway function. This will both integrate and extend the Layer 2 and Layer 3 services.
(1) VXLAN Multi-Site is supported on –EX and –FX Nexus 9k series platforms (ToR and Line-cards) as well as on the Nexus 9364C device.
Nevertheless, in addition, this new approach to the interconnection of multiple sites brings several added values.
VXLAN EVPN Multi-Site offers an “All-In-One” box solution with, if desired, the possibility to locally attach endpoints for other purposes (computes, firewalls, load balancers, WAN edge routers, etc).
For VXLAN EVPN Multi-PoD, in order to attach endpoints to the transit node, it is required that the hardware support the bud-node mode. A bud node is a device that is a VXLAN VTEP device and at the same time an IP transit device for the same multicast group used for VXLAN VNIs.
For VXLAN EVPN Multi-Fabric, a pair of additional boxes must be dedicated for DCI purposes only, with no SVI and no attached endpoint.
This new integrated solution offers a full separation of Data-Plane and Control-Plane per site, maintaining the failure domain to its smallest diameter.
VXLAN EVPN Multi-site overview
VXLAN EVPN Multi-site Border Gateway
The stitching of the Layer 2 and Layer 3 VNIs together (intra-Fabric tunnel ⇔ inter-Fabric tunnel) happens in the same Border node (Leaf or Spine). This Border node role is known as Border Gateway or BGW.
Multiple designs for VXLAN EVPN Multi-site
The approach of VXLAN EVPN Multi-Site is to offer a level of simplicity and flexibility never-before-reached. Indeed, the Border Gateway function can be initiated from existing vPC-based Border leaf nodes, with a maximum of the two usual vPC peer devices, offering seamless migration from traditional to Multi-Site architectures. This vPC-based option allows for locally attached dual-homed endpoints at Layer 2.
However, it is also possible to deploy up to four independent border leaf nodes per site, relying on BGP delivering the function of Border Gateway in a cluster fashion, all active with the same VIP address. With this BGP-based option, Layer 3 services (FW, router, SLB, etc..) can be leveraged on the same switches.
Finally, it is also possible to initiate the Multi-Site function with its Border Gateway role directly from the Spine layer (up to 4 Spine nodes), as depicted in Figure 4.
When the Border Gateway is deployed, however the cluster is built – with 2 or 4 nodes – it will share the same VIP (Virtual IP address) across the associated members. The same Border Gateway VIP is used for the virtual networks initiated inside the VXLAN Fabric as well as for the VXLAN tunnels established with the remote sites.
The architecture of VXLAN EVPN Multi-Site is transport agnostic. It can be established between two sites connected with direct fibre links in a back-to-back fashion, or across an external native Layer 3 WAN/MAN or Layer 3 VPN Core (MPLS VPN), interconnecting multiple Data Centres across any distances. This new solution offers higher reliability and resiliency for the end-to-end design with all redundant devices delivering fast convergence in case of link or device failure.
From an operational, admin and network management point of view, VXLAN OAM tools can be leveraged to understand the physical path taken by the VXLAN tunnel, inside and between the Fabrics, to provide information on path reachability; Overlay to Underlay correlation; interface load and error statistic in the path; latency, etc. – a data lake of information which is usually “hidden” in a traditional network overlay deployment.
VXLAN EVPN Multi-site – Layer 2 Tunnel Stitching
Because the VNI to VNI stitching happens in the same VTEP, there is no requirement to handoff the VRF-Lite, nor the Dot1Q frame. Hence, theoretically, Multi-Site allows for going beyond the 4K Layer 2 segments imposed by traditional DCI solution.
As depicted in Figure 6, the border leaf offers multiple functions, including local endpoint attachment and Border Gateway for integrated Layer 2 and Layer 3 service extension.
vPC must be used for any locally dual-homed endpoints at layer 2. This picture depicts a BGP-based deployment of the Border Gateways with Layer 3 connections toward an external Layer 3 network.
VXLAN EVPN Multi-site with Legacy Data Centers
The VXLAN EVPN Multi-Site solution is not limited to interconnecting modern Greenfield Data Centres. It offers two additional flavours for seamless network connectivity with Brownfield Data Centres.
The first use-case is with a legacy DC alone devoid of any DCI service. For this option, if the enterprise wants to build two new Greenfield DCs and keep the original one, with all sites interconnected together, it is possible to deploy a pair of Border Gateways, dual-homed toward the aggregation layer. This function of Border Gateway mapping Dot1Q to VXLAN is also known as Pseudo BGW. The VLANs to be extended are extended using VXLAN EVPN toward the brand-new VXLAN Multi-Site, as shown in Figure 7.
The second use-case concerns traditional Data Centre networks to interconnect, however these DC are planning to migrate smoothly to VXLAN EVPN fabric. With that migration in mind, it is possible to leverage the Pseudo BGW fonction, leveraging VXLAN EVPN Multi-Site as shown in Figure 8.
Figure 8 illustrates two legacy Data Centre infrastructures interconnected using the Pseudo BGW feature. This is an alternative to OTV when there is a migration planned to a VXLAN EVPN-based fabric. As a result, this option can be leveraged a pair of Pseudo-BGWs inserted in each legacy site to extend Layer-2 and Layer-3 connectivity between sites that will be reused afterward for full Multi-site functions. It offers a slowly phase out the legacy networks replacing them with VXLAN EVPN fabrics. It allows integration with existing legacy default gateway (HSRP, VRRP, etc.) with a smooth migration for a more efficient and integrated Layer 3 Anycast Gateway offering the default gateways for all Endpoints.
Finally this option offers a more solid and integrated solutions than a simple VXLAN EVPN-based solution, with features natively embedded with the Border Gateway functions in order to improve the sturdiness from end-to-end, such as Split Horizon, Designated Forwarded for BUM traffic, Traffic Control of BUM (Disable or Rate-based), Control Plane with Selective Advertisement for L2 & L3 segments.
VXLAN EVPN Multi-site with outside Layer 3 connectivity
Connecting the VXLAN EVPN fabric in conjunction with Multi-site to the outside of the world.
Every DC fabric needs external connectivity to the campus or core network (WAN/MAN).
The VXLAN EVPN fabric can be connected across Layer 3 boundaries using MPLS L3VPN, Virtual Routing and Forwarding (VRF) IP routing (VRF-lite), or Locator/ID Separation Protocol (LISP).
Integration can occur for the Layer 3 segmentation toward a WAN/MAN, native or segmented Layer 3 network. If required, each Tenant from the VXLAN EVPN Fabric can connect to a specific L3 VPN network, like achieved previously with the Multi-Pod and Multi-Fabric models. In this case, in addition to the Multi-Site service, either VRF-Lite are handed off at the Border Leaf nodes to be bound with their external segmentations while the L3 segmentation is also maintained with the remote Fabric, as illustrated in Figure 9.
In deployments with a very large number of extended VRF instances, the number of eBGP sessions or IGP neighbors can cause scalability problems and configuration complexity and usually require a two-box solution.
It is now possible to merge the Layer 3 segmentation with the external Layer 3 connectivity (Provider Edge Router) with a new single-device solution.
Figure 10 depicts a full integration of the Layer 3 segmentation provided by the VXLAN EVPN fabric, extended for external connectivity from the BGW toward an MPLS L3VPN network (or any type of Layer 3 network, including LISP). This model requires that the WAN edge device is Border Provider Edge (BorderPE) feature capable. This feature allow to terminate the VXLAN tunnel (VTEP) in order the stitch the L3 VNI from the VXLAN fabric to a Layer3 VPN network. This feature is supported with the Nexus 7k/M3, ASR1k, and ASR9k.
VXLAN EVPN Multi-site and BUM traffic
Finally, the technology used to replicate BUM traffic is independent per site. One site can run PIM ASM Multicast to replicate the BUM traffic inside the same VXLAN domain regardless what the replication mode used on other sites. The BUM traffic is carried across the Multi-Site overlay using Ingress Replication, while other sites can run whatever is desired by the Network team, Multicast or Ingress Replication. Although Ingress Replication allows BUM transport across any WAN network, VXLAN Multi-Site will support Multicast replication later in a next software release.
To prevent any risks of loop with BUM traffic, a Designated-Forwarder is elected dynamically on a per Layer 2 segment basis. The DF uses its Physical IP address as a unique destination from the BGW cluster.
VXLAN EVPN Multi-site and Rate Limiters
The decapsulation and encapsulation of the respective tunnels happen inside the same VTEP. This method allows for the enabling of traffic policers per tenant, controlling the rate of Storm independently for Broadcast, Unknown Unicast or Multicast traffic, or other policers limiting the bandwidth per tunnel. Therefore, failure containment is improved through the granular control of BUM traffic, protecting remote sites from any broadcast storms.
DC Interconnection versus DC Integration
With VXLAN EVPN Multi-Site coming shortly, we will consider 3 main use cases for interconnecting multiple DC network:
- Interconnecting Legacy to Legacy DC
For Brownfield to Brownfield DC Interconnection, if OTV previously existed for DC interconnection requirements, then, it’s likely to continue using the same DCI approach. There is no technical reason to change to a different DCI solution if the OTV Edge devices are already available. Indeed, OTV offers an easy way to add new DC into its Layer 2 Overlay network.
If a new project to interconnect multiple Brownfield DCs (Classical Ethernet-based DC), it is worth to understand if there is a requirement to migrate in the near future the existing legacy network to a modern fabric infrastructure, as depicted with Figure 8.
Nonetheless, we never want to extend all VLANs from site to site, hence, even if not often elaborated in the DCI documentation, it is important to take the routed traffic inter-site into consideration. The new function of Border Gateway integrated with VXLAN EVPN Multi-site offers a great alternative for interconnecting Data Centre for Layer 2 and Layer 3 requirements, maintaining the in-dependency network transport of each DC, while reducing the failure domain to a single site with a built-in very granular Storm controller for Broadcast, Unknown Unicast or Multicast traffic.
Although OTV 2.5 relies on VXLAN as the encapsulation method, VXLAN continues to evolve with the necessary embedded intelligence leveraging the open standards MP-BGP EVPN control plane to provide a viable, efficient and solid DCI solution comparable with OTV. VXLAN EVPN Multi-site architecture represents this evolution. It offers a single integrated solution for interconnecting multiple traditional Data Centres with embedded and distributed Layer 3 Anycast gateway from the same Border Gateway devices, extending the classical Ethernet frame and routed packets via its Overlay network.
- Interconnecting Legacy DC to modern Fabric
For the second, as depicted in Figure 13, VXLAN EVPN Multi-site can be leveraged to interconnect the Brownfield DC by adding a new pair of N93xx-EX (or FX) series running VXLAN Multi-site with the Pseudo Border Gateway function for controlling the VLAN to VNI transport. On the remote site, the Greenfield DC leverages the Multi-site feature that integrates the Layer 2 and Layer 3 services toward the Brownfield DC, all functions offered in a single-box model.
- Interconnecting modern Fabrics
For Standalone VXLAN EVPN fabrics, the solution is VXLAN EVPN Multi-Site. There is no technical reason to add a dedicated DCI solution inside a modern Fabrics when Layer 2 and Layer 3 services are integrated in the same Border leaf or Border Spine layer, reducing the failure domain to its smallest diameter, that is to say, the link between the host and its Top of Rack switch.
Naming Convention: Pod, Multi-Pod, Multi-Fabric, Multi-Site
The same naming convention for Pod, Multi-Pod, Multi-Fabric and Multi-Site has been elaborated for both VXLAN EVPN Standalone and ACI.
Nonetheless, it is important to specify that the three models discussed in the previous sections differ when comparing VXLAN Standalone with ACI Fabric.
In a VXLAN context deployment, a Pod is seen as a “Fault Domain” area.
The Pod is usually represented by a Leaf-Spine Network Fabric sharing a common Control-plane (MP-BGP) across its whole physical connectivity plant.
A Fabric represents the VXLAN Domain deployed in the same location (same building or Campus) or Geographically Dispersed (Metro distances) sharing a common Control-plane and Data-plane. A single or multiple Pods form therefore a single VXLAN fabric or VXLAN domain.
In VXLAN Stand-alone the Multi-Pod architecture is organized with multiple PoD sharing a common Control-plane (MP-BGP) and Data-Plane across the whole physical connectivity plant across the same VXLAN Domain. The VXLAN Multi-Pod represents a single Availability Zones (VXLAN Fabric) within a Single Region – it is strongly recommended to limit the distances for a single Availability Zone within a Metro area, but technically nothing prevent to deploy a VXLAN Multi-Pod across intercontinental distances. The same VXLAN domain is stretched across multiple locations. The term of Multi-Pod differs from the VXLAN EVPN stretched fabric because the strength of the whole solution has been improved as discussed above.
ACI already offers a more sophisticated and solid integration of Layer 2 and Layer 3 services with its Multi-Pod solution since CY2016.
VXLAN Multi-Fabric deployment offers Multiple Availability Zone (Multiple VXLAN Fabrics) designs, one per VXLAN Domain managed by its own isolated Control-plane with an independent Layer 2 and Layer 3 DCI services. There is no concept of Regions per se, as each VXLAN Fabric runs autonomously.
VXLAN Multi-Site offers the same concept as Multi-Fabric with independent VXLAN Fabric but with the extended Layer 2 and Layer 3 services integrated and fully controlled. As a result, Multiple autonomous VXLAN domains with integrated L2 & L3 services form the VXLAN Region.