VXLAN Multipod geographically dispersed
VXLAN Multipod Overview
Note: Since I wrote this article many years ago, VXLAN EVPN and the ASIC have evolved. You should consider this article obsolete and read the post 37 (http://yves-louis.com/DCI/?p=1588) which explains why this solution is not recommended anymore. The question now is why this post is still here? The reason is that I explained some packet walks that the reader could find helpful to understand the differences between VXLAN EVPN Multi-pod and VXLAN EVPN Multi-site.
This article focuses on the single VXLAN Multipod Fabric stretched across multiple locations as mentioned in the previous post 31 through the 1st option.
We have been recently working, with my friends Patrice and Max, during a couple of months, building an efficient and resilient solution based on VXLAN Multipod fabric stretched across two sites. The whole technical white paper is now available for deeper technical details including insertion of Firewalls, Multi-tenancy VXLAN routing and some additional tuning, and it can be accessible from here
One of the key use-case for that scenario is for an enterprise to select VXLAN EVPN as the technology of choice for building multiple greenfield Fabric PoDs. It becomes therefore logical to extend the VXLAN overlay between distant PoDs that are managed and operated as a single administrative domain. Its deployment is just a continuation of the work performed to roll out the fabric PoDs, simplifying the provisioning of end-to-end Layer 2 and Layer 3 connectivity.
Technically speaking, and thanks to the flexibility of VXLAN, we could deploy the overlay network on top of almost any forms of Layer 3 architecture within a datacenter (CLOS, Multi-layered, Flat, Hub&Spoke, Ring, etc.), as long as VTEP to VTEP communication is always available, and obviously a CLOS (Spine & Leaf model) architecture is being the most efficient.
- [Q] However, can we afford to stretch the VXLAN fabric across distant locations as a single fabric, without taking into consideration the risks of loosing the whole resources in case of a failure?
- [A] The answer mainly relies on the stability and quality of the fiber links established between the distant sites. In that context it is interesting to remind that most of DWDM managed services offer an SLA equal to or lower than 99.7%. Another important element to take into reflection is how much the control plane can be independent from site to site, reducing the domino effect caused by a breakdown in one PoD.
Anyhow, we need to understand how to protect the whole end-to-end Multipod fabric to avoid propagating any disastrous situation in case of a major failure.
Different models for extending multiple PoD across a single logical VXLAN fabric are available. Each option relies on how the Data plane and Routing Control Plane are stretched or distributed across the different sites to separate the routing functions as well as the placement of the transit layer 3 nodes used to interconnect the different PoDs.
From a physical approach, the transit node connecting the WAN can be initiated using dedicated Layer 3 devices, or leveraged from the existing leaf devices or from the spine layer. Diverse approaches exist to disperse the VXLAN EVPN PoDs across long distances.
We can’t say there is one single approach to geographically dispersed the fabric PoDs among different location. The preferred choice should be taken according to the physical infrastructure including distances and physical transport technology, the enterprise business and service level agreement as well as the application criticality levels with the Pros & Cons for each design.
What we should keep into consideration is the following (non exhaustive list, though):
- For VXLAN Multipod purposes, it is important that the fabric learns the endpoints from both sites using a control plane instead of the basic flood & learn.
- The underlay network intra-fabric and inter-fabrics is a pure Layer 3 network used to handle the overlay tunnel transparently.
- The connectivity between sites is initiated from a pair of transite layer 3 nodes. The transit device used to interconnect the distant fabrics is a pure layer 3 routing device.
- This transit layer 3 edge function can be also initiated from a pair of leaf nodes used to attach endpoints or a pair of border spine nodes used for layer 3 purposes only.
- Actually, the number of transit devices used for the layer 3 interconnection is not limited to a pair of devices.
- The VXLAN Multipod solution recommends to interconnect the distant fabrics using Direct Fiber or DWDM (Metropolitan-area distances). However, technically nothing prevents to use inter-continental distances (layer 3) as VXLAN DP and EVPN CP are not sensitive to any latency concerns (although this is not recommended), but the application is.
- For geographically dispersed PoDs, it is recommended that the control planes (underlay and overlay) are deployed in a more structured fashion in order to keep them as independent as possible from each location.
- Redundant and resilient MAN/WAN access with fast convergence is a critical requirement, thus network services such as DWDM in protected mode and with remote port shutdown, and BFD should be enabled.
- Storm Control on the geographical links is strongly recommended to reduce the fate sharing between PoDs.
- BPDU Guard must be enabled by default on the edge interfaces to protect against Layer 2 backdoor scenarios.
Slight Reminder defining the VXLAN Tunnel Endpoint:
VXLAN uses VTEP (VXLAN tunnel endpoint) services to map a particular dot1Q Layer 2 frames (VLAN ID) to a VXLAN segment tunnel (VNI). Each VTEP is connected on one side to the classic Ethernet segment where endpoints are deployed, and on the other side to the underlay Layer 3 network (Fabric) to establish the VXLAN tunnels with other remote VTEPs.
The VTEP performs the following two tasks:
- Receives traffic from locally connected endpoints and encapsulates it into VXLAN packets destined for remote VTEP nodes
- Receives VXLAN traffic originating from remote VTEP nodes, decapsulates it, and forwards it to locally connected endpoints
The VTEP dynamically learns the destination information for VXLAN encapsulated traffic for remote endpoints connected to the fabric by using the MP-BGP EVPN control protocol.
With information received from the MP-BGP EVPN control plane used to build information in the local forwarding tables, the ingress VTEP, where the source endpoint is attached to, encapsulates the original Ethernet traffic with a VXLAN header and sends it over a Layer 3 network toward the egress VTEP, where the destined endpoint sits. The latter then de-encapsulates the layer 3 packet to present the original Layer 2 frame to its final destination endpoint.
From a data-plane perspective, the VXLAN Multipod system behaves as a single VXLAN fabric network. VXLAN encapsulation is performed end-to-end across PoDs and from leaf nodes. The spines and the transit leaf switches perform only Layer 3 routing of VXLAN encapsulated frames to help ensure proper delivery to the destination VTEP.
Multiple Designs for Multipod deployments
PoDs are usually connected with point-to-point fiber links when they are deployed within the same physical location. The interconnection of PoDs dispersed across different locations uses direct dark-fiber connections (short distances) or DWDM to extend Layer 2 and Layer 3 connectivity end-to-end across locations (Metro to long distances).
Somehow, there is no real constraint with the physical deployment of VXLAN per se as it’s an overlay virtual network running on top of a Layer 3 underlay, as long as the egress VTEPs are all reachable.
As a result, different topologies can be deployed (or even coexist) for the VXLAN Multipod design.
Without providing an exhaustive list of all possible designs, the 1st main option is to attach dedicated transit layer 3 nodes toward the Spine devices. They are usually called Transit leaf nodes as they often sit at the same Leaf layer and are also often used as computing Leaf nodes (where endpoints are locally attached), or as service leaf nodes (where firewall, security and network service are locally attache), beside the extension of Layer 3 underlay connectivity toward the remote sites.
Hence, nothing prevents to interconnect distant PoDs through the spine layer devices, as no additional functions nor VTEP are mandated from the transit spine other than the capability to route VXLAN traffic between leaf switches deployed in separate PoDs. A layer 3 core layer can be leveraged for interconnecting remote PoDs.
The VXLAN Multipod design is usually positioned to interconnect data center fabrics that are located at metropolitan-area distances, even though technically, there is no distance limit. Nevertheless, it is crucial that the quality of the DWDM links is optimal. Consequently, it is necessary to lease the optical services from the provider in protected mode and with the remote port shutdown feature, allowing immediate detection of optical link down from end-to-end. Since the network underlay is layer 3 and can be extended across long distances, it may be worth checking the continuity of the VXLAN tunnel established between PoDs. BFD can be leveraged for that specific purpose.
Independent Control Planes
When deploying VXLAN EVPN we need to consider 3 independent layers:
- The layer 2 overlay data plane
- The layer 3 underlay control plane
- The layer 3 overlay control plane
Data Plane: The overlay data plane is the extension of the layer 2 segments established on the top of the layer 3 underlay. It is initiated from VTEP to VTEP, usually enabled at the leaf node layer. In the context of geographically dispersed Multipod, VXLAN tunnels are established between leaf devices belonging to separate sites and within each site as if it was one single large fabric (hence the naming convention of Multipod).
Underlay Control Plane: The underlay control plane is used to exchange reachability information for the VTEP IP addresses. In the most common VXLAN deployments, IGP (OSPF, IS-IS, or EIGRP) is used for that purpose, but not limited to. BGP can also be an option. The underlay protocol mainly relies on the Enterprise experiences and requirements.
Overlay Control Plane: MP-BGP EVPN is the control plane used in VXLAN deployments for the host reachability and distribution information across the whole Multipod fabric. It is used by the VTEP devices to exchange tenant-specific routing information for endpoints connected to the VXLAN EVPN fabric and to distribute inside the fabric IP prefixes representing external networks. In the context of Multipod dispersed across different rooms (same building or same campus), host reachability information is exchanged between VTEP devices belonging to separate locations as if it was one single fabric.
This above design represents the extension of the 3 main components required to build a VXLAN EVPN Multipod fabric between different rooms. This is a basic architecture using direct fibers between two pairs of transit leaf nodes deployed in each physical Fabric (PoD-1 & PoD-2). The whole architecture offers a single stretched VXLAN EVPN fabric with the same data plane and control plane deployed end-to-end. The remote host population is discovered and updated dynamically on each PoD to form a single host reachability information database.
If that design deployed within the same building or even within the same campus can be considered as a safe network design architecture, for long distances (Metropolitan-area distances) the crucial priority must be to improve the sturdiness of the whole solution by creating independent control planes due to the long distances between PoDs.
The approach below shows that the data plane is still stretched across remote VTEP’s. The VXLAN data plane encapsulation extends the virtual overlay network end-to-end. This implies the creation of VXLAN ‘tunnels’ between VTEP devices that sit in separate sites.
However the underlay and overlay control planes are independent.
Underlay Control Plane: Traditionally in Enterprise datacenter, OSPF is the IGP protocol of choice for the underlay network. In the context of geographically distant PoDs, each PoD is organised in a separate IGP Areas, with Area 0 typically deployed for the interpod links. That helps reducing the effects of interpod link bouncing, usually due to a “lower” quality of inter-sites connectivity compared to direct fiber links.
However, in order to improve the underlay separation between the different PoDs, BGP can also be considered as an alternative to IGP for the transit routing Inter-sites links.
Overlay Control Plane: In regard to the MP-BGP EVPN overlay control plane, the recommendation is to deploy each PoD in a separate MP-iBGP autonomous system (AS) interconnected through MP-eBGP sessions. When compared to the single Autonomous System model, using MP-eBGP EVPN sessions between distant PoDs simplifies interpod protocol peering.
Improving simplicity and resiliency
In a typical VXLAN fabric, leaf nodes connect to all the spine nodes, but no leaf-to-leaf or spine-to-spine connections are usually required. The only exception is the vPC peer link connection required between two leaf nodes configured as part of a common vPC domain. When Leaf devices are paired using vPC, this allows the locally attached endpoints to be dual-homed and take advantage of the Anycast VTEP functionality (same VTEP address used by both leaf nodes part of the same vPC domain).
It is not required to dedicated routing devices for the layer 3 transit function between sites. One of a pair of leaf nodes (usually named “Border leaf nodes” for its roles connecting the fabric to outside) can be leveraged for the function of “transit leaf nodes” simplifying the deployment. This coexistence can also be enabled on any pair of computing or service leaf nodes. That doesn’t require any specific function dedicated for the transit role, excepted that the hardware must support the Bud Node design (well supported on Nexus series).
The vPC configuration should be performed following the common vPC best practices. However it is also important enabling the “peer-gateway” functionality . In addition and depending on the deployment scenario, the topology, the amount of devices and subnets, additional IGP tuning might become necessary to improve convergence. This relies on standard IGP configuration. Please feel free to read the white paper for further details.
IP Multicast for VXLAN Multipod geographically dispersed
One task of the underlay network is to transport Layer 2 multidestination traffic between endpoints connected to the same logical Layer 2 broadcast domain in the overlay network. This type of traffic concerns Layer 2 broadcast, unknown unicast, and multicast traffic (aka BUM).
Two approaches can be used to allow transmission of BUM traffic across the VXLAN fabric:
- Use IP multicast deployment in the underlay network to leverage the replication capabilities of the fabric spines delivering traffic to all the edge VTEP devices.
- If multicast is not an option, it is possible to use the ingress replication capabilities of the VTEP nodes to create multiple unicast copies of the BUM workflow to be sent to each remote VTEP device.
It is important to note that the choice made for the replication process (IP Multicast or Ingress replication) concerns the whole Multipod deployment as it acts as a single fabric. It is more likely that IP Multicast might be enabled for Metro distances, as the Layer 3 over the DWDM links is usually managed by the enterprise itself. However it is probably not given to get IP Multicast option or a limited number of multicast groups from the provider for layer 3 managed services (WAN). The latter will impose the Ingress Replication mechanism deployed end-to-end the VXLAN Multipod. Notice that the choice for Ingress Replication may have an impact in regard to the scalability, depending on the total number of leaf nodes (VTEP). Worth to check first.
If IP Multicast is the preferred choice, then two approaches can be made:
- Either deploying the same PIM with anycast RP across the two distant PoDs if you want to simplify the configuration. That is indeed a viable solution.
- Or better and recommended is to keep the PIM-SM domain independently enabled on each site and leverage MSDP for the peering establishment between rendez-vous points across the underlying routed network.
Improving the Sturdiness of the VXLAN Multipod Design
Until all functions are embedded in the future in the VXLAN standardisation, preventing the creation of Layer 2 loops, alternative short-term solutions become necessary. The suggested approach consists of the following two processes:
- Prevention of End-to-End Layer 2 Loops: Use edge port protection features in conjunction with VXLAN EVPN to prevent the creation of a Layer 2 loops. The best-known feature is BPDU Guard. BPDU Guard is a Layer 2 security tool that should be enabled on all the edge interfaces that connect to endpoints (hosts, firewalls, etc.). It prevents the creation of a Layer 2 loop by disabling the switch port after it receives a spanning-tree BPDU. BPDU Guard can be configured at the global level or the interface level.
- Mitigation of End-to-End Layer 2 Loops: A different mechanism is required to mitigate the effects of any form of broadcast storms. When multicast replication is used to handle the distribution of BUM, it is recommended to enable the storm-control function to rate-limit the received amount of BUM traffic carried via the IP Multicast group of interest. The rate limiter of multicast traffic must be initiated on the ingress interfaces of the transit leaf nodes, hence protecting all networks behind it.
If it is strongly recommended to enable Storm Control for Multicast at the Ingress transit interfaces, it is also a priority to consider the current utilisation of multicast traffic produced by the Enterprise applications across remote PoDs, prior to give a threshold value. Otherwise the risk is to make more damage for the user applications than protecting against broadcast storm disaster. As a result, the rate-limit value for BUM traffic should be greater than the application data multicast traffic (use the highest value as a reference) plus the percentage of permitted broadcast traffic.
Optionally, another way to mitigate the effects of a Layer 2 loop inside a given PoD is to also enable storm-control multicast on the spine interfaces to the leaf nodes. More importantly that the previous statement, it is crucial to understand the normal application multicast utilisation rate within each PoD before using a rate limiting threshold value.
The deployment of a VXLAN Multipod fabric allows extending VXLAN overlay data plane in conjunction with its MP-BGP EVPN overlay control plane for Layer 2 and Layer 3 connectivity across multiple PoDs. Depending on the specific use case and requirements, those PoDs may represent different rooms in the same physical data center, or separate data center sites (usually deployed at Metropolitan-area distances from each other).
The deployment of the Multipod design is a logical choice to extend connectivity between fabrics that are managed and operated as a single administrative domain. However, it is important to consider that it behaves like a single VXLAN fabric. This characteristic has important implications for the overall scalability and resiliency of the design.
A true multisite solution with a valid DCI solution is still the best option for providing separate availability zones across data center locations.