One of the questions that many network managers are asking is “Can I use VxLAN stretched across different locations to interconnect two or more physical DCs and form a single logical DC fabric?”
The answer is that the current standard implementation of VxLAN has grown up for an intra-DC fabric infrastructure and would necessitate additional tools as well as a control plane learning process to fully address the DCI requirements. Consequently, as of today it is not considered as a DCI solution.
To understand this statement, we first need to review the main requirements to deploy a solid and efficient DC interconnect solution and dissect the workflow of VxLAN to see how it behaves against these needs. All of the following requirements for a valid DCI LAN extension have already been discussed throughout previous posts, so the following serves as a brief reminder.
DCI LAN Extension requirements
- Failure domain must be contained within a single physical DC
- Leverage protocol control plane learning to suppress the unknown unicast flooding.
- Flooding of ARP requests must be reduced and controlled using rate limiting across the extended LAN.
- Generally speaking, rate limiters for the control plane and data plane must be available to control the broadcast frame rate sent outside the physical DC.
- The threshold must be set carefully with regard to the existing broadcast, unknown unicast and multicast traffic (BUM).
- Redundant paths distributed across multiple edge devices are required between sites, with all paths being active without creating any layer 2 loop
- A built-in loop prevention is a must have (e.g. OTV).
- Otherwise the tools and services available on the DCI platforms to address any form of Multi-EtherChannel split between 2 distinct devices (e.g. EEM, MC-LAG, ICCP, vPC, vPC+, VSS, nV Clustering, etc.) need to be activated.
- Independent control plane on each physical site
- Reduced STP domain confined inside each physical DC.
- STP topology change notifications should not impact the state of any remote link.
- Primary and secondary root bridges or multicast destination tree (e.g. FabricPath) must be contained within the same physical DC.
- A unique active controller managing the virtual switches may lose access to the other switches located on the remote site.
- No impact on the layer 2 protocol deployed on each site
- Any layer 2 protocols must be supported inside the DC (STP, MST, RSTP, FabricPath, TRILL, VxLAN, NVGRE, etc.) regardless of the protocols used for the DCI and within the remote sites.
- Remove or reduce the hairpinning workflow for long distances
- Layer 3 default gateway isolation such as FHRP isolation, anycast gateway or proxy gateway.
- Intelligent ingress path redirection such as LISP, IGP Assist, Route Health Injection, GSLB.
- Fast convergence
- Sub-second convergence for any common failure (link, interface, line card, supervisor, core).
- Any transport
- The DCI protocol must be transport-agnostic, meaning that it can be initiated on any type of link (dark fiber, xWDM, Sonet/SDH, Layer 3, MPLS, etc.).
In addition to the previous fundamentals, we can enable additional services and features to improve the DCI solution:
- ARP caching to reduce ARP broadcasting outside the physical DC.
- VLAN translation allowing the mapping of one VLAN ID to one or multiple VLAN IDs.
- Transport-agnostic core usage – thus enterprises and service providers are not required to change their existing backbone.
- Usage of IP multicast grouping is a plus, but should not be mandatory. However, the DCI solution must be able to support non-IP multicast layer 3 backbone as an enterprise would not necessarily obtain the IP multicast service from their provider (especially for a small number of sites to be layer 2 interconnected).
- Path diversity and traffic engineering to offer multiple routes or paths based on selected criteria (e.g. control plane versus data plane) such as IP based, VLAN-based, Flow-based.
- Removal or reduction of hairpinning workflows for metro distances (balancing between the complexity of the tool versus efficiency).
- Control plane learning versus data plane MAC learning.
- Multi-homing going beyond the two tightly coupled DCI edge devices.
- Site independence – the status of a DCI link should not rely on the remote site state.
- Load balancing: Active/Standby, VLAN-based or Flow-based
A Brief Overview of VxLAN
VxLAN is just another L2 over L3 encapsulation model. It has been designed for a new purpose offering logical layer 2 segments for virtual machines while allowing layer 2 communication between the same VxLAN segment identifier (VNID) over an IP network. VxLAN offers several great benefits: Firstly, as just expressed, the capability to extend the layer 2 segments pervasively over an IP network, allowing the same subnet to be stretched across different physical PoD (Point of Delivery) and therefore preserving the same IP schema in use. A PoD delimits usually the layer 2 broadcast domain. Secondly, it offers the capacity of creating of huge amount of small logical segments, well beyond the traditional dot1Q standard limited to 4k. This improves the agility needed for the cloud supporting stateful mobility with the support of native segmentation required by the multi-tenancy.
The other very important aspect of VxLAN is that it is getting a lot of momentum across many vendors, software and hardware platforms, with enterprises closely monitoring its evolution as the potential new de-facto transport standard inside the new generation of data center network fabric, addressing the host mobility.
However, VxLAN doesn’t natively support systems or network services connected to a traditional network (bare metal servers and virtual machines attached to VLAN, NAS, physical firewalls, IPS, physical SLB, DNS, etc.), hence additional components such as the VxLAN VTEP gateways layer 2 and layer 3 are required to communicate beyond the VxLAN fabric.
Many papers and articles have been posted on the web regarding the concept and details of VxLAN, therefore I’m not going to dig deeper into it here. Nevertheless, let’s still review some important VxLAN workflows and behaviors pertinent to this topic, step by step to help address the DCI requirements listed above more efficiently.
Current Implementation of VxLAN.
A virtual machine (or more generally speaking an end-system) within the VxLAN network sits in a logical L2 network and communicates with other distant machines using so-called VxLAN Tunnel End-Points (VTEP) enabled on the virtual or physical switch. The VTEP offers two interfaces, one connecting the layer 2 segment, which the virtual machine is attached to, and an IP interface to communicate over the routed layer 3 network with other VTEPs that bridge the same segment ID. The VTEPs encapsulate the original Ethernet traffic with a VxLAN header and send it over a layer 3 network toward the VTEP of interest, which then de-encapsulates the VxLAN header in order to present the original layer 2 packet to its final destination end-point.
In short, VxLAN aims to extend the logical layer 2 segment over a routed IP network. Hence, while it remains not a good idea, the idea emerging among many network managers to go a bit further leveraging the VxLAN protocol to stretch its segments across geographically distributed data centers is not fundamentally surprising.
VxLAN Implementation Models
Two types of overlay network-based VxLAN implementations exist: a host-based and network-based overlay implementations.
Confined the Failure Domain at its Physical Location
As of today, both software and hardware-based VxLAN implementations perform the MAC learning process based on traditional layer 2 operations, flooding and dynamic MAC address learning. They both offer the same transport and follow the same encapsulation model, however, dedicated ASIC-based encapsulation offers better performance as the number of VxLAN segments and VTEPs grows.
Each VTEP joins an IP multicast group according to its VxLAN segment ID (VNID), thus reducing the flooding to only the VTEP group concerned with the learning process. This IP multicast group is used to carry the VxLAN broadcast, the unknown unicast as well as the multicast traffic (BUM).
We need to walk (high level) through the learning process for our original DCI requirements. This kinematic of data flow assumes that the learning process starts from the ground-up.
- An end-system A sends an ARP request for another end-system B that belongs to the same VxLAN segment ID behind the VTEP 2.
- Its local VTEP 1 doesn’t know about the destination IP address, thus it encapsulates the original ARP request to its IP multicast group with its identifiers (IP address and segment ID).
- All VTEPs that belong to the same IP multicast group receive the packet and forward it to their respective VxLAN segments (only if this is the same VNID). They learn the MAC address of the original end-system A as well as its association with its VTEP 1 (IP address). They update their corresponding mapping table.
- The remote end-system B learns the IP <=> MAC address of the original requestor system A, ARP replies to it with its own MAC address.
- The VTEP 2 receives the reply to be sent to the end-system A. It knows the mapping of end-system A’s MAC address and the IP address of VTEP 1, then it encapsulates the ARP reply and unicasts the packet to VTEP 1.
- VTEP 1 receives the layer 3 packet from VTEP 2 via an unicast packet, learns the IP address of VTEP 2 as well as the mapping of the MAC address belonging to end-system 2 and the IP address belonging to VTEP 2.
- VTEP 1 strips off the VxLAN encapsulation and sends the original ARP reply.
- All subsequent traffic between both end-systems A and B is forwarded using unicast.
VxLAN uses the traditional data plane learning bridge to populate the layer 2 map tables. If this learning process is suitable within a CLOS fabric architecture, it is certainly not efficient for DCI. The risk is that a flooding of unknown destinations can burden the IP multicast group and therefore pollute the whole DCI solution (e.g. overload the inter-site link) instead of containing the failure domain to its physical data center. The issue is that, contrary to other overlay protocols such as EoMPLS or VPLS, it is very challenging to rate limit a specific IP multicast group responsible to handle any VxLAN-based flooding of BUM.
Flooding unknown unicast can saturate the IP multicast group over the DCI link, hence it is not appropriate for DCI solutions.
An enhanced implementation of VxLAN allows the use of a unicast-only mode. In this case the BUM is replicated to all VTEPs. The best current example is the enhanced mode of VxLAN supported on the Nexus 1000v. However, to improve the head-end replication, the Nexus 1000v controller (VSM) redistributes all VTEPs to its VEMs maintaining dynamically each VM entry for a given VxLAN. Consequently each VEM knows all the VTEP IP addresses spread over the other VEMs for a particular VxLAN. In addition to the VTEP information, the VSM learns and distributes all MAC addresses in a VxLAN to its VEMs. At any given time, all VEM MAC address tables contain all the known MAC addresses in the VxLAN domain. As a result, any packet sent to an unknown destination is dropped.
The unicast-only mode currently only concerns the set of Virtual Ethernet Modules (VEM) managed by the controller (VSM) with a maximum limit of 128 supported hosts even when the Nexus 1000v instance is distributed across two data centers. Currently, one unique Nexus 1000v instance controls the whole VxLAN overlay network spanned across the two data centers. Although other VxLAN domains could be created using additional VxLAN controllers, they are currently not able to communicate with each other. Therefore, despite improving the control over flooding behavior, this feature could potentially be stretched only between two small data centers and only for extended VxLAN logical segments. However, it is important to take into consideration that, in addition to the current scalability, other crucial concerns still exist in regard to the DCI requirement, with only one controller active for the whole extended VxLAN domain and remote devices connected to a traditional VLAN that won’t be extended across sites (discussed below with the VTEP gateway).
Note: some implementations of VxLAN support the function of ARP caching, reducing the amount of flooding over the multicast for previous similar ARP requests. However, unknown unicast is flooded over the IP multicast group. Thus ARP caching must be considered as an additional improvement to DCI, not a DCI solution on its own, hence even if adding great value, it is definitely not sufficient to allow stretching the VxLAN segment across long distances.
Active/Active Redundant DCI Edge Device
As most of the DC networks are running in a “hybrid” model with a mix of physical devices and virtual machines, we need to evaluate the different methods of communication that exist between VxLANs and with a VLAN.
- VxLAN A to VxLAN A. This is the default communication just described above between two virtual machines belonging to the same logical network segment. It uses the VTEP to extend the same segment ID over the layer 3 network. Although the underlying IP network can offer multiple active paths, only one VTEP for a particular segment can exist on each host. This mode is limited for communicating between strictly identical VxLAN IDs, meaning also limited to a fully virtualized DC.
- VxLAN Gateway to VLAN. To communicate with the outside of the virtual network overlay, the VxLAN logical segment ID is bridged to the traditional VLAN ID using the VxLAN VTEP Layer 2 gateway. It is a one-to-one translation that can be achieved using different tools. From the software side, this service can be initiated using a virtual service blade gateway installed on the Nexus 1010/1110 or from the security edge gateway that comes with the vCloud network, to list only two, acting in transparent bridge mode: one active VTEP gateway per VNID to VLAN ID translation, nevertheless a VTEP gateway supports many VNID <=> VLAN ID translations. From the hardware side, the VTEP gateway is embedded in each switch supporting this function using a dedicated ASIC. Regardless of whichever software component is used for the function of the VxLAN VTEP gateway, currently only one L2 gateway can be active at the same time for each particular VNID <=> VLAN ID mapping, preventing layer 2 loops. On the other end with the hardware implementation of VxLAN, for hosts directly attached to a physical leaf, it is possible to run the same L2 gateway mapping locally on each leaf of interest. Nonetheless this is possible only if there is not cascaded layer 2 network attached to the leaf of interest.
Although it is not a viable solution for a production network, the following needs to be stated in order to prevent it from being understood as an alternative solution: indeed a bandage option could be used to connect a virtual machine with a virtual interface attached to each segment side. This would bridge a VNID to a VLAN ID with no VTEP function per se, like a virtual firewall for example configured in transparent bridging mode. Even though this may work technically for a specific secured bridging between a VxLAN and a VLAN segment, it is important to understand the limitation and the risks.
There are some important weaknesses to take into consideration:
- Risk of layer 2 loops by cloning and enabling a virtual machine or just a human mistake connecting the same VxLAN segment with its peer VLAN twice.
- Scalability bottlenecks due to the limited number of supported virtual interfaces (VNIC) per virtual machine: in theory a maximum of five VNID to VLAN ID mappings is feasible, but if we remove one VNIC for management purposes, only four possible translations remain. While it is possible to enable dot1q to increase the number of VNID <=> VLAN ID mappings, some limitations do remain (100-200 VLAN), which also impact performance and bandwidth.
- Live migration: if the VM running the bridge gateway role live migrates to a new host, then the entire MAC to IP bindings will be purged and would need to be re-learned.
Therefore, based on this non-exhaustive list of caveats, it is preferable to not use this method to bridge a VxLAN with a VLAN segment. To avoid any risk of disruption, it is more desirable to enable the virtual firewall in routed mode.
There is another method connecting to the outside of the VxLAN fabric that consists of a logical segment routed to a VLAN or another VxLAN segment using the VxLAN layer 3 gateway. This function can be embedded into the virtual or physical switch or eventually an external VM running the function of a router configured in routed mode, yet only with the same limitations mentioned above. This layer 3 gateway is out of scope for the purpose of this article, so let’s focus on the VxLAN gateway to VLAN for the hybrid model.
As mentioned previously, if the VTEP layer 2 gateway can be deployed in high availability mode, only one is active at a time for a particular logical L2 segment. An additional active VxLAN gateway may create a layer 2 loop, since currently there is no protocol available preventing an alternate path forward.
That means that all translations from VNID to VLAN ID and vice versa must hit the same VTEP gateway. If this behavior works well inside a network fabric, when projected into a DCI scenario, only end-systems on the same data center as the VxLAN gateway can communicate with each other.
In the above structure VxLAN 12345 is mapped to VLAN 100 in DC 1 on the right. Only Srv 1 can communicate with VM2 or VM1. Srv 2 on DC 2 is isolated because it has no path to communicate with the VxLAN gateway on the remote site DC 1.
Theoretically we could create a VxLAN perimeter on each data center, each managed by its own VxLAN controller. This would allow to duplicate the VTEP gateway and to map the same series of VNID <=> VLAN ID on each site.
Each VTEP on each side can share the same IP multicast group used by the VNIDs of interest. Yet currently, and due to the data plane learning, we are still facing ARP and BUM flooding. Until a control plane can distribute the VTEP and MAC reachability on all virtual or physical edges and between VxLAN domains, the data center interconnection will suffer from high flooding of BUM.
Thus, the paradox is that, in order to allow communication between remote machines connected to a particular VLAN and a virtual machine attached to the same VxLAN segment with a single active VTEP gateway, a validated DCI solution is required extending the VLAN in question.
On the minus side using a DCI VLAN extension and one single active VxLAN gateway for both sites, all traffic exchanged between a VLAN and a VxLAN segment will have to hit the VxLAN gateway on the remote site, creating a hairpinning workflow.
Independent Control Plane on each Site
Today, with the current standard implementation of VxLAN there is no control plane for for the learning process. By definition this should be a showstopper for DCI. Although the enhancement mode of VxLAN uses a controller (VSM) to continuously distribute all VTEPs for a given VxLAN to all virtual switches, only one controller is active at a time. This means that the same control plane is shared between the two locations with an inherent risk of partitioning. This is being improved in the near future with a new control plane.
This is dependent on the vendor and the way VxLAN is implemented in software or hardware. Therefore what is really needed in terms of the number of segments and MAC addresses supported is connected to a number of conditions.
- How many VxLAN can a software engine really support? Usually this is limited to few thousand, don’t expect hundreds of thousands.
- Can I distribute the consumed VxLAN resources such as VxLAN gateways between multiple engines?
- How many IP multicast groups can the network support?
- VTEP discovery and data plane MAC learning leads to flooding over the interconnection
- Excessive flood traffic and BW exhaustion.
- Exposure to security threats.
- No isolation of L2 failure-domain
- No rate limiters.
- Current implementation of standard VxLAN assumes an IP multicast transport exists in the layer 3 core (inside and outside the DC)
- IP multicast in the layer 3 backbone may not necessarily be an option.
- Large amounts of IP multicast between DCs could be a challenge from an operational point of view.
- Only one gateway per VLAN segment
- More than one VTEP layer 2 gateway will lead to loops.
- Traffic is hairpinned to the gateway.
- VxLAN transport cannot be relied upon to extend VLAN to VLAN
- Valid DCI solution required to close black holes.
- No network resiliency of the L2 network overlay
- No multi-homing preventing layer 2 back-door
- Most validated DCI solutions offer flow-based or VLAN-based load balancing between two or more DCI edge devices.
- No path diversity
- It is desirable to handle private VLAN and public VLAN using different paths (HA clusters).
- Can we extend a VxLAN segment over long distances?
The VxLAN is aimed at running over an IP network, hence from a transport protocol point of view there is no restriction in term of distances.
- Does that mean that it can be used natively as a DCI solution interconnecting two or multiple DCs extending the layer 2 segment?
VxLAN would necessitate additional tools as well as a control plane to address the DCI requirements listed at the beginning of this post, which are not applicable with the current implementation of VxLAN.
Note: as these lines are written, there is work being done at the IETF nvo3 working group for VXLAN control-plane for learning and notification processes (BGP with E-VPN) reducing the overall amount of flooding (BUM) currently created by the data plane learning implementation. In addition, an anycast gateway feature will be available soon in the physical switches allowing the distribution of active VTEP gateways on each leaf. Stay tuned!
Until VxLAN evolves with a control plane and implements the necessary tools needed for DCI, the best recommendation to interconnect two VxLAN network fabrics is to use a traditional DCI solution (OTV, VPLS, E-VPN).