VxLAN/EVPN and Integrated Routing Bridging
As I mentioned in the post 28 – Is VxLAN Control Plane a DCI solution for LAN extension, VxLAN/EVPN is taking a big step forward with its Control Plane and could be used potentially for extending Layer 2 segments across multiple sites. However it is still crucial that we keep in mind some weaknesses and lacks related to DCI purposes. Neither VxLAN nor VXLAN/EVPN have been designed to offer natively a DCI solution (post 26 & 28).
DCI is not just a layer 2 extension between two or multiple sites. DCI/LAN extension is aiming to offer business continuity and elasticity for the cloud (hybrid cloud). It offers disaster recovery and disaster avoidances services for Enterprise business applications, consequently it must be very robust and efficient. As it concerns on Layer 2 broadcast domain, it is really important to understand the requirement for a solid DCI/LAN extension and how we can leverage the right tools and network services to address some of the shortcomings that rely on the current implementation of VxLAN/EVPN offering a solid DCI solution.
In this article we will examine the integrated anycast L3 gateway available with VxLAN/EVPN MP-BGP control plane, which is one of the key DCI requirements for long distances when the first hop default gateway can be duplicated on multiple sites.
Integrated Routing and Bridging
One of the needs for an efficient DCI deployment is the duplicate Layer 3 default gateway solution such as FHRP isolation (24 – Enabling FHRP filter), reducing the hair-pining workflows while machines are migrating from one site to another. In short, the same default gateway (virtual IP address + virtual MAC) exists and is active on both sites. When a virtual machine or the whole multi-tier application executes a hot live migration to a remote location, it is very important that the host continues to use its default gateway locally and processes the communication without any interruption.Without FHRP isolation, multi-tier applications (E-W traffic) will suffer in term of performances reaching their default gateway in the primary data center.
This operation should happen transparently with the new local L3 gateway, whereas the round-trip workflow via the original location is eliminated. No configuration is required within the Host. All active sessions can be maintained stateful.
Modern techniques such as Anycast gateway can offer at least the same efficiency as FHRP isolation, and not only for Inter-Fabric (DCI) workflows but also for Intra-Fabric traffic optimization. This task is achieved in an easier way because it doesn’t require specific filtering to be configured on each site. The function of the Anycast Layer 3 gateway is natively embedded with the BGP EVPN control plane. And last but not least, in conjunction with VxLAN/EVPN there are as many Anycast L3 gateways as Top-of-Rack switches (Leafs with VxLAN/EVPN enabled).
The EVPN IETF draft elaborated the concept of Integrated Routing and Bridging based on EVPN to address inter-subnet communication between Hosts or Virtual Machines that belong to different VxLAN segments. This is also known as inter-VxLAN routing. VxLAN/EVPN offers natively the Anycast L3 gateway function with the Integrated Routing Bridging feature (IRB).
VxLAN/EVPN MP-BGP Host learning process
The following section is a slight reminder on the Host reachability discovery and distribution process to better understand later the different routing mode.
The VTEP function happens on a First Hop Router usually enabled on the ToR where the Hosts are directly attached. EVPN provides a learning process to dynamically discover the local end-points attached to their local VTEP, distributing afterward the information (Host’s MAC and IP reachability) toward all other remote VTEPs through the MP-BGP control plane. Subsequently, all VTEPs know all end-point information that belongs to their respective VNI.
Let ‘s elaborate the description with the following example.
VLAN 100 (blue) maps to VNI 10000
- Host A: 192.168.1.188
- Host C: 192.168.1.187
- Host E: 192.168.1.193
- Host F: 192.168.1.189
VLAN 30 (orange) maps to VNI 5030
- Host B: 192.168.3.187
- Host D: 192.168.3.193
VLAN 3000 (purple) maps to VNI 300010.
- Host G: 192.168.32.189
In order to keep the following scenarios simple, each pair of physical Leafs (vPC) will be represented by a single logical switch with the Anycast VTEP stamped to it.
VTEP 1: 10.0.2.11:
VLAN 30 and VLAN 3000 are not present in Leaf 1, consequently it is not required to create and map the VLAN 30 to VNI 5030, nor VLAN 3000 to VNI 300010 within VTEP 1. VTEP 1 only needs to get the reachability information for Hosts A, C, E and F (IP & MAC addresses) belonging to VLAN 100. Host A is local to VTEP 1 and Hosts C, E and F are remotely extended using the network overlay VNI 10000.
VTEP 2: 10.0.2.13
VTEP 2 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts B and C are local.
VTEP 3: 10.0.2.25
VTEP 3 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts D and E are local.
VTEP 4: 10.0.2.27
VTEP 4 learned Hosts A, C, E, and F on VLAN 100 and Host G on VLAN 3000. Hosts F and G are local.
Notice that only VTEP 4 learns Host G, which belongs to VLAN 3000. Consider Host G to be isolated from the other segments, like for example a database accessible only via routing.
With the process for Host reachability information, the learning and distribution end-points are achieved by the MP-BGP EVPN control plane with an integrated routing and bridging (IRB) function. Each VTEP learns through the data plane (e.g. Source MAC from a Unicast packet or a ARP/GARP/RARP) and registers its local end-points with their respective MAC and IP address information using the Host Mobility Manager (HMM). Subsequently, it distributes this information through the MP-BGP EVPN control plane. In addition to registering its local attached end-points (HMM), every VTEP will populate its Host table with the devices learned through the control plane (MP-BGP EVPN AF).
From Leaf 1 (VTEP1 : 10.0.2.11)
The “show L2route” command above details the local Host A (192.168.1.188) learned by the Host Mobility Manager (HMM), as well as Host C (192.168.1.187), E (192.168.1.193) and F (192.168.1.189) learned through the BGP on the next hop (remote VTEP) to reach them.
 Please notice that the output has been modified to show only what’s relevant to this section.
From Leaf 3 (VTEP2 : 10.0.2.13)
Host B (192.168.3.188) and Host C (192.168.1.187) are local, whereas Host D (192.168.3.193) and Host E (192.168.1.193) are remotely reachable through VTEP 3 (10.0.2.25) and Host F (192.168.1.189) appears behind VTEP 4 (10.0.2.27).
From Leaf 5 (VTEP3 : 10.0.2.25)
Host D (192.168.3.193) and Host E (192.168.1.193) are local. All other Hosts except Host G are reachable via remote VTEPs listed in the Next Hop column.
From Leaf 7 (VTEP4 : 10.0.2.27)
Host F (192.168.1.189) and Host G (192.168.32.189) are local. VTEP 4 registered only to the Hosts that belong to VLAN 100 and VLAN 3000 throughout the whole VxLAN domain.
With the Host table information, each VTEP knows how to reach all Hosts within its configured VLAN to L2 VNI mapping.
Asymmetric or symmetric Layer 3 workflow
The EVPN draft defines two different operational models to route traffic between VxLAN overlays. The first method describes an asymmetric workflow across different subnets and the second method leverages the symmetric approach.
Nonetheless, vendors offering VxLAN with EVPN control plane support may implement one of these operational models (assuming the Hardware/ASIC supports VxLAN routing). Some may choose the Asymmetrical approach, maybe because it is easier to implement from a software point of view, but it is not as efficient as the symmetrical mode and has some risks that impact scalability. Others will choose the symmetric model for more efficient population of Host information with better scalability.
Assuming the following scenario:
- Host A (VLAN 100) wants to communicate with Host G in a different subnet (VLAN 3000)
- VLAN 100 is mapped to VNI 10000
- VLAN 30 is mapped to VNI 5030
- VLAN 3000 is mapped to VNI 300010
- We assume all Hosts and VLANs belong to the same Tenant-1 (VRF)
- * L3 VNI for this Tenant of interest is VNI 300001 (for symmetric IRB)
- We assume within the same Tenant-1 that all Hosts from different subnets are allowed to communicate with each other (using inter-VxLAN routing).
- Host A shares the same L2 segment with Hosts C, E, and F spread over remote Leafs
- Host G is Layer 2 isolated from all other Hosts. Therefore a Layer 3 routing transport is required to communicate with Host G.
* explained with Symmetric IRB mode
Asymmetric IRB mode:
When an end-point wants to communicate with another device on a different IP network, it sends the packet with the destination MAC address as its default gateway. Its first-hop router (ingress VTEP) performs the routing lookup and routes the packet to the destined L2 segment. When the egress VTEP receives the packet, it strips off the VxLAN header and bridges the original frame to the VLAN of interest. With asymmetrical routing mode, the ingress VTEP (the local VTEP where the source is attached) performs both bridging and routing, whereas the egress VTEP (remote VTEP where the destination sits) performs only bridging. Consequently, the return traffic will take a different VNI, hence a different overlay tunnel.
In the following example, Host A wants to communicate with Host G and sends the packet toward its default gateway. VTEP 1 sees that the destination MAC is its own address and does a routing lookup for Host G. It finds in its Host table the destined IP Host and the Next Hop VTEP 4 to reach it. VTEP 1 encapsulates and routes the packet into the L2 VNI 300010 with VTEP 4 as destination IP address. VTEP 4 receives the packet, strips off the VxLAN header and bridges the frame to its VLAN 3000 toward Host G (with the source MAC as the default gateway address).
When Host G responds, VTEP 4 will encapsulate the frame to its associated VNI 300010 and will route the packet directly through the VNI 10000 where Host A is registered. The egress VTEP 1 will therefore bridge the received packet from VNI 10000 toward Host A in VLAN 100.
The drawback of this asymmetric routing implementation is the consistent configuration across all VTEPs built with all VLANs and VNIs concerned with routing and bridging communication across the fabric. In addition, it needs to learn the reachability information (MAC and IP addresses) for all Hosts that belong to all VNI of interest.
In the above illustration, VTEP 1 needs to be configured with VLAN 100 mapping VNI 10000, as well as the VLAN IDs which are mapped to the VNIs 5030 and 300010, even though there is no any Host attached to those VLANs. The VLAN ID being local significant, what is crucial is that the mapping to the VNI of interest exists. Therefore, in this example the Host tables on all leafs are populated with the reachability information for Hosts A, B, C, D, E, F and G. In a large VxLAN/EVPN deployment, this implementation of asymmetric IRB adds some complexity for the configuration but above all it may have important impact in term of scalability.
Symmetric routing behaves differently in the sense that both ingress and egress VTEP provide bridging and routing functions. This allows introducing a new concept known as transit L3 VNI. This L3 VNI will be dedicated for routing purposes within a tenant VRF. Indeed, the L3 VNI offers L3 segmentation per tenant VRF. Each VRF instance is mapped to a unique L3 VNI in the network. Each series of tenant’s VLAN determines the VRF context to which the receiving packet belongs. As a result, the inter-VxLAN routing is performed throughout the L3 VNI within a particular VRF instance.
Notice that each VTEP can support several hundred VRFs (depending on the hardware, though).
In the following example, all Hosts, A, B, C, D, E, F and G, belong to the same tenant VRF. When Host A wants to talk to Host G in a different Subnet, the local VTEP (ingress) sees the destination MAC as the MAC of the Default Gateway (AGM – Anycast Gateway MAC), as Host G is not know via Layer-2 and consequently routes the packet through the L3 VNI 300001. It rewrites the inner destination MAC address with the egress VTEP 4 router MAC address that, this one is unique for each VTEP. Once the remote VTEP 4 (egress) receives the encapsulated VxLAN packet, it strips of the VxLAN header and does a MAC lookup identifying the destined MAC as being its own. Accordingly, it performs an L3 lookup and routes the receiving packet from the L3 VNI to the destined L2 VNI 300010 that maps VLAN 3000 where Host G resides. VTEP 4 finally maps VNI 300010 to VLAN 3000 and forwards the frame with an L2 destination as Host G.
For return traffic, VTEP 4 will achieve the same packet walk using the same transit L3 VNI 300001, but in the opposite direction.
The following L2route command displays the Host reachability information that exists in VTEP 1’s Host table. It doesn’t show Host G (192.168.32.189), as no VLAN maps locally in Leaf 1 (VTEP 1) the VNI 300010 nor information related to VNI 5030, reducing therefore the population of its Host reachability table to the minimum required information.
Nonetheless, Leaf1 knows how to route the packet to that destination. The Next hop is Leaf_4 via the L3 VxLAN segment ID 300001, as displayed in the following routing information.
The “show IP route” command above shows that Host G is reachable via the VxLAN segment ID 300001 with the next hop as 10.0.2.27 (VTEP 4).
The “show bgp L2vpn evpn” command from Leaf 1 displays the information related to Host G (192.168.32.189), reachable via the next hop, 10.0.2.27 (VTEP 4) using the L3 VNI 300001.
Note also for information only that the label shows the mapping of the L2VNI 300010 where Host G belongs to and the L3VNI 300001 to reach Host G.
 Please notice that the output has been modified to display only information relevant to Host G.
Anycast L3 gateway
One of the useful added-values of this, is the capacity to offer the role of default gateway from the direct attached-Leaf (First Hop Router initiated on ToR). All leafs are configured with the same Default Gateway IP address for a specific subnet as well as the same vMAC address. All routers will use the same virtual Anycast Gateway MAC (AGM) for all Default Gateways. This is explicitly configured during the VxLAN/EVPN setup.
Leaf(config)# fabric forwarding anycast-gateway-mac 0001.0001.0001
When a virtual machine processes a “hot live migration” to a Host attached to a different Leaf, it will continue to use the same default gateway parameter without interruption.
In the above scenario, the server “WEB” in DC-1 communicates (1) with its database attached to a different Layer 2 network and therefore a different subnet in DC-2. The local Leaf 1 achieves the default gateway role for server WEB as the First Hop Router, which is simply its own ToR with the required VxLAN/EVPN and IRB services up and running.
While server WEB is communicating with the end-user (not represented here) and its database, it does a live migration (2) toward the remote VxLAN fabric where its database server resides.
The MP-BGP EVPN AF control plane notices the movement and the new location of the server WEB. Subsequently, it increases the MAC mobility Sequence number value for that particular end-point and notifies all VTEP of the new location. As the Sequence number is now higher than the original value, subsequently all VTEP update their Host table accordingly with the new “next hop” (egress VTEP 4).
Without interruption of the current active sessions, server WEB continues to transparently use its default gateway, locally available from its new physical First Hop Router, now being Leaf 7 and 8 (VTEP 4). The East-West communication between WEB and DB happens locally within DC-2.