I recommend you to read part 1 and part 2 if you missed them 🙂
Thank you, yves
VXLAN EVPN Multi-Fabric with Distributed Anycast Layer 3 Gateway
Layer 2 and Layer 3 DCI interconnecting multiple VXLAN EVPN Fabrics
A distributed anycast Layer 3 gateway provides significant added value to VXLAN EVPN deployments for several reasons:
- It offers the same default gateway to all edge switches. Each endpoint can use its local VTEP as a default gateway to route traffic outside its IP subnet. The endpoints can do so, not only within a fabric but across independent VXLAN EVPN fabrics (even when fabrics are geographically dispersed), removing suboptimal interfabric traffic paths. Additionally, routed flows between endpoints connected to the same leaf node can be directly routed at the local leaf layer.
- In conjunction with ARP suppression, it reduces the flooding domain to its smallest diameter (the leaf or edge device), and consequently confines the failure domain to that switch.
- It allows transparent host mobility, with the virtual machines continuing to use their respective default gateways (on the local VTEP), within each VXLAN EVPN fabric and across multiple VXLAN EVPN fabrics.
- It does not require you to create any interfabric FHRP filtering, because no protocol exchange is required between Layer 3 anycast gateways.
- It allows better distribution of state (ARP, etc.) across multiple devices.
In the VXLAN EVPN Multi-Fabric with Distributed Anycast Layer 3 Gateway scenario the Border nodes can perform 3 main functions:
- VLAN and VRF-Lite hand-off to DCI.
- MAN/WAN connectivity to the external Layer 3 network domain.
- Connectivity to Network Services.
For IP subnets that are extended between multiple fabrics, instantiation of the distributed IP anycast gateway on the border nodes is not supported. It is not supported because, with the instantiation of the distributed IP anycast gateway on the border nodes that also extend the Layer 2 network, the same MAC and IP addresses become visible on the Layer 2 extension on both fabrics, and unpredictable learning and forwarding can occur. Even if the Layer 2 network is not extended between fabrics, because it may potentially later be extended, the use of a distributed IP anycast gateway on the border node is not recommended (Figure 1).
The next sections provide more detailed packet walk descriptions. But before proceeding, keep in mind that MP-BGP EVPN can transport Layer 2 information such as MAC addresses as well as Layer 3 information such as host IP addresses (host routes) and IP subnets. For this purpose, it uses two forms of routing advertisement:
- Route type 2: Used to announce host MAC and IP address information for the endpoint directly connected to the VXLAN fabric and also carrying extended community attributes (such as route-target values, router MAC addresses, and sequence numbers).
- Route type 5: Advertises IP subnet prefixes or host routes (associated, for example, with locally defined loopback interfaces) and also carrying extended community attributes (such as route-target values and router MAC addresses).
Learning Process for Endpoint Reachability Information
Endpoint learning occurs on the VXLAN EVPN switch, usually on the edge devices (leaf nodes) to which the endpoints are directly connected. The information (MAC and IP addresses) for locally connected endpoints is then programmed into the local forwarding tables.
This article assumes that a distributed anycast gateway is deployed on all the leaf nodes of the VXLAN fabric. Therefore, there are two main mechanisms for endpoint discovery:
- MAC address information is learned on the leaf node when it receives traffic from the locally connected endpoints. The usual data-plane learning function performed by every Layer 2 switch allows this information to be programmed into the local Layer 2 forwarding tables.
You can display the Layer 2 information in the MAC address table by using the show mac-address table command. You can display the content of the EVPN table (Layer 2 routing information base [L2RIB]) populated by BGP updates by using the show l2route evpn mac command.
- IP addresses of locally connected endpoints are instead learned on the leaf node by intercepting control-plane protocols used for address resolution (ARP, Gratuitous ARP [GARP], and IPv6 neighbor discovery messages). This information is also programmed in the local L2RIB together with the host route information received through route-type-2 EVPN updates and can be viewed by using the show l2route evpn mac-ip command at the command-line interface (CLI).
Note: Reverse ARP (RARP) frames do not contain any local host IP information in the payload, so they can be used only to learn Layer 2 endpoint information and not the IP addresses.
After learning and registering information (MAC and IP addresses) for its locally discovered endpoints, the edge device announces this information to the MP-BGP EVPN control plane using an EVPN route-type-2 advertisement sent to all other edge devices that belong to the same VXLAN EVPN fabric. As a consequence, all the devices learn endpoint information that belongs to their respective VNIs and can then import it into their local forwarding tables.
Intra-Subnet Communication Across Fabrics
Communication between two endpoints located in different fabrics within the same stretched Layer 2 segment (IP subnet) can be established by using a combination of VXLAN bridging (inside each fabric) and OTV Layer 2 extension services.
To better understand the way that endpoint information (MAC and IP addresses) are learned and distributed across fabrics, start by assuming that source and destination endpoints have not been discovered yet inside each fabric.
The control-plane and data-plane steps required to establish cross-fabric Layer 2 connectivity are highlighted in Figure 2 and described in this section.
The various control and data plane steps required to establish cross Fabrics L2 connectivity are highlighted in Figure 2 and described below.
- Host H1 connected to leaf L11 in fabric 1 initiates intrasubnet communication with host H6, which belongs to remote fabric 2. As a first action, H1 generates a Layer 2 broadcast ARP request to resolve the MAC and IP address mapping for H6 that belongs to the same Layer 2 segment.
- Leaf L11 receives the ARP packet on local VLAN 100 and maps it to the locally configured L2VNI 10100. Because the distributed anycast gateway is enabled for that Layer 2 segment, leaf L11 learns H1’s MAC and IP address information and distributes it inside the fabric through an EVPN route-type-2 update. This process allows all the leaf nodes with the same L2VNI locally defined to receive and import this information in their forwarding tables (if they are properly configured to do so).
- Note: The reception of the route-type-2 update triggers also a BGP update for H1’s host route from border nodes BL1 and BL2 to remote border nodes BL3 and BL4 through the Layer 3 DCI connection.
- This process is not shown in Figure 2, because it is not relevant for intrasubnet communication.On the data plane, leaf L11 floods the ARP broadcast request across the Layer 2 domain identified by L2VNI 10100 (locally mapped to VLAN 100). The ARP request reaches all the local leaf nodes in fabric 1 that host that L2VNI, including border nodes BL1 and BL2. The replication of multidestination traffic can use either the underlay multicast functions of the fabric or the unicast ingress replication capabilities of the leaf nodes. The desired behavior can be independently tuned on a per-L2VNI basis within each fabric.
- Note: The ARP flooding described in the preceding steps occurs regardless of whether ARP suppression is enabled in the L2VNI, because of the initial assumption that H6 has not been discovered yet.
- The border nodes decapsulate the received VXLAN frame and use the L2VNI value in the VXLAN header to determine the bridge domain to which to flood the ARP request. They then send the request out the Ethernet interfaces that carry local VLAN 100 mapped to L2VNI 10100. The broadcast packet is sent to both remote OTV end devices. The OTV authoritative edge device (AED) responsible for extending that particular Layer 2 segment (VLAN 100) receives the broadcast packet, performs a lookup in its ARP cache table, and finds no entry for H6. As a result, it encapsulates the ARP request and floods it across the OTV overlay network to its remote OTV peers. The remote OTV edge device that is authoritative for that Layer 2 segment opens the OTV header and bridges the ARP request to its internal interface that carries VLAN 100.
- Border nodes BL3 and BL4 receive the ARP request from the AED through the connected OTV inside interface and learn on that interface through the data plane the MAC address of H1. As a result, the border nodes announce H1’s MAC address reachability information to the local fabric EVPN control plane using a route-type-2 update. Finally, the ARP request is flooded across L2VNI 10100, which is locally mapped to VLAN 100.
- All the edge devices on which L2VNI 10100 is defined receive the encapsulated ARP packet, including leaf nodes L23 and L24, which are part of a vPC domain. One of the two leaf nodes is designated to decapsulate the ARP request and bridge it to its local interfaces configured in VLAN 100 and locally mapped to L2VNI 10100, so H6 receives the ARP request.
At this point, the forwarding tables of all the devices are properly populated to allow the unicast ARP reply to be sent from H6 back to H1, as shown in Figure 3.
- H6 sends an ARP unicast reply destined to H1’s MAC address.
- The reception of the packet allows leaf nodes L23 and L24 to locally learn H6’s MAC and IP address information and then to generate an MP-BGP EVPN route-type-2 update to the fabric. As shown in the example in Figure 12, border nodes BL3 and BL4 also receive the update and so update their forwarding tables.
- Note: The reception of the route-type-2 update also triggers a BGP update for H6’s host route from border nodes BL3 and BL4 to remote border nodes BL1 and BL2 through the Layer 3 DCI connection. This process is not shown in Figure 3, because it is not relevant for intrasubnet communication.
- Leaf L23 performs a MAC address lookup for H1 and finds as the next hop the anycast VTEP address defined on border nodes BL3 and BL4. It hence VXLAN encapsulates the response and sends it unicast to that anycast VTEP address.
- The receiving border node (BL3 in this example) decapsulates the packet and bridges it to its Ethernet interface on which it previously learned H1’s MAC address (pointing to the OTV AED). The OTV device receives the packet on the internal interface, performs a MAC address lookup for H1, and sends the packet to the remote OTV edge device that is authoritative for that VLAN. The OTV device in fabric 1 then forwards the packet to its internal interface.
- BL1 in this specific example receives the ARP reply from the OTV device and learns the MAC address of H6. Subsequently, this information is announced to the VXLAN EVPN fabric using a route-type-2 BGP update, which is received by all relevant leaf nodes.
- BL1 then performs a MAC address lookup for H1 and finds leaf L11 as its next hop to reach H1. Thus it VXLAN-encapsulates the ARP reply and sends it unicast to the leaf L11 VTEP address.
- Leaf L11 decapsulates the VXLAN header and forwards the ARP reply unicast to H1.
At this stage, the leaf and border nodes in both VXLAN fabrics have fully populated their local forwarding tables with MAC reachability information for both the local and remote endpoints that need to communicate. H1 and H6 hence will always use the Layer 2 DCI connection for intrasubnet communication, as shown in Figure 4.
Note: The VLAN ID and VNID values used in this article have been kept consistent across the system for simplicity. Technically, the VLAN ID is locally significant per leaf node, and the VNID is locally significant per VXLAN fabric.
Inter-Subnets Communication across Fabrics
As previously mentioned, the use of a VXLAN EVPN anycast Layer 3 gateway allows you to reduce traffic hairpinning for routed communication between endpoints. Default gateways can all be active and distributed across different network fabrics on each leaf node. Therefore, you should be sure configure the default gateway used by the directly connected endpoints with the same Layer 2 virtual MAC (vMAC) address and Layer 3 virtual IP address. The distributed MAC address of the Layer 3 gateway is the same across all the defined L2VNIs in a VXLAN fabric. It is also known as the anycast gateway vMAC address.
So how does inter-subnet communication occur between two VXLAN EVPN fabrics with a distributed anycast gateway?
The following example describes the packet walk between two endpoints (H1 and H4) belonging to different IP subnets and communicating with each other. To better demonstrate the endpoint discovery and distribution processes, this section presents the full packet walk for the control plane (the learning and distribution of endpoint MAC and IP address information) and data plane (the data packets exchange between endpoints).
Because the VLAN ID is locally significant, this example refers to subnet_100, which is used for the Layer 2 domain on which H1 resides, and subnet_200, which is associated with the Layer 2 domain on which H4 resides.
To better show the routing behavior within the fabric, this example assumes that the first-hop router leaf L11 to which source endpoint H1 is connected has no destination IP subnet_200 locally configured (that is, no VLAN 200 is created on leaf L11).
In this scenario, for H1 to communicate with H4, it needs first to send an ARP request to its default gateway. This event then triggers a series of control-plane updates. Figure 5 shows this scenario.
- Leaf L11 consumes the received ARP request because the destination MAC address is the default gateway (locally defined). This process allows the leaf node to learn the Layer 2 and 3 reachability information for H1.
- Leaf L11 injects this information into the fabric EVPN control plane using a route-type-2 BGP update.
- Border leaf nodes BL1 and BL2 receive the BGP update and generate a Layer 3 update specific to H1’s VRF instance using, in this specific example, an external BGP (eBGP) update across the Layer 3 DCI connection.
- Consequently, border nodes BL3 and BL4 in fabric 2 receive the Layer 3 update from their Layer 3 DCI connection and advertise a route-type-5 update to the local MP-BGP EVPN control plane.
After the completion of the preceding steps, the forwarding tables of the devices in fabrics 1 and 2 are updated as shown in Figure 14. H1 now can start communicating with H4 as described in the following steps and shown in Figure 6.
Note: The following steps assume that destination H4 is a silent host: that is, it has not sent any ARP message or data packet, and so its MAC and IP address information is not yet known in the EVPN control plane.
- H1 connected to Leaf 11 in Fabric 1 (subnet_100) generates a data packet destined to for remote endpoint H4 that resides in Fabric 2 in a different subnet_200. The traffic is hence sent to its default gateway (that is, the destination MAC address is the anycast gateway MAC address configured in the fabric), which is locally active on leaf L11.
- Leaf 11 has previously received an EVPN Route Type 5 update indicating that subnet_200 is reachable through multiple next hops (equal-cost paths). In the specific example shown in Figure 15, those next hops are the anycast VTEP address for leaf nodes L12 and L13 and the anycast VTEP address for leaf nodes L14 and L15, because subnet_200 exists on all these switches. You should, in fact, configure all leaf nodes to announce the locally defined IP subnet prefixes within the fabric to allow the discovery of silent hosts. For more information about this specific design point, refer to the VXLAN design guide at : http://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/guide-c07-734107.html.
- Note: The IP prefix associated with subnet_200 is also advertised in fabric 1 (route-type-5 BGP update) by border nodes BL1 and BL2, because they receive this information through the route peering over the Layer 3 DCI connection to remote fabric 2.In the specific example above, leaf L11 selects the anycast VTEP address on leaf nodes L14 and L15 as the valid next hop based on the hashing result. It then routes the VXLAN encapsulated traffic to the leaf nodes through the transit L3VNI for that particular tenant (VRF instance).
- The receiving Leaf (14 in this example) performs a Layer 3 lookup but finds no specific host route entry for H4, so it routes the traffic to locally defined subnet_200. This routing triggers an ARP request destined for H4 that is flooded across the L2VNI associated with subnet_200.
- All the Leaf nodes with the L2VNI locally defined receive the ARP request, including border nodes BL1 and BL2. One of the two border nodes decapsulates the VXLAN frame and floods the ARP request across VLAN 200 through its Layer 2 interfaces connected to the OTV devices. The OTV AED for VLAN 200 receives the ARP request and floods it across its extended overlay network to all remote OTV devices. The remote OTV AED in fabric 2 receives the ARP request and forwards it to its inside interface connecting to border nodes BL3 and BL4.
- Border node 4 in this specific example receives the ARP request, encapsulates the packet and floods it across Fabric 2 in the VXLAN Layer 2 domain associated to subnet_200.
- The receiving Leaf nodes flood the ARP request to all the local interfaces where VLAN 200 is configured. In the specific case of Leafs 21 and 22, the former is designated to forward the traffic to the vPC connection to H4 that hence receives the ARP request.
Note that when you configure a common anycast gateway vMAC address across VXLAN fabrics, the OTV devices at each site will continuously update their Layer 2 tables, because they may receive on their Layer 2 internal interfaces ARP requests originating from endpoints connected to the local site. As a good practice, you thus should apply a route map on the OTV control plane to avoid communicating the anycast gateway MAC address information to remote OTV edge devices. You can apply a route map to the OTV Intermediate System–to–Intermediate System (IS-IS) control plane, as shown in the following configuration sample.
mac-list Anycast_GW_MAC_deny seq 10 deny 0001.0001.0001 ffff.ffff.ffff mac-list Anycast_GW_MAC_deny seq 20 permit 0000.0000.0000 0000.0000.0000 route-map Anycast_GW_MAC_filter permit 10 match mac-list Anycast_GW_MAC_deny ! otv-isis default vpn Overlay0 redistribute filter route-map Anycast_GW_MAC_filter
Figure 16 shows the series of control-plane events triggered by the ARP reply originating from H4.
7. H4 sends an ARP reply to the MAC address that sourced the request (the anycast gateway MAC address identifying leaf L14 in fabric 1). However, because the same global anycast gateway MAC address is identically configured at both sites, local leaf L22 locally consumes the ARP reply. As a consequence, the ARP response will never get back to the original sender in fabric 1. This is not a problem, because, as clarified in the following steps, this process allows H4’s discovery to be triggered in both local fabric 2 and remote fabric 1.
8. The received ARP reply triggers on leaf nodes L21 and L22 the local discovery of H4. As a consequence, H4’s Layer 2 and 3 reachability information is announced to fabric 2 through an MP-BGP EVPN route-type-2 update. All the leaf nodes that have locally configured the L2VNI associated with subnet_200 (in this example, leaf nodes L23 and L24 and border nodes BL3 and BL4) receive this information.
9. Border nodes BL3 and BL4 advertise H4’s host route across the Layer 3 DCI connection to border nodes BL1 and BL2 in fabric 1.
10. Border nodes BL1 and BL2 receive the BGP update from fabric 2 and generate an MP-BGP EVPN route-type-5 update in fabric 1 for H4’s host-route information.
11. Now all the local leaf nodes can import H4’s host route into their forwarding tables, specifying the anycast IP address of border nodes BL1 and BL2 as the next hop.
At this point, leaf L11 has learned host route information for H4, which changes the forwarding behavior of the next data packet generated by the locally connected endpoint H1 destined for H4, as described in the next steps and shown in Figure 8:
- H1 generates another data packet destined for H4 and sends it to its local default gateway on leaf L11.
- Leaf L11 performs a Layer 3 lookup for H4, and this time it finds the host route pointing to the anycast VTEP address on border nodes BL1 and BL2 as the next hop. It hence encapsulates and unicasts the data packet to that anycast address.
- The receiving border node decapsulates the VXLAN packet and uses the retrieved L3VNI information to perform a Layer 3 lookup in H4’s routing domain. The result is a host route pointing to a next-hop router reachable through the Layer 3 DCI connection. Hence, traffic is sent to remote border nodes BL3 and BL4 in fabric 2.
- Border nodes BL3 and BL4 receive the traffic destined for H4 and perform a Layer 3 lookup in the appropriate VRF instance. Having previously received a route-type-2 update, the border node encapsulates the packet with a VXLAN header and routes it to the anycast VTEP address defined on leaf nodes L21 and L22 using the transit L3VNI defined for that specific tenant.
- Leaf L21 receives the packet and routes it locally to H4.
Note that routed communication between endpoints connected to separate IP subnets in different VXLAN fabrics always use the Layer 3 DCI connection between fabrics, whereas the Layer 2 DCI is reserved exclusively for intrasubnet flows. This behavior occurs regardless of whether the IP subnets are stretched or locally defined inside each fabric.