36 – VXLAN EVPN Multi-Fabrics with External Routing Block (part 2)


I recommend you to read part 1 if you missed it 🙂

thank you, yves

VXLAN EVPN Multi-Fabric with External Active/Active Gateways

The first use case is simple. Each VXLAN fabric behaves like a traditional Layer 2 network with a centralized routing block. External devices (such as routers and firewalls) provide default gateway functions, as shown in Figure 1.

Figure 8: External Routing Block IP Gateway for VXLAN/EVPN Extended VLAN

Figure 1: External Routing Block IP Gateway for VXLAN/EVPN Extended VLAN

In the Layer 2–based VXLAN EVPN fabric deployment, the external routing block is used to perform routing functions between Layer 2 segments. The same routing block can be connected to the WAN advertising the public networks from each data center to the outside and to propagate external routes to each fabric.

The routing block consists of a “router-on-a-stick” design (from the fabric’s point of view) built with a pair of traditional routers, Layer 3 switches, or firewalls that serve as the IP gateway. These IP gateways are attached to a pair of vPC border nodes that initiate and terminate the VXLAN EVPN tunnels.

Connectivity between the IP gateways and the border nodes is achieved through a Layer 2 trunk carrying all the VLANs that require routing services.

To improve performance with active default gateways in each data center, reducing the hairpinning of east-west traffic for server-to-server communication between sites, and depending on the IP gateway platform of choice, the routing block can be duplicated with the same virtual IP and MAC addresses for all relevant SVIs on both sides. Hence, to use active-active gateways on both fabrics, you must filter communications between gateways that belong to the same First-Hop Routing Protocol (FHRP) group. With OTV as the DCI solution, the FHRP filter will be applied to the OTV control plane.

Figure 2 shows this scenario.

Note: Although VLANs are locally significant per edge device (or even per port), and the Layer 2 virtual network identifier (VNI) is locally significant in each VXLAN EVPN fabric, the following examples assume that the same Layer 2 VNIs (L2VNIs) are reused on both fabrics. The same VLAN ID was also reused on each leaf node and on the border switches. This approach was used to simplify the diagrams and packet walk. In a real production network, the network manager can use different network identifiers for the Layer 2 and 3 VNIs deployed in the individual fabrics.

Figure 9- VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 9: VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 2: VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 2 shows the following:

Note: The basic assumption is that H1 and H2 have already populated their ARP tables with the default gateway information, for example because they previously sent ARP requests targeted to the gateway. As a consequence, the gateway also has H1 and H2 information in its local ARP table.

  1. Host H1 connected to leaf L11 in fabric 1 needs to send a data packet to host H2 connected in vPC mode to a pair of leaf nodes, L12 and L13, in the same fabric 1. Because H1 and H2 are part of different IP subnets, H1 sends the traffic to the MAC address of its default gateway, which is deployed on an external Layer 3 device connected to the same fabric 1. The communication between H1 and the default gateway uses the VXLAN fabric as a pure Layer 2 overlay service.
  2. Traffic from H1 that belongs to VLAN 100 is VXLAN encapsulated by leaf L11 with L2VNI 10100 (locally mapped to VLAN 100) and sent to the egress anycast VTEP address defined in border nodes 1 and 2. This address represents the next hop to reach the MAC address of the default gateway. Layer 3 equal-cost multipath (ECMP) is used in the fabric to load-balance the traffic destined for the egress anycast VTEP between the two border nodes. In the example in Figure 9, border node BL1 is selected as the destination.
  3. BL1 decapsulates the VXLAN frame and bridges the original frames destined for the local default gateway onto VLAN 100 (locally mapped to L2VNI 10100).
  4. The default gateway receives the frame destined for H2, performs a Layer 3 lookup, and subsequently forwards the packet to the Layer 2 segment on which H2 resides.
  5. The Layer 2 flow reaches one of the vPC-connected border nodes (BL2 in this example). BL2 uses the received IEEE 802.1q tag (VLAN 200) to identify the locally mapped L2VNI, 10200, to be used for VXLAN to encapsulates the frame. BL2 then forwards the data packet to the anycast VTEP address defined on the vPC pair of leaf nodes, L12 and L13, on which the destination H2 is connected.
  6. One of the receiving leaf nodes is designated to decapsulate the VXLAN frame and send the original data packet to H2.
  7. Routed communications between endpoints located at the remote site are kept local within fabric 2. This behavior is possible only because FHRP filtering is enabled on the OTV edge devices. In this example, H4 sends traffic destined for H6 using its local default gateway active in data center DC2. East-west routed traffic can consequently be localized within each fabric, eliminating unnecessary interfabric traffic hairpinning.


This entry was posted in DCI. Bookmark the permalink.

11 Responses to 36 – VXLAN EVPN Multi-Fabrics with External Routing Block (part 2)

  1. A Abs says:

    thanks for your great and detailed blog. Is it possible that instead of using a separate Border VTEP + separate L3 Switch/Router , we use a single pair of Nexus 9300-EX/FX to do the same VTEP functionality and IP Gateway service and routing ? I think the ASIC need to support something called “routing in and out of tunnels (RIOT) ” (https://cumulusnetworks.com/blog/vxlan-designs-part-1/) and in a single lookup VTEP bridging and IP Routing occurred so we don’t need a separate Edge VTEP devices along side separate L3 IP Router/Gateway .I think the Central routing Block is simple to implements and less complicated for CSP that hosts VMs and most of their traffic is N-S and not E-W.

    • Yves says:

      I left this article 36.xxx here as the packet walk is IMHO helpful to better understand how route type 2 and 5 works inside the fabric, but this is somehow obsolete since we have Multi-site support with CloudScale ASIC.

      To answer your question, yes, this is possible. Check this post 37 http://yves-louis.com/DCI/?p=1588 and look at figure 9 and 10. And let me know if that answers your question.

      Best regards, yves

  2. A Abs says:

    Thanks for reply.I checked the figure 9 , 10 as you mentioned.what i was looking for is Infra-VRF (within the a VRF/Tenant) VTEP+IP Default gateway functions on the same edge/border devices in a single POD (not between sites).i think you named it as layer-2 only VXLAN fabric.according to the post #37 (figure 8,9,10) i got my answer and we can have Central Routing Block with MP-BGP EVPN Control plane for hosts default gateway service instead of using Anycast default gateway feature on the TOR/LEAF switches.
    one Q left to ask : will the TOR/LEAF with the end-hosts attached to them learn both Route type 2 and 5 or only they learn type 5 (IP Prefix) ?
    Thanks again for let us learn in depth on the new DC fabric.

    • Yves says:

      Ho I see. BTW, think that central routing block was good at the time of L3 anycast gateway was not available. Since almost 4 years, we have distributed Layer 3 anycast gateway which is more efficient and more scalable, offering multitenancy from the fabric itself.
      When external routing block is deployed in conjunction with EVPN (Layer 2 only) the Leaf nodes (VTEP) learn only host reachability information using route type 2 only.
      best regards, yves

      • A Abs says:

        Thank yves.I am totally agree with you.but i think the anycast gateway meant for E-W traffic within the Tenant/VRF.for CSP that hosts only 1 or 2 VM per customer i don’t see any value for anycast gateway.maybe the EVPN MP-BGP is not the right fabric for them.

        • Yves says:

          Forgive if I’m missing your point here, but l3 anycast gateway is not just E-W traffic optimization. Having the default gateway for any VM (1 or many) directly from the next physical hop (ToR), is certainly better from a scalability and performance point of view than a centralized routing block.
          Or is there any function that a centralised routing block can bring than the distributed L3 anycast gateway can’t offer, that I’m missing?

          Thank you, yves

  3. A Abs says:

    My assumption about the Anycast Gateway was it typically used for E-W type traffic (for 3-tier Web-App-DB) to decrease the latency of L3 switching and ease VM Mobility when something like Vmotion happen and VM move between hosts (ESXi) on different rack .I my point-of view a CSP that hosts 1000 VM for 1000 customer (one VM per customer) don’t see any E-W style traffic.All traffic goes N-S to the Internet or to the Customer on-premise or Private cloud.I think Layer-2 only EVPN is simple to use for Small CSP that still not using automation tools for EVPN.i need to learn more about the use case of EVPN as i am really new about it.
    Thank you again ,

    • Yves says:

      As mentioned previously, L3 anycast GW is not just E-W traffic improvement. It allows, beyond other things, to distribute the function across each ToR. Hence each ToR in which host/VM are locally attached will route on its own (L3 VNI) toward the next hop (Border Leaf) where the external Core router is reachable (N-S), improving the whole routing workflow. It also gives you a more granular and easier control of the multitenancy. Why do you think L2 only EVPN is simpler to use for small CSP?

  4. A Abs says:

    And one big Q for me about the Anycast gateway is if we want to secure the communication between suppose a 3-tier app (web-App-DB) we eventually need to send traffic to a cluster of stateful FW sit behind a pair of Service-LEAF.So the traffic need to pass a central block for further inspection.So how we can segment a network when a TOR use L3 anycast gateway that all traffic routing perform on TOR/LEAF switch.

    • Yves says:

      Assume the FW cluster is in routed mode. Either you deploy a VRF (Tenant) per traffic side (out-band/in-band), and route data packets toward the FW. To localize the active FW you can leverage HMM route tracking or you can use (preferred method) recursive next hop to reach the FW of interest. You can also leverage PBR inside the VRF to redirect a selected traffic to the external FW.
      best regards, yves

  5. A Abs says:

    Thank you Yves for your great work here as you go really deep on the topics.
    Best Regard,

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.