30 – VxLAN/EVPN and Integrated Routing Bridging

VxLAN/EVPN and Integrated Routing Bridging


As I mentioned in the post  28 – Is VxLAN Control Plane a DCI solution for LAN extension, VxLAN/EVPN is taking a big step forward with its Control Plane and could be used potentially for extending Layer 2 segments across multiple sites. However it is still crucial that we keep in mind some weaknesses and lacks related to DCI purposes. Neither VxLAN nor VXLAN/EVPN have been designed to offer natively a DCI solution (post 26 & 28).

DCI is not just a layer 2 extension between two or multiple sites. DCI/LAN extension is aiming to offer business continuity and elasticity for the cloud (hybrid cloud). It offers disaster recovery and disaster avoidances services for Enterprise business applications, consequently it must be very robust and efficient. As it concerns on Layer 2 broadcast domain, it is really important to understand the requirement for a solid DCI/LAN extension and how we can leverage the right tools and network services to address some of the shortcomings that rely on the current implementation of VxLAN/EVPN offering a solid DCI solution.

In this article we will examine the integrated anycast L3 gateway available with VxLAN/EVPN MP-BGP control plane, which is one of the key DCI requirements for long distances when the first hop default gateway can be duplicated on multiple sites.

Integrated Routing and Bridging

One of the needs for an efficient DCI deployment is the duplicate Layer 3 default gateway solution such as FHRP isolation (24 – Enabling FHRP filter),  reducing the hair-pining workflows while machines are migrating from one site to another. In short, the same default gateway (virtual IP address + virtual MAC) exists and is active on both sites. When a virtual machine or the whole multi-tier application executes a hot live migration to a remote location, it is very important that the host continues to use its default gateway locally and processes the communication without any interruption.Without FHRP isolation, multi-tier applications (E-W traffic) will suffer in term of performances reaching their default gateway in the primary data center.

This operation should happen transparently with the new local L3 gateway, whereas the round-trip workflow via the original location is eliminated. No configuration is required within the Host. All active sessions can be maintained stateful.

Modern techniques such as Anycast gateway can offer at least the same efficiency as FHRP isolation, and not only for Inter-Fabric (DCI) workflows but also for Intra-Fabric traffic optimization. This task is achieved in an easier way because it doesn’t require specific filtering to be configured on each site. The function of the Anycast Layer 3 gateway is natively embedded with the BGP EVPN control plane. And last but not least, in conjunction with VxLAN/EVPN there are as many Anycast L3 gateways as Top-of-Rack switches (Leafs with VxLAN/EVPN enabled).

The EVPN IETF draft elaborated the concept of Integrated Routing and Bridging based on EVPN to address inter-subnet communication between Hosts or Virtual Machines that belong to different VxLAN segments. This is also known as inter-VxLAN routing. VxLAN/EVPN offers natively the Anycast L3 gateway function with the Integrated Routing Bridging feature (IRB).

VxLAN/EVPN MP-BGP Host learning process

The following section is a slight reminder on the Host reachability discovery and distribution process to better understand later the different routing mode.

The VTEP function happens on a First Hop Router usually enabled on the ToR where the Hosts are directly attached. EVPN provides a learning process to dynamically discover the local end-points attached to their local VTEP, distributing afterward the information (Host’s MAC and IP reachability) toward all other remote VTEPs through the MP-BGP control plane. Subsequently, all VTEPs know all end-point information that belongs to their respective VNI.

Let ‘s elaborate the description with the following example.

vPC Anycast VTEP

vPC Anycast VTEP

VLAN 100 (blue) maps to VNI 10000

  • Host A:
  • Host C:
  • Host E:
  • Host F:

VLAN 30 (orange) maps to VNI 5030

  • Host B:
  • Host D:

VLAN 3000 (purple) maps to VNI 300010.

  • Host G:

In order to keep the following scenarios simple, each pair of physical Leafs (vPC) will be represented by a single logical switch with the Anycast VTEP stamped to it.

vPC Anycast VTEP - Physical View vs. Log

vPC Anycast VTEP – Physical View vs. Logical View


VLAN 30 and VLAN 3000 are not present in Leaf 1, consequently it is not required to create and map the VLAN 30 to VNI 5030, nor VLAN 3000 to VNI 300010 within VTEP 1. VTEP 1 only needs to get the reachability information for Hosts A, C, E and F (IP & MAC addresses) belonging to VLAN 100. Host A is local to VTEP 1 and Hosts C, E and F are remotely extended using the network overlay VNI 10000.


VTEP 2 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts B and C are local.


VTEP 3 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts D and E are local.


VTEP 4 learned Hosts A, C, E, and F on VLAN 100 and Host G on VLAN 3000. Hosts F and G are local.

Notice that only VTEP 4 learns Host G, which belongs to VLAN 3000. Consider Host G to be isolated from the other segments, like for example a database accessible only via routing.

VLAN to Layer 2 VNI mapping

VLAN to Layer 2 VNI mapping

With the process for Host reachability information, the learning and distribution end-points are achieved by the MP-BGP EVPN control plane with an integrated routing and bridging (IRB) function. Each VTEP learns through the data plane (e.g. Source MAC from a Unicast packet or a ARP/GARP/RARP) and registers its local end-points with their respective MAC and IP address information using the Host Mobility Manager (HMM). Subsequently, it distributes this information through the MP-BGP EVPN control plane. In addition to registering its local attached end-points (HMM), every VTEP will populate its Host table with the devices learned through the control plane (MP-BGP EVPN AF).

From Leaf 1 (VTEP1 :

L2route from Leaf 1

L2route from Leaf 1

The “show L2route” command[1] above details the local Host A ( learned by the Host Mobility Manager (HMM), as well as Host C (, E ( and F ( learned through the BGP on the next hop (remote VTEP) to reach them.

[1] Please notice that the output has been modified to show only what’s relevant to this section.

From Leaf 3 (VTEP2 :

L2route from Leaf 3

L2route from Leaf 3

Host B ( and Host C ( are local, whereas Host D ( and Host E ( are remotely reachable through VTEP 3 ( and Host F ( appears behind VTEP 4 (

From Leaf 5 (VTEP3 :

L2route from Leaf 5

L2route from Leaf 5

Host D ( and Host E ( are local. All other Hosts except Host G are reachable via remote VTEPs listed in the Next Hop column.

From Leaf 7 (VTEP4 :

L2route from Leaf 7

L2route from Leaf 7

Host F ( and Host G ( are local. VTEP 4 registered only to the Hosts that belong to VLAN 100 and VLAN 3000 throughout the whole VxLAN domain.

With the Host table information, each VTEP knows how to reach all Hosts within its configured VLAN to L2 VNI mapping.

Asymmetric or symmetric Layer 3 workflow

The EVPN draft defines two different operational models to route traffic between VxLAN overlays. The first method describes an asymmetric workflow across different subnets and the second method leverages the symmetric approach.

Nonetheless, vendors offering VxLAN with EVPN control plane support may implement one of these operational models (assuming the Hardware/ASIC supports VxLAN routing). Some may choose the Asymmetrical approach, maybe because it is easier to implement from a software point of view, but it is not as efficient as the symmetrical mode and has some risks that impact scalability. Others will choose the symmetric model for more efficient population of Host information with better scalability.

Assuming the following scenario:

  • Host A (VLAN 100) wants to communicate with Host G in a different subnet (VLAN 3000)
  • VLAN 100 is mapped to VNI 10000
  • VLAN 30 is mapped to VNI 5030
  • VLAN 3000 is mapped to VNI 300010
  • We assume all Hosts and VLANs belong to the same Tenant-1 (VRF)
  • * L3 VNI for this Tenant of interest is VNI 300001 (for symmetric IRB)
  • We assume within the same Tenant-1 that all Hosts from different subnets are allowed to communicate with each other (using inter-VxLAN routing).
  • Host A shares the same L2 segment with Hosts C, E, and F spread over remote Leafs
  • Host G is Layer 2 isolated from all other Hosts. Therefore a Layer 3 routing transport is required to communicate with Host G.

* explained with Symmetric IRB mode

Asymmetric IRB mode:

When an end-point wants to communicate with another device on a different IP network, it sends the packet with the destination MAC address as its default gateway. Its first-hop router (ingress VTEP) performs the routing lookup and routes the packet to the destined L2 segment. When the egress VTEP receives the packet, it strips off the VxLAN header and bridges the original frame to the VLAN of interest. With asymmetrical routing mode, the ingress VTEP (the local VTEP where the source is attached) performs both bridging and routing, whereas the egress VTEP (remote VTEP where the destination sits) performs only bridging. Consequently, the return traffic will take a different VNI, hence a different overlay tunnel.

In the following example, Host A wants to communicate with Host G and sends the packet toward its default gateway. VTEP 1 sees that the destination MAC is its own address and does a routing lookup for Host G. It finds in its Host table the destined IP Host and the Next Hop VTEP 4 to reach it. VTEP 1 encapsulates and routes the packet into the L2 VNI 300010 with VTEP 4 as destination IP address. VTEP 4 receives the packet, strips off the VxLAN header and bridges the frame to its VLAN 3000 toward Host G (with the source MAC as the default gateway address).

When Host G responds, VTEP 4 will encapsulate the frame to its associated VNI 300010 and will route the packet directly through the VNI 10000 where Host A is registered. The egress VTEP 1 will therefore bridge the received packet from VNI 10000 toward Host A in VLAN 100.

Asymmetric Routing

Asymmetric Routing

The drawback of this asymmetric routing implementation is the consistent configuration across all VTEPs built with all VLANs and VNIs concerned with routing and bridging communication across the fabric. In addition, it needs to learn the reachability information (MAC and IP addresses) for all Hosts that belong to all VNI of interest.

In the above illustration, VTEP 1 needs to be configured with VLAN 100 mapping VNI 10000, as well as the VLAN IDs which are mapped to the VNIs 5030 and 300010, even though there is no any Host attached to those VLANs. The VLAN ID being local significant, what is crucial is that the mapping to the VNI of interest exists. Therefore, in this example the Host tables on all leafs are populated with the reachability information for Hosts A, B, C, D, E, F and G. In a large VxLAN/EVPN deployment, this implementation of asymmetric IRB adds some complexity for the configuration but above all it may have important impact in term of scalability.

Symmetric IRB

Symmetric routing behaves differently in the sense that both ingress and egress VTEP provide bridging and routing functions. This allows introducing a new concept known as transit L3 VNI. This L3 VNI will be dedicated for routing purposes within a tenant VRF.  Indeed, the L3 VNI offers L3 segmentation per tenant VRF. Each VRF instance is mapped to a unique L3 VNI in the network. Each series of tenant’s VLAN determines the VRF context to which the receiving packet belongs. As a result, the inter-VxLAN routing is performed throughout the L3 VNI within a particular VRF instance.

Notice that each VTEP can support several hundred VRFs (depending on the hardware, though).

In the following example, all Hosts, A, B, C, D, E, F and G, belong to the same tenant VRF. When Host A wants to talk to Host G in a different Subnet, the local VTEP (ingress) sees the destination MAC as the MAC of the Default Gateway (AGM – Anycast Gateway MAC), as Host G is not know via Layer-2 and consequently routes the packet through the L3 VNI 300001. It rewrites the inner destination MAC address with the egress VTEP 4 router MAC address that, this one is unique for each VTEP.  Once the remote VTEP 4 (egress) receives the encapsulated VxLAN packet, it strips of the VxLAN header and does a MAC lookup identifying the destined MAC as being its own. Accordingly, it performs an L3 lookup and routes the receiving packet from the L3 VNI to the destined L2 VNI 300010 that maps VLAN 3000 where Host G resides. VTEP 4 finally maps VNI 300010 to VLAN 3000 and forwards the frame with an L2 destination as Host G.

Symmetric Routing

Symmetric Routing

For return traffic, VTEP 4 will achieve the same packet walk using the same transit L3 VNI 300001, but in the opposite direction.

The following L2route command displays the Host reachability information that exists in VTEP 1’s Host table. It doesn’t show Host G (, as no VLAN maps locally in Leaf 1 (VTEP 1) the VNI 300010 nor information related to VNI 5030, reducing therefore the population of its Host reachability table to the minimum required information.

L2route from Leaf 1

L2route from Leaf 1

Nonetheless, Leaf1 knows how to route the packet to that destination. The Next hop is Leaf_4 via the L3 VxLAN segment ID 300001, as displayed in the following routing information.

Show ip route from Leaf 1

Show ip route from Leaf 1

The “show IP route” command above shows that Host G is reachable via the VxLAN segment ID 300001 with the next hop as (VTEP 4).

Show bgp l2vpn evpn from Leaf 1

Show bgp l2vpn evpn from Leaf 1

The “show bgp L2vpn evpn” command[1] from Leaf 1 displays the information related to Host G (, reachable via the next hop, (VTEP 4) using the L3 VNI 300001.

Note also for information only that the label shows the mapping of the L2VNI 300010 where Host G belongs to and the L3VNI 300001 to reach Host G.

[1] Please notice that the output has been modified to display only information relevant to Host G.

Anycast L3 gateway

One of the useful added-values of this, is the capacity to offer the role of default gateway from the direct attached-Leaf (First Hop Router initiated on ToR). All leafs are configured with the same Default Gateway IP address for a specific subnet as well as the same vMAC address. All routers will use the same virtual Anycast Gateway MAC (AGM) for all Default Gateways. This is explicitly configured during the VxLAN/EVPN setup.

Leaf(config)# fabric forwarding anycast-gateway-mac 0001.0001.0001

When a virtual machine processes a “hot live migration” to a Host attached to a different Leaf, it will continue to use the same default gateway parameter without interruption.

Anycast Layer 3 Gateway

Anycast Layer 3 Gateway

In the above scenario, the server “WEB” in DC-1 communicates (1) with its database attached to a different Layer 2 network and therefore a different subnet in DC-2. The local Leaf 1 achieves the default gateway role for server WEB as the First Hop Router, which is simply its own ToR with the required VxLAN/EVPN and IRB services up and running.

While server WEB  is communicating with the end-user (not represented here) and its database, it does a live migration (2) toward the remote VxLAN fabric where its database server resides.

The MP-BGP EVPN AF control plane notices the movement and the new location of the server WEB. Subsequently, it increases the MAC mobility Sequence number value for that particular end-point and notifies all VTEP of the new location. As the Sequence number is now higher than the original value, subsequently all VTEP update their Host table accordingly with the new “next hop” (egress VTEP 4).

Without interruption of the current active sessions, server WEB continues to transparently use its default gateway, locally available from its new physical First Hop Router, now being Leaf 7 and 8 (VTEP 4). The East-West communication between WEB and DB happens locally within DC-2.




This entry was posted in DCI. Bookmark the permalink.

37 Responses to 30 – VxLAN/EVPN and Integrated Routing Bridging

  1. Thanks for this very interesting article (as always).
    I have one comment and one enhancement to propose:

    1) Asymmetric vs Symmetric IRB:
    “The drawback of this asymmetric routing implementation is that all VTEPs need to be configured with all VLANs and VNIs concerned with routing and bridging communication.”
    Why would the VTEPs need to be configured with the remote VLANs? Thanks to EVPN, they can have reachability information regarding all the VMs and when a particular VTEP needs to route a packet to a different L2 VNI, it uses that specific VXLAN ID in asymmetric IRB – not the VLAN – to encapsulate the packet within the VXLAN header.

    “In addition, it needs to learn the reachability information (MAC and IP addresses) for all Hosts that belong to all VNI of interest.”
    Isn’t that what EVPN is all about? I mean all VTEPs exchange their VMs reachability information through MP-BGP, whether we decide to implement asymmetric or symmetric IRB. The BGP VXLAN/EVPN routing table is available and filled with the same information in both cases, isn’t it?

    From where I stand, the only difference between both ways to design IRB is the VXLAN encapsulation and its consequence on the path the packets take in both directions:
    – asymmetric IRB: the source L2 frame is encapsulated with the ***destination L2 VNI*** in the VXLAN header which leads to different paths taken by the traffic in both directions (due to the way the ECMP hash is calculated)
    – symmetric IRB: the source L2 frame is encapsulated with the ***shared L3 VNI*** which leads to an identical path taken by the traffic in both directions.
    Despite some implications with clustered firewalls (degraded performances unless my proposal is implemented by Cisco – cf. https://learningnetwork.cisco.com/message/440977#440977), I don’t see any downside to these asymmetric paths.
    However, I may miss something or oversimplify your point. Please correct me if I’m wrong.

    2) L3 VNI Routing:
    You only spoke about L2 VNI routing, not about L3 VNI routing. I am aware of the scaling consequences, however there are some use cases where this would be very helpful.
    If for example one company decides to map each tenant to one of its BU, and dedicates another tenant to shared services for all the BUs (for example a central database), these tenants must be able to reach them, which implies L3 VNI routing.
    Let’s suppose that the IP addressing ranges are correctly assigned so that the last VRF does not overlap with all other VRFs.
    When we look at the format of the BGP VXLAN EVPN NLRI (https://drive.google.com/file/d/0B6XxNd5c3zV_T0VZQ09sSmttcHc/view?usp=sharing), we see that there are:
    – VM IP & MAC addresses
    – L2 VNI
    – L3 VNI
    – even a DC ID
    Technically, there are no obstacles to that design.
    However, is it possible to perform L3 VNI routing, for example by encapsulating the source L2 frame with the destination L3 VNI in the VXLAN header?
    That would avoid each BU to duplicate all the shared services.

    • Yves says:

      Hi Jean-Christophe, thank you for your questions.

      1) With the Asymmetric mode, only the ingress VTEP (local Leaf where the source is attached) provides routing, hence traffic destined to a different L2 VNI will be routed at the ingress VTEP. Consequently the destination L2 VNI must also exist on the ingress VTEP. I assume the VLAN ID that maps the remote L2 VNI may be required in case of Asymmetric routing, as the ingress VTEP will have to register all remote end-points with the topology ID (VLAN ID) – but to be honest, I won’t be able to verify this assumption, and I’m not sure who is going to implement this mode that we could ask.

      For the host reachability information, the VTEP host table is populated with all the end-points (local + remote) sharing the L2 VNI’s that exist on this VTEP. Because with Asymmetric IRB mode the VTEP must be configured with all the concerned L2 VNI (including the ones to be routed), then all remote end-points (including the ones that belong to the L2 VNIs to be routed) must also exist in the local host table. For a large deployment, this mode may suffer from scalability & perf issue. Your understanding concerning the differences between both mode is therefore correct. Thus the scalability issue discussed for asymmetric IRB. With symmetric IRB, all the traffic to be routed toward a VxLAN segment will “transit” via a shared L3 VNI, hence no need to configure all L2 VNI and no need to register all end-points.

      For the ASA cluster, it’s a fair point. Actually if I remember correctly, a similar concept has been implemented with the FWSM. It is called Asymmetric routing (ASR). ASR is aiming to group together a set of trusted ports (within the same FW module) allowing a stateful session to return via a different physical port (belonging to the same trusted group). It is (or it was) also supported for a pair of FWSM configured in resilient mode (HA), but the return traffic was redirected to the owner FW (somehow like the ASA cluster does today) – I don’t believe it’s a technical issue to implement such asymmetric routing with the ASA cluster, I thing for security reasons, it will be very risky to allow stateful sessions to be trusted by a FW that’s not the owner of the session. That said, I ‘m going to escalate your thoughts to the security folks.

      2) L3 VNI Routing:
      JC >> You only spoke about L2 VNI routing, not about L3 VNI routing. I am aware of the scaling consequences, however there are some use cases where this would be very helpful.
      I’m not 100% clear with your comment, sorry. Can you please elaborate a bit more ? L3 VNI is a concept initiated for Symmetric IRB as described in this post ( I thought I spoke about L3 VNI ).
      Let’s focus on Symmetric routing. A L2 VNI belongs to a Tenant. For the same Tenant you can have multiple L2 VNI. Each Tenant is mapped to a unique L3 VNI and must exist on all VTEPs (Leafs) concerned by this Tenant. L2 VNI can be extended over the IP network (WAN) or it can be isolated within a DC. However the L3 VNI will be extended, hence each Tenant will maintain its L2 and L3 segmentation among all remote sites. Yep, BTW isn’t a great “new” solution offering L3VPN without MPLS deployment? Beyond that segmentation, if a communication is required between Tenants, it will happens only via a routed Firewall.

      Kind regards,


      • 1) Asymmetric vs Symmetric IRB
        “With symmetric IRB, all the traffic to be routed toward a VxLAN segment will “transit” via a shared L3 VNI, hence no need to configure all L2 VNI and no need to register all end-points.”
        But the source leaf must know where to send the “L2 VNI” routed packet.
        For example, let’s say A needs to communicate with B which is located in another L2 VNI; A’s ToR needs to know the IP address of B’s ToR VTEP, so that it is able to encapsulate A’s frame with VXLAN (Dest L2 VNI) – UDP – IP (its own IP as source – B’s ToR IP as destination) – MAC headers
        That will be possible only if A’s ToR knows B’s reachability information, i.e that B is located in that L2 VNI, below that other ToR.
        So all ToRs will need all L2 VNIs and all VMs reachability information in order to be able to route between L2 VNIs.

        2) L3 VNI Routing
        You talked only about routing between L2 VNIs within the same L3 VNI, not between different L3 VNIs, i.e between different tenants.
        I totally agree that this is a great solution – although not completely mature yet – avoiding the complexity of MPLS to connect different DCs.

        • Yves says:

          Hi Jean-Christophe,

          1) The ingress VTEP knows how to route traffic destined to B (send to next hop egress VTEP via L3 VNI) – Look at the [show ip route “B”] and the [show bgp l2vpn evpn “B”] in the captured screens. It doesn’t need to register host “B” in its layer 2 host reachability table [show l2route evpn mac-ip all].

          2) as I said, the example discussed in the post addressed a single tenant (VRF) in a DCI environment. Each Tenant by definition will be isolated from the other Tenants. Each VRF instance is mapped to a unique L3 VNI in the network. As a result, the inter-VxLAN routing is performed throughout the L3 VNI within a particular Tenant (VRF instance), hence talking only about routing between L2 VNI. All the inter-VxLAN segment routing happens within a Tenant via a unique transit L3 VNI per Tenant. That said:
          Traffic between Tenant can be achieved via an external Firewall (service Leaf).
          Traffic to/from outside the VxLAN fabric happens from the Border leaf (not discussed in this post). MP-BGP EVPN imports the BGP routes learned in the IPv4 or IPv6 unicast address family into the L2VPN EVPN address family. Usually each tenant maintains his Layer-3 routing instances separately. Public subnets within each Tenant can be advertised separately to the outside. traffic from outside to a public subnet can be routed throughout the VXLAN fabric.

          I agree that some enterprises who don’t have necessarily the technical background with MPLS or MP-BGP may suffer from the whole complexity of the configuration.
          Fortunately, things are evolving very quickly to help IT mgrs to accelerate the deployment and simplify OPS. There is one great SDN approach solution with the Programmable Fabric ( in addition to ACI/APIC ), aka VTS. VTS is aiming to centralise and automate the overlay provisioning and management across all Nexus platforms.

          Hope that helps, thank you


    • lkrattiger says:


      To your 1st points.
      While a control-plane provides reachability information, the data-plane still requires to be existent. I can not route, if there is no router and I can not bridge, if there is no bridge-domain aka Layer-2 segment.
      The semantics of forwarding stay the same as long as I do not route MACs or bridge IPs subnets 🙂

      The differences in symmetric vs. asymmetric IRB are the way a packets are handled within the respective VTEP. Is it VNI-route-VNI, which will have a different path depending on ingress or egress VTEP OR is it VNI-route-VNI-route-VNI for symmetric, where both directions within a VTEP is the same from a VNI perspective.
      The physical path in the network is regardless of the symmetric vs. asymmetric IRB and depends on the hashing implementation. I potentially could create a symmetric hashing by using asymmetric IRB.

      To your 2nd point.
      VRF route-leaking is the approach you are referencing to and as per MP-BGP, this should be possible from a protocol design perspective if implemented in the respective Vendor implementation.

  2. I’ve just realized that the part I’ve written about the ECMP hash is not correct: since the Nexus Series implement VXLAN in a way that a hash of the inner frame header is used as the source UDP port, the VNI used in the VXLAN header has no impact on that hash (unless it is not calculated on the 5-tuple anymore). That means that the hashes will be different anyway in both directions whatever the value of the VNI is.

  3. wangweibin says:

    Great article, Yves.
    Symmetric IRB is obviously a mutter better choice than asymmetric IRB.
    However, Cisco is the only vendor supporting EVPN/VXLAN as well as symmetric IRB VXLAN routing at the moment.
    If we want to enable multiple-vendor VTEPs in the same fabric, the only choice left is non-EVPN, standard VXLAN(rfc7348) based fabric.

    • Yves says:

      Hello, thank you for your feedback. If most of vendors implement the standard VxLAN rfc7348 Multicast-only, from an interoperability point of view, it still requires to support the same IP multicast routing protocol PIM which is also platform dependent.

      That said, I think we should clarify this important point. VxLAN/EVPN fully relies on standard protocols (some are still draft thought). There is no specific proprietary implementation, even with the IRB mode.
      – RFC 7348 Virtual eXtensible Local Area Network
      – RFC 7432 BGP MPLS based Ethernet VPNs
      – Network Virtualization Overlay Solution using EVPN / draft-ietf-bess-evpn-overlay
      – Integrated Routing and Bridging in EVPN / draft-ietf-bess-evpn-inter-subnet-forwarding
      – IP Prefix Advertisement in E-VPN / draft-rabadan-l2vpn-evpn-evpn-prefix-advertisement
      …just to list the most important standards.

      We have recently demonstrated VXLAN/EVPN interoperability during the last MPLS/SDN World Congress in Paris
      Participating Vendors were Cisco, Juniper, Alcatel Lucent & Ixia
      Independently tested at EANTC with public Whitepaper available at: http://www.eantc.de/showcases/mpls_sdn_2015/intro.html

      In short, we (and all vendors listed above) are all strongly supporting this interoperability with multi-platforms and multi-vendors.

      However, on top of the standard compliancy, each vendors may bring some additional improvement such as dual-homing with vPC Anycast VTEP, Anycast L3 gateway, multi-homing, storm control, loop detection and protection, path-diversity, etc…

      We haven’t finished being bored with VxLAN/EVPN and DCI, I’m telling it to you 🙂

      • lkrattiger says:

        Small correction on the draft side:

        IP Prefix Advertisement in E-VPN follows “draft-ietf-bess-evpn-prefix-advertisement” which was adopted by IETF BESS (BGP Enabled Services) workinggroup. The current draft is based on “draft-rabadan-l2vpn-evpn-evpn-prefix-advertisement”

  4. taozi says:

    Very Good ! Thanks a lot.
    非常好的文章。谢谢分享。Happy New Year 2016.

  5. MauriBCN says:

    Hi Yves,

    One question regarding this model.

    We have customer with 2 datacenters, connection between 2 datacenter we use mpls.

    Datacenter 1 as 65001 and datacenter 2 65002 AS, ebgp session (2 neighbors) for overlay, for underlay we are using ospf and ibgp session.

    With 2 spines and route reflector and leafs connected with vpc, when we got failure in one of the link of ebgp, we do not have vxlan traffic between 2 datacenters.

    As workaround we added another ebgp peer connection as solution, have you seen this before? Probably blackhole traffic?


  6. ManuAtDell says:

    Hi Yves,
    Thanks for a great series of articles – very detailed and informative.

    Can you explain how symmetric IRB works with a silent host?
    To use your example above, in a setup where Host G is silent, how does VTEP1 know where t the next-hop is as “show ip route” or “show bgp l2vpn evpn” would return nothing – what happens then?

    If Host G was in the same VNI as Host A then the ARP request would get encapsulated and sent to relevant VTEPs (using HER or mcast as you detailed in post #28), but that functionality does not exist for the L3 VNI (or does it?).

    Every VXLAN/IRB packet walk I can find online always talks about VTEP 1 doing an L3 lookup and getting the dst VTEP from that, but none talk about what happens if the lookup fails.

    Would appreciate if you could shed some light on what happens in such a case.

    • Yves says:

      Hi Manu,

      What is crucial in VXLAN EVPN is to configure all leaf nodes to announce their locally defined IP subnet prefixes within the EVPN fabric to allow the discovery of silent hosts.
      If you miss this stage, you may face some black-hole situations due to silent hosts.

      First of all, and in short, if Host G belongs to subnet X, then subnet X has been advertised throughout the whole fabrics using an EVPN route type 5 update (assuming the statement above is true).
      Assuming that network prefix X exists on multiple Leaf nodes, as a consequence, all the concerned Leaf nodes will advertise this subnet X as next hop.
      All VTEP knows therefore all the L3 next hops for each network prefixes including for subnet X and will use ECMP hashing to load balance the traffic among one of the next L3 hop advertising this prefix X.

      Assuming Host G is a silent host, the ingress VTEP (call it V1) selects the anycast VTEP address on one of the leaf nodes (say V2) as the valid next hop for subnet X based on the ECMP hashing result.
      V1 then routes the VXLAN encapsulated traffic to the leaf nodes V2 through the transit L3VNI for that particular tenant (VRF instance) – this is a symmetrical IRB.
      The egress VTEP (V2) that receives the routed data packet performs a Layer 3 lookup but finds no specific host route entry for Host G, so it routes the traffic to locally defined X.
      This routing triggers an ARP request destined for G that is flooded across the L2VNI associated with subnet X.

      1 – If G is locally attached to V2, it will received the ARP request from its local VLAN and it will ARP reply.
      Its 1st L3 router (the leaf node V2 where it is locally attached with Anycast L3 Gateway) will learn Host G (from the ARP msg) and will propagate this host reachability information using a Route type 2 EVPN notification across the whole VXLAN EVPN fabric.
      Consequently, all VTEP concerned by this Layer 2 segment will record Host G MAC+IP.
      2 – If G is attached to another leaf node, then the ARP request is flooded across the L2 VNI using the MCAST group or Ingress Replication to all egress Leaf nodes concerned by this data replication (usually the L2 VNI of interest or the Tenant, depending how the mcsat group is allocated).
      Host G receives the ARP request, replies and its 1st hop router, which will learn G information (MAC+IP) and will populate this information as described previously.

      And the loop is closed as now every VTEP of interest knows Host G 🙂

      Does that help ?

      If not enough, let me suggest you to sit down for a couple of hours with Jack Daniels and read carefully the post 36 (Specially the part 3)

      thank you, yves

      • DukeNukem3D says:

        Yves, can you please explain this phrase in more detail:
        “What is crucial in VXLAN EVPN is to configure all leaf nodes to announce their locally defined IP subnet prefixes within the EVPN fabric to allow the discovery of silent hosts.
        If you miss this stage, you may face some black-hole situations due to silent hosts.”
        Can you please give some configuration and verification examples of this?

        • Yves says:

          What that means is that all leaf nodes advertise the subnets which are locally configured (based on the L2VNI with SVI), so in case the control plane EVPN is not aware of a particular endpoint (/32) as destination IP, the data packet is sent to one of the leaf nodes that advertise this subnet of interest (ECMP is used to load balancing among all leaf nodes advertising this subnet). The elected leaf node receiving the route destined to the silent host – it does 1st a L3 lookup, but because the endpoint is a silent host, it certainly has no entry for that endpoint, hence – will ARP request to its access interfaces (facing hosts) and across the L2 VNI (VXLAN fabric). The silent host receives the ARP req and ARP replies consecutively. It is then discovered by its 1st hop router (leaf node where is it locally attached) which will advertise its host IP address (/32) to the control plane.

          SVI configuration for the L2 segment (vlan 2100)

          interface Vlan2100
          no shutdown
          vrf member tenant-1
          no ip redirects
          ip address tag 12345
          no ipv6 redirects
          fabric forwarding mode anycast-gateway

          The tag 12345 is associated to the route map “fabric-rmap-redist-subnet”
          route-map fabric-rmap-redist-subnet permit 10
          match tag 12345

          The route map is configured under the concerned VRF, for which all subnet with the tag 12345 will be advertise across the l2vpn even transport

          router bgp 65511
          vrf tenant-1
          address-family ipv4 unicast
          advertise l2vpn evpn
          redistribute direct route-map fabric-rmap-redist-subnet
          maximum-paths ibgp 2

          Does that make sense ?

          Thank you, yves

          • DukeNukem3D says:

            it makes sense, thank you, but strange thing – I did similar configuration in my lab, but silent host detection not works, I checked BGP – type 5 routes are in place for subnets, but silent host appears only after ‘arp’ from attached leaf. I’ve been digging for two days but with no luck.

          • Yves says:

            Hum, is the silent host detection working or not working? You mentioned the silent host detection doesn’t work and then you mentioned the silent host appears after the arp msg.
            Actually, the silent host will wake-up only after it receives the ARP request from its 1st hop router (where it is locally attached).
            The ARP request should be sent by the Leaf node receiving routed data packet destined to the silent host, sending the ARP request to its access interfaces (facing hosts, maybe where the silent host is attached, but not mandatory) and across the L2VNI (BUM) reaching all other Leaf nodes sharing the same L2VNI, including the egress leaf node where the silent host is locally attached. The latter (and all other lef nodes of interest) will forward the ARP request to its access interfaces facing silent hosts, which, in its turn, will ARP reply.
            The ARP reply will trigger the host detection (HMM) and BGP will advertise it (MAC+IP) across the fabric (EVPN).

  7. ManuAtDell says:

    Thanks for the swift reply Yves!

    It does more than help, in fact it 100% answers my question, which I now understand was flawed because it’s not a case of “what happens if the lookup fails” but a case of “the lookup should never fail” : if you don’t have a host route, you should always have a (longer) subnet route.

    I didn’t see any route type-5 in my setup (Titanium simulation on dcloud) so will double check the config and will certainly take the time to read post #36 in its entirety, although I may opt for Jameson or Cointreau as a reading companion instead 😉

    Thanks again.

      • Hi Yves,

        is it possible to use SPINE to do VXLAN encapsulation while it is performing the role of RR and RP?

        as you already know, the SVIs on the VTEPs related to servers are in the same Tenant VRF while the links interconnecting the VTEPs and the SPINES are in the Global routing table. in my network design the SPINES are connecting to firewall which means the SPINE is the exit point for my network if the servers need to reach internet.

        I have a static default route on the SPINE and I am redistributing it in OSPF which is the IGP running between VTEPS and SPINES. I am wondering if I can somehow do route leaking between Global routing table and Tenant VRF on the VTEP?

        kindly advise.

        Haitham Jneid

        • Yves says:

          Hi Haitham,

          1 – Yes you can enable the VTEP in the SPINE nodes as well (according to the HW and SW rel). that’s what is usually called Border Spine.
          The spine devices run VXLAN routing and MP-BGP EVPN inside the fabric (like a Border Leaf) to establish the tunnel overlays with the other VTEPs accordingly exchanging EVPN routes with them. At the same time, it supports the role of RR and RP and it runs the normal IPv4 or IPv6 unicast routing in the tenant VRF instances with the external routing device on the outside. The routing protocol can be regular eBGP or any IGP of choice. The border spine switch learns external routes and advertises them to the EVPN domain as EVPN routes so that other VTEP leaf nodes can also learn about the external routes for sending outbound traffic.

          2 – Today with the current release (N9k) you can do “distributed” VRF route leaking (in short, you need to configure the route leaking on all switch nodes (Spine + leaf node in your case)). In the new release coming these days, you can do centralised route leaking (centralised in the Border spine devices in your case, but also supported on the Border leaf nodes if needed)

          Let me know if that helps or you need more details,

          regards, yves

          • I read in Cisco documentation that configuring Rendezvous Point (RP) on the VTEP is not supported. Therefore if we configure the SPINE to do VXLAN encapsulation, it will become a VTEP and I will not be able to configure an RP in my network.

            Also is it possible to configure a static routing between the border spine and the firewall instead of IGP or BGP. let’s say that we have a static default route from border spine to the firewal as an exit point, can I redistribute this static route into VXLAN Tenant VRF on the SPINE so that other VTEPs receive it. this way all the VTEPs can reach the external world.

            Haitham Jneid

          • Yves says:

            Can you point me this Cisco documentation that mentions you can’t initiate the RP on a VTEP device?
            It is possible to enable RP on a VTEP device, but we usually recommend to separate these two functions (Border and Spine functions).
            From a DP point of view, there is no differences at all, same performances. from a CP point of view, you have to consider number of peers, number of hosts + peering with outside.
            And from a maintenance point of view, change management, failure scenarios, it’s preferable to keep the Spine to just distribute the overlay network.

            When the FW is using routed mode, you can use a dynamic protocol between the FW and the network devices to exchange routes with the VXLAN fabric and the WAN edge devices. Alternatively, if the FW is in transparent mode or doesn’t support a dynamic protocol, you can use static routing, commonly used in deployments.

            best regards, yves

  8. A Abs says:

    Thanks for your great explanation on EVPN. I was struggling to learn the MP-BGP EVPN as there are many moving parts and your blog help me a lot (even in 2019) to learn about it.

  9. philuxe says:

    This article is brillant. It explains clearly the difference between assymetric and symetric IRB. I can now understand how routing between VNI works and why you need to import evpn routes in the VRF ipv4 family.

    Regarding the below statement :

    “What is crucial in VXLAN EVPN is to configure all leaf nodes to announce their locally defined IP subnet prefixes within the EVPN fabric to allow the discovery of silent hosts.
    If you miss this stage, you may face some black-hole situations due to silent hosts.”

    Is that still true when you run multicast for the BUM trafic ?

    thanks in advance

  10. philuxe says:

    please ignore my question I did not catch the point initialy as from the V1 point of view trafic to G can only be routed trafic (since no local VNI for that particulat BD exists on V1)

    I believe I m going to print that page and stick it somewhere I can see it every day :p

  11. Satish Patel says:

    Hi Yves,

    I am dealing with very strange issue related silent host and not sure what to do. Let me explain my network design. I have 2 leaf and dedicated border-leaf for L3 routing to connect to my ISP for internet access. I am running openstack cloud so when i spin up Linux VM with public IP that host is pretty must new so it doesn’t have any evpn route in fabric. now when i ping that public IP vm from my home laptop it doesn’t work. when i traceroute public ip of vm i can see my packet hitting border-leaf but then it doesn’t have evpn route in tables so its discarding packet. (I thought border-leaf should generate BUM traffic to discover VM host mac/ip using ARP flood). I don’t have any L2VNI on border-leaf because its L3 function so i didn’t bother to create L2VNI. It has VRF only.

    Now if i logged into VM and send 1 packet to my gateway which generate arp and evpn route and same time i can see evpn route on border-leaf and my home laptop start pinging that VM. (Is this normal behavior of EVPN/VxLAN deployment?)

    Do you think to solve this issue i should create L2VNI on border-leaf so it can participate in ARP learning process?

    I have posted my full question with all configuration here: https://community.cisco.com/t5/data-center-switches/evpn-vxlan-very-strange-silent-host-issue/m-p/4266659#M6836

    • Yves says:

      Hi Satish

      Maybe a high-level drawing could help to clarify your scenario.
      Where is the destined public VM of interest connected to? is VRF-lite configured toward your WAN edge (SP), or is the VM locally attached to the BL?
      Also are you saying this issue is limited to with VM configured with a “public” address?

      What is crucial in VXLAN EVPN is to configure all leaf nodes to announce their locally defined IP subnet prefixes within the EVPN fabric to allow the discovery of silent hosts.
      If you miss this stage, you may face some black-hole situations due to silent hosts. Also make sure that the BL knows how to reach the destined subnet.

      Let me give you an example, assuming your VM-public is a silent host connected to subnet X on your BL, the ingress VTEP (say L1 where your laptop is connected) selects the egress VTEP address of the leaf nodes (say BL) as the next hop to reach for subnet X.
      The ingress VTEP (L1) routes the VXLAN encapsulated traffic destined to VM-public subnet X toward the border leaf nodes through the transit L3VNI for that particular VRF instance.
      The egress VTEP (BL) that receives the routed data packet performs a Layer 3 lookup for VM-public but finds no specific host route entry for VM-public, so it routes the traffic to locally defined X. This routing triggers an ARP request (not a BUM as your mentioned) destined for VM-public that is flooded across the L2VNI associated with subnet X.
      If the VM-public is behind a L3 routed network, traffic is routed toward the next hop L3 gateway (WAN edge) destined to subnet X.

      Does that make sense? yves

      • Satish Patel says:

        I have updated new diagram on my cisco post here: https://community.cisco.com/t5/data-center-switches/evpn-vxlan-very-strange-silent-host-issue/td-p/4266659

        Let me try to explain via diagram. PC-1 just born and it is fresh and it didn’t send any single packet to its gateway so pretty must silent host, my EVPN fabric doesn’t know anything about this host. (I have verify evpn route and its not on Leaf-1 and BL-1. Now lets say from PC-2 which is located in internet somewhere trying to ping PC-1 now i can see in trace packet coming to BL-1 public interface. now BL-1 doesn’t have any Type-5 route for that host because it silent. How does border-leaf find this silent host using ARP/BUM? (BL-1 doesn’t have any VTPE for PC-1 subnet in that case how does BL-1 send arp/BUM?)

        Now if i ping from PC-1 to PC-2 then it generate Type-2/3 routes and after that i can see route in BL-1 and after that PC-2 can start pinging PC-1.

        For Experiment: I have created VLAN/VNI 10100 on border-leaf for my PC-1 subnet and configured mcast-group and now i can see from outside world PC-2 can ping any silent host located inside Leaf-1 using BUM traffic. That is why i am confused do i need VLAN on border-leaf to discover silent host located inside other leaf when someone trying to ping first time from internet?

        interface nve1
        no shutdown
        host-reachability protocol bgp
        source-interface loopback1
        member vni 10100
        member vni 10555 associate-vrf

        • Yves says:

          When the ingress VTEP (BL-1) receives the request destined to PC-1, it routes the data packet (VXLAN encapsulated traffic) destined to PC-1’s subnet toward Leaf-1 through the transit L3VNI used for VRF RED. The egress VTEP (L-1) receives the routed data packet and performs a Layer 3 lookup for PC-1 but finds no specific host route entry for this endpoint, as a result it routes the traffic to locally defined X where PC-1 belongs. That triggers an ARP request destined for PC-1 flooded across the L2VNI associated with subnet X and the host interfaces where the concerned bridge domain is configured. PC-1 receives the ARP request and then it ARP replies, PC-1 reachability information is learnt and distributed across the fabric.
          It sounds like your BL-1 doesn’t receive the initial request from PC-2 when PC-1 is silent.
          When PC-1 is learnt via EVPN, certainly its host route is advertised by BL-1 toward the Internet cloud which makes the ping to work.
          Can you check if the subnet X is properly advertised to outside the fabric (toward Internet)?
          You need to enable the advertisement of EVPN routes within IPv4/6 address family for the VRF of interest
          It is crucial to configure all leaf nodes to announce their locally defined IP subnet prefixes within the EVPN fabric, hence consequently to be advertised to outside to allow the discovery of silent hosts.

          Best regards, yves

          • Satish Patel says:

            Thank for reply,

            Look like we are on same page now. when BL-1 received packet from PC-2 then it will look into RIB to find route and ofc it doesn’t have route because host is silent. now how can it will send packet to vrf RED if it doesn’t know which Leaf that host is located L1, L2, L3 etc.. (that is what i am trying to understand, so can you explain how BL-1 use VRF RED to discover host?)

            I don’t have L3VNI/VLAN configured on BL-1, I have only following config on BL-1

            vrf context RED
            description ** VRF-RED **
            vni 10555
            rd auto
            address-family ipv4 unicast
            route-target both auto
            route-target both auto evpn

            interface Vlan555
            description ** L3VNI-For-IRB **
            no shutdown
            vrf member RED
            ip forward
            ipv6 address use-link-local-only

            interface nve1
            no shutdown
            host-reachability protocol bgp
            source-interface loopback1
            member vni 10555 associate-vrf

            >> Can you check if the subnet X is properly advertised to outside the fabric (toward Internet)?

            This is what i have on my L1 to advertise route to public/internet network, EVPN advertise host in /32 subnet. Do you think i should advertise full /23 statically?

            vrf RED
            address-family ipv4 unicast
            redistribute direct route-map DIRECT-PERMIT-ALL

            interface Vlan100
            no shutdown
            vrf member RED
            ip address

            I would like to give you my full config not sure how do i share it here.

          • Yves says:

            Good day Satish,

            This is a fair question. Actually, this is the reason I mentioned previously that you need to configure all leaf nodes to announce their locally defined IP subnet prefixes across the EVPN fabric, hence BL-1 knows the next hop to reach the subnet X, if it doesn’t get the host route of PC-1.
            When receiving the request from PC-2 destined to PC-1, BL-1 (ingress VTEP) selects the anycast VTEP address of one of the leaf nodes (say Leaf-2 in your example) as the valid next hop for subnet X. This is based on the ECMP hashing result. BL-1 then routes the data packet (VXLAN encapsulated traffic) to Leaf-2 through the transit L3VNI for that particular tenant (RED).
            The egress VTEP (Leaf-2) that receives the routed data packet performs a Layer 3 lookup but finds no specific host route entry for PC-1, so it routes the traffic to locally defined X.
            This routing triggers an ARP request destined for PC-1 that is flooded across the L2VNI associated with subnet X, hitting all leaf nodes where the L2VNI of interest is configured. As a consequence Leaf-1 gets the ARP request which is flooded to all host interfaces where the bridge domain (VLAN) for PC-1 exists.

            Does that clarify your question ?


  12. Satish Patel says:

    Thank you Yves,

    Yes, i think i am following you. I was little confused in terminology, when someone say you need L3VNI so i think following in my mind.

    interface Vlan555
    description ** L3VNI-For-IRB **
    no shutdown
    vrf member RED
    ip forward
    ipv6 address use-link-local-only

    But actually anycast-gateway is your true L3VNI which is following for Subnet X.

    interface Vlan100
    description ** Anycast Gateway For Public **
    no shutdown
    mtu 9216
    vrf member RED
    ip address
    ipv6 address 2001:c05:3011::1/64
    ipv6 nd prefix default no-advertise
    ipv6 nd ra route suppress
    no ipv6 redirects
    fabric forwarding mode anycast-gateway

    This is what i did to fix my issue. I have created Vlan100 anycast-gateway on border-leaf and now PC-2 successfully able to find silent host inside any leaf. (That is what missing part of my BL-1)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.