26 – Is VxLAN (Flood&Learn) a DCI solution for LAN extension ?

One of the questions that many network managers are asking is “Can I use VxLAN stretched across different locations to interconnect two or more physical DCs and form a single logical DC fabric?”

The answer is that the current standard implementation of VxLAN has grown up for an intra-DC fabric infrastructure and would necessitate additional tools as well as a control plane learning process to fully address the DCI requirements. Consequently, as of today it is not considered as a DCI solution.

To understand this statement, we first need to review the main requirements to deploy a solid and efficient DC interconnect solution and dissect the workflow of VxLAN to see how it behaves against these needs. All of the following requirements for a valid DCI LAN extension have already been discussed throughout previous posts, so the following serves as a brief reminder.

DCI LAN Extension requirements

Strongly recommended:

  • Failure domain must be contained within a single physical DC
    • Leverage protocol control plane learning to suppress the unknown unicast flooding.
    • Flooding of ARP requests must be reduced and controlled using rate limiting across the extended LAN.
    • Generally speaking, rate limiters for the control plane and data plane must be available to control the broadcast frame rate sent outside the physical DC.
    • The threshold must be set carefully with regard to the existing broadcast, unknown unicast and multicast traffic (BUM).
  • Redundant paths distributed across multiple edge devices are required between sites, with all paths being active without creating any layer 2 loop
    • A built-in loop prevention is a must have (e.g. OTV).
    • Otherwise the tools and services available on the DCI platforms to address any form of Multi-EtherChannel split between 2 distinct devices (e.g. EEM, MC-LAG, ICCP, vPC, vPC+, VSS, nV Clustering, etc.) need to be activated.
  • Independent control plane on each physical site
    • Reduced STP domain confined inside each physical DC.
    • STP topology change notifications should not impact the state of any remote link.
    • Primary and secondary root bridges or multicast destination tree (e.g. FabricPath) must be contained within the same physical DC.
    • A unique active controller managing the virtual switches may lose access to the other switches located on the remote site.
  • No impact on the layer 2 protocol deployed on each site
    • Any layer 2 protocols must be supported inside the DC (STP, MST, RSTP, FabricPath, TRILL, VxLAN, NVGRE, etc.) regardless of the protocols used for the DCI and within the remote sites.
  • Remove or reduce the hairpinning workflow for long distances
    • Layer 3 default gateway isolation such as FHRP isolation, anycast gateway or proxy gateway.
    • Intelligent ingress path redirection such as LISP, IGP Assist, Route Health Injection, GSLB.
  • Fast convergence
    • Sub-second convergence for any common failure (link, interface, line card, supervisor, core).
  • Any transport
    • The DCI protocol must be transport-agnostic, meaning that it can be initiated on any type of link (dark fiber, xWDM, Sonet/SDH, Layer 3, MPLS, etc.).

Must Have

In addition to the previous fundamentals, we can enable additional services and features to improve the DCI solution:

  • ARP caching to reduce ARP broadcasting outside the physical DC.
  • VLAN translation allowing the mapping of one VLAN ID to one or multiple VLAN IDs.
  • Transport-agnostic core usage – thus enterprises and service providers are not required to change their existing backbone.
  • Usage of IP multicast grouping is a plus, but should not be mandatory. However, the DCI solution must be able to support non-IP multicast layer 3 backbone as an enterprise would not necessarily obtain the IP multicast service from their provider (especially for a small number of sites to be layer 2 interconnected).
  • Path diversity and traffic engineering to offer multiple routes or paths based on selected criteria (e.g. control plane versus data plane) such as IP based, VLAN-based, Flow-based.
  • Removal or reduction of hairpinning workflows for metro distances (balancing between the complexity of the tool versus efficiency).
  • Control plane learning versus data plane MAC learning.
  • Multi-homing going beyond the two tightly coupled DCI edge devices.
  • Site independence – the status of a DCI link should not rely on the remote site state.
  • Load balancing: Active/Standby, VLAN-based or Flow-based

A Brief Overview of VxLAN

VxLAN is just another L2 over L3 encapsulation model. It has been designed for a new purpose offering logical layer 2 segments for virtual machines while allowing layer 2 communication between the same VxLAN segment identifier (VNID) over an IP network. VxLAN offers several great benefits: Firstly, as just expressed, the capability to extend the layer 2 segments pervasively over an IP network, allowing the same subnet to be stretched across different physical PoD (Point of Delivery) and therefore preserving the same IP schema in use. A PoD delimits usually the layer 2 broadcast domain. Secondly, it offers the capacity of creating of huge amount of small logical segments, well beyond the traditional dot1Q standard limited to 4k. This improves the agility needed for the cloud supporting stateful mobility with the support of native segmentation required by the multi-tenancy.

The other very important aspect of VxLAN is that it is getting a lot of momentum across many vendors, software and hardware platforms, with enterprises closely monitoring its evolution as the potential new de-facto transport standard inside the new generation of data center network fabric, addressing the host mobility.

However, VxLAN doesn’t natively support systems or network services connected to a traditional network (bare metal servers and virtual machines attached to VLAN, NAS, physical firewalls, IPS, physical SLB, DNS, etc.), hence additional components such as the VxLAN VTEP gateways layer 2 and layer 3 are required to communicate beyond the VxLAN fabric.

Many papers and articles have been posted on the web regarding the concept and details of VxLAN, therefore I’m not going to dig deeper into it here. Nevertheless, let’s still review some important VxLAN workflows and behaviors pertinent to this topic, step by step to help address the DCI requirements listed above more efficiently.

Current Implementation of VxLAN.

A virtual machine (or more generally speaking an end-system) within the VxLAN network sits in a logical L2 network and communicates with other distant machines using so-called VxLAN Tunnel End-Points (VTEP) enabled on the virtual or physical switch. The VTEP offers two interfaces, one connecting the layer 2 segment, which the virtual machine is attached to, and an IP interface to communicate over the routed layer 3 network with other VTEPs that bridge the same segment ID. The VTEPs encapsulate the original Ethernet traffic with a VxLAN header and send it over a layer 3 network toward the VTEP of interest, which then de-encapsulates the VxLAN header in order to present the original layer 2 packet to its final destination end-point.

In short, VxLAN aims to extend the logical layer 2 segment over a routed IP network. Hence, while it remains not a good idea, the idea emerging among many network managers to go a bit further leveraging the VxLAN protocol to stretch its segments across geographically distributed data centers is not fundamentally surprising.

VxLAN Implementation Models

Two types of overlay network-based VxLAN implementations exist: a host-based and network-based overlay implementations.

Confined the Failure Domain at its Physical Location

As of today, both software and hardware-based VxLAN implementations perform the MAC learning process based on traditional layer 2 operations, flooding and dynamic MAC address learning. They both offer the same transport and follow the same encapsulation model, however, dedicated ASIC-based encapsulation offers better performance as the number of VxLAN segments and VTEPs grows.

Each VTEP joins an IP multicast group according to its VxLAN segment ID (VNID), thus reducing the flooding to only the VTEP group concerned with the learning process. This IP multicast group is used to carry the VxLAN broadcast, the unknown unicast as well as the multicast traffic (BUM).

We need to walk (high level) through the learning process for our original DCI requirements. This kinematic of data flow assumes that the learning process starts from the ground-up.

 

  • An end-system A sends an ARP request for another end-system B that belongs to the same VxLAN segment ID behind the VTEP 2.
  • Its local VTEP 1 doesn’t know about the destination IP address, thus it encapsulates the original ARP request to its IP multicast group with its identifiers (IP address and segment ID).
  • All VTEPs that belong to the same IP multicast group receive the packet and forward it to their respective VxLAN segments (only if this is the same VNID). They learn the MAC address of the original end-system A as well as its association with its VTEP 1 (IP address). They update their corresponding mapping table.
  • The remote end-system B learns the IP <=> MAC address of the original requestor system A, ARP replies to it with its own MAC address.
  • The VTEP 2 receives the reply to be sent to the end-system A. It knows the mapping of end-system A’s MAC address and the IP address of VTEP 1, then it encapsulates the ARP reply and unicasts the packet to VTEP 1.
  • VTEP 1 receives the layer 3 packet from VTEP 2 via an unicast packet, learns the IP address of VTEP 2 as well as the mapping of the MAC address belonging to end-system 2 and the IP address belonging to VTEP 2.
  • VTEP 1 strips off the VxLAN encapsulation and sends the original ARP reply.
  • All subsequent traffic between both end-systems A and B is forwarded using unicast.

VxLAN uses the traditional data plane learning bridge to populate the layer 2 map tables. If this learning process is suitable within a CLOS fabric architecture, it is certainly not efficient for DCI. The risk is that a flooding of unknown destinations can burden the IP multicast group and therefore pollute the whole DCI solution (e.g. overload the inter-site link) instead of containing the failure domain to its physical data center. The issue is that, contrary to other overlay protocols such as EoMPLS or VPLS, it is very challenging to rate limit a specific IP multicast group responsible to handle any VxLAN-based flooding of BUM.

 

Flooding unknown unicast can saturate the IP multicast group over the DCI link, hence it is not appropriate for DCI solutions.

Unicast-only Mode

An enhanced implementation of VxLAN allows the use of a unicast-only mode. In this case the BUM is replicated to all VTEPs. The best current example is the enhanced mode of VxLAN supported on the Nexus 1000v. However, to improve the head-end replication, the Nexus 1000v controller (VSM) redistributes all VTEPs to its VEMs maintaining dynamically each VM entry for a given VxLAN. Consequently each VEM knows all the VTEP IP addresses spread over the other VEMs for a particular VxLAN. In addition to the VTEP information, the VSM learns and distributes all MAC addresses in a VxLAN to its VEMs. At any given time, all VEM MAC address tables contain all the known MAC addresses in the VxLAN domain. As a result, any packet sent to an unknown destination is dropped.

The unicast-only mode currently only concerns the set of Virtual Ethernet Modules (VEM) managed by the controller (VSM) with a maximum limit of 128 supported hosts even when the Nexus 1000v instance is distributed across two data centers. Currently, one unique Nexus 1000v instance controls the whole VxLAN overlay network spanned across the two data centers. Although other VxLAN domains could be created using additional VxLAN controllers, they are currently not able to communicate with each other. Therefore, despite improving the control over flooding behavior, this feature could potentially be stretched only between two small data centers and only for extended VxLAN logical segments. However, it is important to take into consideration that, in addition to the current scalability, other crucial concerns still exist in regard to the DCI requirement, with only one controller active for the whole extended VxLAN domain and remote devices connected to a traditional VLAN that won’t be extended across sites (discussed below with the VTEP gateway).

Note: some implementations of VxLAN support the function of ARP caching, reducing the amount of flooding over the multicast for previous similar ARP requests. However, unknown unicast is flooded over the IP multicast group. Thus ARP caching must be considered as an additional improvement to DCI, not a DCI solution on its own, hence even if adding great value, it is definitely not sufficient to allow stretching the VxLAN segment across long distances.

Active/Active Redundant DCI Edge Device

As most of the DC networks are running in a “hybrid” model with a mix of physical devices and virtual machines, we need to evaluate the different methods of communication that exist between VxLANs and with a VLAN.

  • VxLAN A to VxLAN A. This is the default communication just described above between two virtual machines belonging to the same logical network segment. It uses the VTEP to extend the same segment ID over the layer 3 network. Although the underlying IP network can offer multiple active paths, only one VTEP for a particular segment can exist on each host. This mode is limited for communicating between strictly identical VxLAN IDs, meaning also limited to a fully virtualized DC.
  • VxLAN Gateway to VLAN. To communicate with the outside of the virtual network overlay, the VxLAN logical segment ID is bridged to the traditional VLAN ID using the VxLAN VTEP Layer 2 gateway. It is a one-to-one translation that can be achieved using different tools. From the software side, this service can be initiated using a virtual service blade gateway installed on the Nexus 1010/1110 or from the security edge gateway that comes with the vCloud network, to list only two, acting in transparent bridge mode: one active VTEP gateway per VNID to VLAN ID translation, nevertheless a VTEP gateway supports many VNID <=> VLAN ID translations. From the hardware side, the VTEP gateway is embedded in each switch supporting this function using a dedicated ASIC. Regardless of whichever software component is used for the function of the VxLAN VTEP gateway, currently only one L2 gateway can be active at the same time for each particular VNID <=> VLAN ID mapping, preventing layer 2 loops. On the other end with the hardware implementation of VxLAN, for hosts directly attached to a physical leaf, it is possible to run the same L2 gateway mapping locally on each leaf of interest. Nonetheless this is possible only if there is not cascaded layer 2 network attached to the leaf of interest.

Although it is not a viable solution for a production network, the following needs to be stated in order to prevent it from being understood as an alternative solution: indeed a bandage option could be used to connect a virtual machine with a virtual interface attached to each segment side. This would bridge a VNID to a VLAN ID with no VTEP function per se, like a virtual firewall for example configured in transparent bridging mode. Even though this may work technically for a specific secured bridging between a VxLAN and a VLAN segment, it is important to understand the limitation and the risks.

 

There are some important weaknesses to take into consideration:

  • Risk of layer 2 loops by cloning and enabling a virtual machine or just a human mistake connecting the same VxLAN segment with its peer VLAN twice.
  • Scalability bottlenecks due to the limited number of supported virtual interfaces (VNIC) per virtual machine: in theory a maximum of five VNID to VLAN ID mappings is feasible, but if we remove one VNIC for management purposes, only four possible translations remain. While it is possible to enable dot1q to increase the number of VNID <=> VLAN ID mappings, some limitations do remain (100-200 VLAN), which also impact performance and bandwidth.
  • Live migration: if the VM running the bridge gateway role live migrates to a new host, then the entire MAC to IP bindings will be purged and would need to be re-learned.

Therefore, based on this non-exhaustive list of caveats, it is preferable to not use this method to bridge a VxLAN with a VLAN segment. To avoid any risk of disruption, it is more desirable to enable the virtual firewall in routed mode.

 There is another method connecting to the outside of the VxLAN fabric that consists of a logical segment routed to a VLAN or another VxLAN segment using the VxLAN layer 3 gateway. This function can be embedded into the virtual or physical switch or eventually an external VM running the function of a router configured in routed mode, yet only with the same limitations mentioned above. This layer 3 gateway is out of scope for the purpose of this article, so let’s focus on the VxLAN gateway to VLAN for the hybrid model.

As mentioned previously, if the VTEP layer 2 gateway can be deployed in high availability mode, only one is active at a time for a particular logical L2 segment. An additional active VxLAN gateway may create a layer 2 loop, since currently there is no protocol available preventing an alternate path forward.

That means that all translations from VNID to VLAN ID and vice versa must hit the same VTEP gateway. If this behavior works well inside a network fabric, when projected into a DCI scenario, only end-systems on the same data center as the VxLAN gateway can communicate with each other.

In the above structure VxLAN 12345 is mapped to VLAN 100 in DC 1 on the right. Only Srv 1 can communicate with VM2 or VM1. Srv 2 on DC 2 is isolated because it has no path to communicate with the VxLAN gateway on the remote site DC 1.

Theoretically we could create a VxLAN perimeter on each data center, each managed by its own VxLAN controller. This would allow to duplicate the VTEP gateway and to map the same series of VNID <=> VLAN ID on each site.

 

Each VTEP on each side can share the same IP multicast group used by the VNIDs of interest. Yet currently, and due to the data plane learning, we are still facing ARP and BUM flooding. Until a control plane can distribute the VTEP and MAC reachability on all virtual or physical edges and between VxLAN domains, the data center interconnection will suffer from high flooding of BUM.

Thus, the paradox is that, in order to allow communication between remote machines connected to a particular VLAN and a virtual machine attached to the same VxLAN segment with a single active VTEP gateway, a validated DCI solution is required extending the VLAN in question.

On the minus side using a DCI VLAN extension and one single active VxLAN gateway for both sites, all traffic exchanged between a VLAN and a VxLAN segment will have to hit the VxLAN gateway on the remote site, creating a hairpinning workflow.

Independent Control Plane on each Site

Today, with the current standard implementation of VxLAN there is no control plane for for the learning process. By definition this should be a showstopper for DCI. Although the enhancement mode of VxLAN uses a controller (VSM) to continuously distribute all VTEPs for a given VxLAN to all virtual switches, only one controller is active at a time. This means that the same control plane is shared between the two locations with an inherent risk of partitioning. This is being improved in the near future with a new control plane.

Scalability

This is dependent on the vendor and the way VxLAN is implemented in software or hardware. Therefore what is really needed in terms of the number of segments and MAC addresses supported is connected to a number of conditions.

  • How many VxLAN can a software engine really support? Usually this is limited to few thousand, don’t expect hundreds of thousands.
  • Can I distribute the consumed VxLAN resources such as VxLAN gateways between multiple engines?
  • How many IP multicast groups can the network support?

Key Takeaways

  • VTEP discovery and data plane MAC learning leads to flooding over the interconnection
    • Excessive flood traffic and BW exhaustion.
    • Exposure to security threats.
  • No isolation of L2 failure-domain
    • No rate limiters.
  • Current implementation of standard VxLAN assumes an IP multicast transport exists in the layer 3 core (inside and outside the DC)
    • IP multicast in the layer 3 backbone may not necessarily be an option.
    • Large amounts of IP multicast between DCs could be a challenge from an operational point of view.
  • Only one gateway per VLAN segment
    • More than one VTEP layer 2 gateway will lead to loops.
    • Traffic is hairpinned to the gateway.
  • VxLAN transport cannot be relied upon to extend VLAN to VLAN
    • Valid DCI solution required to close black holes.
  • No network resiliency of the L2 network overlay
  • No multi-homing preventing layer 2 back-door
    • Most validated DCI solutions offer flow-based or VLAN-based load balancing between two or more DCI edge devices.
  • No path diversity
    • It is desirable to handle private VLAN and public VLAN using different paths (HA clusters).


Conclusion

  • Can we extend a VxLAN segment over long distances?

The VxLAN is aimed at running over an IP network, hence from a transport protocol point of view there is no restriction in term of distances.

  • Does that mean that it can be used natively as a DCI solution interconnecting two or multiple DCs extending the layer 2 segment?

VxLAN would necessitate additional tools as well as a control plane to address the DCI requirements listed at the beginning of this post, which are not applicable with the current implementation of VxLAN.

Note: as these lines are written, there is work being done at the IETF nvo3 working group for VXLAN control-plane for learning and notification processes (BGP with E-VPN) reducing the overall amount of flooding (BUM) currently created by the data plane learning implementation. In addition, an anycast gateway feature will be available soon in the physical switches allowing the distribution of active VTEP gateways on each leaf. Stay tuned!

Recommendation

Until VxLAN evolves with a control plane and implements the necessary tools needed for DCI, the best recommendation to interconnect two VxLAN network fabrics is to use a traditional DCI solution (OTV, VPLS, E-VPN).

This entry was posted in DCI and tagged . Bookmark the permalink.

11 Responses to 26 – Is VxLAN (Flood&Learn) a DCI solution for LAN extension ?

  1. cstand141 says:

    4 months later –
    what is the status of inter/intra DCI connectivity with VXLAN ?

    You write “long distances” – my company wants to do this in a 20 mile radius.
    Is this “close enough” ? How do we determine if it is ?
    Devices, movement between data centers, size of hardware, circuit speeds ?

    thank you,

    • Yves says:

      Good day,

      Actually that was to clarify the maximum distances supported by the VXLAN protocol itself rather than trying to give the minimum distance supported by VXLAN :).
      As said in the conclusion:
      Can we extend the VXLAN segment over long distances? “The VxLAN is aimed at running over an IP network, hence from a transport protocol point of view there is no restriction in term of distances.”
      Does that mean that it can be used natively as a DCI solution interconnecting two or multiple DCs extending the layer 2 segment? : VxLAN would necessitate additional tools as well as a control plane to address the DCI requirements listed at the beginning of this post, which are not applicable with the current implementation of VxLAN.

      In addition for the second comment I should have say, “Whatever is the distance”.

      In short it’s not just a question of distances per se, the VXLAN protocol on its own is not sensitive to latency. The main concern is that there is no control plane for the learning process, and the failure domain is extended to the remote sites. And you probably don’t want to take any risk disrupting your secondary DC if the primary DC fails. And that’s the point.
      Review the recommendations for a valid and solid DCI design listed at the introduction, which should help you to take the right decision.
      – Can we use VXLAN to transport a L2 overlay network across 2 miles, 20 miles or <10000 miles? Technically speaking the answer is yes. But with the current implementation of VXLAN (v1.0) that should not be considered as a valid DCI design.

      Somehow we already discussed this in the past with STP.
      – Can I use STP to offer redundant links between 2 remote sites distant from 10 miles? The answer is that, technically it will work whatever the distance is, until a failure happens, and both DC will be down. I know many enterprises running STP between sites without any issue for years. But I also know many of them who have finally faced a tragic outage situation for a couple of days. And that’s one of the main reasons why we elaborated valid DCI solutions and protocols such as OTV or EVPN.

      Having said that, and assuming that some enterprises accept to take the risk to use VXLAN as a L2 transport between remote sites, how the distance is concerned by the protocol itself?
      Usually when we mention long distances in the DCI/LAN extension scenario, it’s related to the risks of application performances impacted by the latency due to hair-pining workflows (e.g. stateful devices (post 27)) or due to the limitation of the software framework (hypervisor live migration – e.g. shared storage (post 11).
      But in addition to the above which is true for any kind of DCI solutions, the particularity of the VXLAN protocol is when it is enabled within an hybrid network (where a VM attached to a VXLAN segment needs to communicate and to be L2 adjacency with a device attached to a traditional VLAN), then (with host-based VXLAN), it exists only 1 active L2 VTEP Gateway at a time (post 26-bis).

      And that limitation may have a huge impact in a DCI environment (on post 26 VxLAN-single-VTEP-Gateway-and-DCI-v1-hair-pining-.png).
      Fortunately some ToR switches with ASIC-based VXLAN support allow distributed and active L2 VTEP GW at the same time on each leaf.

      Thus in your use case, the latency for 27 miles is almost 0.5ms. That’s the latency between two end-points. Compared to some network services such as a FW for example (2-5 ms of latency) this latency (0.5ms) is insignificant. But some other apps may suffer. We need to think about the impact on the applications, multi-tier applications and how they are deployed (e.g. FE & Apps on VXLAN segment and DB on VLAN and in between tiers, some network services connected to dot1q).

      With VXLAN v2, enforced with an embedded BGP-EVPN control plane, with distributed L2 and anycast L3 gateway the situation will be different.

      Hope that clarifies a bit.

      thank you, yves

  2. Pingback: SDN for Dummies – Part Drei | DreezSecurityBlog

    • Yves says:

      …snip… cut&past from https://dreezman.wordpress.com/2015/05/27/sdn-for-dummies-part-drei/ .…snip…

      …snip… VXLAN is the magic tunneling protocol SDN uses to make virtual guests float. In the physical world subnets usually are in 1 physical location (e.g. DMZ is physically located connected to firewall). With VXLAN tunneling, you can have virtual guests all over the world on the same subnet, so they can float and maintain their same network/operational context. …snip…

      There are two main use-cases why you would need to extend the same subnet across two locations:

      – 1st is to allow hot migration with zero interruption. This use-case is limited by the distance (synchronous replication for data storage). VxLAN/EVPN MP-BGP can help but you need to understand the shortcomings of it (see post 28). VxLAN Multicast-based only mode (flood&learn) is definitely not recommended to be extended outside the Fabric.

      – 2nd is for mgmt purposes and for cold migration. Enterprises want to maintain the same IP schemas on machines moving to a new location. This use case is not concerned by the distance and can apply over the world (unlimited distances). However you don’t need a layer 2 overlay network extended between the two locations, you can achieve the same requirement using a layer 3 overlay network across multiple subnets. This great alternative is known as LISP ASM (Across Subnet Mode). Question: Can this use-case be solved with VxLAN/EVPN MP-BGP for subnet extension (Notice VxLAN mcast-only mode with no control plane is definitely not an option) ? Technically the answer is yes, but we will need integrated routing and bridging (IRB) and distributed anycast L3 gateway on both sides, get a L2 loop detection/protection service such as BPDU guard, enable Storm control at the ingress DCI interfaces and inside the fabric, protect the CP and CPU (CoPP). And we need to understand if IP Multicast is available over le world; if not then we need to enable Ingress replication between sites, but can we enable IR between fabrics and use MCAST inside each fabric from the same platforms (because you don’t want to run IR inside the fabric for scalability and perf concerns) ? So I would not claim that VxLAN is the magic tunnel to offer the virtual guets all over the world on the same subnet. It’s an option that we may take into consideration, as long as we understand the caveats, the maturity, the performances and the design impact… IMHO not the best solution as of today.
      Does that make sense ?
      thank you, yves

  3. taozi says:

    Cool. Thanks sharing.

  4. bertrand Bordereau says:

    Hello, as far i know the DCI solution with VxLan has been improved
    i ve been working on a DCI design , whith two DC with n9K as spine and leaf in NX-OS mode : VxLAn streched fabric in mind, the two DC are connected with DWDM solution.
    on the border leaf , i understand that i need a L3 subnet( private) to exchange host reachability with MP-BGP EVPN between the two DC.
    I need clarification regarding data plane, do i need a classical Layer 2 extension between border leaf as data plane ? because i need VM mobility ( no ip address changing), so my two DC are sharing the same adress space for VMs or end host
    How data plane is working between both.
    How i can proceed as L3 over vPC is not supported with N9K ,
    Thanks for Clarification
    Best regards

    • Yves says:

      Hello Bertrand, sorry for the delay, I have been disconnected for almost 2 months.
      PS: have you been through post 28 ?

      I’ve been recently deeply testing VXLAN stretched fabric and we are going to publish a technical white paper very soon.
      Having said that, let me try to reply in few lines.
      With the stretched fabric, the L2 tunnel (VXLAN) (Overlay Data plane) is extended across the two sites, from the ingress VTEP initiated from the border leaf node (DC1) to the egress VTEP terminating the VXLAN tunnel in a (local or remote) border leaf node (DC2). however the underlay Data plane is pure Layer 3.

      Think of a VXLAN/EVPN fabric in DC1, and another VXLAN/EVPN fabric in DC2. Each VXLAN/EVPN fabric is configured with its MP-iBGP AS for control plane. The underlay for each VXLAN fabric is pure Layer 3 (usually IGP but is not limited to IGP). Establish then a Layer 3 (IGP) connectivity (underlay) between the 2 sites. For that DCI/L3 purpose you can use a dedicated L3 router (connected to the Spine nodes) connecting L3 toward the remote router or you could leverage a ToR VXLAN leaf node to offer the same L3 function to outside (call that function Transit L3 node) + local attached endpoints. Then, establish MP-eBGP EVPN control plane to interconnect the two fabrics in order to exchange the host reachability information from endpoints distributed across the different locations. You have stretched your VXLAN/EVPN fabric across the 2 locations.
      Look at this drawing (from post 28) http://yves-louis.com/DCI/wp-content/uploads/2015/03/VxLAN-2.0-EVPN-independent-CP-1.png. L14, L15, L21 & L22 can be pure L3 router or vPC peer switches (as represented in this design) for local attached vPC members.

      In summary, to achieve the VXLAN stretched fabric, the transit Leaf nodes interconnecting the DC sites simply performs in this case L3 routing functionalities, allowing the exchange of VXLAN encapsulated frames between VTEPs deployed in separate DC sites.
      Since host reachability information is exchanged between leaf devices belonging to separate sites, the VXLAN data plane encapsulation however is extended end-to-end (From VTEP to VTEP, from Leaf node to leaf node). This is the L2 overlay network required for your mobility for example. This implies the creation of VXLAN tunnels between VTEP devices deployed in the same DC site or belonging to separate sites, like if it is was the same logical VXLAN/EVPN fabric (hence the term of stretched fabric used here, to not be confused with VXLAN DCI).

      With VXLAN stretched fabric, you don’t need to establish a DCI LAN extension between site, the network underlay interconnecting sites is L3 only. The overlay L2 is established automatically and dynamically when L2 communication is required between two endpoints (attached to ToR offering the VTEP function)

      VM’s (from any hypervisors using GARP or RAPR) can therefore hot live migrate using the L2 overlay network (VXLAN encapsulated tunnel) established from site to site transparently.

      With N9k (same with N5/6/7k) you can use vPC for local attached endpoints (dot1Q) while the same vPC pair switches is also used to transit L3 packets toward the remote site. This is only possible because the nexus series do support bud-node. FYI, the bud-node is a device that is a VXLAN VTEP device and at the same time it is an IP transit device for the same multicast group used for VXLAN VNIs. This is performed by the hardware, hence not all switches can support bud-node functions.

      Hope that clarify, if not please let me know

      kind regards, yves

      • Bertrand Bordereau says:

        Hello Yves, many thanks for clarification , it helps me to better understand the DCI possibilities with VxLan Fabric.
        regarding the last part of the reply, i wonder why to use N9K or N5K/) to attach local end point and transit Layer 3 ?
        As these devices do not support L3 routing over vPC?
        if i have two transit leaf on both DC1 & DC2, what is the best option to interconnect them with Dark Fiber for example
        Thanks in advance for ultime clarification
        Best Regards
        Bert

        • Yves says:

          Hi Bertrand,

          The Transit Leaf nodes interconnecting the DC sites via the direct Layer 3 links simply performs Layer 3 routing functionalities, allowing the exchange of VXLAN encapsulated frames between VTEPs deployed in separate DC sites. However, depending on the specific requirements of the deployment, those transit devices may also be used to locally attach endpoints using vPC for resiliency (assuming therefore also the role of compute leaf nodes or ToR) and/or to provide connectivity to the external Layer 3 network domain (assuming the role of transit leaf nodes). The capability of acting as a transit node for VXLAN encapsulated traffic (Layer 3 only) and at the same time being able to terminate VXLAN tunnels is usually referred to as ‘Bud node support’ (hardware/ASIC dependent).
          I see where the confusion comes. In this drawing I pointed previously, the dynamic routing is performed by an external Core Router (CR). A vPC port-channel is used to connect each Core Router toward the VXLAN Fabric to establish active/active load sharing across the 2 border leaf nodes (vPC domain). The vPC would be L2.
          Does this picture better help ? http://yves-louis.com/DCI/wp-content/uploads/2015/07/21.png. Interfaces connecting directly the DWDM links are L3 routed interfaces.

          Let me know if that helps, yves

  5. Bertrand Bordereau says:

    Hello Yves, it helps me very much , thanks for clarification
    I see what is the config , for control plane interconnect between the two DC , it is a classical configuration eBGP multihop dual homed configuration between the two transit leaf on each site
    i do not see other options ( perhaps the N7K as transit leaf, quite expensive to perform it 🙂 )
    Best Regards
    Bertrand Bordereau

  6. Pingback: March update on ACI / Nexus 9000 news – Thoughts on technology

Leave a Reply