32 – VXLAN Multipod stretched across geographically dispersed datacenters

VXLAN Multipod geographically dispersed

VXLAN Multipod Overview

This article focuses on the single VXLAN Multipod Fabric stretched across multiple locations as mentioned in the previous post 31 through the 1st option.

We have been recently working, with my friends Patrice and Max, during a couple of months, building an efficient and resilient solution based on VXLAN Multipod fabric stretched across two sites. The whole technical white paper is now available for deeper technical details including insertion of Firewalls, Multi-tenancy VXLAN routing and some additional tuning, and it can be accessible from here

One of the key use-case for that scenario is for an enterprise to select VXLAN EVPN as the technology of choice for building multiple greenfield Fabric PoDs. It becomes therefore logical to extend the VXLAN overlay between distant PoDs that are managed and operated as a single administrative domain. Its deployment is just a continuation of the work performed to roll out the fabric PoDs, simplifying the provisioning of end-to-end Layer 2 and Layer 3 connectivity.

Technically speaking, and thanks to the flexibility of VXLAN, we could deploy the overlay network on top of almost any forms of Layer 3 architecture within a datacenter (CLOS, Multi-layered, Flat, Hub&Spoke, Ring, etc.), as long as VTEP to VTEP communication is  always available, and obviously a CLOS (Spine & Leaf model) architecture is being the most efficient.

  • [Q] However, can we afford to stretch the VXLAN fabric across distant locations as a single fabric, without taking into consideration the risks of loosing the whole resources in case of a failure?
  • [A] The answer mainly relies on the stability and quality of the fiber links established between the distant sites. In that context it is interesting to remind that most of DWDM managed services offer an SLA equal to or lower than 99.7%. Another important element to take into reflection is how much the control plane can be independent from site to site, reducing the domino effect caused by a breakdown in one PoD.

Anyhow, we need to understand how to protect the whole end-to-end Multipod fabric to avoid propagating any disastrous situation in case of a major failure.

Different models for extending multiple PoD across a single logical VXLAN fabric are available. Each option relies on how the Data plane and Routing Control Plane are stretched or distributed across the different sites to separate the routing functions as well as the placement of the transit layer 3 nodes used to interconnect the different PoDs.

From a physical approach, the transit node connecting the WAN can be initiated using dedicated Layer 3 devices, or leveraged from the existing leaf devices or from the spine layer. Diverse approaches exist to disperse the VXLAN EVPN PoDs across long distances.

We can’t say there is one single approach to geographically dispersed the fabric PoDs among different location. The preferred choice should be taken according to the physical infrastructure including distances and physical transport technology, the enterprise business and service level agreement as well as the application criticality levels with the Pros & Cons for each design.

VXLAN EVPN Multipod Stretched Fabric

VXLAN EVPN Multipod Stretched Fabric

What we should keep into consideration is the following (non exhaustive list, though):

  • For VXLAN Multipod purposes, it is important that the fabric learns the endpoints from both sites using a control plane instead of the basic flood & learn.
  • The underlay network intra-fabric and inter-fabrics is a pure Layer 3  network used to handle the overlay tunnel transparently.
  • The connectivity between sites is initiated from a pair of transite layer 3 nodes. The transit device used to interconnect the distant fabrics is a pure layer 3 routing device.
  • This transit layer 3 edge function can be also initiated from a pair of leaf nodes used to attach endpoints or a pair of border spine nodes used for layer 3 purposes only.
  • Actually, the number of transit devices used for the layer 3 interconnection is not limited to a pair of devices.
  • The VXLAN Multipod solution recommends to interconnect the distant fabrics using Direct Fiber or DWDM (Metropolitan-area distances). However, technically nothing prevents to use inter-continental distances (layer 3) as VXLAN DP and EVPN CP are not sensitive to any latency concerns (although this is not recommended), but the application is.
  • For geographically dispersed PoDs, it is recommended that the control planes (underlay and overlay) are deployed in a more structured fashion in order to keep them as independent as possible from each location.
  • Redundant and resilient MAN/WAN access with fast convergence is a critical requirement, thus network services such as DWDM in protected mode and with remote port shutdown, and BFD should be enabled.
  • Storm Control on the geographical links is strongly recommended to reduce the fate sharing between PoDs.
  • BPDU Guard must be enabled by default on the edge interfaces to protect against Layer 2 backdoor scenarios.

Slight Reminder defining the VXLAN Tunnel Endpoint: 

VXLAN uses VTEP (VXLAN tunnel endpoint) services to map a particular dot1Q Layer 2 frames (VLAN ID) to a VXLAN segment tunnel (VNI). Each VTEP is connected on one side to the classic Ethernet segment where endpoints are deployed, and on the other side to the underlay Layer 3 network (Fabric) to establish the VXLAN tunnels with other remote VTEPs.

The VTEP performs the following two tasks:

  • Receives traffic from locally connected endpoints and encapsulates it into VXLAN packets destined for remote VTEP nodes
  • Receives VXLAN traffic originating from remote VTEP nodes, decapsulates it, and forwards it to locally connected endpoints

The VTEP dynamically learns the destination information for VXLAN encapsulated traffic for remote endpoints connected to the fabric by using the MP-BGP EVPN control protocol.

With information received from the MP-BGP EVPN control plane used to build information in the local forwarding tables, the ingress VTEP, where the source endpoint is attached to, encapsulates the original Ethernet traffic with a VXLAN header and sends it over a Layer 3 network toward the egress VTEP, where the destined endpoint sits. The latter then de-encapsulates the layer 3 packet to present the original Layer 2 frame to its final destination endpoint.

From a data-plane perspective, the VXLAN Multipod system behaves as a single VXLAN fabric network. VXLAN encapsulation is performed end-to-end across PoDs and from leaf nodes. The spines and the transit leaf switches perform only Layer 3 routing of VXLAN encapsulated frames to help ensure proper delivery to the destination VTEP.

VXLAN VTEP-to-VTEP VXLAN Encapsulation

VXLAN VTEP-to-VTEP VXLAN Encapsulation

Multiple Designs for Multipod deployments

PoDs are usually connected with point-to-point fiber links when they are deployed within the same physical location. The interconnection of PoDs dispersed across different locations uses direct dark-fiber connections (short distances) or DWDM to extend Layer 2 and Layer 3 connectivity end-to-end across locations (Metro to long distances).

Somehow, there is no real constraint with the physical deployment of VXLAN per se as it’s an overlay virtual network running on top of a Layer 3 underlay, as long as the egress VTEPs are all reachable.

As a result, different topologies can be deployed (or even coexist) for the VXLAN Multipod design.

Without providing an exhaustive list of all possible designs, the 1st main option is to attach dedicated transit layer 3 nodes toward the Spine devices. They are usually called Transit leaf nodes as they often sit at the same Leaf layer and are also often used as computing Leaf nodes (where endpoints are locally attached), or as service leaf nodes (where firewall, security and network service are locally attache), beside the extension of Layer 3 underlay connectivity toward the remote sites.

VXLAN Multipod Design Interconnecting Leaf Nodes

VXLAN Multipod Design Interconnecting Leaf Nodes

Hence, nothing prevents to interconnect distant PoDs through the spine layer devices, as no additional functions nor VTEP are mandated from the transit spine other than the capability to route VXLAN traffic between leaf switches deployed in separate PoDs. A layer 3 core layer can be leveraged for interconnecting remote PoDs.

VXLAN Multipod Design Interconnecting Spine Nodes

VXLAN Multipod Design Interconnecting Spine Nodes

The VXLAN Multipod design is usually positioned to interconnect data center fabrics that are located at metropolitan-area distances, even though technically, there is no distance limit. Nevertheless, it is crucial that the quality of the DWDM links is optimal. Consequently, it is necessary to lease the optical services from the provider in protected mode and with the remote port shutdown feature, allowing immediate detection of optical link down from end-to-end. Since the network underlay is layer 3 and can be extended across long distances, it may be worth checking the continuity of the VXLAN tunnel established between PoDs. BFD can be leveraged for that specific purpose.

Independent Control Planes

When deploying VXLAN EVPN we need to consider 3 independent layers:

  • The layer 2 overlay data plane
  • The layer 3 underlay control plane
  • The layer 3 overlay control plane

Data Plane: The overlay data plane is the extension of the layer 2 segments established on the top of the layer 3 underlay. It is initiated from VTEP to VTEP, usually enabled at the leaf node layer. In the context of geographically dispersed Multipod, VXLAN tunnels are established between leaf devices belonging to separate sites and within each site as if it was one single large fabric (hence the naming convention of Multipod).

Underlay Control Plane:  The underlay control plane is used to exchange reachability information for the VTEP IP addresses. In the most common VXLAN deployments, IGP (OSPF, IS-IS, or EIGRP) is used for that purpose, but not limited to. BGP can also be an option. The underlay protocol mainly relies on the Enterprise experiences and requirements.

Overlay Control Plane: MP-BGP EVPN is the control plane used in VXLAN deployments for the host reachability and distribution information across the whole Multipod fabric. It is used by the VTEP devices to exchange tenant-specific routing information for endpoints connected to the VXLAN EVPN fabric and to distribute inside the fabric IP prefixes representing external networks. In the context of Multipod dispersed across different rooms (same building or same campus), host reachability information is exchanged between VTEP devices belonging to separate locations as if it was one single fabric.

Multipod basic 1

Classical VXLAN Multipod deployment across different rooms within the same building

This above design represents the extension of the 3 main components required to build a VXLAN EVPN Multipod fabric between different rooms. This is a basic architecture using direct fibers between two pairs of transit leaf nodes deployed in each physical Fabric (PoD-1 & PoD-2). The whole architecture offers a single stretched VXLAN EVPN fabric with the same data plane and control plane deployed end-to-end. The remote host population is discovered and updated dynamically on each PoD to form a single host reachability information database.

If that design deployed within the same building or even within the same campus can be considered as a safe network design architecture, for long distances (Metropolitan-area distances) the crucial priority must be to improve the sturdiness of the whole solution by creating independent control planes due to the long distances between PoDs.

The approach below shows that the data plane is still stretched across remote VTEP’s. The VXLAN data plane encapsulation extends the virtual overlay network end-to-end. This implies the creation of VXLAN ‘tunnels’ between VTEP devices that sit in separate sites.

However the underlay and overlay control planes are independent.

VXLAN multipod stretched fabric advanced 1

Resilient VXLAN Multipod deployment for Geographically Dispersed Locations


Underlay Control Plane:  Traditionally in Enterprise datacenter, OSPF is the IGP protocol of choice for the underlay network. In the context of geographically distant PoDs, each PoD is organised in a separate IGP Areas, with Area 0 typically deployed for the interpod links. That helps reducing the effects of interpod link bouncing, usually due to a “lower” quality of inter-sites connectivity compared to direct fiber links.

However, in order to improve the underlay separation between the different PoDs, BGP can also be considered as an alternative to IGP for the transit routing Inter-sites links.

Overlay Control Plane: In regard to the MP-BGP EVPN overlay control plane, the recommendation is to deploy each PoD in a separate MP-iBGP autonomous system (AS) interconnected through MP-eBGP sessions. When compared to the single Autonomous System model, using MP-eBGP EVPN sessions between distant PoDs simplifies interpod protocol peering.

Improving simplicity and resiliency

In a typical VXLAN fabric, leaf nodes connect to all the spine nodes, but no leaf-to-leaf or spine-to-spine connections are usually required. The only exception is the vPC peer link connection required between two leaf nodes configured as part of a common vPC domain. When Leaf devices are paired using vPC, this allows the locally attached endpoints to be dual-homed and take advantage of the Anycast VTEP functionality (same VTEP address used by both leaf nodes part of the same vPC domain).

vPC Anycast VTEP

It is not required to dedicated routing devices for the layer 3 transit function between sites. One of a pair of leaf nodes (usually named  “Border leaf nodes” for its roles connecting the fabric to outside) can be leveraged for the function of “transit leaf nodes” simplifying the deployment. This coexistence can also be enabled on any pair of computing or service leaf nodes. That doesn’t require any specific function dedicated for the transit role, excepted that the hardware must support the Bud Node design (well supported on Nexus series).

vPC Border Leaf Nodes as Transit Leaf Nodes

The vPC configuration should be performed following the common vPC best practices. However it is also important enabling the “peer-gateway” functionality . In addition and depending on the deployment scenario, the topology, the amount of devices and subnets, additional IGP tuning might become necessary to improve convergence. This relies on standard IGP configuration. Please feel free to read the white paper for further details.

IP Multicast for VXLAN Multipod geographically dispersed

One task of the underlay network is to transport Layer 2 multidestination traffic between endpoints connected to the same logical Layer 2 broadcast domain in the overlay network. This type of traffic concerns Layer 2 broadcast, unknown unicast, and multicast traffic (aka BUM).

Two approaches can be used to allow transmission of BUM traffic across the VXLAN fabric:

  • Use IP multicast deployment in the underlay network to leverage the replication capabilities of the fabric spines delivering traffic to all the edge VTEP devices.
  • If multicast is not an option, it is possible to use the ingress replication capabilities of the VTEP nodes to create multiple unicast copies of the BUM workflow to be sent to each remote VTEP device.

It is important to note that the choice made for the replication process (IP Multicast or Ingress replication) concerns the whole Multipod deployment as it acts as a single fabric. It is more likely that IP Multicast might be enabled for Metro distances, as the Layer 3 over the DWDM links is usually managed by the enterprise itself. However it is probably not given to get IP Multicast option or a limited number of multicast groups from the provider for layer 3 managed services (WAN). The latter will impose the Ingress Replication mechanism deployed end-to-end the VXLAN Multipod. Notice that the choice for Ingress Replication may have an impact in regard to the scalability, depending on the total number of leaf nodes (VTEP). Worth to check first.

If IP Multicast is the preferred choice, then two approaches can be made:

Anycast Rendezvous-Point PIM and MSDP in a Multipod Design

Anycast Rendezvous-Point PIM and MSDP in a Multipod Design

  1. Either deploying the same PIM with anycast RP across the two distant PoDs if you want to simplify the configuration. That is indeed a viable solution.
  2. Or better and recommended is to keep the PIM-SM domain independently enabled on each site and leverage MSDP for the peering establishment between rendez-vous points across the underlying routed network.

Improving the Sturdiness of the VXLAN Multipod Design

Until all functions are embedded in the future in the VXLAN standardisation, preventing the creation of Layer 2 loops, alternative short-term solutions become necessary. The  suggested approach consists of the following two processes:

  • Prevention of End-to-End Layer 2 Loops: Use edge port protection features in conjunction with VXLAN EVPN to prevent the creation of a Layer 2 loops. The best-known feature is BPDU Guard. BPDU Guard is a Layer 2 security tool that should be enabled on all the edge interfaces that connect to endpoints (hosts, firewalls, etc.). It prevents the creation of a Layer 2 loop by disabling the switch port after it receives a spanning-tree BPDU. BPDU Guard can be configured at the global level or the interface level.
VXLAN with BPDU Guard

VXLAN with BPDU Guard

  • Mitigation of End-to-End Layer 2 Loops: A different mechanism is required to mitigate the effects of any form of broadcast storms. When multicast replication is used to handle the distribution of BUM, it is recommended to enable the storm-control function to rate-limit the received amount of BUM traffic carried via the IP Multicast group of interest. The rate limiter of multicast traffic must be initiated on the ingress interfaces of the transit leaf nodes, hence protecting all networks behind it.
Storm-Control Multicast on Interpod Links

Storm-Control Multicast on Interpod Links

If it is strongly recommended to enable Storm Control for Multicast at the Ingress transit interfaces, it is also a priority to consider the current utilisation of multicast traffic produced by the Enterprise applications across remote PoDs, prior to give a threshold value. Otherwise the risk is to make more damage for the user applications than protecting against broadcast storm disaster. As a result, the rate-limit value for BUM traffic should be greater than the application data multicast traffic (use the highest value as a reference) plus the percentage of permitted broadcast traffic.

Optionally, another way to mitigate the effects of a Layer 2 loop inside a given PoD is to also enable storm-control multicast on the spine interfaces to the leaf nodes. More importantly that the previous statement, it is crucial to understand the normal application multicast utilisation rate within each PoD before using a rate limiting threshold value.


The deployment of a VXLAN Multipod fabric allows extending VXLAN overlay data plane in conjunction with its MP-BGP EVPN overlay control plane for Layer 2 and Layer 3 connectivity across multiple PoDs. Depending on the specific use case and requirements, those PoDs may represent different rooms in the same physical data center, or separate data center sites (usually deployed at Metropolitan-area distances from each other).

The deployment of the Multipod design is a logical choice to extend connectivity between fabrics that are managed and operated as a single administrative domain. However, it is important to consider that it behaves like a single VXLAN fabric. This characteristic has important implications for the overall scalability and resiliency of the design.

A true multisite solution with a valid DCI solution is still the best option for providing separate availability zones across data center locations.



This entry was posted in DCI. Bookmark the permalink.

25 Responses to 32 – VXLAN Multipod stretched across geographically dispersed datacenters

  1. MauriBCN says:

    Hi Yves,

    If we want to interconnet 2 datacenters Using vxlan, the best choice is estrechad fabric, underlay or overlay?

    Now we have ospf in area 0 in each datacenter with ebgp,


    • Yves says:

      Hi Mauri,

      Have you been through the article 31 ? http://yves-louis.com/DCI/?p=1269
      There are several criteria’s to take into consideration such as distances, bandwidth, scalability, transport dependency, security, etc.
      IMHO, the best solution to interconnect 2 DC is to use a validated DCI solution (a DCI architecture) such as OTV or PBB EVPN.
      VXLAN is an encapsulation technique, a transport. It is not a DCI architecture per se, but as I explained in this post, it can be diverted for DCI purposes as long as we know how to improve the VXLAN encapsulated transport to make the DCI solid.
      The concept of VXLAN Multipod stretched Fabric is mainly used to extend Layer 2 segments across two greenfield DC where VXLAN EVPN has been already deployed intra-Fabric (within each DC). Thus the whole VXLAN EVPN configuration is done on both locations, and you just need to establish the Layer 3 connectivity between the two fabrics. With VXLAN Multipod geographically dispersed across two distant locations, the overlay data plane is extended from end-to-end (from leaf node in site A to leaf node in site B). You can have the same OSPF area if you wish, but it’s not recommended as mentioned in this post and detailed in the white paper for strength concerns. VXLAN Multipod is not a DCI solution, there is no delineation between each fabric.
      If the two DC are built with “classical” Ethernet transport (vPC, STP) or FP, then you can’t make a stretched fabric, the whole architecture becomes a dual-sites.
      Having said that, if your choice for interconnecting 2 sites goes with VXLAN EVPN, then, I will recommend to keep the 2 DC as independent as possible: Dual-sites (even if each site is built with VXLAN EVPN), with independent data plane and independent control plane. As a result, each site is transport independent (FabricPath, vPC, TRILL, STP, VXLAN, ACI..etc..). You will select the VLAN from each DC to be extended toward your DCI layer (VXLAN-EVPN).
      In that case, VXLAN with MP-BGP AF EVPN control plane is primordial. Don’t use Flood & Learn techniques for interconnecting distant DC’s. Enable Storm Control Multicast at ingress DCI interfaces. You can deploy VXLAN EVPN as Layer 2 fabric interconnecting the 2 DC’s, or you can leverage Anycast Layer 3 gateways (hence you can enforce the sturdiness with ARP Suppress). Use vPC at the DCI layer to offer a resilient and redundant Active/active DCI border layer with Anycast VTEP.

      hope that helps,

      thanks, yves

  2. Noel says:

    Hi Yves,

    Thanks for this very good article. I’m interrested in the “VXLAN Multipod Design Interconnecting Spine Nodes”.
    How would you implement the routing between the core and the spines to keep underlay and overlay control planes independent in such scenario ? Would you have the core devices (super-spine/transit leaf ?) as another BGP AS-number (Running EBGP) and no IGP within the core and the spines ?



    • Yves says:

      Hello Noel,

      If you think of the design presented in this post (or figure 6 in the white paper), the “transit leaf” used to interconnected the two PoD’s is a pure Layer 3 device, it can be dedicated router as transit leaf function or it can be your Core Router connecting to your MAN/WAN. The white paper talks about several options, even though it mainly focuses of the following:
      – Separation of the underlay network with 3 different IGP area’s (1 area for each PoD (each site) and 1 area for the Layer 3 inter-PoD connectivity). Note you can use MP-BGP for the inter-sites underlay or you can use eBGP for intra-site underlay.
      – Separation of the Overlay control plane with MP-iBGP for intra-PoD (each site) and eBGP inter-PoD’s.

      The validated multiPoD design described in the white paper suggests that you deploy each pod in a separate MP-iBGP AS interconnected through MP-eBGP EVPN sessions. This model using MP-eBGP EVPN sessions simplifies interPoD protocol peering (for more information, you can read the “Overlay Network Deployment Considerations” section of the white paper pointed in this article). The use of MP-eBGP between geographically dispersed PoD’s is required with each fabric deployed as part of a different autonomous system for interconnection to the external routed domain.

      You will find more details in the white paper of Multipod and you can also look at this generic document: http://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/guide-c07-734107.html#_Toc444553378

      Nonetheless, this multiPoD design is flexible and should be adapted at your existing WAN/MAN deployment.
      Having a dedicated L3 DCI for a better separation of underlay and overlay control planes is certainly the best choice though.

      Hope that clarifies a bit more.

      thank you, yves

  3. Noel says:

    Thanks for your quick answer.

    I was more refering to the figure 2 of the white paper of this article and how to interconnect the core with the transit spine. A figure 6 that would reflect the figure 2 interconnection.
    I guess the answer is Figure 16 of http://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/guide-c07-734107.html#_Toc444553378
    Where the core become the spine of Figure 16 and the transit spine become the leaf of Figure 16. Is that correct ?


    • Yves says:

      Hi Noel,

      Figure 16 represents the VXLAN intra-fabric where the Spine layer keeps the Spine function for the CLOS fabric, and where the Layer Core (WAN/MAN) of the enterprise is not represented. This is just an alternative leveraging BGP routing for underlay network instead of IGP. And the decision on the choice between BGP or IGP for the underlay mainly relies on the experience of the network team. However back to fig.#2 in the WP, yes theoretically you can have some of the Spine devices running as “transit Spine” toward a Core layer to extend the Layer 3 with MP-eBGP for the Overlay control plane. I don’t see why it will not work, but it may be worth running some design testing and make sure we are not facing any corner cases.

      Please feel free to send me your detailed drawing/thoughts in case I’m misreading your point (ylouis@cisco.com).

      kind regards, yves

  4. Anthony says:

    Hi Yves,

    Thanks for this article. Keep writing…

    I’ve a question about transit leaf nodes. Why using additional devices (in case of transit are dedicated nodes) for interconnecting multipod fabric, instead of using spine nodes with back to back links ? Spines are also “pure” layer 3 nodes but with additional features (like RR BGP and Anycast RP).

    Regarding fig 1 of WP, you will have:
    – pod1-sp1 pod2-sp1
    – pod1-sp1 pod3-sp1
    – pod1-sp2 pod2-sp2
    – pod1-sp2 pod3-sp2
    – pod2-sp1 pod3-sp1
    – pod2-sp2 pod3-sp2

    So eBGP and MSDP peering will follow physical connections.

    I know that is not a recommended solution (as it’s not mentioned in the WP or in all DG or CL libraries that I have read) but with this approach you could simplify the overall design and build a 4 stage fabric with one or two hop less (that is less latency, less buffering, less device to manage, and less expensive …) without compromise on resiliency and communication between VTEPs in different pods.

    Could you please give me your opinion about this design and if there are specific constraints (like multicast especially MSDP SA sync between RP, and so on)?

    I plan to set up this for interconnecting our 3 pods fabric (same campus, in two buildings, one pod per building, and the last shared between these two locations for border connectivity of the overall architecture).

    Thank you in advance for your answers.
    Kind Regards

    • Yves says:

      Hi Antony,

      Thank you! This is a good comment. Actually, I mentioned this short statement just before figure 1 in the post 32.
      ” //snip// From a physical approach, the transit node connecting the WAN can be initiated using dedicated Layer 3 devices, or leveraged from the existing leaf devices or from the spine layer. Diverse approaches exist to disperse the VXLAN EVPN PoDs across long distances //snip// ”

      However you are correct, this option is not detailed, but this should be supported (personally I haven’t tested this option, but I don’t see any corner cases).
      If you are planing to deploy the 3 PoDs within the same campus, then you can also think of the Spine nodes (PoD_1) to remote Leaf nodes (PoD_2 & PoD_3). This obviously depends on your physical fiber plant. However this should allow you to use NFM to configure the 3 remotes PoD fabrics in a “3-clicks” fashion :). Personally I like the approach where we leverage a pair of existing Leaf Nodes (where Hosts and/or Network services are locally attached) for Layer 3 inter-PoD connectivity.

      Having said that, VXLAN EVPN is very flexible and offers different options that should satisfy almost all requirements.

      Please, feel free to share your experience afterward!

      King regards, yves

  5. Peter says:

    Hi Yves,
    I think that there may be plans for MP-BGP EVPN to be used as a way to connect separate ACI fabrics over long distances. I wonder if this might present the possibility that an ACI EPG could be extended to join up with a remote VLAN located on a more traditional NX-OS environment (or maybe even non-Cisco). And also for interoperation of the distributed anycast gateway function. And all this without having to compromise the arrangement by switching on ARP flooding in the ACI BD.
    Or am I expecting too much 🙂
    Best Wishes,

    • Yves says:

      Hi Peter,

      I don’t think you are expecting too much 🙂

      The last ACI release 2.0 offers a new solution to stretch the ACI fabric across multiple locations using a Layer 3 underlay (OSPF) from Spine layer to Spine layer with a BGP EVPN control plane between ACI Pods. This is called ACI Multipod where “Pods” are local within a campus or geographically dispersed (maximum distance supported at release 2.0 is 10 ms RTT between Pods). The ACI Fabric is deployed in Active/Active fashion, offering deployment of various application components across separate ACI Pods (keep in mind that each software framework relies on specific latency concerns when its members are dispersed and may or may not support 10ms). The entire network runs as a single large fabric from an operational perspective (APIC cluster). However, ACI Multi-Pod introduces some enhancements to contain the failure domains inside each Pod, increasing the resiliency of the whole ACI multipod fabric compared to the “basic” stretched fabric (partial mesh). This is achieved by running separate instances of fabric control planes (IS-IS, COOP, MP-BGP) across Pods.
      As a result, ACI configuration or policy applies seamlessly and transparently to any of the APIC nodes across all the distant Pods managed by the single APIC cluster like if it was the same physical ACI Fabric. This simplifies the operational aspects of the whole ACI solution.

      Back to your note, you can also “attach” remote “stand-alone” Pod (VXLAN based on NX-OS or any other Overlay-based or VLAN-based DC architecture including non-cisco platforms) to the ACI fabric with a DCI solution (OTV, PBB-EVPN, MPLS-EVPN, VXLAN-EVPN..etc..). The VLAN ID’s will be bound to the ACI EPG’s of interest. And you can leverage Anycast Gateways on both sides if you wish, from the ACI fabric and from the VXLAN EVPN Fabric. Note that you will need to filter the duplicate vMAC address of the Anycast gateway using traditional PACL or VACL on the boundary of the ACI fabric (outside the ACI fabric).

      As for the ARP flooding mode, if it is configured Off in the ACI relevant BD, you can still leverage ARP gleaning.

      We should publish a technical white paper by end of September on ACI multipod. I will let you know as soon as it becomes public

      Hope that helps,

      Thank you, yves

      • Peter says:

        Hi Yves,
        Thank you for the very informative reply about the ACI Multi-Pod solution.
        Looking further to the future, I understand that there will be an ACI Multi-Site solution. I realise that this may not allow all the ACI environments to operate as a single fabric from the APIC perspective. But would it relax the 10ms requirement and possibly avoid the need to run OSPF on the underlay? Would a distributed anycast gateway still be possible in this situation or would this software framework re-introduce the RTD requirement.
        Kind Regards,

        • Yves says:

          Hi Peter,

          ACI Multi-sites will replace the traditional Dua-Fabric, still operating as distinct ACI fabrics with independent APIC clusters on each site, with separate management and policy domains. Inter-site connectivity will be initiated from the Spine layer (like ACI multipod). However it will be possible to synchronize automatically the application policies, to import selectively policy context from any Fabric (site) to any target Fabric (site) – not limited to 2 sites BTW. Maybe a bit early to commit but as far as I’m aware, multi-site should not be concerned by the 10ms but should offer inter-continental deployment (unlimited distances).
          Host reachability information will be advertised across all sites via MP-BGP EVPN. The Transit network can be any IP based protocol. BTW, with ACI multipod, OSPF is required only between the Spine and the IPN.

          Kind regards, yves

      • Yves says:

        Hi Peter and all,

        The new White Paper on ACI Multipod is now available


        Have a look and let me know if you have any questions

        Thank you, yves

        • Peter says:

          Hi Yves,
          I am keenly awaiting ACI Multi-site. I wondered if there was any news or updates?
          Best Wishes,

          • Yves says:

            Hi Peter,

            I assume you are talking about the evolution of the Dual-Fabric option that addresses (among many other things) the need to extend policies across multiple sites with independent SDN controllers (APIC clusters), but a central management point to propagate the configuration toward multiple APIC clusters (ouf!). For this solution, unfortunately, the released date is not public yet, hence, as of today I can’t provide yet any update, sorry.

            Just a reminder for other readers, The ACI multi-pod solution (single APIC cluster distributed across multiple locations) is already available since version 2.0. http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/application-centric-infrastructure/white-paper-c11-737855.html?cachemode=refresh

            And because you asked this question under the post VXLAN Multipod, I may publish soon a new series of articles on interconnecting multiple VXLAN EVPN Fabrics.

            Will keep you posted

            Thank you, yves

  6. yanman says:

    Thanks for the article Yves! I was really glad to see your notes about vPC Anycas VTEPs or “vPC Border Leaf” switches as I’ve also heard them termed.

    Our small setup is made up just like the centre of your diagram called “vPC Border Leaf Nodes as Transit Leaf Nodes” – we have two sites, each with 2 x 9Ks in vPC pair. Each 9K has L3 connections to it’s local node and one remote node, forming a square for the underlay. I’ve statically defined the BGP neighbors for a full-mesh, and using ingress replication. (no spines). Each node also has a link to it’s closest of 2 main DC’s.

    I’m worried some recent black-hole incidents are due to the topology and it cannot be made workable, would you say this design is not valid?

    I’ve seen a few issues – for tenants we mutually redistribute IGP with EVPN. Due to metrics and bandwidth set, hops across BGP add no weight, meaning often BGP cross-site routes are used instead of local node-to-node in the event of a DC link failure on one.

    Another is that, using an SVI over the peer-link for tenant VRF node-to-node routing, it is default to 1Gbps and thus not being utilised. I’m concerned that in some scenarios, the Anycast VTEP load-balancing is landing traffic at a node that then black-holes since it is trying to send traffic onward over BGP.

  7. Hi,
    I’ve studied your great Article.

    If pods are located in same physical place, pod interconnection links are stable.Does it still recommend to build stretched fabric with single control plane for underlay and overlay? Is it only interconnection pods stability reason is caused to offer multi-control plane for underlay and overlay? Is there any reason to use iBGP as the single control plane for underlay and overlay across inter pods?

    Best Regards,

    • Yves says:

      Hi Shahin,

      Even if Pods are located in the same physical location, the final design will drive the sturdiness of the whole VXLAN domain.
      If you trust every HW and SW components including human intents, then technically you can stretch the fabric with the same CP and DP until you reach the scalability limits (currently 256 leaf nodes).
      If you think that a non-expected phenomena could disrupt you DP and CP (including from external network services or machines – which always happen soon or later), then you may want to improve the sturdiness of the VXLAN domain.
      You can build a multipod where each Pod interconnected using eBGP has its own ASN. It’s not perfect thought as the CP and DP are anyhow extended from end-to-end.
      The best approach IMHO, since last year, is to deploy a VXLAN EVPN Multi-site concept where you create a physical demarcations at the border edge of the fabric (terminate DP and CP), from the same border nodes, integrating a hierarchical architecture to interconnect multiple fabrics, even if they are located in the same location. This is something I see more and more for very large DC. With that design in mind, you can rate limit the BUM and reduce risks of disruptions.

      Reason for iBGP? simplicity and efficiency! You can use eBGP for underlay and overlay transport, this is well documented, but what would be the reason to do so?

      Best regards, yves

  8. A.A. says:

    Hi there… Would the following design be correct ?

    – 2 datacenters
    – OSPF as an IGP underlay (area 1 for DC1, area 2 for DC2, area 0 for the IPN links)
    – iBGP in the underlay (AS 65001 for DC1, AS 65002 for DC2) and eBGP between the two DCs
    – Spine switches as RR in both DCs
    – PIM ASM in both DCs, Spines as rendezvous points in both DCs…MSDP peerings
    – MP-iBGP between Spines and Leaves in each DC
    – Full mesh of MP-eBGP peerings between all Spines

    Does it make sense ?

    PS: I really have a hard time finding detailled configuration guides/examples for NX-OS mode (non-ACI !!!!!) deployement of multipod designs…



  9. efthimis says:

    Hello there,

    I have a topology with three sites (triangle). Each site has a pair of Cisco 9396 in vPC( NX-OS standalone mode). The sites are connected with protected DWDM fibers. I read that the multi-site is not supported on these models(9396), so I am thinking to implement multi-pod “DCI” between the sites.
    Can you tell me if there is any alternative(better) way to interconnected these sites (without buying new switches)?
    I am planning to use OSPF (area 0) for the underlay (for loopbacks etc) and also use iBGP (full mesh/or maybe a RR?) for the peering between the sites and Anycast Gateways(i need same subnets in all sites). I chose not to use multicast but ingress replication. Is the above design valid?
    Finally, can you clarify if I should peer between vPC nodes? If yes is there any best practices/recommendations that I should apply?

    Thank you in advance for your time!

    • admin says:


      You are correct, N9396 doesn’t support VXLAN EVPN Multi-site.
      We don’t recommend anymore to deploy VXLAN EVPN Multipod as you certainly know, but if you definitely can’t add a pair of 2nd gen Nexus9k-EX/FX, IMO there are only two options:
      – VXLAN EVPN Multipod as you mentioned
      + easy to deploy 🙂
      + you extend the failure domain from end-to-end 🙁
      + you can use ingress replication for the BUM traffic if you wish
      + A best practice for VXLAN EVPN Multipod is to:
      – deploy an independent MP-iBGP overlay domain within each VXLAN pod.
      – define a pair of MP-iBGP route reflectors (RRs) in each pod (traditionally from the spine nodes).
      – and to deploy each pod in a separate MP-iBGP AS interconnected through MP-eBGP sessions.
      + I recommend you to enable QoS Policers for Layer 3 Interfaces (inter-Pod) to mitigate a bit the risks of global outage in case of BC storms
      + Have you seen this WP ? http://yves-louis.com/DCI/wp-content/uploads/2015/10/VXLAN-Multipod-geographically-dispersed-white-paper-final.pdf

      – VXLAN EVPN Multi-fabric
      + more solid 🙂
      + but more complicated 🙁
      + the DCI border leaf can’t be use as default gateway 🙁
      + you need to hand-off dot1Q and VRF-lite and use separate links for L2 extension + L3 extension
      + enable Multicast Storm Control from DCI Layer 2 Interfaces

      This being said, if possible, I strongly recommend you to deploy a pair of N9k-EX/FX, it will be much more efficient and easier to deploy VXLAN EVPN Multi-site (also fully supported by DCNM in few clicks using best practices). Remember that you can reuse the same Border gateways to attach L3 devices (FW, SLB, external routers..etc..) as well as locally attached hosts at L2 (vPC mode) while it continues to offer the L3 Anycast gateway functions.

      Let me know if you have more questions

      Best regards, yves

      • efthimis says:


        Thank you for the quick reply and for the recommendations.
        About the Multipod design, can you elaborate on this( you extend the failure domain from end-to-end)? For a example if I ‘lose’ one site I will face problems and with the other sites?

        As for the (deploy an independent MP-iBGP overlay domain within each VXLAN pod.), if each site has only one pair of 9396 (no CLOS design) is it ok just to establish an iBGP connection between the vPC peers switches? Also should I establish MP-eBGP sessions from each node of one to site to other nodes of the other sites?

        Thank you,

        • admin says:

          Hi Efthimis,

          Failure domain can refer to multiple elements. For example, Multi-pod leverages the same data plane (same vxlan tunnel from VTEP in pod 1 and VTEP in pod 2) and the same control plane between the 2 locations. As a result each VTEP discovers local endpoints from HMM and learns remote endpoints from the EVPN control plane and distribute the host tables across all other VTEPs including the remote ones. If either the DP or the CP fails for any reason, risks are that for example a broadcast storm may disrupt the remote pod or if endpoints learning/distribution fails for any reason, you may disrupt the host table every where. In addition, the scalability is limited to a single VXLAN fabric. The concept of Multi-pod improves a little bit the distribution process by having an eBGP transport between pod (you can leverage for example BGP dampening in case a DCI link bounces), but it’s not perfect. If pod 1 learns something wrong, the wrong will be propagated to every VTEPs of interest. This is what we call the failure domain and you want to reduce it to its smallest diameters (a pod for example).
          This said, don’t get me wrong, in a vxlan evpn multi-pod deployment, if you shutdown pod 1, then Pod 2 can still continue to work, independently. Pod 2 doesn’t need Pod1 for its DP and CP, it will use its RR (and RP in multicast replication mode).

          For your second point, with only one pair of leaf node in each site, technically yes, you could establish a single iBGP network between the two pairs of n9396 and have the same vxlan domain (same ASN) extended across the two DCs. then, you don’t need to establish the eBGP session between the two DCs. However, then it’s not a VXLAN EVPN Multipod design anymore. Do you agree?
          Let me just rephrase what I said previously. If technically your second case works, as a result, you have the same stretch VXLAN fabric extended between the two location. This is not a recommended design. You can have this simple physical design with 2 pairs of (border) leaf nodes in each site, for local L2/L3 attachment and L3 external connectivity using VXLAN EVPN Multi-sites making the end-to-end solution and it the solution will be much more solid.

          Let me know if this is not clear

          Best regards, yves

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.