42 – LISP Network Deployment and Troubleshooting

Good day Network experts

It has been a great pleasure and an honor working with Tarique Shakil and Vinit Jain on this book below, deep-diving on this amazing LISP protocol.

I would like also to take this opportunity to thank Max Ardica, Victor Moreno and Marc Portoles Comeras for their invaluable help. I wrote the section on LISP Mobility deployment with traditional and modern data center fabrics (VXLAN EVPN based as well as ACI Multi-Pod/Multi-Site), however this could not have been done without the amazing support of these guys.

Available from Cisco Press
or from Safari book Online

LISP Network Deployment and Troubleshooting

Implement flexible, efficient LISP-based overlays for cloud, data center, and enterprise

The LISP overlay network helps organizations provide seamless connectivity to devices and workloads wherever they move, enabling open and highly scalable networks with unprecedented flexibility and agility.

LISP Network Deployment and Troubleshooting is the definitive resource for all network engineers who want to understand, configure, and troubleshoot LISP on Cisco IOS-XE, IOS-XR and NX-OS platforms. It brings together comprehensive coverage of how LISP works, how it integrates with leading Cisco platforms, how to configure it for maximum efficiency, and how to address key issues such as scalability and convergence.

Focusing on design and deployment in real production environments, three leading Cisco LISP engineers present authoritative coverage of deploying LISP, verifying its operation, and optimizing its performance in widely diverse environments. Drawing on their unsurpassed experience supporting LISP deployments, they share detailed configuration examples, templates, and best practices designed to help you succeed with LISP no matter how you intend to use it.

This book is the Cisco authoritative guide to LISP protocol and is intended for network architects, engineers, and consultants responsible for implementing and troubleshooting LISP network infrastructures. It includes extensive configuration examples with troubleshooting tips for network engineers who want to improve optimization, performance, reliability, and scalability.

This book covers all applications of LISP across various environments including DC, Enterprise, and SP.

  • Review the problems LISP solves, its current use cases, and powerful emerging applications
  • Gain in-depth knowledge of LISP’s core architecture and components, including xTRs, PxTRs, MR/MS, ALT, and control plane message exchange
  • Understand LISP software architecture on Cisco platforms
  • Master LISP IPv4 unicast routing, LISP IPv6 routing, and the fundamentals of LISP multicast routing
  • Implement LISP mobility in traditional data center fabrics, and LISP IP mobility in modern data center fabrics
  • Plan for and deliver LISP network virtualization and support multitenancy
  • Explore LISP in the Enterprise multihome Internet/WAN edge solutions
  • Systematically secure LISP environments
  • Troubleshoot LISP performance, reliability, and scalability
Posted in DCI | Leave a comment

41 – Interconnecting Traditional DCs with VXLAN EVPN Multi-site using DCNM

Good day team,

The same question arises often about how to leverage DCNM to deploy a VXLAN EVPN Multi-site between traditional Data Centers. To clarify, DCNM can definitively help interconnecting two or multiple Classical Ethernet-based DC networks, in a very short time, thanks to its automation engine.

Just a reminder (post 37) , VXLAN EVPN Multi-site overlay is initiated from the Border Gateway nodes (BGW). The BGW function is the key element of the EVPN Multi-site solution offering the extension of the Layer 2 and Layer 3 connectivity across distant sites. The BGW nodes can run in two different modes; either in Anycast BGW mode (Up to 6 BGW devices per site), traditionally deployed to interconnect VXLAN EVPN based fabrics, or in vPC BGW mode (up to 2 vPC BGW devices per site), essentially designed to interconnect traditional data centers, but not limited to; indeed, vPC BGW devices can also be leveraged to locally dual-attached endpoints at layer 2, even though the infrastructure supports VXLAN EVPN from end-to-end.

In this demo, I used several designations to describe the Classic Ethernet-based data center networks such as Traditional or legacy data center network. All these terms mean that these networks are non VXLAN EVPN-based fabrics. In this use-case, the vPC BGW node is agnostic about data center network types, models or switches that construct the data center network infrastructure, as long as it offers Layer 2 (dot1Q) connectivity for layer 2 extension, and if required, external routed traffic (VRF lite). The traditional data center can be organized with any Cisco Nexus series (N9k/N7k/N5k/N2k) or Catalyst or simply non-Cisco switches. Any enterprises with two or more data centers built with a hierarchical multi-tier architecture (Core, Aggregation, Access) running a simple classic Ethernet based Layer 2 network, with or without centralized routing block can deploy a VXLAN EVPN Multi-site solution using DCNM 11 to interconnect their legacy data centers.

DCNM 11 can discover and import the network devices belonging to the traditional data center using CDP (Cisco) or LLDP (Vendor-neutral). In case of NX-OS platforms, DCNM 11 can also be leveraged to push traditional network configuration and monitor these Nexus devices inside an “External Fabric”.

The figure below depicts the lab used to demonstrate step by step how DCNM will auto-provision the underlay and overlay networks between DC-1 to DC-2. The lab consists of two Classic Ethernet-based data center networks. On each location, a pair of BGW switches (Nexus 9000-EX series) has been grafted to each aggregation layer using a virtual port-channel. Regarding the physical interconnection between sites, a complex routed network has been set up. Nonetheless, it is possible to directly interconnect the two pairs of BGW nodes in a back-to-back fashion using direct fibers for example, either in square fashion or in full-mesh, making its deployment again faster. However, the goal for adding a Layer 3 core between the distant data centers for this video, is to make the demo more comprehensive, explaining thereby how DCNM can also auto-provision the layer 3 underlay for the Inter-fabric connectivity (IFC) designed with a complex routed network. Last but not least, a Layer 3 Core network becomes very helpful if you have more than 2 data centers to interconnect.

For an end-to-end architecture with a back-to-back connectivity between BGW nodes, you can skip the step 2 [ Creation of one “External Fabric” that contains the complex Layer 3 network ] discussed during the video. Indeed, DCNM will automatically provision the underlay network required between the distant BGW nodes.

The deployment for this DCI architecture can be organized in 2 main stages. The first stage consists of creating the infrastructure with multiple building blocks called “fabrics”, one fabric per network function.

The following demo describes the 1st stage organized in four building blocks:

  1. Creation of two “External Fabric” comprising the traditional data centers networks.
    • Each brownfield data center network is respectively imported in “monitor” mode. This means that DCNM discovers the switches including the connectivity between the devices. The existing configuration of the legacy network is imported without any changes.
    • Notice this External fabric for each data center is optional for the purpose of Multi-site. However, it brings the visual topology of the traditional DC network. Furthermore, DCNM can configure the Classic Ethernet (CE) switches if needed and it provides device health and link utilization in real time.
  2. Creation of one “External Fabric” that contains the complex Layer 3 network.
    • The Layer 3 is imported with its current configuration, however, this time it is configured in “managed” mode. The reason is that, as demonstrated in this video, we want DCNM to also auto-provision the network connectivity between the BGW nodes and the Core routers. As a result, DCNM will discover and configure the routed interfaces on both sides of the IFC, from the BGW nodes and the Core routers. To achieve this automation, the user needs to uncheck the monitor mode and configure the role of the Core routers.
  3. Creation of two “Easy Fabric” used to support the function of BGW Multi-site.
    • The BGW devices are discovered by DCNM and imported without preserving the configuration. Actually, DCNM is going to deploy the whole configuration reuired for EVPN Multi-site. All parameters will be used from the Easy Fabric settings.
    • You can leave all default parameters, or you can change any of them accordingly to your specific needs or habits. DCNM policies that automates the deployment rely on Cisco best practices. Nevertheless, during this demo, some of the parameters in the resources tab are updated, particularly for the loopback interfaces address ranges in order to appear differently between the two sites – however, changing the parameters remains optional, you can leave all default values if you wish.
  4. Creation of the Multi-Site Domain (MSD)
    • In this particular use-case, you import the two “Easy Fabric” created for the management of the BGW nodes and the “External Fabric” comprising the Layer 3 core.
    • MSD is responsible to auto-deploy the Underlay and Overlay networks required for the VXLAN EVPN Multi-site established between the two data centers.

During the deployment of the configurations, different colors will inform you about the status of the devices. Green means that the intent matches the running config and that configuration is in sync between the device of interest and DCNM. When the deployment is pending on a switch, the switch is displayed in blue color.

The multi-site network infrastructure interconnecting the two traditional data centers being deployed, you want now to provision the networks and VRF required for the Day-2 operations.

In this scenario, you have several endpoints spread across the traditional data centers belonging to two different networks. In this example, VLAN 2100 and 2101 under the same Tenant 1 must be extended between the two data centers. You want to allow these two networks across the 2 port-channel interfaces “22” facing the traditional data centers, in order to extend the networks of interest from the traditional data centers across the Multi-site infrastructure. Notice that one physical host is connected to the BGW 1 in single attachment via the interface e1/36 in which the concerned VLANs must also be allowed.

The following video describes how to deploy the networks and VRFs. This is the same deployment described in post 39 with end-to-end VXLAN EVPN fabrics, except that, in this particular scenario the auto deployment concerns only the BGW nodes. The Dot1Q Layer 2 traffic uses the traditional network transport across the legacy data centers to hit the destined endpoints.

Finally, a short series of tests demonstrating the successful multi-site deployment and network connectivity extended from DC-1 to DC-2 concludes this demo.

After the Nexus9000-EX have been connected to the aggregation layer of the traditional data center, it took less than 30 mins to fully deploy the Multi-site infrastructure between the two traditional data centers, including the extension of the Layer 2 and Layer 3 networks.

The BGW function requires a particular hardware offering tunnel stitching and hardware-based rate limiters for the tunnel transporting the BUM traffic. As a result, the BGW nodes must be part of the CloudScale ASIC family, starting from the Nexus9000-EX series to any new Nexus 9000 platforms. However, the network devices inside the traditional data centers can be any hardware platforms, and network operating software, including non-Cisco switches.

I ran through an exhaustive configuration showing further integration with legacy data centers, and a complex Layer 3 Core network, however, if you have direct fibers that interconnect two data centers, and want to make the deployment faster, you just need to run through the steps 3 (easy Fabric for BGWs) and 4 (MSD), and then deploy network a_la_demand.

Hope you enjoyed, thank you !

Posted in DCI | 7 Comments

40 – DCNM 11.1 and VRF-Lite connection to an external Layer 3 Network

Another great feature supported by DCNM 11 concerns the extension of Layer 3 network connections from a VXLAN EVPN fabric across an external Layer 3 network using VRF-Lite hand-off from the Border leaf node toward the external Edge router.

There are different options to deploy a VRF-Lite connection to the outside of the VXLAN fabric. Either using a manual deployment or leveraging the auto-configuration process that will configure automatically the VRF-lite on a Border Leaf node toward an external Layer 3 network. VRF Lite is supported on border devices with the role of either Border Leaf node, Border Spine node or Border Gateway node. Thus, the later providing both functions, Layer 3 connectivity from a VXLAN EVPN Fabric to outside a Layer 3 network, as well as VXLAN EVPN Multi-site service offering hierarchical and integrated VXLAN EVPN to VXLAN EVPN fabric interconnection.

Prior to deploy the external VRF-lite, an External Fabric must exist, and if not, it must be created, in which the concerned external router (DCI-WAN-1) should be imported.

Manual VRF Lite Configuration

One of the key reasons for configuring the interfaces manually could be for example when the Layer 3 network is managed by an external service entity, thus the Network team has no control on the configuration which is traditionally managed by a different organization, internal or external such as a Layer 3 service operator. The first demo follows this scenario and illustrates an end-to-end manual configuration of VRF-Lite connections from the Border leaf node to an external Edge router. Actually, this manual mode comes with the default Fabric settings.

Physical topology for VRF-Lite

The Border leaf nodes, BW1 and BW2, being vPC peer devices, it is a best practice to configure a routed interface per device connecting the external Layer 3 network. As a result, one physical link per Border Gateway connects toward the WAN Edge router as depicted in the figure above, and demo’ed.

For that particular scenario, the external fabric management can be left to Monitor mode (default mode under Fabric setup) as we don’t need DCNM to push any configuration to the Edge router

Fabric Monitor Mode

The role of the targeted external router must be set to “Edge Router” in order to extend the VRF-Lite network. As elaborated in the previous demo, the role of “Core Router” is used for the Multi-site deployment.

External Fabric with WAN edge routers

The interface E1/3 of the Edge router (DCI-WAN-1) connecting the Border Leaf node BW1 of the VXLAN Fabric is manually configured with the associated sub-interface for the VRF Tenant-1. Notice the choice for the Dot1Q tag being “2” in order to be aligned with the 1st sub-interface Dot1Q from the pool (see figure below):

interface Ethernet1/3.2
mtu 9216
encapsulation dot1q 2
vrf member Tenant-1
ip address

Finally, for connectivity and testing purposes, a Loop-back 100 is configured for the VRF Tenant-1 in the external Edge router:

interface loopback100
vrf member Tenant-1
ip address

Under the Fabric builder topology, it is crucial to check and set the role of the Border leaf node to “Border”.

From the VXLAN Fabric setup, the deployment for the VRF Lite is left to “Manual” (default).

Fabric Setup VRF-lite resources

When both, VXLAN and External Fabrics are ready, VRF-Lite can be deployed as illustrated in the video below.

Manual VRF-Lite deployment

Automatic VRF Lite Configuration

The second demo illustrates the automatic configuration for an external network connection using VRF Lite. We use the Border Leaf node, BW3, from the VXLAN EVPN Fabric 2 that connects the Edge Router DCI-WAN-2 belonging an External Fabric.

DCNM allows the Network Manager to automatically detect and create the external links. In order to succeed the automatic configuration for the VRF Lite IFC (Inter-Fabric Connection) stage, each physical interface should belong to a Border Leaf node set with the role “Border”, and should be connected to a device with the Edge Router role belonging to an External Fabric. Under the VXLAN Fabric setting, the VRF Lite deployment must be configured to “To External Only”. This mode gives automatically a subnet IP address range for the VRF lite interfaces (Point-to-Point).

On the External Fabric, we need to allow DCNM to manage the Edge Router. By default the External Fabric comes in “Monitor” only mode. It is crucial to uncheck the Monitor mode under the Fabric setting, thus DCNM can push the configuration to the Edge router.

As soon as the roles has been attributed for each device, Border Leaf node and Edge router) and both Fabrics has been configured accordingly, DCNM 11 will detect and provision automatically the external Links (IFC) for each interface of the Border Leaf node(s) that connects a Edge router role device. DCNM allocates the VRF Policy setup automatically to those external links using the policy called “ext_fabric_setup”.

The following demo covers two different stages. First of all, it shows the deployment of the VRF Lite with the automatic configurations for the Border Leaf node. Secondly, it demonstrates how DCNM can manage an external router using the Interface Control windows to configure a Loopback interface for network continuity testing purposes, a Sub-interface for the extension of the VRF Lite network, as well as how to push configuration using line commands (CLI) directly from DCNM via a “freeform config” template. In this example, we configure BGP setup for the VRF Tenant-1 network as the neighbor information.

Posted in DCI | Leave a comment

39 – DCNM 11.1 and VXLAN EVPN Multi-Site Update

Dear Network experts,

It took a while to post this update on DCNM 11.1 due to other priorities, but I should admit it’s a pity due to all the great features that came with DCNM 11.1. As mentioned in the previous post, DCNM 11.1 brings a lot of great improvements.

Hereafter is a summary of the top LAN fabric enhancements for VXLAN Multi-Site deployment that comes with DCNM 11.1 for LAN Fabric. Notice that many other new features and improvements come with 11.1. Please feel free to look at the Release-note for an exhaustive list of New Features and Enhancements in Cisco DCNM, Release 11.1(1).

Instead of describing in details all those new features, I think that a series of videos can be more efficient to understand those functionalities through specific demos. Please notice that I’m not covering all the features of DCNM 11.1 such as analytics, telemetry, compute monitoring, backup and restore, etc.; the main focus for this post is around VXLAN EVPN Multi-Site. I will provide a demo on VRF-Lite connection to the outside the the fabric in a next post.

The series of following demos rely on the same VXLAN EVPN Multi-Site infrastructure that I used for the previous post 38. Hence, you can can also compare and see the evolution of DCNM 11.1 for those particular functions.

Please feel free to make any comment or ask any questions if one of the following sections is not clear enough, and I will try to provide you with the needed information.

Brownfield Migration

  • Transition an existing VXLAN fabric management into DCNM.

The network manager can now import directly to DCNM 11.1 an existing VXLAN EVPN Fabric while the DC network continues to provide connectivity to production endpoints. Typically, the original VXLAN fabric has been previously created and managed through manual CLI configurations or custom automation scripts. With this new software release, the Network manager can import an existing VXLAN network and start managing the fabric through DCNM. All configurations, as well as logical Networks and VRFs  are preserved. The migration happens without interruption. After the migration, the initial configurations of the VXLAN Fabric underlay and overlay networks will be managed by DCNM and new devices, Networks and VRFs can be added afterward seamlessly.

The 1st demo shows how to import an existing Brownfield VXLAN EVPN Fabric while maintaining the Networks and VRFs


Fabric Builder, fabric devices and fabric underlay networks

  • Configuration Compliance displays side-by-side the current and pending configuration before deployment.
  • vPC support for VXLAN Multi-Site Border Gateway (BGWs) and standalone fabrics have been added.

The second demo shows how to create a new Greenfield VXLAN fabric from the ground up. The process is quite similar to what illustrated in the previous post for DCNM 11.0; however, several improvements have been added such as the configuration compliance that allows to compare side by side the current setup and the expected configuration. Also, new roles can be assigned to Spine and Border Leaf nodes extending the support for additional topologies, including:

  • Multi-Site Border Gateways deployed on the Spine nodes (Border Spines).
  • The new vPC Border Gateway mode, allowing to locally connect endpoints (single- or dual-homed attached) at Layer 2 to the same vPC pair of switches where the VXLAN Multi-site is initiated (coexistence of Border Gateway and Compute leaf roles).


External Fabric

The external fabric can be created to include network devices (switches and routers) that offer Layer 3 connectivity services to the fabric (for both north-south and east-west communication).

  • Inter-Fabric Connections (IFCs) can be automatically created to/from those external devices and the fabric border switches.
  • The external fabric devices can have two roles: Edge Router (used to establish VRF-Lite connectivity required for north-south communication) and Core Router (representing the inter-site network devices allowing to establish east-west VXLAN data-plane communication between sites).

The following demo  illustrates how to import an External Layer 3 Network interconnecting the VXLAN EVPN Fabrics. Routers, link connectivity as well as the topology are automatically discovered and imported into DCNM.

Notice that the devices belonging to the external fabric can be managed by DCNM (this applies to Cisco Nexus devices only).


VXLAN EVPN Multi-Site Domain

The fourth video shows how DCNM 11.1 allows to quickly deploy a VXLAN EVPN Multi-Site infrastructure by moving the existing VXLAN Fabrics into an active Multi-site Domain. The underlay and overlay network configurations for the Inter-Fabric Connection (BGWs <=> Core-routers) is made automatically by DCNM, either using a peer to peer link establishment for the underlay and a full-mesh establishment for the overlay network or using a Router Server for eBGP (functioning like a traditional Route Reflector normally used for iBGP peering) located in the Layer 3 core network.


Overlay Network/VRF provisioning & Interfaces

  • Port-channels, vPCs, sub-interfaces, and loopback interfaces can be added and modified with an external fabric devices.
  • Cisco DCNM 11.1(1) specific template enhancements are made for interfaces.
  • Networks and VRFs deployment can be deployed automatically at the Multi-site Domain level from Site to Site in one single action.

Finally, the last video shows how DCNM can simplify the deployment for Network and VRF extension across both sites, in few clicks.

Just amazing 🙂

Hope you will enjoy these few demos

Thank you, yves

Posted in DCI | 2 Comments

38 – DCNM 11.0 and VXLAN EVPN Multi-site

Hot networks served chilled, DCNM style


*** Important Note ****

Since I ran the following demos using the first version of DCNM 11, a new maintenance release 11.1 came out in December 2018. As a consequence, several videos given underneath are not anymore topical. However, I leave them available for those who are interested, knowing that some demos are still valid such as the DCNM installation, or the deployment of a Green-Field Fabric (with some slight differences). But for the Multi-site demo, just switch to the next post http://yves-louis.com/DCI/?p=1744


When I started this blog for Data Center Interconnection purposes some time ago, I was not planning to talk about network management tools. Nevertheless, I recently tested DCNM 11.0 to deploy an end-to-end VXLAN EVPN Multi-site architecture, hence, I thought about sharing with you my recent experience with this software engine. What pushed me to publish this post is that I’ve been surprisingly impressed with how efficient and time-saving DCNM 11.0 is in deploying a complex VXLAN EVPN fabric-based infrastructure, including the multi-site interconnection, while greatly reducing the risk of human errors caused by several hundred required CLI commands. Hence, I sought to demonstrate the power of this fabric management tool using a little series of tiny videos, even though I’m usually not a fan of GUI tools.

To cut a long story short, if you are not familiar with DCNM (Data Center Network Manager), DCNM is a software management platform that can run from a vCenter VM, a KVM machine, or a Bare metal server.  DCNM is a general purpose Network Management Software (NMS) and Operational Support Software (OSS) product targeted at NX-OS networking equipment. It focuses on Cisco Data Center infrastructure, supporting a large set of devices, services, and architecture solutions. It covers multiple types of Data Center Fabrics; from the Storage fabric management, where it comes from originally, with MDS 9000 and Nexus platforms, to the IP-based fabric, to the traditional Classical STP network or even legacy FabricPath domains, and recently Media networking. But more importantly, and relevant to this blog, it also supports modern VXLAN EVPN Fabrics. It enables the network admin organization to provision, monitor, and troubleshoot the data center infrastructure (FCAPS*). DCNM relies on multiple pre-configured templates used to automate the deployment of many network services and transport such as routing, switching, VXLAN MP-BGP EVPN, Multicast deployment, etc. We can also modify or create our own templates, which makes DCNM very flexible, going beyond a limited hard-coded configuration. And last but not least, DCNM offers a Power-On Auto Provisioning (POAP) service process using the fabric bootstrap option, allowing network devices to be automatically provisioned with their configuration of interest at Powered on, without entering any CLI command. Finally DCNM contributes to the Day0 (Racking, Cabling, power-on – though zero bootstrap configuration), Day1 (POAP, VXLAN EVPN Fabric builder, Underlay, Overlay) and DayN (Network and host deployment, firmware updates, monitoring, fault-mgmt) operations of Network Fabrics.

* FCAPS is a network management framework created by the International Organization for Standardization (ISO). FCAPS groups network management operational purposes into five levels: (F) fault-management, (C) configuration, (A) Accounting, (P) performance, and (S) security.

For this article, I will just cover a small section of DCNM 11.0, focusing on VXLAN EVPN and Multi-site deployment. And, because of the Graphical User Interface, I thought that a series of videos could be more visual and easier to understand than a long technical post to read.

However, if you need further technical details about DCNM 11.0, there is a very good User Guide available from this public DCNM repository

I have divided the deployment of a VXLAN EVPN Multi-site infrastructure into five main stages. I’m not certain this is the initial thought of the DCNM Dev team, but this is how I saw it when using DCNM 11.0 for the first time.

      • Installation of DCNM 11.0
      • Building multiple VXLAN EVPN Fabrics
      • Extending the VXLAN Fabrics using VXLAN EVPN Multi-site
      • Deploying Network’s and VRF’s
      • Attaching Hosts to the VXLAN Fabric

DCNM 11.0 installation

The installation of DCNM 11.0 is very fast, however it requires two different steps: The OVF template deployment and the DCNM Web Installer. Notice that instead of the OVA installation, it is also possible to install the ISO Virtual appliance if you wish to install DCNM on KVM or Bare Metal computer. Let’s talk about on the OVA installation, but if needed, all installation and upgrade details are available at cisco.com.

The first action for installing the OVA is to download the dcnm.ova file from CCO. You can try DCNM 11 for free for a certain period of time. Yes, there is a 30 day full Advanced Feature Trial license!

If you are not yet familiar with VXLAN EVPN, with or without VXLAN EVPN Multi-site, or if you are a bit anxious about risks of configuration errors, you will definitely enjoy this software platform.

A few comments on the installation. Firstly, the OVF template comes with 3 networks:

  • One interface network for the DCNM management itself
  • A second interface for the Out-Band-Management of the Fabric
  • A third one that provides In-Band connection to the fabric. It will be used for communicating with the control plane of the fabric, for example to locate the end-points within the VXLAN fabric.

If you are running the vCenter web client, at the end of the OVA installation, a pop-up window will propose to start the DCNM Web Installer automatically. Otherwise, you can start the Web Installer using the management IP address, associated with port 2080 (http://a.b.c.d:2080) as shown in this 1st video below.

Secondly, the DCNM Installer will inquire to choose the installation mode corresponding to the type of network architecture and technology you are planning to deploy. Consequently, select the “Easy Fabric” mode for VXLAN EVPN fabric deployment and automation. It will then ask for a few traditional network parameters such as NTP servers, DNS server, Fully Qualified Host Name, so be prepared with these details.

Subsequently, you need to configure the network interfaces.

  • dcnm-mgmt network:

This network provides connectivity (SSH, SCP, HTTP, HTTPS) to the Cisco DCNM Appliance. Associate this network with the port group that corresponds to the subnet that is associated with the DCNM Management network.

This one should be already configured from the previous OVA installation.

  • enhanced-fabric-mgmt

This network provides enhanced fabric management of Nexus switches. You must associate this network with port group that corresponds to management network of leaf and spine switches.

  • enhanced-fabric-inband

This network provides in-band connection to the fabric. You must associate this network with port group that corresponds to a fabric in-band connection.

This network will be essentially used to communicate with the Control plane of the VXLAN Fabric in order to exchange crucial endpoint reachability and location information. Actually, we don’t need this interface for the deployment of the VXLAN EVPN fabric per se, but it will be useful later, for example, for a feature called Endpoint Locator. I didn’t configure this network interface for the purpose of this demo.

I will try to add more videos to demonstrate further DCNM features such as Endpoint Locator in a fabric or future across Multi-sites. In addition VXLAN OAM (NGOAM) provides the reachability information to the hosts and the VTEPs in a VXLAN network. NGOAM can be enabled from DCNM to show the physical path and devices taken by the overlay network (L2 and L3 VNI). I will configure the In-bound network interface when appropriate.

The 1st video below is optional, it describes how to install the OVA file and show the configuration of the DCNM Web installer. If you are already familiar with the OVF template deployment, you can skip this video.

Video 1: DCNM 11 OVA Installation and WEB Installer

Building multiple VXLAN EVPN Fabrics

This second video demonstrates how DCNM 11.0 can be leverage to deploy VXLAN EVPN- Fabric in the context of multiple sites interconnected using a routed Layer 3 WAN.

The first action is to create the network underlay for each Fabric. There are several profiles available to build the DC network of your choice. The one we need to select for deploying a VXAN EVPN Fabric is called “Easy Fabric”, which is the default architecture we previously configured during the Web Installer initialization. Most of network parameters are already provisioned by default. The only values to manually enter are the AS Number and the Site number for each VXLAN domain. The rest can be left by default. Nonetheless, DCNM offers the flexibility to change any parameters by the ones that the network manager wishes to practice. The video demonstrates the usage of particular ranges of IP addresses for the underlay components. Additional options are available if needed, for example, to enable Tenant Routed Multicast in order to optimize the routed c-multicast traffic across the VXLAN EVPN fabric. Another great added value of DCNM is to automate the deployment of multiple fabric devices using POAP bootstrap configured from the centralized DCNM platform. As a result, there is no need to connect slowly switch-by-switch using each physical console port in order to configure the manageability access via the CLI language. On the contrary, all the switches automatically discovered by DCNM and identified by the network manager as part of the VXLAN Fabric, will download their configuration automatically during their power cycle.

Even the respective role of each device (Spine, Leaf, Border Leaf) is automatically recognized, although it can be modified on the fly if needed, as shown in the video for the Multi-site Border Gateway function.

Therefore, the only requirement prior to start the DCNM configuration, is to collect the serial numbers of the switch devices, so they can be identified and selected accordingly to be assigned with their corresponding network management identifiers. Another slight TIP that can help prior to the creation of the fabric, is to draw a high-level topology figure with the physical interfaces used for the uplinks as well as connections toward the WAN edge devices crossed-out for inter-fabric network establishment.

For Network managers who want to check the outline prior to deploy the configuration toward the switches, it is possible to “preview” the configuration given in a CLI format, hence he knows exactly what configuration is pushed toward the switch. When a configuration is deployed, it is also possible to check the status of it and check, in case of error, where the config stops and why. In addition, if the  DCNM configuration is different than the one expected (for example, if someone changed one parameter directly using the CLI), DCNM will highlight the differences with the non-expected CLI  from a DCNM point of view.

This video also includes how to connect each fabric to the WAN. To achieve this inter-site Layer 3 connectivity, we need to create an External Fabric that acts for the L3 WAN/MAN. By electing a seed unit from the L3 network and defining the appropriate number of hops from there, the Layer 3 topology with all existing routers of interest will be automatically discovered.

Video 2: Building Multiple VXLAN EVPN Fabrics

Video3: Fabric interconnection with VXLAN EVPN Multi-site

With the previous video 2, we have seen how the external Layer 3 Core connects both VXLAN EVPN fabrics. This following stage illustrates the initialization of the VXLAN EVPN Multi-site feature in order to interconnect these 2 VXLAN EVPN fabrics over the top of the Layer 3 Core via their respective Border Gateways.

VXLAN EVPN Multi-Site uses the same physical border gateways to terminate and initiate the overlay network inside the fabric, as well as to interconnect distant VXLAN EVPN domains using integrated VXLAN EVPN based network overlay.

VXLAN EVPN Multi-Site is a hierarchical solution that relies on the networking design principles to offer an efficient solution extending layer 2 and layer 3 segments across multiple EVPN-based Fabrics. For additional reading on VXLAN EVPN Multi-site, I recommend you to look at the previous post (37) as well as the public white paper that covers deeply VXLAN EVPN Multi-site.

Currently, DCNM 11.0(1) requires you to configure the underlay and overlay network establishments between each device (From each BGW to its distant peers), but, the Dev Engineers are already working on a new version that simplifies further the automation of the Multi-site creation. I will post a new video as soon as I can test it.

Video 3: Fabrics interconnection with VXLAN EVPN Multi-site

Video 4: Deploying Network and VRF

This phase shows the creation of the layer 2 and layer 3 network segmentation and how they are pushed to the Border Gateways. The order of this stage might be different and the creation of Network and VRF can be initiated after the attachment of the compute nodes. The concept to build a network overlay is quite simple. We can leave the proposed VXLAN Network ID which is consumed from the DCNM pool. We can provide a new name for that network or leave as is, etc. What is required is to map the concerned VLAN ID for that specific Network. Finally, we need to associate the network with its Layer 3 segment (VRF). Either the VRF can be selected from a list of existing VRF’s previously created or we have to build a new one from the same window. Then, the default gateway (SVI) for that particular network can be configured under Network profile. All the configuration ready to be deployed takes few seconds.

When all desired Network have been finally created, from the topology view under control network tab, we must select the devices with the same role and deploy the Networks of choice. During this video, I only selected the Border Gateway devices. Indeed, the deployment of networks for the compute leaf nodes are discussed in the next and last video, after the vPC port-channel has been created.

Notice that prior to deploy any configuration to a switch, it is always possible to preview the configuration before it is pushed to the targeted devices, if you wish.

Video 4: Deploying Networks and VRF’s

Video 5: Host Interface deployment and Endpoint discovery

This last video concludes the demonstration of deploying an End-to-End VXLAN EVPN Multi-site infrastructure, with ultimately the configuration of the vPC port-channels where hosts are locally attached to the fabric and how to allow the concerned networks into those vPC port-channels.

It also demonstrates how DCNM 11.0 can integrate a VMware topology onto its dynamic topology views, discovering automatically the vCenter that controls the host-based networking on the fabric to show how virtual machines, hosts, and virtual switches are interconnected.

In this video, when the configuration of the port-channels is completed, the intra-subnet and inter-subnet communication is verified using a series of tests, inside the fabric and across the Multi-site extension.

Video 5: Host Interface deployment, vCenter discovery and Verification


After building the physical deployment of the fabric (racking, cabling, powering all gears), and after collecting the serial numbers and drawing the uplink interfaces with their respective identifiers into a high-level topology diagram, it will take only a couple of hours to build the whole multi-site VXLAN EVPN infrastructure, ready to bridge and to route traffic across multiple VXAN EVPN Fabrics in a multi-tenancy environment.

I only recorded each video once because all first attempts to use DCNM to deploy the VXLAN EVPN multi-site were a success. The communication from site to site worked immediately. BTW, I only found the DCNM 11.0 User Guide to add the pointer in this post after I ran all the DCNM configurations. It’s a rumor, please don’t repeat, but the Engineers working on DCNM said that several crucial improvements will come with the major release (MR) coming soon. I am very eager to test what they have done :). I will demo these improvements here as soon as I can get a solid beta version to try.

I’ll keep you posted. However, in the meantime, please feel free to bring your comments.

Thank you for reading and watching

Posted in DCI | 12 Comments

37 – DCI is dead, long live to DCI

Dear readers,

Since I wrote this article last spring discussing about future integrated DCI features, VXLAN EVPN Multi-site has become real. It is available since NX-OS 7.0(3)I7(1).

There is also a new White paper that discusses deeper about VXLAN EVPN Multi-site Design and deployment

Good read !

++++++++++++++++++++++  Updated Jan 2018  +++++++++++++++++++++++++

DCI is dead, long live to DCI

Some may find the title a bit strange, but, actually, it’s not 100% wrong. It just depends on what the acronym “DCI” stands for. And, actually, a new definition for DCI came out recently with VXLAN EVPN Multi-site, disrupting the way we used to interconnect multiple Data Centres together.

For many years, the DCI acronym has conventionally stood for Data Centre Interconnect.

The “Data Centre Interconnect” naming convention may essentially be used to describe solutions for interconnecting traditional-based DC network solutions, which have been used for many years. I am not denigrating any traditional DCI solutions per se, but the Data Centre networking is evolving very quickly from the traditional hierarchical network architecture to the emerging VXLAN-based fabric model gaining momentum in enterprise adopting it to optimize modern applications, computing resources, save costs and gain operational benefits. Consequently, if these independent DCI technologies may continue to be deployed for extending Layer 2 networks between traditional DC networks, they will tend to be replaced by modern solution allowing a smooth migration to VXLAN EVPN-based fabrics. Nonetheless, for the interconnection of modern VXLAN EVPN standalone (1) Fabrics, a new innovative solution called “VXLAN EVPN Multi-site” – which integrates in a single device the extension of the Layer 2 and Layer 3 services across multiple sites – has been created for a long life. As a consequence, DCI also stands for Data Centre Integration when interconnecting multiple fabrics using VXLAN EVPN Multi-site, thus this strange title.

(1) The term “standalone Fabric” describes a Network-based Fabric. The concept of Multi-Pod, Multi-Fabric and Multi-site discussed in this article doesn’t concerned the enhanced SDN-based Fabric, also known as ACI Fabric. If both rely on VXLAN as the network overlay transport within the Fabric, they both use different solutions for interconnecting multiple Fabrics together.

ACI already offers a more sophisticated and solid integration of Layer 2 and Layer 3 services with its Multi-Pod solution since ACI 2.0 as well as ACI Multi-site since ACI 3.0 (Q3CY2017).

A bit of history:

Cisco brought the MP-BGP EVPN based control-plane for VXLAN in beginning of 2015. After a long process of qualification and testing in 2015, this article has been elaborated to describe the approach of the VXLAN EVPN Multi-Pod. At the time, it offered the best design considerations and recommendations for interconnecting multiple VXLAN EVPN Pods. Afterwards, we ran a similar process for validating the architecture for interconnecting multiple independent VXLAN EVPN Fabrics, of which post 36 with its sub-sections has been published. This second set of technical information and design considerations has mainly been elaborated to clarify the network services and network design required for interconnecting thru an independent DCI technology, multiple VXLAN EVPN Fabrics in conjunction with the Layer 3 Anycast Gateway (1) functions.

(1) The Layer 3 Anycast gateway is distributed among all compute and service leaf nodes and across all VXLAN EVPN Fabrics offering the default gateway for any endpoints at the closest Top of Rack.

Although the sturdiness of this solution of VXLAN EVPN Multi-Fabric is fully reached, this methodical validation design demonstrates the critical software and hardware components required when interconnecting VXLAN EVPN Fabrics with the Layer 3 Anycast Gateway, with manual connectivity’s for independent and dedicated Layer 2 and Layer 3 DCI services, making the whole traditional connectivity a bit more complex than usual.

To summarise, as of August 2017, we had 2 choices when interconnecting multiple VXLAN EVPN Pods or Fabrics:


The first option aims to create a single logical VXLAN EVPN Fabric in which multiple Pods are dispersed to different locations using a Layer 3 underlay to interconnect the Pods. Also known as stretched Fabric, this option is more commonly called VXLAN EVPN Multi-Pod Fabric. This model goes without a DCI solution, from a VXLAN overlay point of view, there is no demarcation between a Pod and another. The key benefit of this design is that the same architecture model can be used to interconnect multiple Pods within a building or campus area as well as across metropolitan areas.


Figure 1: VXLAN EVPN Multi-PoD

If this solution is simple and easy to deploy, and thanks to the flexibility of VXLAN EVPN, it’s lack of sturdiness with the same Data-Plane and Control-Plane shared across all geographically dispersed Pod.

Although with the VXLAN Multi-Pod methodology, the Underlay and Overlay Control-Planes have been optimised – separating the Control-Planes for the Overlay network (iBGP ASN) using eBGP for the external site network – to offer a more robust End-to-End architecture than a basic stretched Fabric, the same VXLAN EVPN Fabric is overextended across different locations. Indeed, the reachability information for all VXLAN Tunnel endpoints (VTEPs) as well as that of endpoints is extended across all VXLAN EVPN Pod (throughout the whole VXLAN Domain).

If this VXLAN EVPN Multi-pod solution simplifies its deployment, the failure domain is kept extended accros all the PoD, and the scalability remains the same as it is in a single VXLAN fabric.

VXLAN EVPN Multi-Fabric

The second option when interconnecting multiple VXLAN EVPN Fabrics in a more solid fashion, is by inserting an autonomous DCI solution, extending Layer 2 and Layer 3 connectivity services between independent VXLAN Fabrics, while maintaining segmentation for the multi-tenancy purposes. Compared to the VXLAN EVPN Multi-Pod Fabric design, the VXLAN EVPN Multi-Fabric architecture offers a higher level of separation for each of the Data Centre Fabrics, from a Data-Plane and Control-Plane point of view. The reachability information for all VXLAN Tunnel endpoints (VTEPs) is contained within each single VXLAN EVPN Fabric. As a result, VXLAN tunnels are initiated and terminated inside each Data Centre Fabric (same location).

The standalone VXLAN EVPN Multi-Fabric model also offers higher scalability than the VXLAN Multi-Pod approach, supporting a larger total number of leaf nodes and endpoints across all Fabrics.

Figure 2: VXLAN EVPN Multi-Fabric

Figure 2: VXLAN EVPN Multi-Fabric

However, the sensitive aspect of this solution is that it requires complex and manual configurations for a dedicated Layer 2 and Layer 3 extension between each VXLAN EVPN Fabric. In addition, the original protocol responsible for Layer 2 and Layer 3 segmentations must handoff at the egress side of the border leaf nodes (respectively, VRF-Lite handoff and Dot1Q handoff). Last but not least, in addition to the dual-box design for dedicated Layer 2 and Layer 3 DCI services, endpoints cannot be locally attached to them.

Although the additional DCI layer has been a key added-value, maintaining the failure domain inside each Data Centre with a physical boundary between DC and DCI components, it has been a choice balanced between sturdiness and CAPEX, sometimes impacting the final choice to deploy a solid DCI solution.

On the same article part 5, Inbound and Outbound Path Optimisation solutions, that can be leveraged for different Data Centre deployments, have also been elaborated. And although VXLAN Multi-Pod and Multi-Fabric solutions may become soon “outdated” for interconnecting modern VXLAN-based Fabrics (see next section), the technical content of path optimisation is still valid for VXLAN Multi-Site or just when trying to better apprehend the full learning and distribution processes for the endpoint reachability information within a VXLAN EVPN Fabric. Hence, don’t ignore them yet, they may still be very useful for education purposes.

If this VXLAN EVPN Multi-fabric solution improves the strength of the end to end solution, there is no integration per se between each VXLAN fabric and the DCI, making its deployment complex to operate and manage.

VXLAN EVPN Multi-site

The above short refresh on the traditional solutions when interconnecting geographically dispersed modern DC Fabrics being said, and while some network vendors (hardware and software/virtual platforms) are trying to offer a suboptimal method of the options discussed previously, actually, these two traditional solutions might soon become obsolete when interconnecting standalone VXLAN EVPN-based Fabrics. Indeed, a new architecture called VXLAN EVPN Multi-Site has become available since August 2017 with the NX-OS software release 7.0(3)I7(1). A related IETF draft is available for further details (https://tools.ietf.org/html/draft-sharma-Multi-Site-evpn) if needed.

For guidelines, please visit the Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide, Release 7.x

The concept of VXLAN EVPN Multi-Site integrates the Layer 2 and Layer 3 services so that they can be extended within the same box. This model relies on Layer 2 VNI stitching. A new function known as Border Gateway (BGW) is responsible for that integrated extension. As a result, an Intra-Site VXLAN tunnel terminates at a Border Gateway which, from the same VTEP, re-initiates a new VXLAN tunnel toward the remote sites (Inter-Sites). The extension of the L2 and L3 services are transported using the same protocol. This new innovation combines the simplicity and flexibility of the VXLAN Multi-Pod with the sturdiness of the VXLAN Multi-Fabric.

The role of Border Gateway is a unique function offered by the hardware at line rate the because of Cisco ASIC technology (Cloud Scale ASIC). Yet another feature that will never come with a whitebox with Merchant Silicon.

It is not mandatory that the non-Border Gateway network devices within the VXLAN EVPN Fabric support the function of Multi-Site, neither in hardware nor in software.  As a result, any VXLAN EVPN Fabric based on the first-generation of N9k in production today will be able to deploy Multi-Site by simply adding the pair of required devices(1) that support the Border Gateway function. This will both integrate and extend the Layer 2 and Layer 3 services.

(1) VXLAN Multi-Site is supported on –EX and –FX Nexus 9k series platforms (ToR and Line-cards) as well as on the Nexus 9364C device.

Nevertheless, in addition, this new approach to the interconnection of multiple sites brings several added values.

VXLAN EVPN Multi-Site offers an “All-In-One” box solution with, if desired, the possibility to locally attach endpoints for other purposes (computes, firewalls, load balancers, WAN edge routers, etc).


For VXLAN EVPN Multi-PoD, in order to attach endpoints to the transit node, it is required that the hardware support the bud-node mode. A bud node is a device that is a VXLAN VTEP device and at the same time an IP transit device for the same multicast group used for VXLAN VNIs.

For VXLAN EVPN Multi-Fabric, a pair of additional boxes must be dedicated for DCI purposes only, with no SVI and no attached endpoint.

This new integrated solution offers a full separation of Data-Plane and Control-Plane per site, maintaining the failure domain to its smallest diameter.

VXLAN EVPN Multi-site overview

VXLAN EVPN Multi-site Border Gateway

The stitching of the Layer 2 and Layer 3 VNIs together (intra-Fabric tunnel ⇔ inter-Fabric tunnel) happens in the same Border node (Leaf or Spine). This Border node role is known as Border Gateway or BGW.

Figure 3: VXLAN EVPN Multi-Site Overview

Figure 3: VXLAN EVPN Multi-Site Overview

Multiple designs for VXLAN EVPN Multi-site

The approach of VXLAN EVPN Multi-Site is to offer a level of simplicity and flexibility never-before-reached. Indeed, the Border Gateway function can be initiated from existing vPC-based Border leaf nodes, with a maximum of the two usual vPC peer devices, offering seamless migration from traditional to Multi-Site architectures. This vPC-based option allows for locally attached dual-homed endpoints at Layer 2.

However, it is also possible to deploy up to four independent border leaf nodes per site, relying on BGP delivering the function of Border Gateway in a cluster fashion, all active with the same VIP address. With this BGP-based option, Layer 3 services (FW, router, SLB, etc..) can be leveraged on the same switches.

Finally, it is also possible to initiate the Multi-Site function with its Border Gateway role directly from the Spine layer (up to 4 Spine nodes), as depicted in Figure 4.

Figure 5: VXLAN EVPN Multi-Site Flexibility

Figure 4: VXLAN EVPN Multi-Site Flexibility

When the Border Gateway is deployed, however the cluster is built – with 2 or 4 nodes – it will share the same VIP (Virtual IP address) across the associated members. The same Border Gateway VIP is used for the virtual networks initiated inside the VXLAN Fabric as well as for the VXLAN tunnels established with the remote sites.

The architecture of VXLAN EVPN Multi-Site is transport agnostic. It can be established between two sites connected with direct fibre links in a back-to-back fashion, or across an external native Layer 3 WAN/MAN or Layer 3 VPN Core (MPLS VPN), interconnecting multiple Data Centres across any distances. This new solution offers higher reliability and resiliency for the end-to-end design with all redundant devices delivering fast convergence in case of link or device failure.

From an operational, admin and network management point of view, VXLAN OAM tools can be leveraged to understand the physical path taken by the VXLAN tunnel, inside and between the Fabrics, to provide information on path reachability; Overlay to Underlay correlation; interface load and error statistic in the path; latency, etc. – a data lake of information which is usually “hidden” in a traditional network overlay deployment.

VXLAN EVPN Multi-site – Layer 2 Tunnel Stitching

Because the  VNI to VNI stitching happens in the same VTEP, there is no requirement to handoff the VRF-Lite, nor the Dot1Q frame. Hence, theoretically, Multi-Site allows for going beyond the 4K Layer 2 segments imposed by traditional DCI solution.

Figure 6: Traditional DCI design with VRF-lite and Dot1Q Hand-off

Figure 5: Traditional DCI design with VRF-lite and Dot1Q Hand-off

Figure 7: VXLAN EVPN Multi-Site with L2 VNI stitching

Figure 6: VXLAN EVPN Multi-Site with L2 VNI stitching

As depicted in Figure 6, the border leaf offers multiple functions, including local endpoint attachment and Border Gateway for integrated Layer 2 and Layer 3 service extension.

vPC must be used for any locally dual-homed endpoints at layer 2. This picture depicts a BGP-based deployment of the Border Gateways with Layer 3 connections toward an external Layer 3 network.

VXLAN EVPN Multi-site with Legacy Data Centers

The VXLAN EVPN Multi-Site solution is not limited to interconnecting modern Greenfield Data Centres. It offers two additional flavours for seamless network connectivity with Brownfield Data Centres.

The first use-case is with a legacy DC alone devoid of any DCI service. For this option, if the enterprise wants to build two new Greenfield DCs and keep the original one, with all sites interconnected together, it is possible to deploy a pair of Border Gateways, dual-homed toward the aggregation layer. This function of Border Gateway mapping Dot1Q to VXLAN is also known as Pseudo BGW. The VLANs to be extended are extended using VXLAN EVPN toward the brand-new VXLAN Multi-Site, as shown in Figure 7.

Figure 8: Integration with traditional Data Centre

Figure 7: Integration with traditional Data Centre

The second use-case concerns traditional Data Centre networks to interconnect, however these DC are planning to migrate smoothly to VXLAN EVPN fabric. With that migration in mind, it is possible to leverage the Pseudo BGW fonction, leveraging VXLAN EVPN Multi-Site as shown in Figure 8.

Figure 9: Integration with traditional DC Network

Figure 8: Integration with traditional DC Network

Figure 8 illustrates two legacy Data Centre infrastructures interconnected using the Pseudo BGW feature. This is an alternative to OTV when there is a migration planned to a VXLAN EVPN-based fabric. As a result, this option can be leveraged a pair of Pseudo-BGWs inserted in each legacy site to extend Layer-2 and Layer-3 connectivity between sites that will be reused afterward for full Multi-site functions. It offers a slowly phase out the legacy networks replacing them with VXLAN EVPN fabrics. It allows integration with existing legacy default gateway (HSRP, VRRP, etc.) with a smooth migration for a more efficient and integrated Layer 3 Anycast Gateway offering the default gateways for all Endpoints.

Finally this option offers a more solid and integrated solutions than a simple VXLAN EVPN-based solution, with features natively embedded with the Border Gateway functions in order to improve the sturdiness from end-to-end, such as Split Horizon, Designated Forwarded for BUM traffic, Traffic Control of BUM (Disable or Rate-based), Control Plane with Selective Advertisement for L2 & L3 segments.

VXLAN EVPN Multi-site with outside Layer 3 connectivity

Connecting the VXLAN EVPN fabric in conjunction with Multi-site to the outside of the world.

Every DC fabric needs external connectivity to the campus or core network (WAN/MAN).

The VXLAN EVPN fabric can be connected across Layer 3 boundaries using MPLS L3VPN, Virtual Routing and Forwarding (VRF) IP routing (VRF-lite), or Locator/ID Separation Protocol (LISP).

Integration can occur for the Layer 3 segmentation toward a WAN/MAN, native or segmented Layer 3 network. If required, each Tenant from the VXLAN EVPN Fabric can connect to a specific L3 VPN network, like achieved previously with the Multi-Pod and Multi-Fabric models. In this case, in addition to the Multi-Site service, either VRF-Lite are handed off at the Border Leaf nodes to be bound with their external segmentations while the L3 segmentation is also maintained with the remote Fabric, as illustrated in Figure 9.

Figure 10: VXLAN EVPN Multi-site and VRF-Lite

Figure 9: VXLAN EVPN Multi-site and VRF-Lite

In deployments with a very large number of extended VRF instances, the number of eBGP sessions or IGP neighbors can cause scalability problems and configuration complexity and usually require a two-box solution.

It is now possible to merge the Layer 3 segmentation with the external Layer 3 connectivity (Provider Edge Router) with a new single-device solution.

Figure 10 depicts a full integration of the Layer 3 segmentation provided by the VXLAN EVPN fabric, extended for external connectivity from the BGW toward an MPLS L3VPN network (or any type of Layer 3 network, including LISP). This model requires that the WAN edge device is Border Provider Edge (BorderPE) feature capable. This feature allow to terminate the VXLAN tunnel (VTEP) in order the stitch the L3 VNI from the VXLAN fabric to a Layer3 VPN network. This feature is supported with the Nexus 7k/M3, ASR1k, and ASR9k.

Figure 11: VXLAN EVPN Multi-site integrated Border PE

Figure 10: VXLAN EVPN Multi-site integrated Border PE

VXLAN EVPN Multi-site and BUM traffic

Finally, the technology used to replicate BUM traffic is independent per site. One site can run PIM ASM Multicast to replicate the BUM traffic inside the same VXLAN domain regardless what the replication mode used on other sites. The BUM traffic is carried across the Multi-Site overlay using Ingress Replication, while other sites can run whatever is desired by the Network team, Multicast or Ingress Replication. Although Ingress Replication allows BUM transport across any WAN network, VXLAN Multi-Site will support Multicast replication later in a next software release.

To prevent any risks of loop with BUM traffic, a Designated-Forwarder is elected dynamically on a per Layer 2 segment basis. The DF uses its Physical IP address as a unique destination from the BGW cluster.

Figure 12: Independent Replication mode

    Figure 11: Independent Replication mode

VXLAN EVPN Multi-site and Rate Limiters

The decapsulation and encapsulation of the respective tunnels happen inside the same VTEP. This method allows for the enabling of traffic policers per tenant, controlling the rate of Storm independently for Broadcast, Unknown Unicast or Multicast traffic, or other policers limiting the bandwidth per tunnel. Therefore, failure containment is improved through the granular control of BUM traffic, protecting remote sites from any broadcast storms.

Figure 4: VXLAN EVPN Multi-Site and Broadcast Storm control (BUM Control)

Figure 12: VXLAN EVPN Multi-Site and Broadcast Storm control (BUM Control)

DC Interconnection versus DC Integration

With VXLAN EVPN Multi-Site coming shortly, we will consider 3 main use cases for interconnecting multiple DC network:

  • Interconnecting Legacy to Legacy DC

For Brownfield to Brownfield DC Interconnection, if OTV previously existed for DC interconnection requirements, then, it’s likely to continue using the same DCI approach. There is no technical reason to change to a different DCI solution if the OTV Edge devices are already available. Indeed, OTV offers an easy way to add new DC into its Layer 2 Overlay network.

If a new project to interconnect multiple Brownfield DCs (Classical Ethernet-based DC), it is worth to understand if there is a requirement to migrate in the near future the existing legacy network to a modern fabric infrastructure, as depicted with Figure 8.

Nonetheless, we never want to extend all VLANs from site to site, hence, even if not often elaborated in the DCI documentation, it is important to take the routed traffic inter-site into consideration. The new function of Border Gateway integrated with VXLAN EVPN Multi-site offers a great alternative for interconnecting Data Centre for Layer 2 and Layer 3 requirements, maintaining the in-dependency network transport of each DC, while reducing the failure domain to a single site with a built-in very granular Storm controller for Broadcast, Unknown Unicast or Multicast traffic.

Although OTV 2.5 relies on VXLAN as the encapsulation method, VXLAN continues to evolve with the necessary embedded intelligence leveraging the open standards MP-BGP EVPN control plane to provide a viable, efficient and solid DCI solution comparable with OTV. VXLAN EVPN Multi-site architecture represents this evolution. It offers a single integrated solution for interconnecting multiple traditional Data Centres with embedded and distributed Layer 3 Anycast gateway from the same Border Gateway devices, extending the classical Ethernet frame and routed packets via its Overlay network.

  • Interconnecting Legacy DC to modern Fabric

Figure 13: Brownfield to Greenfield Data Centre Interconnection

Figure 13: Brownfield to Greenfield Data Centre Interconnection

For the second, as depicted in Figure 13, VXLAN EVPN Multi-site can be leveraged to interconnect the Brownfield DC by adding a new pair of N93xx-EX (or FX) series running VXLAN Multi-site with the Pseudo Border Gateway function for controlling the VLAN to VNI transport. On the remote site, the Greenfield DC leverages the Multi-site feature that integrates the Layer 2 and Layer 3 services toward the Brownfield DC, all functions offered in a single-box model.

  • Interconnecting modern Fabrics

For Standalone VXLAN EVPN fabrics, the solution is VXLAN EVPN Multi-Site. There is no technical reason to add a dedicated DCI solution inside a modern Fabrics when Layer 2 and Layer 3 services are integrated in the same Border leaf or Border Spine layer, reducing the failure domain to its smallest diameter, that is to say, the link between the host and its Top of Rack switch.

Naming Convention: Pod, Multi-Pod, Multi-Fabric, Multi-Site

The same naming convention for Pod, Multi-Pod, Multi-Fabric and Multi-Site has been elaborated for both VXLAN EVPN Standalone and ACI.

Nonetheless, it is important to specify that the three models discussed in the previous sections differ when comparing VXLAN Standalone with ACI Fabric.

VXLAN Pod - Availability Zone

    VXLAN Pod – Availability Zone

In a VXLAN context deployment, a Pod is seen as a “Fault Domain” area.

The Pod is usually represented by a Leaf-Spine Network Fabric sharing a common Control-plane (MP-BGP) across its whole physical connectivity plant.

A Fabric represents the VXLAN Domain deployed in the same location (same building or Campus) or Geographically Dispersed (Metro distances) sharing a common Control-plane and Data-plane. A single or multiple Pods form therefore a single VXLAN fabric or VXLAN domain.


VXLAN Multi-Pod – Single Availability Zone

VXLAN Multi-Pod – Single Availability Zone

In VXLAN Stand-alone the Multi-Pod architecture is organized with multiple PoD sharing a common Control-plane (MP-BGP) and Data-Plane across the whole physical connectivity plant across the same VXLAN Domain. The VXLAN Multi-Pod represents a single Availability Zones (VXLAN Fabric) within a Single Region – it is strongly recommended to limit the distances for a single Availability Zone within a Metro area, but technically nothing prevent to deploy a VXLAN Multi-Pod across intercontinental distances. The same VXLAN domain is stretched across multiple locations. The term of Multi-Pod differs from the VXLAN EVPN stretched fabric because the strength of the whole solution has been improved as discussed above.

ACI already offers a more sophisticated and solid integration of Layer 2 and Layer 3 services with its Multi-Pod solution since CY2016.

VXLAN Multi-Fabric - Multiple Availability Zones

VXLAN Multi-Fabric – Multiple Availability Zones

VXLAN Multi-Fabric deployment offers Multiple Availability Zone (Multiple VXLAN Fabrics) designs, one per VXLAN Domain managed by its own isolated Control-plane with an independent Layer 2 and Layer 3 DCI services. There is no concept of Regions per se, as each VXLAN Fabric runs autonomously.

VXLAN Multi-site - Multiple Availability Zones

VXLAN Multi-site – Multiple Availability Zones

VXLAN Multi-Site offers the same concept as Multi-Fabric with independent VXLAN Fabric but with the extended Layer 2 and Layer 3 services integrated and fully controlled. As a result, Multiple autonomous VXLAN domains with integrated L2 & L3 services form the VXLAN Region.

Posted in DCI | 17 Comments

36 – New White Paper that describes OTV to interconnect Multiple VXLAN EVPN Fabrics

Good day,

In the meantime that this long series of sub-posts becomes a white paper, there is a new document available on CCO written by Lukas Krattiger that covers the Layer 2 and Layer 3 interconnection of multiple VXLAN fabrics. I feel this document is complementary to this series of Post 36, describing from a different angle (using a human language approach) how to achieve VXLAN EVPN Multi-Fabric design in conjunction with OTV.

Optimizing Layer 2 DCI with OTV between Multiple VXLAN EVPN Fabrics (Multifabric) White Paper

Good reading, yves

Posted in DCI | Leave a comment

36 – VXLAN EVPN Multi-Fabrics – Path Optimisation (part 5)

Ingress/Egress Traffic Path Optimization

In the VXLAN Multi-fabric design discussed in this post, each data center normally represents a separate BGP autonomous system (AS) and is assigned a unique BGP autonomous system number (ASN).

Three types of BGP peering are usually established as part of the VXLAN Multi-fabric solution:

  • MP internal BGP (MP-iBGP) EVPN peering sessions are established in each VXLAN EVPN fabric between all the deployed leaf nodes. As previously discussed, EVPN is the intrafabric control plane used to exchange reachability information for all the endpoints connected to the fabric and for external destinations.
  • Layer 3 peering sessions are established between the border nodes of separate fabrics to exchange IP reachability information (host routes) for the endpoints connected to the different VXLAN fabrics and the IP subnets that are not stretched (east-west communication). Often, a dedicated Layer 3 DCI network connection is used for this purpose. In a multitenant VXLAN fabric deployment, a separate Layer 3 logical connection is required for each VRF instance defined in the fabric (VRF-Lite model). Although either eBGP or IGP routing protocols can be used to establish interfabric Layer 3 connectivity, the eBGP scenario is the most common and is the one discussed in this post.
  • Per-VRF eBGP peering sessions are frequently used for WAN connectivity to exchange IP reachability information with the external Layer 3 network domain (north-south communication). A common best practice and the recommended approach is to deploy eBGP for this purpose.

When extending IP subnets across separate VXLAN fabrics, you need to consider the paths used for ingress and egress communication. You particularly should avoid establishing an asymmetric path such as one shown in Figure 1 (virtual machines VM1 and VM3 are part of the same extended IP subnet that is advertised in the MAN and WAN), which would cause communication failure when independent stateful network services are deployed across sites.

Figure 22: Undesired Asymmetric Flow Establishment

Figure 1: Undesired Asymmetric Flow Establishment

The solution presented in this article avoids establishing asymmetric traffic paths by following three main design principles, illustrated in Figure 2:

  • Traffic originating from the external Layer 3 domain and destined for endpoints connected to a specific VXLAN fabric should always come inbound through the site’s local border nodes (ingress traffic-path optimization).
  • Traffic originating from data center endpoints and destined for the external Layer 3 domain should always prefer the outbound path through the local border nodes (outbound traffic-path optimization).
  • All east-west routed communication between endpoints that are part of different data center sites should always prefer the dedicated Layer 3 DCI connection if it is available.

Figure 23 – Ingress/Egress Traffic Path Optimization and East-West Communication

Figure 2: Ingress/Egress Traffic Path Optimization and East-West Communication

Ingress Traffic Path Optimization

This article proposes the use of host-route advertisement from the border nodes to the local WAN edge routers to influence and optimize ingress traffic flows. After the WAN edge routers receive the host routes, two approaches are possible:

  • You can inject specific host-route information from each VXLAN fabric into the MAN or WAN so that incoming traffic can be optimally steered to the destination. This method is usually applicable when a Layer 3 VPN hand-off is deployed to the WAN, so that host routes can be announced in each specific Layer 3 VPN service. Before adopting this approach, be sure to assess the scalability implications of the solution for consumption of local resources on the data center WAN edge and remote routers and within the WAN provider network.
  • In designs in which advertisement of host routes in the MAN or WAN is not desirable because of scalability concerns or is not possible (as, for example, in many cases in which the MAN or WAN is managed by a service provider), you can deploy Cisco Location Identifier Separation Protocol (LISP) as an IP-based hand-off technology. LISP deployment is beyond the scope of this paper. For more information about LISP and LISP mobility, see http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Data_Center/DCI/5-0/LISPmobility/DCI_LISP_Host_Mobility.html.

The rest of this section focuses on the deployment model using a Layer 3 VPN hand-off to the MAN or WAN. The goal is to help ensure that the border nodes in each VXLAN fabric always advertise to the local WAN edge routers only host routes for endpoints connected to the local fabric.

As shown in Figure 3, the border nodes at a given site receive host route information for endpoints connected to remote fabrics through the route peering established over the dedicated Layer 3 DCI connection. You therefore must help ensure that these host routes advertised from the Layer 3 DCI are never sent to the local WAN edge routers. Host routing from the WAN always should steer the traffic to the fabric to which those destinations are connected, and a less specific route should be used only in specific WAN isolation scenarios discussed later in this section.

Figure 24 – Border Leaf Nodes advertising only Host Routes for Locally Connected Endpoints

Figure 3: Border Nodes advertising only Host Routes for Locally Connected Endpoints

Many different approaches can be used to achieve the behavior shown in Figure 3. The example shown here simply proposes configuring the border nodes in a given fabric so that they do not announce to the local WAN edge routers host route information received through the Layer 3 DCI connection.

ip as-path access-list 1 permit "_65200_”
ip prefix-list MATCH-HOST-ROUTES seq 5 permit eq 32
ip access-list ANY
  10 permit ip any any
  match as-path 1
  match ip address prefix-list MATCH-HOST-ROUTES
route-map DENY-HOST-ROUTES-FROM-REMOTE-DCs permit 20
  match ip address ANY 
router bgp 65100
  vrf Tenant-1
      description Local WAN Edge Device
      remote-as 65300             
      address-family ipv4 unicast
        route-map DENY-HOST-ROUTES-FROM-REMOTE-DCs in

Note: The same configuration must be applied to all the border nodes facing the WAN edge and deployed across VXLAN fabrics.

Note that other prefixes received from remote sites are still accepted, as long as they are not host routes. This behavior is required mainly to handle the potential WAN isolation scenario in which a given fabric loses connectivity to the MAN or WAN (as a result of a dual failure of the WAN edge routers or a WAN or MAN outage). In that case, traffic originating from the external routed domain and destined for an endpoint connected to the WAN-isolated fabric should be steered to a different VXLAN fabric following a less specific route (usually the IP prefix for the subnet to which the destination endpoint is connected). This scenario is shown in Figure 4.

Figure 25 – Inbound Traffic in a WAN Isolation Scenario

Figure 4: Inbound Traffic in a WAN Isolation Scenario

Note: The same considerations apply if VXLAN fabric 1 experiences a WAN isolation scenario.

When virtual machine VM1 migrates to data center DC2, the host route information is updated across the VXLAN fabrics, as previously described in the part 4 “Host Mobility Across Fabrics.” As a consequence, local border nodes BL3 and BL4 will start advertising VM1’s host route to the fabric 2 WAN edge router with AS 65200 of fabric 2, and border nodes BL1 and BL2 will stop sending the same host route information to the fabric 1 WAN edge router. As a consequence, traffic destined for VM1 and originating from the Layer 3 MAN or WAN will be steered directly to fabric 2.

Egress Traffic Path Optimization

After you have optimized the ingress traffic, you usually should do the same for the egress traffic to maintain symmetry for the communications with the external Layer 3 domain. As previously mentioned, this optimization is mandatory for deployment across sites with independent stateful network services such as firewalls.

To force the egress traffic to prefer the local WAN connection, you can modify the local-preference value for the prefixes learned through peering with the local WAN edge routers. The local preference is an attribute that routers exchange in the same autonomous system and that tells the autonomous system which path to prefer to reach destinations that are external to it. A path with a higher local-preference value is preferred. The default local-preference value is 100.

In the configuration sample shown here, one of the border nodes in VXLAN EVPN fabric 1 is configured to assign a higher local preference (200) to all the prefixes received from the WAN edge routers in fabric 1 in data center DC1 using the route map INCREASE-LOCAL-PREF-FOR-WAN-ROUTES.

ip access-list ANY
  10 permit ip any any
  match ip address ANY
  set local-preference 200
router bgp 65100
  vrf Tenant-1
      remote-as 65300
      description eBGP Peering with WAN Edge router in DC1
      address-family ipv4 unicast

The same external prefixes received through the Layer 3 DCI connection would instead have the default local-preference value of 100, so the local path will always be preferred (steered with the local preference 200), as shown in Figure 5.

Figure 26 : Local Preferences for Outbound traffic

Figure 5: Local Preferences for Outbound traffic

In the WAN isolation scenario previously considered, the border nodes in isolated VXLAN fabric 2 would start using the external prefixes received on the Layer 3 DCI connection from the border nodes in VXLAN fabric 1. This behavior still helps ensure that inbound and outbound communication between VM2 and the WAN remain symmetrical, which you can verify by comparing previous Figure 4 to Figure 6.

Figure 27 - Outbound Traffic in a WAN Isolation Scenario

Figure 6: Outbound Traffic in a WAN Isolation Scenario

Keeping Inter-Fabric Routing via Layer 3 DCI

The last requirement is to help ensure that all communications between endpoints belonging to different VXLAN fabrics preferably are established using the dedicated Layer 3 DCI connection. This route is desirable because this connectivity usually is characterized by lower latency and higher bandwidth than the path through the MAN or WAN.

To meet this requirement, you must help ensure that even if a prefix (host route or IP subnet) belonging to fabric 2 is received in fabric 1 through the WAN edge router, this latter information is considered less preferable than the information received through the Layer 3 DCI connection.

Recall the configuration previously discussed to optimize the egress traffic flows. In this case, all the routes received by the border nodes from the WAN edge routers are characterized by a local-preference value of 200. You then must add a route map (INCREASE-LOCAL-PREF-FOR-REMOTE-HOST-ROUTES) to the routing updates received from the remote border nodes to help ensure that all the IP prefixes belonging to the remote fabric (that is, received by eBGP updates originating from the autonomous system of the remote data center) have a higher local-preference value (300 in the example in Figure 7).

Figure 28 - Increase Local-Preference to Keep Inter-Fabric Routing via Layer 3 DCI

Figure 7: Increase Local-Preference to Keep Inter-Fabric Routing via Layer 3 DCI

The following sample shows the required configuration.

ip as-path access-list 2 permit "^65200$”
ip access-list ANY
  10 permit ip any any
  match as-path 2
  set local-preference 300
  match ip address ANY
router bgp 65100
  vrf Tenant-1
      remote-as 65200
      description eBGP Peering with BL Node 1 in DC2
      address-family ipv4 unicast

As a result of this configuration, all the intersite communication stays on the Layer 3 DCI connection and will use the MAN or WAN path only if this connection completely fails.


VXLAN EVPN Multi-Fabric is a hierarchical network design comprising individual Fabrics interconnected together. The design described in this article focuses on the individuality of the data center domains, allowing independent scale and, more important, independent failure domains. The connectivity between the individual fabric domains is independent of the choice that is being used within the data center, and thus a natural separation is achieved.

Overlay Transport Virtualization (OTV) is the recommended technology to provide Layer 2 extension while maintaining failure containment. With this ability and the additional attributes OTV offers for Data Center Interconnect (DCI), modern data center fabrics can be extended in an optimized fashion.

Different solutions can also be adopted to extend multi-tenant Layer 3 connectivity across Fabrics, mainly depending on the nature of the transport network interconnecting them.

Finally, specific deployment considerations and configuration can be used to symmetrize inbound and outbound traffic flows. This is always desirable to optimize access to Data Center resources that can be spread across different Fabrics and becomes mandatory when independent sets of stateful network services (as firewalls for example) are deployed in separate Fabrics.

Posted in DCI | 4 Comments

36 – VXLAN EVPN Multi-Fabrics – Host Mobility (part 4)

Host Mobility across Fabrics

This section discusses support for host mobility when a distributed Layer 3 Anycast gateway is configured across multiple VXLAN EVPN fabrics.

In this scenario, VM1 belonging to VLAN 100 (subnet_100) is hosted by H2 in fabric 1, and VM2 on VLAN 200 (subnet_200) initially is hosted by H3 in the same fabric 1. Destination IP subnet_100 and subnet_200 are locally configured on leaf nodes L12 and L13 as well as on L14 and L15.

This example assumes that the virtual machines (endpoints) have been previously discovered, and that Layer 2 and 3 reachability information has been announced across both sites as discussed in the previous sections.

Figure 1 highlights the content of the forwarding tables on different leaf nodes in both fabrics before virtual machine VM2 is migrated to fabric 2.

Figure 1: Content of Forwarding Tables Before Host Migration

Figure 1: Content of Forwarding Tables Before Host Migration

The following steps show the process for maintaining communication between the virtual machines in a host mobility scenario, as depicted in Figure 2

Figure 2: VXLAN EVPN Multifabric and Host Mobility

Figure 2: VXLAN EVPN Multifabric and Host Mobility

  1. For operational purposes, virtual machine VM2 moves to host H4 located in fabric 2 and connected to leaf nodes L21 and L22.
  2. After the migration process is completed, assuming that VMware ESXi is the hypervisor used, the virtual switch generates a RARP frame with VM2’s MAC address information.

Note: With other hypervisors, such as Microsoft Hyper-V or Citrix Xen, a GARP request is sent instead, which includes the source IP address of the sender in the payload. As a consequence, the procedure will be slightly different than the one described here.

  1. Leaf L22 in this example receives the RARP frame and learns the MAC address of VM2 as locally connected. Because the RARP message can’t be used to learn VM2’s IP address, the forwarding table of the devices in fabric 2 still points to border nodes BL3 and BL4 (that is, VM2’s IP address is still known as connected to fabric 1). Leaf L22 also sends an MP-BGP EVPN route-type-2 update in fabric 2 with VM2’s MAC address information. When doing so, it increases the sequence number associated with this specific entry and specifies as the next hop the anycast VTEP address of leaf nodes L21 and L22. The receiving devices update their forwarding tables with this new information.
  2. On the data plane, the RARP broadcast frame is also flooded in fabric 2 and reaches border nodes BL3 and BL4, which forward it to the local OTV devices.
  3. The OTV AED in fabric 2 forwards the RARP frame across the Layer 2 DCI overlay network to reach the remote OTV devices in fabric 1. The OTV AED device in fabric 1 forwards the frame to the local border nodes.
  4. Border nodes BL1 and BL2 learn the MAC address of VM2 from the reception of the RARP frame as locally attached to their Layer 2 interfaces connecting to the OTV AED device. As a consequence, one of the border nodes advertises VM2’s MAC address information in fabric 1 with a route-type-2 BGP update using a new sequence number (higher than the previous number).
  5. The forwarding tables for all relevant local leaf nodes in fabric 1 are updated with the information that VM2’s MAC address is now reachable through the anycast VTEP address of border nodes BL1 and BL2.

At this point, all the devices in fabrics 1 and 2 have properly updated their forwarding tables with the new VM2’s MAC address reachability information. This process implies that intrasubnet communication to VM2 is now fully reestablished. However, VM2’s IP address still is known in both fabrics as connected to the old location (that is, to leaf nodes L14 and L15 in fabric 1), so communications still cannot be routed to VM2. Figure 3 shows the additional steps required to update the forwarding tables of the devices in fabrics 1 and 2 with the new reachability information for VM2’S IP address.

Figure 3: Propagation of VM2’s Reachability Information toward Fabric 1

Figure 3: Propagation of VM2’s Reachability Information toward Fabric 1

8.   The reception of the route-type-2 MAC address advertisement on leaf nodes L14 and L15 triggers a verification process to help ensure that VM2 is not locally connected anymore. Note that ARP requests to VM2 are locally sent out the local interface to which VM2 was originally connected as well as to fabric 1 and subsequently to fabric 2 through the Layer 2 DCI connection. The ARP request reaches VM2, which responds, allowing leaf nodes L21 and L22 to update the local ARP table and trigger the consequent control-plane updates discussed previously and shown in Figure 16.

9.   After verification that VM2 has indeed moved away from leaf nodes L14 and L15, one of the leaf nodes withdraws VM2’s IP reachability information from local fabric 1, sending an MP-BGP EVPN update. This procedure helps ensure that this information can be cleared from the forwarding tables of all the devices in fabric 1.

10.  Because border nodes BL1 and BL2 also receive the withdrawal of VM2’s IP address, they update the border nodes in the remote fabric to indicate that this information is not reachable anymore through the Layer 3 DCI connection.

11.  As a consequence, border nodes BL3 and BL4 also withdraw this information from remote VXLAN EVPN fabric 2, allowing all the local devices to clear this information from their tables.

The end result is the proper update of VM2’s IP address information in the forwarding tables of all the nodes in both fabrics, as shown in Figure 4.

Figure 4: End State of the Forwarding Tables for Nodes in Fabrics 1 and 2

Figure 4: End State of the Forwarding Tables for Nodes in Fabrics 1 and 2

At this point, Layer 2 and 3 communication with VM2 can be fully reestablished.


Posted in DCI | Leave a comment

36 – VXLAN EVPN Multi-Fabrics with Anycast L3 gateway (part 3)


I recommend you to read part 1 and part 2 if you missed them 🙂

Thank you, yves

VXLAN EVPN Multi-Fabric with Distributed Anycast Layer 3 Gateway

Layer 2 and Layer 3 DCI interconnecting multiple VXLAN EVPN Fabrics

A distributed anycast Layer 3 gateway provides significant added value to VXLAN EVPN deployments for several reasons:

  • It offers the same default gateway to all edge switches. Each endpoint can use its local VTEP as a default gateway to route traffic outside its IP subnet. The endpoints can do so, not only within a fabric but across independent VXLAN EVPN fabrics (even when fabrics are geographically dispersed), removing suboptimal interfabric traffic paths. Additionally, routed flows between endpoints connected to the same leaf node can be directly routed at the local leaf layer.
  • In conjunction with ARP suppression, it reduces the flooding domain to its smallest diameter (the leaf or edge device), and consequently confines the failure domain to that switch.
  • It allows transparent host mobility, with the virtual machines continuing to use their respective default gateways (on the local VTEP), within each VXLAN EVPN fabric and across multiple VXLAN EVPN fabrics.
  • It does not require you to create any interfabric FHRP filtering, because no protocol exchange is required between Layer 3 anycast gateways.
  • It allows better distribution of state (ARP, etc.) across multiple devices.

In the VXLAN EVPN Multi-Fabric with Distributed Anycast Layer 3 Gateway scenario the Border nodes can perform 3 main functions:

  1. VLAN and VRF-Lite hand-off to DCI.
  2. MAN/WAN connectivity to the external Layer 3 network domain.
  3. Connectivity to Network Services.

For IP subnets that are extended between multiple fabrics, instantiation of the distributed IP anycast gateway on the border nodes is not supported. It is not supported because, with the instantiation of the distributed IP anycast gateway on the border nodes that also extend the Layer 2 network, the same MAC and IP addresses become visible on the Layer 2 extension on both fabrics, and unpredictable learning and forwarding can occur. Even if the Layer 2 network is not extended between fabrics, because it may potentially later be extended, the use of a distributed IP anycast gateway on the border node is not recommended (Figure 1).

Figure 10: No Distributed Anycast Gateway Functionality for Endpoints on Border Nodes

Figure 1: No Distributed Anycast Gateway Functionality for Endpoints on Border Nodes

The next sections provide more detailed packet walk descriptions. But before proceeding, keep in mind that MP-BGP EVPN can transport Layer 2 information such as MAC addresses as well as Layer 3 information such as host IP addresses (host routes) and IP subnets. For this purpose, it uses two forms of routing advertisement:

  • Route type 2: Used to announce host MAC and IP address information for the endpoint directly connected to the VXLAN fabric and also carrying extended community attributes (such as route-target values, router MAC addresses, and sequence numbers).
  • Route type 5: Advertises IP subnet prefixes or host routes (associated, for example, with locally defined loopback interfaces) and also carrying extended community attributes (such as route-target values and router MAC addresses).

Learning Process for Endpoint Reachability Information

Endpoint learning occurs on the VXLAN EVPN switch, usually on the edge devices (leaf nodes) to which the endpoints are directly connected. The information (MAC and IP addresses) for locally connected endpoints is then programmed into the local forwarding tables.

This article assumes that a distributed anycast gateway is deployed on all the leaf nodes of the VXLAN fabric. Therefore, there are two main mechanisms for endpoint discovery:

  • MAC address information is learned on the leaf node when it receives traffic from the locally connected endpoints. The usual data-plane learning function performed by every Layer 2 switch allows this information to be programmed into the local Layer 2 forwarding tables.

You can display the Layer 2 information in the MAC address table by using the show mac-address table command. You can display the content of the EVPN table (Layer 2 routing information base [L2RIB]) populated by BGP updates by using the show l2route evpn mac command.

  • IP addresses of locally connected endpoints are instead learned on the leaf node by intercepting control-plane protocols used for address resolution (ARP, Gratuitous ARP [GARP], and IPv6 neighbor discovery messages). This information is also programmed in the local L2RIB together with the host route information received through route-type-2 EVPN updates and can be viewed by using the show l2route evpn mac-ip command at the command-line interface (CLI).

Note: Reverse ARP (RARP) frames do not contain any local host IP information in the payload, so they can be used only to learn Layer 2 endpoint information and not the IP addresses.

After learning and registering information (MAC and IP addresses) for its locally discovered endpoints, the edge device announces this information to the MP-BGP EVPN control plane using an EVPN route-type-2 advertisement sent to all other edge devices that belong to the same VXLAN EVPN fabric. As a consequence, all the devices learn endpoint information that belongs to their respective VNIs and can then import it into their local forwarding tables.

Intra-Subnet Communication Across Fabrics

Communication between two endpoints located in different fabrics within the same stretched Layer 2 segment (IP subnet) can be established by using a combination of VXLAN bridging (inside each fabric) and OTV Layer 2 extension services.

To better understand the way that endpoint information (MAC and IP addresses) are learned and distributed across fabrics, start by assuming that source and destination endpoints have not been discovered yet inside each fabric.

The control-plane and data-plane steps required to establish cross-fabric Layer 2 connectivity are highlighted in Figure 2 and described in this section.

Figure 11- VXLAN EVPN Multi-Fabric - ARP Request Propagation across L2 DCI

Figure 2: VXLAN EVPN Multi-Fabric – ARP Request Propagation across L2 DCI

The various control and data plane steps required to establish cross Fabrics L2 connectivity are highlighted in Figure 2 and described below.

  1. Host H1 connected to leaf L11 in fabric 1 initiates intrasubnet communication with host H6, which belongs to remote fabric 2. As a first action, H1 generates a Layer 2 broadcast ARP request to resolve the MAC and IP address mapping for H6 that belongs to the same Layer 2 segment.
  2. Leaf L11 receives the ARP packet on local VLAN 100 and maps it to the locally configured L2VNI 10100. Because the distributed anycast gateway is enabled for that Layer 2 segment, leaf L11 learns H1’s MAC and IP address information and distributes it inside the fabric through an EVPN route-type-2 update. This process allows all the leaf nodes with the same L2VNI locally defined to receive and import this information in their forwarding tables (if they are properly configured to do so).
    • Note: The reception of the route-type-2 update triggers also a BGP update for H1’s host route from border nodes BL1 and BL2 to remote border nodes BL3 and BL4 through the Layer 3 DCI connection.
    • This process is not shown in Figure 2, because it is not relevant for intrasubnet communication.On the data plane, leaf L11 floods the ARP broadcast request across the Layer 2 domain identified by L2VNI 10100 (locally mapped to VLAN 100). The ARP request reaches all the local leaf nodes in fabric 1 that host that L2VNI, including border nodes BL1 and BL2. The replication of multidestination traffic can use either the underlay multicast functions of the fabric or the unicast ingress replication capabilities of the leaf nodes. The desired behavior can be independently tuned on a per-L2VNI basis within each fabric.
    • Note: The ARP flooding described in the preceding steps occurs regardless of whether ARP suppression is enabled in the L2VNI, because of the initial assumption that H6 has not been discovered yet.
  3. The border nodes decapsulate the received VXLAN frame and use the L2VNI value in the VXLAN header to determine the bridge domain to which to flood the ARP request. They then send the request out the Ethernet interfaces that carry local VLAN 100 mapped to L2VNI 10100. The broadcast packet is sent to both remote OTV end devices. The OTV authoritative edge device (AED) responsible for extending that particular Layer 2 segment (VLAN 100) receives the broadcast packet, performs a lookup in its ARP cache table, and finds no entry for H6. As a result, it encapsulates the ARP request and floods it across the OTV overlay network to its remote OTV peers. The remote OTV edge device that is authoritative for that Layer 2 segment opens the OTV header and bridges the ARP request to its internal interface that carries VLAN 100.
  4. Border nodes BL3 and BL4 receive the ARP request from the AED through the connected OTV inside interface and learn on that interface through the data plane the MAC address of H1. As a result, the border nodes announce H1’s MAC address reachability information to the local fabric EVPN control plane using a route-type-2 update. Finally, the ARP request is flooded across L2VNI 10100, which is locally mapped to VLAN 100.
  5. All the edge devices on which L2VNI 10100 is defined receive the encapsulated ARP packet, including leaf nodes L23 and L24, which are part of a vPC domain. One of the two leaf nodes is designated to decapsulate the ARP request and bridge it to its local interfaces configured in VLAN 100 and locally mapped to L2VNI 10100, so H6 receives the ARP request.

At this point, the forwarding tables of all the devices are properly populated to allow the unicast ARP reply to be sent from H6 back to H1, as shown in Figure 3.

Continue reading

Posted in DCI | Leave a comment

36 – VXLAN EVPN Multi-Fabrics with External Routing Block (part 2)


I recommend you to read part 1 if you missed it 🙂

thank you, yves

VXLAN EVPN Multi-Fabric with External Active/Active Gateways

The first use case is simple. Each VXLAN fabric behaves like a traditional Layer 2 network with a centralized routing block. External devices (such as routers and firewalls) provide default gateway functions, as shown in Figure 1.

Figure 8: External Routing Block IP Gateway for VXLAN/EVPN Extended VLAN

Figure 1: External Routing Block IP Gateway for VXLAN/EVPN Extended VLAN

In the Layer 2–based VXLAN EVPN fabric deployment, the external routing block is used to perform routing functions between Layer 2 segments. The same routing block can be connected to the WAN advertising the public networks from each data center to the outside and to propagate external routes to each fabric.

The routing block consists of a “router-on-a-stick” design (from the fabric’s point of view) built with a pair of traditional routers, Layer 3 switches, or firewalls that serve as the IP gateway. These IP gateways are attached to a pair of vPC border nodes that initiate and terminate the VXLAN EVPN tunnels.

Connectivity between the IP gateways and the border nodes is achieved through a Layer 2 trunk carrying all the VLANs that require routing services.

To improve performance with active default gateways in each data center, reducing the hairpinning of east-west traffic for server-to-server communication between sites, and depending on the IP gateway platform of choice, the routing block can be duplicated with the same virtual IP and MAC addresses for all relevant SVIs on both sides. Hence, to use active-active gateways on both fabrics, you must filter communications between gateways that belong to the same First-Hop Routing Protocol (FHRP) group. With OTV as the DCI solution, the FHRP filter will be applied to the OTV control plane.

Figure 2 shows this scenario.

Note: Although VLANs are locally significant per edge device (or even per port), and the Layer 2 virtual network identifier (VNI) is locally significant in each VXLAN EVPN fabric, the following examples assume that the same Layer 2 VNIs (L2VNIs) are reused on both fabrics. The same VLAN ID was also reused on each leaf node and on the border switches. This approach was used to simplify the diagrams and packet walk. In a real production network, the network manager can use different network identifiers for the Layer 2 and 3 VNIs deployed in the individual fabrics.

Figure 9- VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 9: VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 2: VXLAN EVPN Layer 2 Fabric with External Routing Block

Figure 2 shows the following:

Note: The basic assumption is that H1 and H2 have already populated their ARP tables with the default gateway information, for example because they previously sent ARP requests targeted to the gateway. As a consequence, the gateway also has H1 and H2 information in its local ARP table.

  1. Host H1 connected to leaf L11 in fabric 1 needs to send a data packet to host H2 connected in vPC mode to a pair of leaf nodes, L12 and L13, in the same fabric 1. Because H1 and H2 are part of different IP subnets, H1 sends the traffic to the MAC address of its default gateway, which is deployed on an external Layer 3 device connected to the same fabric 1. The communication between H1 and the default gateway uses the VXLAN fabric as a pure Layer 2 overlay service.
  2. Traffic from H1 that belongs to VLAN 100 is VXLAN encapsulated by leaf L11 with L2VNI 10100 (locally mapped to VLAN 100) and sent to the egress anycast VTEP address defined in border nodes 1 and 2. This address represents the next hop to reach the MAC address of the default gateway. Layer 3 equal-cost multipath (ECMP) is used in the fabric to load-balance the traffic destined for the egress anycast VTEP between the two border nodes. In the example in Figure 9, border node BL1 is selected as the destination.
  3. BL1 decapsulates the VXLAN frame and bridges the original frames destined for the local default gateway onto VLAN 100 (locally mapped to L2VNI 10100).
  4. The default gateway receives the frame destined for H2, performs a Layer 3 lookup, and subsequently forwards the packet to the Layer 2 segment on which H2 resides.
  5. The Layer 2 flow reaches one of the vPC-connected border nodes (BL2 in this example). BL2 uses the received IEEE 802.1q tag (VLAN 200) to identify the locally mapped L2VNI, 10200, to be used for VXLAN to encapsulates the frame. BL2 then forwards the data packet to the anycast VTEP address defined on the vPC pair of leaf nodes, L12 and L13, on which the destination H2 is connected.
  6. One of the receiving leaf nodes is designated to decapsulate the VXLAN frame and send the original data packet to H2.
  7. Routed communications between endpoints located at the remote site are kept local within fabric 2. This behavior is possible only because FHRP filtering is enabled on the OTV edge devices. In this example, H4 sends traffic destined for H6 using its local default gateway active in data center DC2. East-west routed traffic can consequently be localized within each fabric, eliminating unnecessary interfabric traffic hairpinning.


Posted in DCI | 11 Comments

36 – VXLAN EVPN Multi-Fabrics Design Considerations (part 1)



Since this article was published 2 years ago, for 1 year we have now a integrated and hierarchical solution to interconnect multiple VXLAN EVPN fabrics called Multi-site (post 37) that offers Layer 2 and Layer 3 extension in a much more efficient, more robust and easier way to deploy and maintain the extension of layer 2 and layer 3 networks.

However I want to keep this article still live there because there are several sections such as the packet walk that can help to better understand the learning and distribution handshake mechanisms used by EVPN. An additional reason is path optimisation as the key solutions described in part#5 is still valid with Multis-site.


With my friend and respectful colleague Max Ardica, we have tested and qualified the current solution to interconnect multiple VXLAN EVPN fabrics. We have elaborated this technical support to clarify the network design requirements when the function Layer 3 Anycast gateways is distributed among all server node platforms and all VXLAN EVPN Fabrics. The  whole article is organised in 5 different posts.

  • This 1st part below elaborates the design considerations to interconnect two VXLAN EVPN based fabrics.
  • The 2nd post discusses the Layer 2 DCI requirements interconnecting Layer-2-based VXLAN EVPN fabrics deployed in conjunction with Active/Active external routing blocks.
  • The 3rd section covers the Layer 2 and Layer 3 DCI requirement interconnecting VXLAN EVPN  fabrics deployed in conjunction with distributed Layer 3 Anycast Gateway.
  • The 4th post examines host mobility across two VXLAN EVPN Fabrics with Layer 3 Anycast Gateway.
  • Finally the last section develops inbound and outbound path optimization with VXLAN EVPN fabrics geographically dispersed.


Recently, fabric architecture has become a common and popular design option for building new-generation data center networks. Virtual Extensible LAN (VXLAN) with Multiprotocol Border Gateway Protocol (MP-BGP) Ethernet VPN (EVPN) is essentially becoming the standard technology used for deploying network virtualization overlays in data center fabrics.

Data center networks usually require the interconnection of separate network fabrics, which may also be deployed across geographically dispersed sites. Consequently, organizations need to consider the possible deployment alternatives for extending Layer 2 and 3 connectivity between these fabrics and the differences between them.

The main goal of this article is to present one of these deployment options, using Layer 2 and 3 data center interconnect (DCI) technology between two or more independent VXLAN EVPN fabrics. This design is usually referred to as VXLAN multi-fabrics.

To best understand the design presented in this article, the reader should be familiar with VXLAN EVPN, including the way it works in conjunction with the Layer 3 anycast gateway and its design for operation in a single site. For more information, see the following Cisco® VXLAN EVPN documents:

To extend Layer 2 as well as Layer 3 segmentation across multiple geographically dispersed VXLAN-based fabrics, you can consider two main architectural approaches: VXLAN EVPN Multi-pods fabric and VXLAN EVPN Multi-fabrics.

Option 1: VXLAN EVPN Multi-Pod Fabric

You can create a single logical VXLAN EVPN fabric in which multiple pods are dispersed to different locations using a Layer 3 underlay to interconnect the pods. Also known as stretched fabric, this option is now more commonly called VXLAN EVPN Multi-pods fabric. The main benefit of this design is that the same architecture model can be used to interconnect multiple pods within a building or campus area as well as across metropolitan-area distances.

The VXLAN EVPN Multipod design is thoroughly discussed in this post 32 – VXLAN Multipod stretched across geographically dispersed datacenters

Option 2: VXLAN EVPN Multi-Fabric

You can also interconnect multiple VXLAN EVPN fabrics using a DCI solution that provides multitenant Layer 2 and Layer 3 connectivity services across fabrics. This article focuses on this option.

Compared to the VXLAN EVPN Multi-pod fabric design, the VXLAN EVPN Multi-fabric architecture offers greater independence of the data center fabrics. Reachability information for VXLAN tunnel endpoints (VTEPs) is contained in each single VXLAN EVPN fabric. As a result, VXLAN tunnels are initiated and terminated within the same fabric (in the same location). This approach helps ensure that policies, such as storm control, are applied specifically at the Layer 2 interface level between each fabric and the DCI network to limit or eliminate storm and fault propagation across fabrics.

The Multi-fabric deployment model also offers greater scalability than the Multi-pod approach, supporting a greater total number of leaf nodes across fabrics. The total number of leaf devices supported equals the maximum number of leaf nodes supported in a fabric (256 at the time of the writing of this post) multiplied by the number of interconnected fabrics. A Multi-fabric deployment can also support an overall greater number of endpoints than a multipod deployment; commonly, only a subset of Layer 2 segments is extended across separate fabrics, so most MAC address entries can be contained within each fabric. Additionally, host route advertisement can be filtered across separate fabrics for the IP subnets that are defined only locally within each fabric, increasing the overall number of supported IP addresses.

Separate VXLAN fabrics can be interconnected using Layer 2 and 3 functions, as shown in Figure 1:

  • VLAN hand-off to DCI for Layer 2 extension
  • Virtual Routing and Forwarding Lite (VRF-Lite) hand-off to DCI for multitenant Layer 3 extension

VLAN and VRF-lite Hand-Off

Figure 1: VLAN and VRF-Lite Hand-Off

The maximum distance between separate VXLAN EVPN fabrics is determined mainly by the application software framework requirements (maximum tolerated latency between two active members) or by the mode of disaster recovery required by the enterprise (hot, warm, or cold migration).

After Layer 2 and 3 traffic is sent out of each fabric through the border nodes, several DCI solutions are available for extending Layer 2 and 3 connectivity across fabrics while maintaining end-to-end logical isolation.

  • Layer 2 DCI:

Layer 2 connectivity can be provided in several ways. It can be provided through a dedicated Layer 2 dual-sided virtual port channel (vPC) for metropolitan-area distances using fiber or dense wavelength-division multiplexing (DWDM) links between two sites. For any distance, it can be provided any valid Layer 2 overlay technology over a Layer 3 transport mechanism such as Overlay Transport Virtualization (OTV), Multiprotocol Label Switching (MPLS) EVPN, VXLAN EVPN, or Virtual Private LAN Service (VPLS).

  • Layer 3 DCI:

The Layer 3 DCI connection has two main purposes:

  1. It advertises between sites the network prefixes for local hosts and IP subnets.
  2. It propagates to the remote sites host routes and subnet prefixes for stretched IP subnets.

The DCI solutions selected for extending multitenant Layer 2 and 3 connectivity across VXLAN EVPN fabrics usually depend on the type of service available in the transport network connecting the fabrics:

Scenario 1: Enterprise-owned direct links (dark fibers or DWDM circuits)

  • vPC or OTV for Layer 2 extension
  • Back-to-back VRF-Lite subinterfaces for Layer 3 extension

Scenario 2: Enterprise-owned or service provider–managed multitenant Layer 3 WAN service

  • OTV for Layer 2 extension through a dedicated VRF-Lite subinterface
  • VRF-Lite subinterfaces to each WAN Layer 3 VPN service

Scenario 3: Enterprise-owned or service provider–managed single-tenant Layer 3 WAN service

  • OTV for Layer 2 extension across native Layer 3
  • VRF-Lite over OTV for Layer 3 segmentation

Note: In this article, OTV is the DCI solution of choice used to extend Layer 2 connectivity between VXLAN fabrics. OTV is an IP-based DCI technology designed purposely to provide Layer 2 extension capabilities over any transport infrastructure. OTV provides an overlay that enables Layer 2 connectivity between separate Layer 2 domains while keeping these domains independent and preserving the fault-isolation, resiliency, and load-balancing benefits of an IP-based interconnection. For more information about OTV as a DCI technology and associated deployment considerations, refer to the document at  :


Scenario 1: In a metropolitan area network (MAN), you can use a direct DWDM transport for Layer 2 and 3 extension. Therefore, dedicated links for Layer 2 and 3 segmentation are established using different DWDM circuits (or direct fiber when the network is owned by the enterprise). Figure 2 shows two separate physical interfaces. One is used for Layer 3 segmentation, and the other is used for Layer 2 extension (classic Ethernet or overlay). A dedicated Layer 3 connection in conjunction with vPC is a best practice, and it is usually required for platforms that do not support dynamic routing over vPC. However, using border node platforms (such as Cisco Nexus® 7000 or 9000 Series Switches) and software releases, you also can use Layer 3 over a vPC dual-sided connection (not covered in this article).

Note: Dynamic routing over vPC is supported on Cisco Nexus 9000 Series Switches starting from Cisco NX-OS Software Release 7.0(3)I5(1).

Figure 2- Layer 2 and Layer 3 segmented across a DWDM network

Figure 2- Layer 2 and Layer 3 segmented across a DWDM network

Nonetheless, the recommended approach is to deploy an overlay solution to provide Layer 2 DCI services such as OTV, especially if more than two sites need to be interconnected (a scenario not documented here). OTV inherently offers multipoint Layer 2 DCI services while helping ensure robust protection against the creation of end-to-end Layer 2 loops.

Note: At this stage, the OTV encapsulation can be built with a generic routing encapsulation (GRE) header or with a VXLAN and User Datagram Protocol (UDP) header. The latter option is preferred because it allows better load balancing of OTV traffic across the Layer 3 network that interconnects the VXLAN fabrics. However, it requires the use of F3 or M3 line cards and NX-OS Release 7.2 or later.

  • OTV edge devices are connected to the border nodes with a Layer 3 and a L2 connection as shown in Figure 3, a deployment model referred to as “on a stick”. The OTV overlay is carried within a dedicated Layer 3 subinterface and VRF instance. This subinterface is carved on the same physical interface that is used to provide cross-fabric Layer 3 connectivity for all the tenants (discussed later in this section) and can belong to the default VRF instance or to a dedicated OTV VRF instance.

Note: Dual-sided vPC can be used as a Layer 2 DCI alternative for VXLAN dual-fabric deployments, though it is less preferable due to the lack of various failure isolation functions natively offered by OTV.

  • Dedicated back-to-back subinterfaces carry OTV encapsulated traffic and provide Layer 3 Tenant connectivity.
    • OTV: Default VRF = E1/1.999
    • Tenant 1: VRF T1 = E1/1.101
    • Tenant 2: VRF T2 = E1/1.102
    • Tenant 3: VRF T3 = E1/1.103
    • etc..

Figure 3 : Physical View with OTV “on the stick” to carry Intra-subnet communication

Figure 3 : Physical View with OTV “on the stick” to carry Intra-subnet communication

Scenario 2: The second scenario applies to deployments in which a Layer 3 multitenant transport service is available across separate fabrics.

This deployment model is similar to the one in scenario 1: OTV edge devices “on a stick” are used to extend Layer 2 connectivity across sites. The OTV traffic is sent out of the border nodes on dedicated subinterfaces that are mapped to a specific Layer 3 VPN service on the WAN edge router acting as a provider edge device.

At the same time, independent Layer 3 subinterfaces are deployed for each tenant for end-to-end Layer 3 communication. Each subinterface is then mapped to a dedicated Layer 3 VPN service.

Figures 4 and 5 show this scenario.

Figure 4: OTV Overlay “on a stick” and VRF-Lite sessions run independently

Figure 4: OTV Overlay “on a stick” and VRF-Lite sessions run independently


Figure 5 : Physical View with OTV “on a stick” to carry Intra-subnet communication – Layer 3 VPN / MPLS Core

Figure 5 : Physical View with OTV “on a stick” to carry Intra-subnet communication – Layer 3 VPN / MPLS Core

Scenario 3: In this scenario, the WAN and MAN transport network offers only single-tenant Layer 3 service (it does not support MPLS or VRF-Lite).

In this case, the Layer 2 DCI services offered by an overlay technology such as OTV can be used to establish cross-fabric Layer 2 and 3 multitenant connectivity. This connection is achieved in two steps:

  • First, a Layer 2 DCI overlay service is established over the native Layer 3 WAN and MAN core.
  • Next, per-tenant Layer 3 peerings are established across the fabrics over this Layer 2 overlay transport. The dedicated tenant Layer 3 interfaces to connect to the remote sites can be deployed as Layer 3 subinterfaces or as switch virtual interfaces (SVIs). Your choice has an impact on the physical connectivity between the border nodes and the OTV edge devices:
    • Use of subinterfaces for tenant-routed communication: In this case, you must deploy two separate internal interfaces to connect the border nodes to the OTV devices. As shown in Figure 6, the first interface is configured with subinterfaces to carry tenant traffic (a unique VLAN tag is associated with each tenant), and the second interface is a Layer 2 trunk used to extend Layer 2 domains between sites.

Figure 6: Layer 3 VPN peering over the Layer 2 Overlay Transport

Figure 6: Layer 3 VPN peering over the Layer 2 Overlay Transport

Because Layer 3 routing peerings are established for each tenant to the remote border nodes across the OTV overlay service, you must enable bidirectional forwarding detection (BFD) on each Layer 3 subinterface for indirect failure detection, helping ensure faster convergence in those specific cases.

  • Use of SVIs for tenant-routed communication: In this case, shown in Figure 7, the same Layer 2 trunk interface can be used to carry the VLANs associated with each tenant Layer 3 interface and the VLANs extending the Layer 2 domains across sites. Note that these two sets of VLANs must be unique, and that SVIs are not defined for the VLANs that extend the Layer 2 domains across sites.

In this specific scenario, the join interfaces of the OTV edge devices can be connected directly to the Layer 3 core, as shown in Figure 7.

Figure 7 - OTV in line to carry data VLANs + VLANs for Tenants Layer 3 communication

Figure 7 – OTV in line to carry data VLANs + VLANs for Tenants Layer 3 communication

Deployment Considerations

The following sections present two use cases in which independent VXLAN EVPN fabrics are interconnected:

  • The first use case is a more traditional deployment in which each VXLAN EVPN fabric runs in Layer 2 mode only and uses a centralized external routing block for intersubnet communication. This scenario can be a requirement when the default gateway function is deployed on external devices such as firewalls. This use case is discussed in the sections VXLAN EVPN Multi-Fabrics with External Routing Block (part 2)
  • The second use case deploys the distributed functions of anycast Layer 3 gateways across both sites, reducing hairpinning workflow across remote network fabrics. This enhanced scenario can address a variety of requirements:
    1. For business-continuance purposes, it is a common practice to allow transparent “hot” mobility of virtual machines from site to site without any service interruption. To reduce application latency, the same default gateways are active at both sites, reducing hairpinning across long distances for local routing purposes as well as for egress path optimization. Nonetheless, to reduce hairpinning for east-west workflows and to offer north-south traffic optimization, the same MAC and IP addresses for the default gateways must be replicated on all active routers at both sites. With the anycast Layer 3 gateway, the default gateway function is performed by all computing leaf nodes. This scenario is the main focus of this article and is discussed in greater detail in the section VXLAN EVPN Multi-fabric with Distributed Anycast Layer 3 Gateway (part 3).
    2. For disaster-recovery purposes and for operational cost containment, enterprises may want to move endpoints using a “cold” migration process only. Consequently, only the IP address of the gateway will be replicated to simplify operation management after machines have been moved (for example, to maintain the same IP address schema after the relocation of the servers). The MAC address of the default gateway can be different on each fabric, and for a “cold” migration process, the endpoint will apply Address Resolution Protocol (ARP) on its default gateway when restarting its processes and will get the new MAC address accordingly. This use case is not covered in this post.
Posted in DCI | Leave a comment

35 – East-West Endpoint localization with LISP IGP Assist

East-West Communication Intra and Inter-sites

For the following scenario, subnets are stretched across multiple locations using a Layer 2 DCI solution. There are several use cases that require LAN extension between multiple sites, such as Live migration, Health-check probing for HA cluster (heartbeat), Operational Cost containment such as migration of Mainframes, etc.  It is assumed that due to long distances between sites, the network services are duplicated and active on each of the sites. This option allows the use of local network services such as default gateways, load balancer’s and security engines distributed across each location, helps reduce server to server communication latency (East-West work flows).

Traditionally, an IP address uses a unique identifier assigned to a specific network entity such as physical system, virtual machine or firewall, default gateway, etc. The routed WAN uses the identifier to also determine the network entity’s location in the IP subnet. When a Virtual Machine migrates from one data center to another, the traditional IP address schema retains its original unique identifier and location, although the physical location has actually changed. As a result, the extended VLAN must share the same subnet so that the TCP/IP parameters of the VM remain the same from site to site, which is necessary to maintain active sessions for migrated applications.

When deploying a logical data center architecture stretched across multiple locations, with duplicated active network services, a couple of critical behaviors have to be taken into consideration. First, it is extremely important to maintain transparently with no interruption the same level of security within and across sites, before, during and after live migration of any Endpoints. Secondly, while machines follow a stateful migration process (hot live migration) from site to site, each security and network service that owns current active sessions should not reject any sessions due to an asymmetric workflow. It is mandatory that the round-trip data packet flow always reach the stateful device that owns the current session, otherwise the session is broken by the stateful device. To achieve the required symmetric workflow, it is fundamental that each IP Endpoint is localized dynamically and any movement of virtual machines are notified in real time to update all relevant routing tables.

The scenario discussed in the following lines covers security appliances in routed mode, however, the same behavior applies with any stateful devices such as SLB engines, physical appliances or virtual services.

In the following use-case, two physical data centers are interconnected using a Layer 2 DCI solution offering a single logical data center from a multi-tiered application framework point of view. Using OTV it is possible to address intercontinental distances between the two sites in a simple and robust fashion. However, this solution is agnostic to the DCI transport (OTV, VPLS, PBB EVPN, MPLS EVPN, VXLAN EVPN) as long as a method to filter FHRP handshake between datacenter protocol is supported.

The figure below depicts a generic deployment of multiple sites interconnected with a Layer 2 overlay network extended over the Layer 3 connection. The two datacenter tightly close by the Layer 2 extension are organized to support multitenancy with in this example, Tenant Red and Tenant Green elaborated latter.

Physical Architecture of DC Multi-sites with Stateful Devices

DC Multi-sites – High Level Physical Architecture

The goal is to utilize the required services from local stateful devices, reducing latency for server to server communication, in a multi-tenant environment, whilst traffic flows are kept symmetric from end-to-end.

In the following example, when host R1 from tenant Red communicates with host G1 from tenant Green, both located in DC-1, the traffic is inspected by the local FW in DC-1. On the other side, when host R2 communicates with hosts G2, the traffic is inspected by the local FW in DC2.

DC Multi-sites - Localized E-W traffic

DC Multi-sites – Localized E-W traffic

If this behavior sounds obvious for independent data centers, with geographically dispersed data centers it becomes a real concern when the broadcast domains are extended across multiple sites, with duplicate active stateful devices.

The following logical view depicts the issue. The traffic from VLAN 10 (subnet Red) destined to VLAN 20 (subnet Green) is routed via each local active Firewall. As a consequence, when host R1 (VLAN 10) in DC-1 needs to communicate with host G2 (VLAN 20) in DC-2, its data packet is routed toward its local FW in DC-1, which in turn, routes the traffic destined to G2 (subnet Green) extended to the remote site. It is therefore needed to extend the L2 for those Subnets to reach the endpoints wherever they are located. The routed traffic is therefore forwarded toward host G2 across the extended Subnet Green (Layer 2 DCI connectivity established across the OTV overlay). G2 receives the data packets and replies as expected to R1 using its local default gateway, which in turn, routes toward its local FW (DC-2) the response destined to R1 (subnet Red). In this design, by default the local FW on each site receives the preferred path for Red and Green Subnet from its local fabric.  Hence, routing between endpoints belonging to those two IP subnets is always kept local to a site. As a result, this behavior affects the workflow that becomes asymmetrical and consequently the FW in DC2 terminates that current session for security reason.

Asymmetric flow with Stateful device not allowed

Asymmetric flow with Stateful device not allowed

Two solutions exist based on host-routing to provide the physical location of the IP Endpoint, with and without Layer 2 segments extended between multiple locations. The concept relies on more specific routes (/32) propagated to multiple host databases (DC-1 and DC-2). Having a dynamic database of endpoints associated with their physical location allows to redirect the traffic destined to IP machines of interest over a specific Layer 3 path.

 Host Route Injection

The following section describes a sub-function of LISP that consists injecting the host-route toward the IGP protocol in conjunction with extended subnets across distant locations; this is known as LISP IGP Assist ESM (Extended Subnet Mode). LISP IGP Assist is agnostic to the DCI technology deployed, however because IGP Assist uses a Multicast group to transport host route notifications from site to site, it is important that IP Multicast traffic can be routed across the DCI network.

Note the latter doesn’t mean that the WAN must be IP Multicast capable, it means that Multicast data packet can be carried across all remote locations, thus the choice to OTV in this DCI design, which supports and optimizes MCAST traffic using a Head-end Replication technique (aka OTV Adjacency Server) to transport data Multicast packets over an non-Multicast capable WAN.

LISP IGP Assist can be leveraged to trigger dynamically an update in its Endpoint Identifiers (EID) database for each detection of new machine that belong to a selected subnet. As soon as an Endpoint is powered-on or has moved to a new physical host, it is automatically detected by its Default gateway (LISP First Hop Router function), the LISP process running on the relevant switch (DG) registers the new host reachability information (host route) to its EID database with its location and notifies all remote First Hop Routers accordingly using a dedicated Multicast group across the L2 DCI. Meanwhile, the FHR redistributes the Host route into its IGP routing table, adding the /32 Host route for each local EID.

Based on the above, the key concept is to propagate the host routes dynamically to the remote site using a dedicated Layer 3 DCI network. As a consequence, a more specific route (/32) is announced dynamically to attract the traffic to the EID of interest using a specific path. This L3 DCI connectivity is depicted in the next figure as the Secure L3 DCI connection for inter-site inter-VRF routed traffic.

  • R1 and G1 Host routes are propagated toward DC-2 over the Layer 3 DCI.
  • R2 and G2 Host routes are propagated toward DC-1 over the Layer 3 DCI.

The following diagram provides further details that will be used for the following test-bed.

DC Multi-sites - Physical Architecture

Physical Architecture of DC Multi-sites with dedicated L3 DCI

Site to site Inter-VRF secure communication

Site to site Inter-VRF secure communication

To summarize the logical view above:

  • Hosts R1 and R2 belong to the same VLAN 10 within Tenant Red, while G1 and G2 share the same VLAN 20 that belongs to Tenant Green.
  • VLAN 10 and VLAN 20 are extended across the L2 DCI connection established using OTV.
  • R1 and G1 are located in DC-1, R2 and G2 are located in DC-2.
  • Communications within and between IP subnets belonging to a given DMZ or tenant happens freely, while inter-tenant packet flows must be enforced through the Firewall. As a result, a L3 segmentation (VRF) is performed between tenants to force the routed traffic to use the Firewall. VLAN 10 belongs to VRF Red and VLAN 20 to VRF Green.
  • FHRP filter is enabled in conjunction with OTV, so that the same SVI can be active on both sites.
  • LISP First Hop Router (FHR) must be configured as the Default Gateways for all hosts of interest as it uses ARP Messages to trigger the host route notification for the EIDs.
  • An active Firewall on each site is used to secure the routed traffic between VRF Red and VRF Green.

It is assumed (but not discussed in this document) that each Network service such as SVI, LISP, FW, OTV, etc… are fully redundant within each data center (usually Active/Standby mode per site).

We can consider 3 types of data flows:

  • Intra-Subnet communication via L2 DCI: Bridged traffic destined for a remote host within the same broadcast domain uses the Layer 2 DCI connectivity (OTV). Note that, additional security services in transparent mode and/or encryption can be leveraged if needed to secure the Layer 2 DCI connectivity without impacting this scenario (not discussed in this post).

Intra-Subnet communication via L2 DCI

Intra-Subnet communication via L2 DCI


  • Local routed traffic inter-Tenant intra-DC: Each  Firewall within a data center is first of all used to secure the local traffic between the two VRF’s. Firewalls are configured using dynamic routing.

Local Inter-Subnet communication

Local Inter-VRF communication


  • Routed traffic inter-Tenant inter-DC: Routed traffic destined for a remote machine that belongs to a different DMZ, uses both Firewalls from each site. The traffic is routed via a dedicated Layer 3 DCI connection (depicted in the following figure as Secured L3 DCI link), preventing asymmetric traffic. Indeed, when receiving a host route notification update from the remote LISP FHR, the local gateway records the concerned Endpoint with a /32 entry. As a result, the route to reach that host is now being more specific via the remote Firewall as the next hop. As a consequence, the relevant traffic inter-VRF will be transported across the dedicated Layer 3 DCI path, toward the remote Firewall.

Site to site Inter-VRF communication

Site to site Inter-VRF communication across both FWs


The full configuration used for this testbed can be downloaded here:


OTV has been tested in Multicast mode and in Adjacency server mode (Head End Replication) carrying the multicast group used to exchange the LISP EID notification across sites. HSRP filtering is initiated in the OTV edge device.

interface Overlay1
 otv join-interface Ethernet1/10
 otv control-group
 otv data-group
 otv extend-vlan 10, 20
 no shutdown
 otv-isis default
 vpn Overlay1
 redistribute filter route-map stop-HSRP
 otv site-identifier 0001.0001.0001
interface Ethernet1/9
 description OTV Internal interface
 switchport mode trunk
 switchport trunk allowed vlan 10,20,210
interface Ethernet1/10
 description OTV Join Interface
 ip address
 ip ospf network point-to-point
 ip router ospf 1 area
 ip igmp version 3
 no shutdown

DC-1 Firewall

For the tests, ICMP (to check host reachability), and SSH (to validate that routed traffic remains symmetric from site to site) are permitted from any source to any destination.

DC-1 LISP IGP Assist

The minimum configuration required for LISP IGP Assist is quite simple.

The first step is to enable PIM to distribute the host route notification using a multicast group. All interfaces of interest must be configured with PIM sparse-mode. In this setup, a single RP is configured in DC-1 using the same loopback address used for the LISP locator.

interface loopback10
 description LISP Loopback
 ip address
 ip router ospf 1 area
 ip pim sparse-mode
ip pim rp-address group-list
ip pim ssm range

Then, it is required to configure the route-map to advertise the host-routes toward the distant LISP FHR. The command described here is “global” for any subnets but more specific prefixes can be used. The route-map is redistributed via the OSPF process per VRF.

ip prefix-list HOST-ROUTES seq 5 permit eq 32
route-map ADV-HOST-ROUTES deny 5
 match interface Null0
route-map ADV-HOST-ROUTES permit 10
 match ip address prefix-list HOST-ROUTES
router ospf 100
 vrf GREEN
 redistribute lisp route-map ADV-HOST-ROUTES
 vrf RED
 redistribute lisp route-map ADV-HOST-ROUTES

Because it’s a multi-tenant environment, LISP IGP Assist is configured under each relevant VRF. The notification of host routes is achieved using a dedicated multicast group 239.1.1.n per Tenant. A dedicated LISP Instance is also required per Tenant.  The Subnet used for the VRF GREEN is added into the “local” mapping database (actually the loop-back address in DC-1). Subnet used for the VRF RED is added into the same “local” mapping database. In this testbed, the notifications of EID (host-route) are performed using the multicast group

vrf context GREEN
 ip pim ssm range
 ip lisp itr-etr
 lisp instance-id 2
 ip lisp locator-vrf default
 lisp dynamic-eid LISP_EXTENDED_SUBNET
 database-mapping priority 1 weight 50
 no route-export away-dyn-eid
vrf context RED
 ip pim ssm range
 ip lisp itr-etr
 lisp instance-id 1
 ip lisp locator-vrf default
 lisp dynamic-eid LISP_EXTENDED_SUBNET
 database-mapping priority 1 weight 50
 no route-export away-dyn-eid

All concerned interface VLAN must be configured to use the LISP dynamic EID process

interface Vlan10
 no shutdown
 vrf member RED
 lisp extended-subnet-mode
 ip address
 ip ospf passive-interface
 ip pim sparse-mode
 hsrp 10
 priority 110
interface Vlan20
 no shutdown
 vrf member GREEN
 lisp extended-subnet-mode
 ip address
 ip ospf passive-interface
 ip pim sparse-mode
 hsrp 20
 priority 110

The above configuration is the minimum required with IGP Assist to detect and inject the host-route dynamically toward the upward IGP routing table, offering more specific route for remote Endpoints.


The same configuration is duplicated on each site with the relevant changes.

interface loopback20
  description LISP Loopback
  ip address
  ip router ospf 1 area
  ip pim sparse-mode
ip pim rp-address group-list
ip pim ssm range
ip prefix-list HOST-ROUTES seq 5 permit eq 32
  match interface Null0
route-map ADV-HOST-ROUTES deny 5
route-map ADV-HOST-ROUTES permit 10
  match ip address prefix-list HOST-ROUTES
vrf context GREEN
  ip pim ssm range
  ip lisp itr-etr
  lisp instance-id 2
  ip lisp locator-vrf default
  lisp dynamic-eid LISP_EXTENDED_SUBNET
    database-mapping priority 1 weight 50
    no route-export away-dyn-eid
vrf context RED
  ip pim ssm range
  ip lisp itr-etr
  lisp instance-id 1
  ip lisp locator-vrf default
  lisp dynamic-eid LISP_EXTENDED_SUBNET
    database-mapping priority 1 weight 50
    no route-export away-dyn-eid
interface Vlan10
  no shutdown
  vrf member RED
  lisp mobility LISP_EXTENDED_SUBNET
  lisp extended-subnet-mode
  ip address
  ip ospf passive-interface
  ip pim sparse-mode
  hsrp 10
    priority 120
interface Vlan20
  no shutdown
  vrf member GREEN
  lisp mobility LISP_EXTENDED_SUBNET
  lisp extended-subnet-mode
  ip address
  ip ospf passive-interface
  ip pim sparse-mode
  hsrp 20
    priority 120
router ospf 100
  vrf GREEN
    redistribute lisp route-map ADV-HOST-ROUTES
  vrf RED
    redistribute lisp route-map ADV-HOST-ROUTES
no system default switchport shutdown


Check the dynamic-eid database on each site

Note that the output is reduced to make it readable

The EID reachability information is given for each VRF:

  • R1 (
  • R2 (
  • G1 (
  • G2 (

Data Center 1 (Left)

LISP-IGP_DC-1# sho lisp dynamic-eid summary vrf RED
LISP Dynamic EID Summary for VRF “RED
LISP-IGP_DC-1# sho lisp dynamic-eid summary vrf GREEN
LISP Dynamic EID Summary for VRF “GREEN

Data Center 2 (Right)

LISP-IGP_DC-2# sho lisp dynamic-eid summary vrf RED
LISP Dynamic EID Summary for VRF "RED"
LISP-IGP_DC-2# sho lisp dynamic-eid summary vrf GREEN
LISP Dynamic EID Summary for VRF "GREEN"

Check the routing table (just keeping the relevant EID). From VRF RED, R1 is ( locally attached to VLAN  10. R2 ( , G1 ( & G2 ( are reachable via the next hop router (

Data Center 1 (Left)

LISP-IGP_DC-1# sho ip route vrf RED
IP Route Table for VRF "RED"
..., ubest/mbest: 1/0, attached 
 *via, Vlan10, [240/0], 00:31:42, lisp, dyn-eid, ubest/mbest: 1/0   
 *via, Vlan100, [110/1], 00:32:49, ospf-100, type-2
..., ubest/mbest: 1/0
 *via, Vlan100, [110/1], 00:31:40, ospf-100, type-2, ubest/mbest: 1/0
 *via, Vlan100, [110/1], 00:32:47, ospf-100, type-2

Firewall in DC-1

FW-DC-1# sho route
O E2 [110/1] via, 60:53:29, Inter-DMZ  <== via L3 DCI toward DC-2
O E2 [110/1] via, 60:52:56, RED          <== Local routed traffic

Firewall in DC-2

FW-DC-1# sho route
O E2 [110/1] via, 2d12h, Inter-DMZ  <== via L3 DCI toward DC
O E2 [110/1] via, 2d12h, RED     <== Local routed traffic

Data Center 1 (Left)

From VRF GREEN, G1 ( is locally attached to VLAN 20. R1 (, R2 ( & G2 ( are reachable via the next hop router (

LISP-IGP_DC-1# # sho ip route vrf green
IP Route Table for VRF "GREEN"
..., ubest/mbest: 1/0
    *via, Vlan200, [110/1], 01:04:23, ospf-100, type-2, ubest/mbest: 1/0
    *via, Vlan200, [110/1], 01:05:30, ospf-100, type-2
..., ubest/mbest: 1/0, attached
    *via, Vlan20, [240/0], 01:04:22, lisp, dyn-eid, ubest/mbest: 1/0
    *via, Vlan200, [110/1], 01:05:28, ospf-100, type-2

Let’s perform a Live migration with R1  moving to DC-2 and R2  moving to DC-1. As soon as the migration is complete, the dynamic EID mapping database is updated accordingly.

Data Center 1 (Left)

LISP-IGP_DC-1# sho lisp dynamic-eid summary vrf red
LISP Dynamic EID Summary for VRF "RED"

Data Center 2 (Right)

LISP-IGP_DC-2# sho lisp dyn summary vrf red
LISP Dynamic EID Summary for VRF "RED"

Ping and SSH sessions inter-DC between the 2 VRF continue to work with a sub-second interruption.

Configuration (continue)

As previously mentioned, the LISP IGP Assist set-up given above is the minimum configuration required to notify the EID dynamically using the multicast protocol across the L2 DCI links and redistribute the host route into the IGP routing table. As is, it already works like a charm, as long as the multicast group can reach the remote LISP mapping-database using the Layer 2 DCI extension.

It is optionally possible, but recommended, to add an alternative mechanism to notify the EID via a routing protocol established with a LISP Map-Server, in case the primary mechanism fails. For any reasons, if the Multicast transport or L2 extension stops working, the map-server will notify the remote mapping database about the new EID using the routing protocol. Actually, this is the method used for IGP Assist in ASM mode (Across Subnet Mode without any L2 extension), when no extended VLAN exists across data centers to carry the multicast accordingly for each VRF.

The Map-Resolver is responsible to receive map requests from remote ingress Tunnel  Routers (iTRs) to retrieve the mapping between an Endpoint Identifier and its current location (Locator). For the purpose of this specific scenario, there is no inbound path optimization, nor either eTR, iTR or LISP encapsulation. Hence, only the Map-server function is relevant for this solution as an backup mechanism to trigger the Endpoint notification’s. In the context of IGP Assist, the Map-Server system is responsible to exchange EID mapping between all other LISP devices. The M-DB can cohabit on the same device as the LISP FHR multiple mapping databases can be distributed over dedicated hosts. For the purpose of our test, the function of MS runs on the same switch device.

To configure the LISP map-server, follow the next global configuration on each LISP IGP router.

Data Center 1 (Left)

ip lisp itr-etr
ip lisp map-server
lisp site DATA_CENTER
  eid-prefix instance-id 1 accept-more-specifics
  eid-prefix instance-id 2 accept-more-specifics
  authentication-key 3 9125d59c18a9b015
ip lisp etr map-server key 3 9125d59c18a9b015
ip lisp etr map-server key 3 9125d59c18a9b015

Data Center 2 (Right)

ip lisp itr-etr
ip lisp map-server
lisp site DATA_CENTER
  eid-prefix instance-id 1 accept-more-specifics
  eid-prefix instance-id 2 accept-more-specifics
  authentication-key 3 9125d59c18a9b015
ip lisp etr map-server key 3 9125d59c18a9b015
ip lisp etr map-server key 3 9125d59c18a9b015

Additional document


Posted in DCI | 2 Comments

34 – VXLAN EVPN Q-in-VNI and EFP for Hosting Providers

Dear Network and DCI Experts !

While this post is a little bit out of the DCI focus, and assuming many of you already know Q-in-Q, the question is, are you yet familiar with Q-in-VNI? For those who are not, I think this topic is a good opportunity to bring Q-in-VNI deployment with VXLAN EVPN for intra- and inter-VXLAN-based Fabrics and understand its added value and how one or multiple Client VLANs from the double encapsulation can be selected for further actions.

Although it’s not an unavoidable rule per-se, some readers already using Dot1Q tunneling may not necessarily fit into the following use-case. Nonetheless, I think it’s safe to say that most of Q-in-Q deployment have been used by Hosting Provider for co-location requirements for multiple Clients, hence, the choice of the Hosting Provider use-case elaborated in this article.

Nonetheless, what is elaborated in the following sections can address as well other requirement, to list just one more. For example, after the acquisition of one or multiple enterprises or other business organisations, merging multiple DC together into a single infrastructure, in order to reduce OPEX. This is another form of multi-tenancy that requires deploying the same level of Layer 2 and Layer 3 segmentation.

For many years now, Hosting Services have physically sheltered thousands of independent clients’ infrastructures within the Provider’s Data Centers. The Hosting Service Provider is responsible for supporting each and every Client’s data network in its shared network infrastructure. This must be achieved without changing any Tenant’s Layer 2 or Layer 3 parameter, and must also be done as quickly as possible. For many years, the co-location business for independent Tenants has been accomplished by using a dedicated physical Provider network for each Client’s infrastructure, which has been costly in terms of network equipment, operational deployment and maintenance, rack space, power consumption, and this method has also been rigid, with limited scalability for growth.

With the evolution of this business and virtualization, there is a strong need to offer additional services at lower cost with more agility for customers, and the requirements for this solution can be summarized as follows:

  • To seamlessly append any new Client network to the shared infrastructure owned by the Hosting Service Provider with the same requirement mentioned previously in terms of Layer 2 and Layer 3 segmentation.
  • To support multiple Clients sharing the same Top-of-Rack switches.
  • The above implies that the shared Provider’s infrastructure must be able to support duplicate Layer 2 VLAN identifiers across all the Tenants, including reuse of the same Layer 3 address space by all of them.
  • To provide the flexibility of transparently spreading each Tenant‘s infrastructure throughout the Hosting Provider’s organization.
  • To allow for the seamless growth and evolution of each Tenant’s network, for the Client business as well as for the Service Provider production.
  • To offer new Service Clouds for each Client.

Figure-1 hosting service provider with service cloud

Figure 1: hosting service provider with service cloud

Figure 1 shows a logical view of different Client infrastructures spread across the Provider’s Network infrastructure with an extension of their respective network segments between local and geographically dispersed Pods or data centers. In addition to traditionally maintaining the Client network segmentation from end-to-end, each Tenant must be able to access both the private and global Service Cloud.

Essentially, besides managing their Tenant’s data network, Hosting Providers should be able to develop their business by proposing connectivity to a new and enhanced model of Service Cloud, to each Tenant. The Service Cloud provides new and exciting service tools for the hosted Clients such as a broad range of XaaS as well as “Network and Security as a Service” (NSaaS) for the Clients’ applications. However, the key technical challenge for the Hosting Provider is to be able to select one or multiple “private” segments from any isolated Client infrastructure in order to provide access to a dedicated private or shared public Service Cloud.

Note: “Network and Security as a Service”: This Cloud Service provides Application Security, Deep Inspection, Optimization, Offloading, and Analytics Services, just to list a few, for each and every client‘s set of multi-tier applications.

Hosting Service Requirements & Solutions

The shared Data Center Network infrastructure from the Hosting Provider must support VLAN overlapping as well as duplicated Layer 3 networks for any Client organization.

To address this requirement, as done for many years, the historical “double VLAN tag” encapsulation method is leveraged in conjunction with VXLAN EVPN as a Layer 3 transport. This Dot1Q tunnel encapsulation for the Layer 2 segmentation is named Q-in-VNI. Besides the Layer 2 overlay network transport, VXLAN EVPN is used for the multi-tenancy Layer 3 segmentation.

The ingress interface connecting the Client network is configured in access mode with a Backbone VLAN (B-VLAN) that encapsulates the original Client VLAN (C-VLAN). B-VLANs belong to the Provider network’s resources and are consumed accordingly. Q-in-VNI maps each Backbone VLAN to a L2 VNI transport which contains all the Client VLAN’s (C-VLAN) carried with the inner Dot1Q. As a consequence, the original C-VLAN’s can be retrieved at the egress VTEP and this transparently provides the extension of all Client Networks across different Top-of-Racks.

Figure-2 Q-in-VNI to transport client-vlans from end-to-end

Figure 2: Q-in-VNI to transport client-vlans from end-to-end

As a consequence, the VXLAN Fabric is agnostic regarding the original Client VLAN’s. There is no need to configure the “private” C-VLAN’s as such. Only the B-VLANs mapped to a VXLAN tunnel network need to be created as shown in the sample configuration below.

Figure-3 Q-in-VNI sample configuration with vlan 1001 as the dot1q-tunnel vlan

Figure 3: Q-in-VNI sample configuration with vlan 1001 as the dot1q-tunnel vlan


VXLAN EVPN Fabric Multipod

Because all the “private” VLAN’s for each and every Client must be carried from end-to-end throughout the shared network connectivity for multiple sites’ deployment, one option is to maintain the double tag encapsulation across different and distant locations over the Layer 3 network, without any hand-off between the VLAN’s.

Consequently, the first choice for network transport relies on VXLAN EVPN Multipod discussed in post 32

The VXLAN EVPN Multipod architecture pillars can be expressed as follows:

  • Encapsulation:
    • Maintains the Tenant encapsulation from Pod-to-Pod and from site-to-site.
    • No requirement for VLAN hand-off at any site boundary, which will break the Dot1Q tunneling (Figure 4).
  • Sturdiness
    • Validated for geographically dispersed Pods.
    • Offers a solid Layer 3-based underlay Fabric, including inter-site transport.
    • Contains the failure domain by reducing the amount of flooding frames across the whole Fabric (no Flood & Learn, ARP Suppress).
    • Efficient Endpoint learning with MP-BGP EVPN – Client host’s (MAC) reachability information is discovered and distributed among all Leaf Nodes of interest by the Control Plane.
    • Maintains Layer 2 and Layer 3 segmentation for each Tenant from end-to-end.
  • Flexibility
    • Multiple Client infrastructures share the same Top-of-Rack switch.
    • Mix of Q-in-Q transport, Dot1Q trunk and Access port within the same ToR switch. From each Top-of-Rack switch on to which client infrastructures are locally attached, multiple interfaces per Tenant organization can use different encapsulation types (Q-in-Q, Dot1Q) and modes (Trunk, Access, Routed).
    • Per Port VLAN translation for native Dot1Q client-VLAN.
    • Independent Client Layer 2 and Layer 3 network infrastructures (Physical, Virtual, Hybrid).
  • Scalability
    • Offers very high scalability, which is required by Service Providers.
    • Reuses duplicate VLAN IDs across all Tenants throughout the whole Fabric.
    • Transit inter-site devices are pure Layer 3 routers that do not require any Layer 2 encapsulation capabilities. Thus they do not dictate the amount of L2 encapsulated frames and tags that can be stretched across the remote sites (note, MTU must be increased accordingly).
    • Transit inter-site devices can also be leveraged to extend or support additional Tenants (Bud node support).

Figure 4: Q-inVNI with VXLAN EVPN Multipod geographically dispersed

Figure 4: Q-in-VNI with VXLAN EVPN Multipod geographically dispersed

Figure 4 illustrates the requirement of Layer 2 segmentation and VLAN ID overlapping between Clients, across different Leaf nodes spread over multiple locations. The transit transport between Pods being pure Layer 3, the double encapsulation is performed at the edge of each Leaf Node.

Service Cloud Integration

As mentioned above, it is critical that the Hosting Provider offers other Service Clouds to its Clients. Although each Client network infrastructure keeps fully isolated, access to the Service Cloud for each Tenant can be achieved using two different approaches.

Selective Client Service C-VLAN’s from Ingress Access Interface

The first option is an elementary method to provide external access so that each individual Client infrastructure can benefit from the Provider Services. As explained previously, in order to maintain the scalability and segmentation of each and every Client network, the key transport relies on Q-in-VNI (a Per-Client Q-in-Q encapsulation mapped to a L2 VNI). As a result, due to the double tagging induced with the Dot1Q tunneling, as of today, Client VLAN’s cannot be natively selected for any further treatment such as routing, bridging or for security purposes. As described in Figure 2 and Figure 4, Client-VLAN’s are fully isolated from the rest of the Provider’s infrastructure, which is, above all, the foremost expectation from each Client. As a consequence, for that particular requirement, each Tenant’s infrastructure consumes at least one physical interface per Top-of-Rack in order to be expanded to other racks or other Pods using Q-in-VNI.

Additionally, to offer access to the Service Cloud, one or multiple C-VLANs for each individual private Client network must be selected from a different source Interface, where the latter is locally connected. Consequently, a second physical access interface is allocated on the Top-of-Rack for one or more selective C-VLANs to be routed outside the Client’s organization.

Figure 5: separated-interface-for-hosting-transport-and-services

Figure 5: Separated Interface for Hosting transport and Services

Figure 5 depicts an extreme scenario in which, firstly each Client infrastructure uses the full range of isolated Dot1Q VLANs, and secondly Client Blue and Client Orange have elected the same VLAN 10 to access a Service Cloud

Note: Let me call these Client-VLANs which are used to access the Service Cloud, “Client Service C-VLANs” or “public C-VLANs”.

To keep it simple with this example, for each Client:

  • A first L2 VNI is initiated to transport the double-tagged frames (Q-in-VNI), to carry all the “private” L2 network segments of a particular Client within the VLAN Fabric.
  • Another L2 VNI is leveraged to transport the “Client Service VLAN” (public network) used to access one of the Provider L3 services.
  • Finally, a L3 VNI is used to maintain the Layer 3 segmentation for that particular Client among the other Tenants.

Note that it is also possible to transport multiple “Client Service VLANs” with their respective L2 VNIs (1:1) for a specific Client infrastructure over the same ingress edge trunk interface of the concerned ToR. It is also critical to maintain the Layer 3 segmentation for each Client extended within the VXLAN Fabric (multiple L2 VNIs with multiple L3 VNIs per Client infrastructure). However, it is important to mention that for Clients to access the Provider’s Service Cloud, each Tenant will consume one Provider VLAN for their L2 segment and another for their L3 segment.

Consequently, this must be used with caution with regard to the consumption of the Provider’s VLAN resources.

VLAN Overlapping for “private” Client VLAN’s

To address the VLAN overlapping such that the 4k C-VLAN’s are spread across the Provider infrastructure, the Dot1Q VLAN tunnel used as the Backbone VLAN is unique (per ToR) for each Client (1001, 1002, 1003, etc..). Each B-VLAN is mapped to a dedicated and unique Layer 2 VNI which also carries within the VXLAN header the original Dot1Q tags of each and every Client segment.

Note: VLAN ID are per Leaf node locally significant, consequently, it is possible to reuse the same VLAN ID for a different Tenant as long as they are locally attached to distinct Top-of-Rack switch and these B-VLAN are mapped to a different and unique VN ID.

As discussed previously, for the Q-in-Q transport, the VXLAN Fabric is not aware of the existence of any C-VLAN per-se, hence there is no C-VLAN to configure onto the Fabric, as illustrated in Figure 6, but the B-VLANs need to be configured. The original “private” C-VLAN’s are kept fully “hidden” from the rest of the network infrastructure.

However, for “public” C-VLAN’s aiming to be routed to the Service Cloud, these particular VLAN’s must be created like any traditional access VLAN as they are not part of the Q-in-Q tunneling encapsulation. As illustrated in the configuration sample below (Figure 6), Client VLAN’s 10, 20 and 30 are created to be used respectively for further forwarding actions controlled by the Fabric itself.

Figure 6: vlan-configuration-with-mapping-to-l2-vni-as-well-as-the-l3-vni

Figure 6: VLAN configuration with mapping to L2 VNI as well as the L3 VNI

“Client Service VLAN’s” can now be routed like any other Native Dot1Q VLAN mapped to an L2 VNI. All VXLAN Layer 2 and Layer 3 services are available for those Client segments.

  • Specific “public” C-VLAN’s can be routed to Layer 3 Service Clouds.
  • The Layer 3 Anycast Gateway feature can hence be leveraged to each Client for that specific “public” C-VLAN.
  • Layer 3 segmentation between Tenants is addressed using the traditional VRF-lite transported over a dedicated L3 VNI as shown in the configuration sample in Figure 4.

VLAN Overlapping for “public” Client VLAN’s

To address the VLAN Overlapping for the “Client Service VLAN’s,” the service, “Per-port VLAN mapping,” is leveraged at the ingress port level to offer a one-to-one VLAN translation feature with a unique Provider VLAN.

Figure 7: interface-vlan-mapping-per-port-vlan-translation-for-native-c-vlan

Figure 7: Interface VLAN mapping per port VLAN Translation for native C-VLAN

As the result, each “public” C-VLAN (original or translated) can be mapped to its respective Layer 2 VNI, eliminating the risk of any VLAN IDs overlapping with another Client’s infrastructure. Since it’s a one-to-one translation between the original C-VLAN and a unique Provider VLAN, although this feature is very helpful, it must be considered only for a limited number of C-VLAN’s (4k VLAN’s per shared Bridge Domain). In the context of Service Providers in general, with a hundred or a thousand clients, this per-port VLAN mapping solution alone may not be scalable enough for a large number of originated Client VLAN’s. Hence, the need for the parallel transport using Q-in-VNI, leveraged to all the other isolated “private” C-VLAN’s that do not require any Services from the Hosting Provider, as discussed previously. 

Consequently, from each set of “private” C-VLAN’s, each Client must provide at least one C-VLAN ID for accessing the Hosting Provider network. The “Client Service VLAN’s” can then be routed within the Fabric and can, above all, benefit from the dedicated or shared Layer 3 Service Cloud.

Figure 8: tenant-connections-to-service-cloud

Figure 8: Tenant connections to Service-Cloud

Client Access to the Provider Service Cloud

To route the “Client Service VLANs” to a Service Cloud, the corresponding VNIs are extended and terminated at a pair of Border Leaf Nodes that provide direct connectivity with the Service Cloud gateway. On this physical interface connecting the Cloud network, a logical sub-interface is created for each Tenant as depicted in Figure 8.

Each sub-interface is configured with a unique Dot1Q encapsulation established with its peer network. The Dot1Q tag used in this configuration is not directly related to any encapsulated tag from the Access Interfaces (Client VLAN attachment). However, the same Dot1Q tagging is used on the reverse side (Provider Service Cloud Gateway) in order to provide the L3 segmentation peering between each Client network (Tenant) toward their respective private or shared Service Cloud (Figure 9 & 10)

VXLAN Border Leaf Sub-interfaces

VXLAN Border Leaf Sub-interfaces

Accordingly, the first solution described above requires a second physical interface per Client to be configured for selective C-VLAN’s to be routed. As a result, this solution doubles the number of physical access interfaces consumed from the Provider infrastructure, which may have an impact on the operational expenses. In addition, it is important to remember that from a scalability point-of-view, additional Provider VLAN’s to route these Client VLAN’s outside their private infrastructure are used, which are not unlimited in number.

Selective Client Service C-VLANs using Ethernet Flow Point (EFP)

For keeping the consumption of access interfaces for locally attached Client infrastructures as low as possible, a more sophisticated method is utilized in conjunction with the Cisco ASR9000 or the Cisco NCS5000, which are deployed as the selective access gateway for accessing the Service Cloud. This selection of the concerned Client Service VLAN’s is centralized and is performed at a single interface (inbound interface of the cloud service provider router) and hence, offering a much easier access for configurations. Thanks to the Ethernet Virtual Connection (EVC) and Ethernet Flow Point (EFP) features. Obviously, the same WAN Edge Router used to access the Provider cloud or for WAN connectivity in general, can be used for this selection purpose. It doesn’t have to be dedicated.

Note: Already quickly described a while back in post 10

10 – Ethernet Virtual Connection (EVC)

An Ethernet Virtual Connection (EVC) is a Cisco carrier Ethernet equipment function dedicated to Service Providers and large enterprises. It provides a very fine granularity to select and treat the inbound workflows known as service instances, under the same or different ports, based on flexible frame matching.

 The EFP is a Layer 2 logical sub-interface used to classify traffic under a physical interface or a bundle of interfaces. It represents a logical demarcation point of an Ethernet Virtual Connection (EVC) on a particular interface. It exists as a Flow Point on each interface, through which the EVC passes.

With the EFP, it is therefore possible to perform a variety of operations on ingress traffic flows, such as routing, bridging or tunneling the traffic through many ways by using a mixture of VLAN IDs, single or double (Q-in-Q) encapsulation, and Ether-types.

Figure 11: ethernet-flow-point-for-selective-b-vlan-c-vlan-toward-l3-vpn

Figure 11: Ethernet-F0low-Point (EFP) for Selective B-VLAN, C-VLAN toward L3 Service (VRF)

In our use-case, the EFP serves to identify the double tag, and to associate the Backbone VLAN (outer VLAN tag) identifying the Client-VLAN (outer VLAN tag) of choice for a particular Tenant, in order to route the later identified frames into a new segmented Layer 3 network.

Figure 11 represents the physical ingress Layer 2 interface of the ASR9000 or NCS5000 receiving the double tag frames from one of the VXLAN Border Leaf Nodes. Each double tag is identified by the Backbone VLAN and one Client VLAN. The C-VLAN is selected and added to a dedicated Bridge Domain for enhanced forwarding actions. In addition to other tasks such as bridging back the original Layer 2 to a new Layer 2 segment, to address our particular need, a routed interface can be added to the relevant Bridge Domain, offering the L3 VPN network to that particular Tenant for accessing their Service Cloud (Tenant 1 & 3). However, if the requirement is a simple Layer 3 service, the routed interface can be directly tied to the sub-interface (Tenant 2).

Figure 12: client-infrastructure-access-simplified

Figure 12: Client Infrastructure access simplified

From the point of attachment of the Client infrastructure to the Provider network, the relevant Access Interfaces from the Top-of-Rack switch are simplified to a single logical ingress approach as shown in Figure 12. The selection of the Client-VLAN’s to be routed is completed outside the VXLAN domain at the Service Cloud gateway that connects to the Border Leaf Nodes, which makes it a centralized process (Figure 14). This significantly simplifies the physical configuration from each Client network’s end.

The function of the Ethernet Flow Point offers flexible VLAN matching. When the traffic coming from the Fabric hits the Layer 2 interface transport of the router, the TCAM for that port is used to find out which particular sub-interface the filters match with [outer-VLAN, inner-VLAN], in order to bind, for example, the selected ingress traffic to a dedicated L3 VPN network.

Two approaches are possible. The first one is quite simple, where the selected C-VLAN is directly routed to the VRF of interest.

Figure 13: connecting-a-private-c-client-to-a-private-layer-3-service-vrf-or-l3-vpn-asr9k-and-ncs5k

Figure 13: Connecting a private C-Client to a private Layer 3 Service VRF or L3 VPN ASR9K and NCS5K

Figure 14: efp-for-association-of-double-tag-with-selection-of-c-vlan-bound-to-layer-3-services-ncs5k

Figure 14: EFT used for association of Double TAG with selection of C-VLAN bound to L3 Services (NCS5k)

Figure15: efp-configuration-for-routing-service

Figure 15: EFT Sample Configuration for Routing Service

Figure 14 depicts the logical view of the ingress interface of the gateway (ASR9000 or NCS5000) connecting each selected C-VLAN directly to a Layer 3 network. As a result, each Client can connect their private network infrastructure to one of the Layer 3 Service Clouds offered by the Provider. VRF and L3 VPN are supported by both the platforms, ASR9000 and NCS5000.

The second option uses the Bridge Domain to bind the selected network, allowing additional actions for the same data packet.

Figure16: connecting-a-private-c-client-to-a-private-layer-3-service-via-a-bd

Figure 16: Connecting a Private C-VLAN to a private L3 Service using a Bridge Domain (BD)

Figure 17: efp-for-association-of-double-tag-with-selection-of-c-vlan-bound-to-bridge-domain-or-layer-3-services-asr9k

Figure 17: EFP for association of double TAG with selection of C-VLAN bound to a Bridge Domain and/or L3 service (ASR9k)

Figure 17 depicts the logical view of the ingress interface of the gateway (ASR9000) connecting each selected C-VLAN with a Bridge Domain for access to a Layer 3 network. As a result, each Client can connect their private network infrastructure to one of the Layer 3 Service Clouds offered by the Provider.

The key added value with the Bridge Domain attachment is that additional actions can be applied to any Client’s VLAN [B-VLAN, C-VLAN], such as Layer 2 VPN services (VPLS, Dot1Q, Q-in-Q, PBB-EVPN).

Figure18: Layer-2-to-bvi-sample-asr9000

Figure18: Layer 2 to BVI Sample Configuration (ASR9000)

Beyond the Layer 3 Services

As mentioned previously, the added value of using the Bridge Domain is the ability to extend the actions beyond the L3 and the L3 VPN with a L2 or L2 VPN transport.

For example, the same double tag frame can be re-encapsulated using another overlay transport such as PBB-EVPN (Figure 20) or Hierarchical VPLS deployment. The filter can be applied for “any” inner-VLAN tag as a global match for further action. As a result, the same Client infrastructure can be maintained in isolation across multiple sites using a hierarchical DCI solution, without impacting the Client VLAN’s. Each set of Client VLAN’s can be directly bound to another double tag encapsulation method such as PBB-EVPN or Hierarchical VPLS while still supporting a very high scalability. This can become a nice alternative to a VXLAN Multipod by keeping each network infrastructure fully independent from a VXLAN data plane and control plane point of view.

Figure19: Extending-the-c-vlan-to-l2-vpn-and-l3-vpn-via-the-bridge-domain

Figure19: Logical View Extending the C-VLAN to L2-VPN and L3-VPN via the Bridge Domain

Figure 20: Logical-view-of-tenant-infrastructure-extended-across-pbb-evpn-per-tenant-service-cloud

Figure 20: Logical View of Tenant infrastructure extended across PBB-EVPN per-Tenant Service-Cloud

Network and Security as a Service Cloud

In addition to private access to a particular Service Cloud, it is also possible to offer additional “Network and Security as a Service (NSaaS)” for each Client’s host of multi-tier applications.

This feature offers a very granular and flexible solution. For example, as shown in Figure 21, multiple C-VLAN’s supporting a multi-tier application (e.g. VLAN 10 = WEB, VLAN 20 = APP, VLAN 30 = DB) are each bound to a specific Bridge Domain. Each concerned VLAN is sent to a particular Network Service node running in routed mode or transparent mode, to be treated afterward based on the application requirements (Firewalling, Load balancing, IPS, SSL off loader, etc.).

Figure 2: provides-network-and-security-services-from-the-provider

Figure 21: Providing Network and Security Services From the Provider Resources


As of today, Q-in-VNI support with VXLAN EVPN Fabrics is one of the best and easiest method to transport multiple isolated Client infrastructures within a data center across the same physical infrastructure. In association with the Ethernet Virtual Circuit and the Ethernet Flow Point features of the Cisco Service Provider platforms (ASR9000 & NCS5000), each and every Client infrastructure can be selectively extended to the Service Clouds offered by the Hosting Providers.

With the binding of those private VLAN’s to a Bridge Domain, each Client infrastructure can be extended using Layer 2 VPN and/or Layer 3 VPN, maintaining the same level of segmentation across multiple sites. Last but not least, the EFP feature can furthermore be leveraged to offer additional Network and Security as a Service by the Hosting Provider for any of the Client’s applications.

Posted in DCI | 6 Comments

33 – Cisco ACI Multipod

Since 2.0, Multipod for ACI enables provisioning a more fault tolerant fabric comprised of multiple pods with isolated control plane protocols. Also, multipod provides more flexibility with regard to the full mesh cabling between leaf and spine switches.  When leaf switches are spread across different floors or different buildings, multipod enables provisioning multiple pods per floor or building and providing connectivity between pods through spine switches.

A new White Paper on ACI Multipod is now available


Posted in DCI | Leave a comment

32 – VXLAN Multipod stretched across geographically dispersed datacenters

VXLAN Multipod geographically dispersed

VXLAN Multipod Overview

This article focuses on the single VXLAN Multipod Fabric stretched across multiple locations as mentioned in the previous post 31 through the 1st option.

We have been recently working, with my friends Patrice and Max, during a couple of months, building an efficient and resilient solution based on VXLAN Multipod fabric stretched across two sites. The whole technical white paper is now available for deeper technical details including insertion of Firewalls, Multi-tenancy VXLAN routing and some additional tuning, and it can be accessible from here

One of the key use-case for that scenario is for an enterprise to select VXLAN EVPN as the technology of choice for building multiple greenfield Fabric PoDs. It becomes therefore logical to extend the VXLAN overlay between distant PoDs that are managed and operated as a single administrative domain. Its deployment is just a continuation of the work performed to roll out the fabric PoDs, simplifying the provisioning of end-to-end Layer 2 and Layer 3 connectivity.

Technically speaking, and thanks to the flexibility of VXLAN, we could deploy the overlay network on top of almost any forms of Layer 3 architecture within a datacenter (CLOS, Multi-layered, Flat, Hub&Spoke, Ring, etc.), as long as VTEP to VTEP communication is  always available, and obviously a CLOS (Spine & Leaf model) architecture is being the most efficient.

  • [Q] However, can we afford to stretch the VXLAN fabric across distant locations as a single fabric, without taking into consideration the risks of loosing the whole resources in case of a failure?
  • [A] The answer mainly relies on the stability and quality of the fiber links established between the distant sites. In that context it is interesting to remind that most of DWDM managed services offer an SLA equal to or lower than 99.7%. Another important element to take into reflection is how much the control plane can be independent from site to site, reducing the domino effect caused by a breakdown in one PoD.

Anyhow, we need to understand how to protect the whole end-to-end Multipod fabric to avoid propagating any disastrous situation in case of a major failure.

Different models for extending multiple PoD across a single logical VXLAN fabric are available. Each option relies on how the Data plane and Routing Control Plane are stretched or distributed across the different sites to separate the routing functions as well as the placement of the transit layer 3 nodes used to interconnect the different PoDs.

From a physical approach, the transit node connecting the WAN can be initiated using dedicated Layer 3 devices, or leveraged from the existing leaf devices or from the spine layer. Diverse approaches exist to disperse the VXLAN EVPN PoDs across long distances.

We can’t say there is one single approach to geographically dispersed the fabric PoDs among different location. The preferred choice should be taken according to the physical infrastructure including distances and physical transport technology, the enterprise business and service level agreement as well as the application criticality levels with the Pros & Cons for each design.

VXLAN EVPN Multipod Stretched Fabric

VXLAN EVPN Multipod Stretched Fabric

What we should keep into consideration is the following (non exhaustive list, though):

  • For VXLAN Multipod purposes, it is important that the fabric learns the endpoints from both sites using a control plane instead of the basic flood & learn.
  • The underlay network intra-fabric and inter-fabrics is a pure Layer 3  network used to handle the overlay tunnel transparently.
  • The connectivity between sites is initiated from a pair of transite layer 3 nodes. The transit device used to interconnect the distant fabrics is a pure layer 3 routing device.
  • This transit layer 3 edge function can be also initiated from a pair of leaf nodes used to attach endpoints or a pair of border spine nodes used for layer 3 purposes only.
  • Actually, the number of transit devices used for the layer 3 interconnection is not limited to a pair of devices.
  • The VXLAN Multipod solution recommends to interconnect the distant fabrics using Direct Fiber or DWDM (Metropolitan-area distances). However, technically nothing prevents to use inter-continental distances (layer 3) as VXLAN DP and EVPN CP are not sensitive to any latency concerns (although this is not recommended), but the application is.
  • For geographically dispersed PoDs, it is recommended that the control planes (underlay and overlay) are deployed in a more structured fashion in order to keep them as independent as possible from each location.
  • Redundant and resilient MAN/WAN access with fast convergence is a critical requirement, thus network services such as DWDM in protected mode and with remote port shutdown, and BFD should be enabled.
  • Storm Control on the geographical links is strongly recommended to reduce the fate sharing between PoDs.
  • BPDU Guard must be enabled by default on the edge interfaces to protect against Layer 2 backdoor scenarios.

Slight Reminder defining the VXLAN Tunnel Endpoint: 

VXLAN uses VTEP (VXLAN tunnel endpoint) services to map a particular dot1Q Layer 2 frames (VLAN ID) to a VXLAN segment tunnel (VNI). Each VTEP is connected on one side to the classic Ethernet segment where endpoints are deployed, and on the other side to the underlay Layer 3 network (Fabric) to establish the VXLAN tunnels with other remote VTEPs.

The VTEP performs the following two tasks:

  • Receives traffic from locally connected endpoints and encapsulates it into VXLAN packets destined for remote VTEP nodes
  • Receives VXLAN traffic originating from remote VTEP nodes, decapsulates it, and forwards it to locally connected endpoints

The VTEP dynamically learns the destination information for VXLAN encapsulated traffic for remote endpoints connected to the fabric by using the MP-BGP EVPN control protocol.

With information received from the MP-BGP EVPN control plane used to build information in the local forwarding tables, the ingress VTEP, where the source endpoint is attached to, encapsulates the original Ethernet traffic with a VXLAN header and sends it over a Layer 3 network toward the egress VTEP, where the destined endpoint sits. The latter then de-encapsulates the layer 3 packet to present the original Layer 2 frame to its final destination endpoint.

From a data-plane perspective, the VXLAN Multipod system behaves as a single VXLAN fabric network. VXLAN encapsulation is performed end-to-end across PoDs and from leaf nodes. The spines and the transit leaf switches perform only Layer 3 routing of VXLAN encapsulated frames to help ensure proper delivery to the destination VTEP.

VXLAN VTEP-to-VTEP VXLAN Encapsulation

VXLAN VTEP-to-VTEP VXLAN Encapsulation

Multiple Designs for Multipod deployments

PoDs are usually connected with point-to-point fiber links when they are deployed within the same physical location. The interconnection of PoDs dispersed across different locations uses direct dark-fiber connections (short distances) or DWDM to extend Layer 2 and Layer 3 connectivity end-to-end across locations (Metro to long distances).

Somehow, there is no real constraint with the physical deployment of VXLAN per se as it’s an overlay virtual network running on top of a Layer 3 underlay, as long as the egress VTEPs are all reachable.

As a result, different topologies can be deployed (or even coexist) for the VXLAN Multipod design.

Without providing an exhaustive list of all possible designs, the 1st main option is to attach dedicated transit layer 3 nodes toward the Spine devices. They are usually called Transit leaf nodes as they often sit at the same Leaf layer and are also often used as computing Leaf nodes (where endpoints are locally attached), or as service leaf nodes (where firewall, security and network service are locally attache), beside the extension of Layer 3 underlay connectivity toward the remote sites.

VXLAN Multipod Design Interconnecting Leaf Nodes

VXLAN Multipod Design Interconnecting Leaf Nodes

Hence, nothing prevents to interconnect distant PoDs through the spine layer devices, as no additional functions nor VTEP are mandated from the transit spine other than the capability to route VXLAN traffic between leaf switches deployed in separate PoDs. A layer 3 core layer can be leveraged for interconnecting remote PoDs.

VXLAN Multipod Design Interconnecting Spine Nodes

VXLAN Multipod Design Interconnecting Spine Nodes

The VXLAN Multipod design is usually positioned to interconnect data center fabrics that are located at metropolitan-area distances, even though technically, there is no distance limit. Nevertheless, it is crucial that the quality of the DWDM links is optimal. Consequently, it is necessary to lease the optical services from the provider in protected mode and with the remote port shutdown feature, allowing immediate detection of optical link down from end-to-end. Since the network underlay is layer 3 and can be extended across long distances, it may be worth checking the continuity of the VXLAN tunnel established between PoDs. BFD can be leveraged for that specific purpose.

Independent Control Planes

When deploying VXLAN EVPN we need to consider 3 independent layers:

  • The layer 2 overlay data plane
  • The layer 3 underlay control plane
  • The layer 3 overlay control plane

Data Plane: The overlay data plane is the extension of the layer 2 segments established on the top of the layer 3 underlay. It is initiated from VTEP to VTEP, usually enabled at the leaf node layer. In the context of geographically dispersed Multipod, VXLAN tunnels are established between leaf devices belonging to separate sites and within each site as if it was one single large fabric (hence the naming convention of Multipod).

Underlay Control Plane:  The underlay control plane is used to exchange reachability information for the VTEP IP addresses. In the most common VXLAN deployments, IGP (OSPF, IS-IS, or EIGRP) is used for that purpose, but not limited to. BGP can also be an option. The underlay protocol mainly relies on the Enterprise experiences and requirements.

Overlay Control Plane: MP-BGP EVPN is the control plane used in VXLAN deployments for the host reachability and distribution information across the whole Multipod fabric. It is used by the VTEP devices to exchange tenant-specific routing information for endpoints connected to the VXLAN EVPN fabric and to distribute inside the fabric IP prefixes representing external networks. In the context of Multipod dispersed across different rooms (same building or same campus), host reachability information is exchanged between VTEP devices belonging to separate locations as if it was one single fabric.

Multipod basic 1

Classical VXLAN Multipod deployment across different rooms within the same building

This above design represents the extension of the 3 main components required to build a VXLAN EVPN Multipod fabric between different rooms. This is a basic architecture using direct fibers between two pairs of transit leaf nodes deployed in each physical Fabric (PoD-1 & PoD-2). The whole architecture offers a single stretched VXLAN EVPN fabric with the same data plane and control plane deployed end-to-end. The remote host population is discovered and updated dynamically on each PoD to form a single host reachability information database.

If that design deployed within the same building or even within the same campus can be considered as a safe network design architecture, for long distances (Metropolitan-area distances) the crucial priority must be to improve the sturdiness of the whole solution by creating independent control planes due to the long distances between PoDs.

The approach below shows that the data plane is still stretched across remote VTEP’s. The VXLAN data plane encapsulation extends the virtual overlay network end-to-end. This implies the creation of VXLAN ‘tunnels’ between VTEP devices that sit in separate sites.

However the underlay and overlay control planes are independent.

VXLAN multipod stretched fabric advanced 1

Resilient VXLAN Multipod deployment for Geographically Dispersed Locations


Underlay Control Plane:  Traditionally in Enterprise datacenter, OSPF is the IGP protocol of choice for the underlay network. In the context of geographically distant PoDs, each PoD is organised in a separate IGP Areas, with Area 0 typically deployed for the interpod links. That helps reducing the effects of interpod link bouncing, usually due to a “lower” quality of inter-sites connectivity compared to direct fiber links.

However, in order to improve the underlay separation between the different PoDs, BGP can also be considered as an alternative to IGP for the transit routing Inter-sites links.

Overlay Control Plane: In regard to the MP-BGP EVPN overlay control plane, the recommendation is to deploy each PoD in a separate MP-iBGP autonomous system (AS) interconnected through MP-eBGP sessions. When compared to the single Autonomous System model, using MP-eBGP EVPN sessions between distant PoDs simplifies interpod protocol peering.

Improving simplicity and resiliency

In a typical VXLAN fabric, leaf nodes connect to all the spine nodes, but no leaf-to-leaf or spine-to-spine connections are usually required. The only exception is the vPC peer link connection required between two leaf nodes configured as part of a common vPC domain. When Leaf devices are paired using vPC, this allows the locally attached endpoints to be dual-homed and take advantage of the Anycast VTEP functionality (same VTEP address used by both leaf nodes part of the same vPC domain).

vPC Anycast VTEP

It is not required to dedicated routing devices for the layer 3 transit function between sites. One of a pair of leaf nodes (usually named  “Border leaf nodes” for its roles connecting the fabric to outside) can be leveraged for the function of “transit leaf nodes” simplifying the deployment. This coexistence can also be enabled on any pair of computing or service leaf nodes. That doesn’t require any specific function dedicated for the transit role, excepted that the hardware must support the Bud Node design (well supported on Nexus series).

vPC Border Leaf Nodes as Transit Leaf Nodes

The vPC configuration should be performed following the common vPC best practices. However it is also important enabling the “peer-gateway” functionality . In addition and depending on the deployment scenario, the topology, the amount of devices and subnets, additional IGP tuning might become necessary to improve convergence. This relies on standard IGP configuration. Please feel free to read the white paper for further details.

IP Multicast for VXLAN Multipod geographically dispersed

One task of the underlay network is to transport Layer 2 multidestination traffic between endpoints connected to the same logical Layer 2 broadcast domain in the overlay network. This type of traffic concerns Layer 2 broadcast, unknown unicast, and multicast traffic (aka BUM).

Two approaches can be used to allow transmission of BUM traffic across the VXLAN fabric:

  • Use IP multicast deployment in the underlay network to leverage the replication capabilities of the fabric spines delivering traffic to all the edge VTEP devices.
  • If multicast is not an option, it is possible to use the ingress replication capabilities of the VTEP nodes to create multiple unicast copies of the BUM workflow to be sent to each remote VTEP device.

It is important to note that the choice made for the replication process (IP Multicast or Ingress replication) concerns the whole Multipod deployment as it acts as a single fabric. It is more likely that IP Multicast might be enabled for Metro distances, as the Layer 3 over the DWDM links is usually managed by the enterprise itself. However it is probably not given to get IP Multicast option or a limited number of multicast groups from the provider for layer 3 managed services (WAN). The latter will impose the Ingress Replication mechanism deployed end-to-end the VXLAN Multipod. Notice that the choice for Ingress Replication may have an impact in regard to the scalability, depending on the total number of leaf nodes (VTEP). Worth to check first.

If IP Multicast is the preferred choice, then two approaches can be made:

Anycast Rendezvous-Point PIM and MSDP in a Multipod Design

Anycast Rendezvous-Point PIM and MSDP in a Multipod Design

  1. Either deploying the same PIM with anycast RP across the two distant PoDs if you want to simplify the configuration. That is indeed a viable solution.
  2. Or better and recommended is to keep the PIM-SM domain independently enabled on each site and leverage MSDP for the peering establishment between rendez-vous points across the underlying routed network.

Improving the Sturdiness of the VXLAN Multipod Design

Until all functions are embedded in the future in the VXLAN standardisation, preventing the creation of Layer 2 loops, alternative short-term solutions become necessary. The  suggested approach consists of the following two processes:

  • Prevention of End-to-End Layer 2 Loops: Use edge port protection features in conjunction with VXLAN EVPN to prevent the creation of a Layer 2 loops. The best-known feature is BPDU Guard. BPDU Guard is a Layer 2 security tool that should be enabled on all the edge interfaces that connect to endpoints (hosts, firewalls, etc.). It prevents the creation of a Layer 2 loop by disabling the switch port after it receives a spanning-tree BPDU. BPDU Guard can be configured at the global level or the interface level.

VXLAN with BPDU Guard

VXLAN with BPDU Guard

  • Mitigation of End-to-End Layer 2 Loops: A different mechanism is required to mitigate the effects of any form of broadcast storms. When multicast replication is used to handle the distribution of BUM, it is recommended to enable the storm-control function to rate-limit the received amount of BUM traffic carried via the IP Multicast group of interest. The rate limiter of multicast traffic must be initiated on the ingress interfaces of the transit leaf nodes, hence protecting all networks behind it.

Storm-Control Multicast on Interpod Links

Storm-Control Multicast on Interpod Links

If it is strongly recommended to enable Storm Control for Multicast at the Ingress transit interfaces, it is also a priority to consider the current utilisation of multicast traffic produced by the Enterprise applications across remote PoDs, prior to give a threshold value. Otherwise the risk is to make more damage for the user applications than protecting against broadcast storm disaster. As a result, the rate-limit value for BUM traffic should be greater than the application data multicast traffic (use the highest value as a reference) plus the percentage of permitted broadcast traffic.

Optionally, another way to mitigate the effects of a Layer 2 loop inside a given PoD is to also enable storm-control multicast on the spine interfaces to the leaf nodes. More importantly that the previous statement, it is crucial to understand the normal application multicast utilisation rate within each PoD before using a rate limiting threshold value.


The deployment of a VXLAN Multipod fabric allows extending VXLAN overlay data plane in conjunction with its MP-BGP EVPN overlay control plane for Layer 2 and Layer 3 connectivity across multiple PoDs. Depending on the specific use case and requirements, those PoDs may represent different rooms in the same physical data center, or separate data center sites (usually deployed at Metropolitan-area distances from each other).

The deployment of the Multipod design is a logical choice to extend connectivity between fabrics that are managed and operated as a single administrative domain. However, it is important to consider that it behaves like a single VXLAN fabric. This characteristic has important implications for the overall scalability and resiliency of the design.

A true multisite solution with a valid DCI solution is still the best option for providing separate availability zones across data center locations.



Posted in DCI | 25 Comments

31 – Multiple approaches interconnecting VXLAN Fabrics

As discussed in previous articles, VXLAN data plane encapsulation in conjunction with its control plane MP-BGP AF EVPN is becoming the foremost technology to support the modern network Fabric.

DCI is an solution architecture that you deploy to interconnected multiple data centers to extend Layer 2 and/or Layer 3 with or without multi-tenancy. A DCI architecture relies on a Data Plane for the Transport and a Control plane for an efficient and solid mechanism for endpoint discovery and distribution between sites (and much more).  VXLAN is encapsulated method, it’s not a architecture. If we want to divert VXLAN for DCI purpose, it is crucial that we understand what we need and how to address the requirement with a VXLAN transport.

Some articles from different vendors claiming that VXLAN can be simply leveraged for DCI purposes do frighten me a bit! Either they are a bit light or they don’t understand what interconnecting multiple data centers implies. You could indeed easily built a crap solution to interconnect multiple sites using VXLAN tunnels, if you just wish to extend layer 2 segments outside the data center. Or, you can build a solid DCI architecture based on business requirements using OTV or PBB EVPN or even VXLAN EVPN, however the latter implies some additional features and sophisticated configurations.

It is therefore interesting to clarify how to interconnect multiple VXLAN/EVPN fabrics geographically dispersed across different locations.

If we look at how to interconnect VXLAN-based fabrics at layer 2 and layer 3, three approaches can be considered:

  • The 1st option is the extension of multiple sites as one large single stretched VXLAN Fabric. There is no network overlay boundary, nor VLAN hand-off at the interconnection, which simplifies operations. This option is also known as geographically dispersed VXLAN Multipod. However we should not consider this solution as a DCI solution per se as there is no demarcation, nor separation between locations. Nonetheless this one is very interesting for its simplicity and flexibility. Consequently we have deeply tested and validated this design (see next article).
  • The second option to consider is multiple VXLAN/EVPN-based Fabrics interconnected using a DCI Layer 2 and layer 3 extension. Each greenfield Data Center located on different site is deployed as an independent fabric, increasing autonomy of each site and enforcing global resiliency. This is also called Multisite in opposition to the previous one Multipod. As a consequence, a efficient Data Center Interconnect technology (OTV, VPLS, PBB-EVPN, or even VXLAN/EVPN) is used to extend Layer 2 and Layer 3 connectivity across separate sites.VXLAN EVPN Multisites 1For the VXLAN multisite scenario, there are 2 possible models that need to be taken into reflection:

Model 1: Each VXLAN/EVPN is a Layer 2 fabric only, with external devices (Routers, FW, SLB) offering routing and default gateways functions (with or without FHRP localization). This one is similar to the traditional multiple DC to interconnect and doesn’t constrain any specific requirements, except the DCI solution itself to interconnect the fabrics in a resilient and secure fashion, as usual.

VXLAN EVPN Multisites with external gateway devices

Model 2: Each VXLAN/EVPN is Layer 2 and Layer 3 fabric.  VXLAN EVPN indeed brings a great added-value with L3 Anycast Gateway as discussed in this post. Consequently most of enterprises could be interested to leverage this function distributed across all sites with the same virtual IP and virtual MAC addresses . This crucial feature, when geographically dispersed among multiple VXLAN fabrics, improves performances and efficiency with the default gateway being active transparently for the endpoints located on each site from their first hop router (leaf node where endpoints are directly attached to). It offers hot live mobility with E-W traffic optimization beside the traditional zero business interruption. However because of the same virtual MAC and same virtual IP address exist on both side, some tricky configurations must be achieved.

.VXLAN EVPN Multisites with anycast L3 gateway

  • A third scenario, which can be considered somehow as a subset of the stretch fabric, is to dedicate the VXLAN/EVPN overlay to extend Layer 2 segments from one DC to another. Consequently, VXLAN/EVPN is agnostic of the Layer 2 transport deployed within each location (it could be vPC, FabricPath, VXLAN, ACI, just to list few of them). VLANs requiring to be extended to outside the DC are prolonged up to the local DCI devices from where layer 2 frames  are encapsulated with a Layer 3 VXLAN header, and sent afterward toward the remote site over a layer 3 network, to be finally de-encapsulated on the remote DCI device and distributed within the DC accordingly. VXLAN EVPN DCI-onlyIn this scenario vPC is also leveraged to offer DCI dual-homing functions for resiliency and traffic load distribution.

This option can be analogous to OTV or PBB-EVPN from an overlay point of view. However as already mentioned, it must be understood that VXLAN has not been built natively to address all the DCI requirements. Nevertheless the implementation of EVPN control-plane could be expanded in the future to include the multi-homing functionality, delivering failure containment, loop protection, site-awareness, or optimized multicast replication like OTV can already offer since its beginning.

BTW it’s always interesting to highlight that since NX-OS 7.2(0)D1(1), OTV relies on VXLAN encapsulation for the data-plane transport as well as the well known ISIS as the control plane natively built for DCI purposes. Hum, interesting to notice that OTV can also be a VXLAN-based DCI solution, however it implements the right control plane for DCI needs. That’s why it is important to understand that not only a control plane is required for Intra-fabric and Inter-DC network transport, but this control plane must natively offer the features to address the DCI requirements.





Posted in DCI | Leave a comment

30 – VxLAN/EVPN and Integrated Routing Bridging

VxLAN/EVPN and Integrated Routing Bridging


As I mentioned in the post  28 – Is VxLAN Control Plane a DCI solution for LAN extension, VxLAN/EVPN is taking a big step forward with its Control Plane and could be used potentially for extending Layer 2 segments across multiple sites. However it is still crucial that we keep in mind some weaknesses and lacks related to DCI purposes. Neither VxLAN nor VXLAN/EVPN have been designed to offer natively a DCI solution (post 26 & 28).

DCI is not just a layer 2 extension between two or multiple sites. DCI/LAN extension is aiming to offer business continuity and elasticity for the cloud (hybrid cloud). It offers disaster recovery and disaster avoidances services for Enterprise business applications, consequently it must be very robust and efficient. As it concerns on Layer 2 broadcast domain, it is really important to understand the requirement for a solid DCI/LAN extension and how we can leverage the right tools and network services to address some of the shortcomings that rely on the current implementation of VxLAN/EVPN offering a solid DCI solution.

In this article we will examine the integrated anycast L3 gateway available with VxLAN/EVPN MP-BGP control plane, which is one of the key DCI requirements for long distances when the first hop default gateway can be duplicated on multiple sites.

Integrated Routing and Bridging

One of the needs for an efficient DCI deployment is the duplicate Layer 3 default gateway solution such as FHRP isolation (24 – Enabling FHRP filter),  reducing the hair-pining workflows while machines are migrating from one site to another. In short, the same default gateway (virtual IP address + virtual MAC) exists and is active on both sites. When a virtual machine or the whole multi-tier application executes a hot live migration to a remote location, it is very important that the host continues to use its default gateway locally and processes the communication without any interruption.Without FHRP isolation, multi-tier applications (E-W traffic) will suffer in term of performances reaching their default gateway in the primary data center.

This operation should happen transparently with the new local L3 gateway, whereas the round-trip workflow via the original location is eliminated. No configuration is required within the Host. All active sessions can be maintained stateful.

Modern techniques such as Anycast gateway can offer at least the same efficiency as FHRP isolation, and not only for Inter-Fabric (DCI) workflows but also for Intra-Fabric traffic optimization. This task is achieved in an easier way because it doesn’t require specific filtering to be configured on each site. The function of the Anycast Layer 3 gateway is natively embedded with the BGP EVPN control plane. And last but not least, in conjunction with VxLAN/EVPN there are as many Anycast L3 gateways as Top-of-Rack switches (Leafs with VxLAN/EVPN enabled).

The EVPN IETF draft elaborated the concept of Integrated Routing and Bridging based on EVPN to address inter-subnet communication between Hosts or Virtual Machines that belong to different VxLAN segments. This is also known as inter-VxLAN routing. VxLAN/EVPN offers natively the Anycast L3 gateway function with the Integrated Routing Bridging feature (IRB).

VxLAN/EVPN MP-BGP Host learning process

The following section is a slight reminder on the Host reachability discovery and distribution process to better understand later the different routing mode.

The VTEP function happens on a First Hop Router usually enabled on the ToR where the Hosts are directly attached. EVPN provides a learning process to dynamically discover the local end-points attached to their local VTEP, distributing afterward the information (Host’s MAC and IP reachability) toward all other remote VTEPs through the MP-BGP control plane. Subsequently, all VTEPs know all end-point information that belongs to their respective VNI.

Let ‘s elaborate the description with the following example.

vPC Anycast VTEP

vPC Anycast VTEP

VLAN 100 (blue) maps to VNI 10000

  • Host A:
  • Host C:
  • Host E:
  • Host F:

VLAN 30 (orange) maps to VNI 5030

  • Host B:
  • Host D:

VLAN 3000 (purple) maps to VNI 300010.

  • Host G:

In order to keep the following scenarios simple, each pair of physical Leafs (vPC) will be represented by a single logical switch with the Anycast VTEP stamped to it.

vPC Anycast VTEP - Physical View vs. Log

vPC Anycast VTEP – Physical View vs. Logical View


VLAN 30 and VLAN 3000 are not present in Leaf 1, consequently it is not required to create and map the VLAN 30 to VNI 5030, nor VLAN 3000 to VNI 300010 within VTEP 1. VTEP 1 only needs to get the reachability information for Hosts A, C, E and F (IP & MAC addresses) belonging to VLAN 100. Host A is local to VTEP 1 and Hosts C, E and F are remotely extended using the network overlay VNI 10000.


VTEP 2 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts B and C are local.


VTEP 3 learned Hosts A, C, E and F, attached to VLAN 100, and Hosts B and D, belonging to VLAN 30. Hosts D and E are local.


VTEP 4 learned Hosts A, C, E, and F on VLAN 100 and Host G on VLAN 3000. Hosts F and G are local.

Notice that only VTEP 4 learns Host G, which belongs to VLAN 3000. Consider Host G to be isolated from the other segments, like for example a database accessible only via routing.

VLAN to Layer 2 VNI mapping

VLAN to Layer 2 VNI mapping

With the process for Host reachability information, the learning and distribution end-points are achieved by the MP-BGP EVPN control plane with an integrated routing and bridging (IRB) function. Each VTEP learns through the data plane (e.g. Source MAC from a Unicast packet or a ARP/GARP/RARP) and registers its local end-points with their respective MAC and IP address information using the Host Mobility Manager (HMM). Subsequently, it distributes this information through the MP-BGP EVPN control plane. In addition to registering its local attached end-points (HMM), every VTEP will populate its Host table with the devices learned through the control plane (MP-BGP EVPN AF).

From Leaf 1 (VTEP1 :

L2route from Leaf 1

L2route from Leaf 1

The “show L2route” command[1] above details the local Host A ( learned by the Host Mobility Manager (HMM), as well as Host C (, E ( and F ( learned through the BGP on the next hop (remote VTEP) to reach them.

[1] Please notice that the output has been modified to show only what’s relevant to this section.

From Leaf 3 (VTEP2 :

L2route from Leaf 3

L2route from Leaf 3

Host B ( and Host C ( are local, whereas Host D ( and Host E ( are remotely reachable through VTEP 3 ( and Host F ( appears behind VTEP 4 (

From Leaf 5 (VTEP3 :

L2route from Leaf 5

L2route from Leaf 5

Host D ( and Host E ( are local. All other Hosts except Host G are reachable via remote VTEPs listed in the Next Hop column.

From Leaf 7 (VTEP4 :

L2route from Leaf 7

L2route from Leaf 7

Host F ( and Host G ( are local. VTEP 4 registered only to the Hosts that belong to VLAN 100 and VLAN 3000 throughout the whole VxLAN domain.

With the Host table information, each VTEP knows how to reach all Hosts within its configured VLAN to L2 VNI mapping.

Asymmetric or symmetric Layer 3 workflow

The EVPN draft defines two different operational models to route traffic between VxLAN overlays. The first method describes an asymmetric workflow across different subnets and the second method leverages the symmetric approach.

Nonetheless, vendors offering VxLAN with EVPN control plane support may implement one of these operational models (assuming the Hardware/ASIC supports VxLAN routing). Some may choose the Asymmetrical approach, maybe because it is easier to implement from a software point of view, but it is not as efficient as the symmetrical mode and has some risks that impact scalability. Others will choose the symmetric model for more efficient population of Host information with better scalability.

Assuming the following scenario:

  • Host A (VLAN 100) wants to communicate with Host G in a different subnet (VLAN 3000)
  • VLAN 100 is mapped to VNI 10000
  • VLAN 30 is mapped to VNI 5030
  • VLAN 3000 is mapped to VNI 300010
  • We assume all Hosts and VLANs belong to the same Tenant-1 (VRF)
  • * L3 VNI for this Tenant of interest is VNI 300001 (for symmetric IRB)
  • We assume within the same Tenant-1 that all Hosts from different subnets are allowed to communicate with each other (using inter-VxLAN routing).
  • Host A shares the same L2 segment with Hosts C, E, and F spread over remote Leafs
  • Host G is Layer 2 isolated from all other Hosts. Therefore a Layer 3 routing transport is required to communicate with Host G.

* explained with Symmetric IRB mode

Asymmetric IRB mode:

When an end-point wants to communicate with another device on a different IP network, it sends the packet with the destination MAC address as its default gateway. Its first-hop router (ingress VTEP) performs the routing lookup and routes the packet to the destined L2 segment. When the egress VTEP receives the packet, it strips off the VxLAN header and bridges the original frame to the VLAN of interest. With asymmetrical routing mode, the ingress VTEP (the local VTEP where the source is attached) performs both bridging and routing, whereas the egress VTEP (remote VTEP where the destination sits) performs only bridging. Consequently, the return traffic will take a different VNI, hence a different overlay tunnel.

In the following example, Host A wants to communicate with Host G and sends the packet toward its default gateway. VTEP 1 sees that the destination MAC is its own address and does a routing lookup for Host G. It finds in its Host table the destined IP Host and the Next Hop VTEP 4 to reach it. VTEP 1 encapsulates and routes the packet into the L2 VNI 300010 with VTEP 4 as destination IP address. VTEP 4 receives the packet, strips off the VxLAN header and bridges the frame to its VLAN 3000 toward Host G (with the source MAC as the default gateway address).

When Host G responds, VTEP 4 will encapsulate the frame to its associated VNI 300010 and will route the packet directly through the VNI 10000 where Host A is registered. The egress VTEP 1 will therefore bridge the received packet from VNI 10000 toward Host A in VLAN 100.

Asymmetric Routing

Asymmetric Routing

The drawback of this asymmetric routing implementation is the consistent configuration across all VTEPs built with all VLANs and VNIs concerned with routing and bridging communication across the fabric. In addition, it needs to learn the reachability information (MAC and IP addresses) for all Hosts that belong to all VNI of interest.

In the above illustration, VTEP 1 needs to be configured with VLAN 100 mapping VNI 10000, as well as the VLAN IDs which are mapped to the VNIs 5030 and 300010, even though there is no any Host attached to those VLANs. The VLAN ID being local significant, what is crucial is that the mapping to the VNI of interest exists. Therefore, in this example the Host tables on all leafs are populated with the reachability information for Hosts A, B, C, D, E, F and G. In a large VxLAN/EVPN deployment, this implementation of asymmetric IRB adds some complexity for the configuration but above all it may have important impact in term of scalability.

Symmetric IRB

Symmetric routing behaves differently in the sense that both ingress and egress VTEP provide bridging and routing functions. This allows introducing a new concept known as transit L3 VNI. This L3 VNI will be dedicated for routing purposes within a tenant VRF.  Indeed, the L3 VNI offers L3 segmentation per tenant VRF. Each VRF instance is mapped to a unique L3 VNI in the network. Each series of tenant’s VLAN determines the VRF context to which the receiving packet belongs. As a result, the inter-VxLAN routing is performed throughout the L3 VNI within a particular VRF instance.

Notice that each VTEP can support several hundred VRFs (depending on the hardware, though).

In the following example, all Hosts, A, B, C, D, E, F and G, belong to the same tenant VRF. When Host A wants to talk to Host G in a different Subnet, the local VTEP (ingress) sees the destination MAC as the MAC of the Default Gateway (AGM – Anycast Gateway MAC), as Host G is not know via Layer-2 and consequently routes the packet through the L3 VNI 300001. It rewrites the inner destination MAC address with the egress VTEP 4 router MAC address that, this one is unique for each VTEP.  Once the remote VTEP 4 (egress) receives the encapsulated VxLAN packet, it strips of the VxLAN header and does a MAC lookup identifying the destined MAC as being its own. Accordingly, it performs an L3 lookup and routes the receiving packet from the L3 VNI to the destined L2 VNI 300010 that maps VLAN 3000 where Host G resides. VTEP 4 finally maps VNI 300010 to VLAN 3000 and forwards the frame with an L2 destination as Host G.

Symmetric Routing

Symmetric Routing

For return traffic, VTEP 4 will achieve the same packet walk using the same transit L3 VNI 300001, but in the opposite direction.

The following L2route command displays the Host reachability information that exists in VTEP 1’s Host table. It doesn’t show Host G (, as no VLAN maps locally in Leaf 1 (VTEP 1) the VNI 300010 nor information related to VNI 5030, reducing therefore the population of its Host reachability table to the minimum required information.

L2route from Leaf 1

L2route from Leaf 1

Nonetheless, Leaf1 knows how to route the packet to that destination. The Next hop is Leaf_4 via the L3 VxLAN segment ID 300001, as displayed in the following routing information.

Show ip route from Leaf 1

Show ip route from Leaf 1

The “show IP route” command above shows that Host G is reachable via the VxLAN segment ID 300001 with the next hop as (VTEP 4).

Show bgp l2vpn evpn from Leaf 1

Show bgp l2vpn evpn from Leaf 1

The “show bgp L2vpn evpn” command[1] from Leaf 1 displays the information related to Host G (, reachable via the next hop, (VTEP 4) using the L3 VNI 300001.

Note also for information only that the label shows the mapping of the L2VNI 300010 where Host G belongs to and the L3VNI 300001 to reach Host G.

[1] Please notice that the output has been modified to display only information relevant to Host G.

Anycast L3 gateway

One of the useful added-values of this, is the capacity to offer the role of default gateway from the direct attached-Leaf (First Hop Router initiated on ToR). All leafs are configured with the same Default Gateway IP address for a specific subnet as well as the same vMAC address. All routers will use the same virtual Anycast Gateway MAC (AGM) for all Default Gateways. This is explicitly configured during the VxLAN/EVPN setup.

Leaf(config)# fabric forwarding anycast-gateway-mac 0001.0001.0001

When a virtual machine processes a “hot live migration” to a Host attached to a different Leaf, it will continue to use the same default gateway parameter without interruption.

Anycast Layer 3 Gateway

Anycast Layer 3 Gateway

In the above scenario, the server “WEB” in DC-1 communicates (1) with its database attached to a different Layer 2 network and therefore a different subnet in DC-2. The local Leaf 1 achieves the default gateway role for server WEB as the First Hop Router, which is simply its own ToR with the required VxLAN/EVPN and IRB services up and running.

While server WEB  is communicating with the end-user (not represented here) and its database, it does a live migration (2) toward the remote VxLAN fabric where its database server resides.

The MP-BGP EVPN AF control plane notices the movement and the new location of the server WEB. Subsequently, it increases the MAC mobility Sequence number value for that particular end-point and notifies all VTEP of the new location. As the Sequence number is now higher than the original value, subsequently all VTEP update their Host table accordingly with the new “next hop” (egress VTEP 4).

Without interruption of the current active sessions, server WEB continues to transparently use its default gateway, locally available from its new physical First Hop Router, now being Leaf 7 and 8 (VTEP 4). The East-West communication between WEB and DB happens locally within DC-2.




Posted in DCI | 30 Comments

29 – Interconnecting two sites for a Logical Stretched Fabric: Full-Mesh or Partial-Mesh

This post discusses about design considerations when interconnecting two tightly coupled fabrics using dark fibers or DWDM, but not limited to Metro distances. If we think very long distances, the point-to-point links can be also established using a virtual overlay such as EoMPLS port x-connect; nonetheless the debate will be the same.

Notice that this discussion is not limited to one type of network fabric transport, but any solutions that use multi-pathing is concerned, such as FabricPath, VxLAN or ACI.

Assuming the distance between DC-1 and DC-2 is about 100 km; if the following two design options sound quite simple to guess which one might be the most efficient, actually it’s not as obvious as we could think of, and a bad choice may have a huge impact for some applications. I met several networkers discussing about the best choice between full-mesh and partial-mesh for interconnecting 2 fabrics. Some folks think that full-mesh is the best solution. Actually, albeit it depends on the distances between network fabrics, this is certainly not the most efficient design option for interconnecting them together.

Full-Mesh and Partial-Mesh

Partial-Mesh with transit leafs design (left)                                                                 Full-Mesh (right)

For this Thread, we assume the same distance between the two locations exists with both option designs (let’s take 100 km for a comprehensive math). On the left side, we have a partial-mesh design with transit leafs to interconnect the two fabrics. On the right side, we have a full-mesh established between the two fabrics.

With a full-mesh approach the inter-fabric workflow will take one hop to communicate with a remote host. As the result the total latency induced here with full-mesh will be theoretically 1 ms (100 km) between two remote end-points + the few micro-secs for traversing the spine in DC-2 (one hop).




However, if the 2 locations are interconnected using a transit leaf, the traffic between the two tiers will take 3 hops. In this case, it will add theoretically few micro-secs for traffic crossing the spine switch in DC-1 + 1ms (100km) + few micro-secs (remote transit leaf + spine switch in DC-2 ).

transit-leaf 1

Partial-Mesh using Transit Leaf


With the partial-mesh, each transit leaf interconnects the remote Spine. Thanks to the Equal Cost Multi-Pathing (ECMP) algorithm used to distribute the traffic among all possible equal paths. The server to server workflows (E-W) are therefore established between hosts and confined within the same location (e.g Web/App tier and DB tiers) with a maximum of one hop. A flow sent to the remote location via the DCI connection would consume two additional hops. As the result, ECMP will never elect the link toward the remote spine and the traffic between the application-tiers will be contained inside the same location (DC-1).

transit-leaf 2

In a Partial-mesh design, ECMP contains the local workflow within the same location


However if the two locations were interconnected using a full-mesh design,  nothing would prevent the ECMP hashing to hit a remote spine, as the number of hops is equal across the two locations.

As the result a full-mesh will add theoretically 2 ms (2 x 100km) for a portion of the traffic.

full-mesh 2

In a Full-mesh design, ECMP will distribute a portion of the local traffic via the remote fabric


If we compare the Full-mesh and the Partial-mesh design, we have the same latency induces by the distances between sites (we can ignore the few micro-secs due to multi-hops ). In Partial-mesh the local communication between two local end-points is contained within the same fabric, while with full mesh, although it saves some few micro-seconds with a maximum of one hop between two end-points, it is not guaranteed that the traffic between two local end-points will be always contained within the same location. Risks are that for the same application, some flows will take a local path (with a negligible latency) and some will take the remote path (with n msec due to the hair-pining via the remote spine switch).

Key considerations

  • With long distances between two fabrics, E-W communication between two local end-points may have an impact on the Application performances with a full-mesh design.
  • Any network transport that uses multi-pathing is concerned, such as FabricPath, VxLAN or ACI.
  • A full-mesh will provide a single hop between the 2 remote end-points. However it will allow distributing the E-W flows equally between the 2 sites. As the result, half of the traffic may be impacted by the latency induced by the distance.
  • A Transit leaf design will impose a 3 hops between the two remote end-points, however the latency induced by the local hop in each site is insignificant compared to the DCI distance. A Transit Leafs design (partial mesh) will contain the E-W flows within a fabric locally to each site.

As a global conclusion, for a single logical fabric geographically stretched across two physical distant location, when distances go beyond the campus distances, it may be important to deploy a transit leaf model.

All that said, it is certainly unlikely that several enterprises deploy a full-mesh interconnecting fabric for long distances, but nothing is impossible, it depends who owns the fiber 🙂



Posted in DCI | 8 Comments

A fantastic overview of the Elastic Cloud project from Luca Relandini

A fantastic overview of the Elastic Cloud project from Luca Relandini

This post shows how the porting of the Elastic Cloud project to a different platform is achieved with UCSD


And don’t miss this excellent recent post which explains how to invoke UCS Director workflows via the northbound REST API.


Posted in DCI | Leave a comment

28 – Is VxLAN with EVPN Control Plane a DCI solution for LAN extension

VxLAN Evolution in the Context of DCI Requirements

Since I posted this article “Is VxLAN a DCI solution for LAN extension ?” clarifying why Multicast-based VxLAN Flood & Learn (no Control Plane) was not suitable to offer a viable DCI solution, the DCI market (Data Center Interconnect) has become a buzz of activity around the evolution of VxLAN based on the introduction of a Control Plane (CP).

Before we elaborate on this subject, it is crucial that we keep in mind the shortcomings of diverting VxLAN for a DCI solution. Neither VxLAN nor VxLAN/EVPN have been designed to offer natively a DCI solution. VxLAN is just another encapsulation type model establishing dynamically an overlay tunnel between two end-points (VTEPs). There is nothing new in the this protocol to claim that a new DCI transport is born.

In this network overlay context, the Control Plane objective is to leverage Unicast transport while processing VTEP and host discovery and distribution processes. This method significantly reduces flooding for Unknown Unicast traffic within and across the fabrics.

The VxLAN protocol (RFC 7348) is aimed at carrying a virtualized Layer 2 network tunnel established over an IP network, hence from a network overlay point of view there is no restriction to transport a Layer 2 frame over an IP network, because that’s what the network overlays offers.

Consequently a question as previously discussed with VxLAN Flood&Learn transport in regard to a new DCI alternative solution, comes back again;

  • Does a Control Plane suffice to claim that a VxLAN can be used as a DCI solution?

This noise requires therefore a clarification on how reliable a DCI solution can be when based on VxLAN Unicast transport using a Control Plane.

Note: This article focuses on Hardware-based VxLAN and VTEP Gateways (L2 & L3). As of today, I will not consider the scalability and flexibility of the Host-based VxLAN sufficient enough for a valid DCI solution.

What Does VxLAN Evolution to Control Plane Mean ?

We can consider four different stages in VxLAN evolution.

These stages differ in the transport and learning processes as well as the components and services required to address the DCI requirements.

  • The first stage of VxLAN relies on a Multicast-based transport also known as Flood&Learn mode with no Control Plane. Learning for host reachability relies on Data plane (DP) only and is treated using Flood&Learn. This is the original transport method and behavior that became available with the first release of VxLAN. Let’s call this first stage, VxLAN phase v1.0*; however, it is out of the scope of this post as it relies on Flood&Learn and has been already broadly elaborated in this post mentioned above.
  • The second stage of VxLAN Flood&Learn relies on Unicast-only transport mode. This mode announces to each VTEP the list of VTEP IP addresses and the associated VNIs. It does NOT discover nor populate any host reachability information. It uses a Head-End Replication mechanism to be able to “flood” (through the data-plane) BUM traffic for each VNI. Nonetheless, the added value with this mode is that, an IP multicast network for the learning process across multiple sites is not mandatory anymore. However, although this mode relies on Flood&Learn discovery process, I think it’s interesting to elaborate this Unicast-only mode transport, essentially due to some implementations of this basic HER stage claiming to offer a  DCI solution. Let’s call this stage VxLAN phase v1.5*. This phase is elaborated in the next sections.
  • The third stage of VxLAN provides dynamic host discovery and distribution across all VTEP using a Control Plane (CP) MP-BGP EVPN Address Family (AF). The host information consists of the IP and MAC identifiers of the end-point concerned by the Layer 2 adjacency with all its peers. This mode of VxLAN is interesting as it helps with reducing drastically the flooding within and between fabrics; thus, in a nutshell, it might be considered “good enough” when diverted as a DCI model. However it is important to note that some of the functions required for a solid DCI solution are not all there. To avoid confusion with other implementation of VxLAN and for references throughout this article, let’s call this stage VxLAN phase v2.0*. This mode is deeply elaborated in the next sections.
  • The fourth stage of VxLAN offers similar transport and learning process as the current implementation of VxLAN MP-BGP EVPN AF (Phase 2.0*); however, it will introduce several functions required for a solid DCI solution. Let’s call this stage VxLAN phase v3.0 *.

* Phase “n”: be aware that there is nothing official nor standard in this nomenclature. This is mainly used for referencing in the following writing.

In a nutshell, which DCI functions are supported with VxLAN?

VxLAN Multicast (VxLAN Phase 1.0)

  • Dual-homing

VxLAN Unicast-only mode using Head-End Replication (VxLAN Phase 1.5)

  • Multicast or Unicast transport (Head-end replication)
  • Dual-homing


  • Control-Plane Domain Isolation using eBGP
  • Independent Autonomous System
  • Multicast or Unicast transport
  • ARP spoofing: reducing ARP flooding requests
  • Distributed L3 Anycast Gateway (E-W traffic localization using Anycast Default Gateway)
  • Dual-homing with L2 Anycast VTEP Address

* Wishes for a Next phase of VxLAN BGP EVPN AF (VxLAN phase 3.0)

  • Data-Plane Domain Isolation
  • Unknown Unicast suppression
  • Selective Storm Control (per VNI)
  • Native L2 Loop detection and protection
  • Multi-homing

* There is no commitment to any of the above. This list of functions is given for information of what could be implemented in VxLAN to improve some existing shortcomings in regard to DCI requirements.

A slight clarification on Multi-homing: it relates to a DCI solution with embedded topology independent Multi-­homing, versus a MLAG based dual-­homing built with a tightly coupled pair of switches.

The figure above aims to provide a high-level overview of VxLAN evolution toward full support for DCI requirements. OTV 2.5 is used as the DCI reference.

VxLAN evolution rating for a DCI solution


Multicast transport for Flood&Learn

  • Definitely excluded for any DCI solution

If one were to rate how efficient this implementation of VxLAN would be for DCI, I would give it a 2 out of 10 for VxLAN 1.0.

DCI Survey VxLAN 1.0


Unicast-only mode with HER

  • Definitely not efficient for any DCI solution
  • Relies on Flood&Learn
  • Having a Unicast-only mode with HER to handle BUM doesn’t mean that it should be considered a valid DCI solution, be cautious.

IMHO this deserves a maximum of 3 out of 10 for VxLAN 1.5

DCI Survey VxLAN 1.5


Control Plane (MP-BGP AF EVPN)

  • This can be considered as “good enough” for a DCI solution.
  • Some DCI requirements are still missing. It’s not perfect for DCI and we need to fully understand the shortcomings before deciding if we can deploy this mode for DCI requirements

We can give it 7 out of 10 for VxLAN 2.0

DCI Survey VxLAN 2.0



Same Control Plane MP-BGP AF EVPN as above but with additional services to improve and secure inter-site communication

  • Should fill some important gaps for DCI requirements

If this evolution honors the promises we may give it 9 out of 10 for VxLAN 3.0

DCI Survey VxLAN 3


If several models of ASIC (Custom ASIC or Merchant Silicon) support VxLAN encapsulation, it doesn’t mean that all vendors have implemented a Control Plane improving the learning process for the hosts. Some vendors, such Cisco, supports Multicast transport and BGP-EVPN AF Control Plane (VxLAN Phase 2.0) for its Data Center switches (Nexus series) and WAN edge routers (ASR series). Others have implemented VxLAN 1.5, while certain vendors support only Multicast-based transport (VxLAN 1.0).

Consequently, it is important to examine the different modes of VxLAN in order to capture which implementation could be suitable for a DCI solution.


This post is written in two main parts: a short summary given above and a detailed packet walk with all the requirements for DCI that comes afterward.

Some readers may want to understand the technical details of the workflows and learn about the DCI requirements for VxLAN, which together comprise the second part. For those brave readers, take a deep breath and goto next :o)

Let’s Go Deeper with the Different Phases

Global Architecture for Dual-Sites

Let’s assume the following DCI scenario as a base reference for the whole document. Two data center network fabrics are interconnected using a Layer 3 core network.

As described throughout this article, and often mentioned in the hybrid cloud environment, maintaining the Control Plane independent from each location is recommended as a strong DCI requirement.

Control Plane (CP) independency is elaborated in the next pages; however, a single Control Plane stretched across two locations is also discussed for VxLAN phase 1.5.

VxLAN DCI generic architecture independante CP

Global architecture for Dual-Sites using independent Control Plane

DCI and Fiber-based Links

This document focuses on a Layer 3 Core Network used for interconnecting sites. The Layer 3 Core network is usually acceded through a pair of Core routers enabled at the WAN Edge layer of each Data Center. This solution typically offers long to unlimited distances to interconnect two of multiple Data Centers.

However, before we go further in this document, it is important to briefly clarify another design generally deployed for Metro distances, which relies on fibers connectivity between sites.


DCI and fiber consideration in a Metro Distances

DCI and fiber consideration in a Metro Distances

From a control plane point of view, it can still support independent BGP Control Plane per Fabric network if chosen. Host information is dynamically learnt and populated according to the CP mode supported by the VxLAN implementation as described in the following pages.

However the main differences between the two designs, reside on the service that can be enabled on top of this Metro solution architecture. Indeed, conventionally, IP Multicast is not often an option with Layer 3 managed services when the Enterprise wishes to deploy a Layer 3 backbone (WAN/MAN). In a Metro distances, most of the time optical links are provided to interconnect sites. With dark fibers owned by the Enterprise itself or with DWDM managed services the network manager can easily enabled IP Multicast on his own.

That gives the flexibility to leverage IP Multicast for BUM traffic also inter-site.

Note that when deploying DWDM, like with any traditional multi-hop routed network, it is recommended to enable BFD. The objective is improve the detection of remote link failure and to accelerate the convergence time. This is particularly true in case the optical switch doesn’t offer this service (aka remote port-shutdown). However BFD is usually not necessarily for direct fibers (e.g. large campus deployments).

VxLAN and Reference Design Taxonomy

Terminology for the following scenarios

  • VxLAN Segment: This is the Layer 2 overlay network over which endpoint devices, physical or virtual machines, communicate through a direct Layer 2 adjacency.
  • VTEP stands for VxLAN Tunnel EndPoint (It initiates and terminates the VxLAN tunnel)
  • “NVE” this is the VxLAN Edge Devices that offers VTEP (also sometimes referenced as “Leaf” in this document as we assume VTEP is enabled on all Leafs in the following scenarios).
  • “V” stands for VTEP
  • “VNI” stands for Virtual Network Identifier (or VxLAN Segment ID.)
  • “H” stands for Host
  • “Host” may be a bare-metal server or a virtualized workload (VM). In both cases the assumption is that the “host” is always connected to a VLAN.
  • “L” stands for Leaf (This is also the Top Of Rack or ToR)
  • “RR” stands for (BGP) Route Reflector
  • “Vn” where “n” is the VTEP IP address
  • “CR” stands for Core Router (redundancy of Core Routers is not shown)
  • “vPC” stands for Virtual Port-Channel. This is a Multi-Chassis Ether-Channel (MEC) feature. Other forms of MEC exist such as Multi-chassis Link Aggregation (MLAG). I’m using vPC as references because it brings a keen feature known as Anycast VTEP gateway that allows sharing the same virtual VTEP address across the two tightly coupled vPC peer-switches. “

With the Dual-homing function, the same VTEP Identifier is duplicated among the 2 vPC peer-switches representing the same virtual secondary IP address (to simplify the figures in this post, some identifiers have been simplified e.g., V2 where “2” is the secondary IP address). Note that the secondary IP address is used for dual-homed vPC members  and Orphan devices with a single attachment.

Dual-homing in the context of VxLAN/DCI is discussed further in this post.

This article does not aimed to deeply detail the dual-homing technology per se. Additional technical details on redundant vPC VTEP can be found in the following pointers:

In addition, for references throughout this article, it is assumed that:

  • VLAN 100 maps to VNI 10000
  • VLAN 200 maps to VNI 20000
    • Note that VLAN are local significant to each Leaf or each port of Leaf, thus the VLAN ID mapped an a VNI can be different on remote Leaf .
  • All VTEPs know their direct-attached hosts
    • Silent host will be discussed in a further section with VxLAN/EVPN MP-BGP.
  • BGP Route Reflectors are enabled to redistribute the information to all of the VTEPs that it knows. They are usually deployed for scaling purposes.
  • One of the main DCI requirements is to maintain Control Plane independence on each site.
    • In the first design that covers VxLAN unicast-only mode with HER, a unique Control Plane with the same Autonomous System (AS) spread over east to west (E-W) locations is discussed.
    • The DCI design will evolve with 2 independent Control Planes in regard to VxLAN/EVPN MP-BGP, as recommended for DCI architectures.
  • Finally, to reduce the complexity of the scenario, let’s assume that the Border-Leafs used to interconnect to the outside world have no host attached to them. Obviously in a real production network the Border-Leafs may be also used as a ToR. A further section discusses this scenario.

VxLAN Phase 1.0: Multicast Transport

As a reminder, this first stage has been deeply elaborated in this post: http://yves-louis.com/DCI/?p=648.

To summarize this solution, this first implementation of VxLAN imposes Multicast transport. It therefore assumes IP Multicast transport exists inside and outside the DC fabric. IP Multicast transport in the Layer 3 backbone may not be necessarily an option for all enterprises. In addition, a large amount of IP Multicast traffic between DCs could be a challenge to configure and maintain from an operational point of view (without mentioning performance and scalability).

Above all, it relies on Flood&Learn to compute the host reachability with no or limited policy tools for rate-limiting the concerned IP Multicast groups.

As a result, with excessive flooded traffic and bandwidth exhaustion, the risks of disrupting the second Data Center are very high.

The above is highlighted without counting on lack of traffic optimization due to nonexistence function of distributed Anycast gateways. As the result, “hair-pinning” or “ping-pong” effect may seriously impact the performances of the applications.

Consequently, Flood & Learn VxLAN may be used for very limited and rudimentary LAN extension across remote sites. If technically it carries Layer 2 frames over a Layer 3 multicast network, be aware that this is risky and it would necessitates additional intelligent services and tools to address the DCI requirements, such as leveraging a Control Plane, Flood suppression, independent Control Plane, Layer 3 anycast gateway, to list just few, which are not applicable with this implementation of VxLAN.

Therefore the recommendation is to NOT deploy VxLAN Multicast transport as a DCI solution.

VxLAN Phase 1.5: Unicast-only Transport Using Head-End Replication (Flood&Learn mode)

This implementation is also known as Unicast-only mode. What it means is that it relies on Head-end replication to dynamically discover and distribute the VxLAN VTEP information necessary to build the overlay network topology. However, hosts are learnt using the traditional Flood & Learn data plane.

Let’s analyze a packet walk step by step for Head-end replication with dynamic VTEP discovery.

In the following design, the Control Plane is spread across the two sites. An IGP protocol of choice is used to communicate between all the Leafs (OSPF, EIGRP, ISIS, etc) establishing the network overlay tunnels used for the data plane. Optionally, a redundant Control Plane is initiated on DC-1 to scalability purposes and to simplify the learning process.

VxLAN 1.5 Head End Replication 1

VTEPs advertise their VNI membership

1 – All VTEPs advertise their VNI membership to the Control Plane. The target is the Route Reflector located in DC-1.

BGP propagates the VTEP information

BGP propagates the VTEP information

2 – The Control Plane consolidates the VTEP information and the Control Plane redistributes the VTEP list with their respective VNI to all VTEPs it knows.

3 – Each VTEP obtains a list of its VTEP neighbors for each VNI.

Now that each VTEP has exhaustive knowledge of all existing VTEP neighbors and their relevant VNI, H1 establishes communication with H4. The hosts are not aware of the existence of the overlay network. They behave as usual to communicate with their peers. H1 believes that H4 is attached to the same L2 segment. Actually they are L2 adjacency throughout the VxLAN tunnel.

ARP request flooded across the DCI

ARP request flooded across the DCI

1 – H1 ARP’s for H4. The source MAC is therefore H1, and the destination MAC is FF:FF:FF:FF:FF:FF.

2 – H1 is attached to VLAN 100, which is mapped to VNI 10000. The local VTEP V1 does a lookup in its VTEP table and localizes all the VTEP neighbors responsible for the VNI 10000. In this example, VTEP 3 and VTEP 4 are both binding VNI 10000.

3 – As a result, VTEP V1 encapsulates the original broadcast ARP request from H1 with the VxLAN header using VNI 10000 and replicates this broadcast packet toward VTEP V3 and V4. VTEP 1 is the source IP address used to construct the overlay packet with VTEP 3 and VTEP 4 as the destination IP address.

4 – VTEP 3 and VTEP 4 both receive the VxLAN packet identified with VNI 10000, remove the VxLAN header and notify the Dot1Q tag (VLAN 100). As a result, NVE Leafs 23 and 25 forward, respectively, the ARP request to all the respective interfaces binding VLAN 100. H4 and H5 receive the ARP request. H5 ignores the request and H4 learns and caches the MAC address for H1.

ARP reply Unicast

ARP reply Unicast

5 – H4 can reply Unicast to H1 toward NVE Leaf 24. A LACP hashing algorithm is implemented across the vPC attachment and the reply will hit one of the two vPC peer-NVE. It doesn’t matter which Leaf receives the frame as both VTEP vPC peer-switches share the same VTEP table entries, we use L24 for this example.

6 – VTEP 3 (NVE Leaf 24) notifies the response frame from H4 for a binding to VLAN 100 and consequently encapsulates the ARP reply with the identifier VNI 10000. Therefore, it sends it Unicast toward the VTEP 1 IP address as it knows that the destination MAC of the ARP reply is locally connected to VTEP 1.

7 – VTEP 1 receives the VxLAN packet with VNI 10000 from VTEP 3, strips off the VxLAN header and forwards the original ARP reply to H1.

Key takeaways for VxLAN Unicast mode with HER for VTEP only and DCI purposes

As a result, we can note that even though the transport is Unicast-only mode, (Head-end replication), the learning process for host information is not optimised. The source VTEP knows the remote VTEPs for each VNI of interest. But it still needs to learn the destination MAC which relies on Data Plane-learning using Flood & Learn. As a result, Multicast and Unknown Unicast, including ARP requests (BUM), are flooded using Head-end replication to the remote VTEPs . ARP suppression cannot be enabled as the host reachability process relies on Flood&Learn. Risks of disrupting both data centers due to Broadcast Storm is high.

In addition, it is important to note that this implementation in general doesn’t used a standard Control Plane approach such as MP-BGP. Consequently it may not allow for an independent Control Plane per site, hence the reason I used a single CP stretched across the two sites.

From a standard point of view, VxLAN does not offer any native L2 loop protection. Consequently, a manual configuration of the BPDU Guard on all network interfaces must be set.

Last but not least, Anycast L3 Gateway for network overlay is usually not supported with this mode. As the result, “hair-pinning” or “ping-pong” effect may seriously impact the performances of the applications.

In short, be aware that having a Unicast-only mode with Head End Replication doesn’t mean that it is a valid DCI solution.

VxLAN Phase 2.0:  VxLAN Transport using BGP-EVPN – Independent CP

This is an important enhanced method of VxLAN with the control plane  MP-BGP EVPN leveraged for end-points learning process. In addition to the VTEP information discussed above, all hosts are now dynamically discovered and distributed among all VTEPs using a BGP EVPN control Plane pushing the end-point information toward all VTEPs, reducing massively the amount of flooding for the learning method.

This is achieved using MP-BGP based on EVPN NLRI (Network Layer Reachability Information), which carries both Layer 2 and Layer 3 reachability reports. The VXLAN Control Plane relies therefore on the IETF RFC 7432  “BGP MPLS-Based Ethernet VPN”  (https://tools.ietf.org/html/rfc7432) as well as the draft version document (https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-04) that describes how RFC 7432 can be used as an  Network Virtualization Overlay (NVO) such as VxLAN.

VxLAN Bridging and routing services in the VxLAN overlay may exist on each Leaf node. Bridging is a native function that comes with the hardware-based VTEP platforms. It is required to map a VLAN ID to a VxLAN ID (VNI). Distributed Anycast routing service is a key element needed to reduce the hair-pinning effect in a distributed virtual fabric design. This is treated in a separate section.

In the previous scenarios, a single Control Plane has been responsible for both fabrics. This implies two consequences:

  • First of all, with a single autonomous system (AS) spread over the two sites, the internal BGP must connect and maintain the session with all other BGP peers in a full mesh fashion (where everyone speaks to everyone directly). In DCI infrastructure, when establishing a full-mesh between all VTEP spread across the multiple sites, this number of sessions may degrade performance of routers, due to either a lack of memory, or too much CPU process requirements.
  • Second, for resiliency purposes, it is preferable to offer an independent Control Plane (Figure 2), thus if one CP fails in one location, the other site is not impacted.

Therefore, in the following part, the VxLAN Data Plane stretched across the two locations is built with two independent Control Planes aligned with the DCI requirements. Let’s re-use the same scenario elaborated previously, this time with an independent Control Plane inside each fabric and eBGP EVPN peering establishment on the WAN side. The next section provides an overview of how both Control Planes will be updated by each other. That is followed by communication between two end-nodes.

This architecture is now referenced as a VxLAN Multipod as the same model is mainly deployed within the same building or campus with multiple dispersed PoD. The only difference with a intra-campus, when the fiber is owned by the enterprise within the same location (HA is 99.999), it is not necessarily required to separate the control plane between each PoD’s. However, if the DWDM is a L2 managed service, as usually the HA SLA drops to 99.7, designing the solution with independent CP can help to reduce the impact of lost of signal (LOS). Consequently we will call this design VxLAN EVPN Strecthed Fabric or VxLAN EVPN Multipod Geographically Dispersed.

Nonetheless, this is not a DCI architecture per se, there is no delineation at the WAN edge, there is no VLAN hand-off at the DCI edge, it’s a pure Layer 3 underlay from PoD to PoD.

Host discovery and distribution process with independent control planes

VxLAN eBGP Control Plane

VxLAN eBGP Control Plane

iBGP EVPN AF is used for the control plane to populate the host information inside each fabric. MP-BGP (eBGP) peering is established between each pair of Border-Leafs, allowing for different AS per location.

The multi-hop function allows eBGP peering to be achieved over intermediate routers (e.g., ISP) without the need to treat EVPN per se. The same VxLAN overlay network, established from end to end across the two Data Centers, represents the Data Plane.

If we replay the same scenario as used before with Phase 1.5, we can see the efficiency gained with the BGP EVPN AF Control Plane.

In the following design, an independent Control Plane is initiated for each site using iBGP. Subsequently, independent BGP Route Reflectors are initiated on both fabrics. All VTEPs advertise Host Routes (IP and MAC identifiers) for their hosts that exist in their Network interfaces to their respective Control Plane BGP. Any VTEP establishes an iBGP peering with each BGP Route Reflector.

each NVE updates the CP with their hosts

Subsequently, each BGP Route Reflector knows all local VTEP and host information that exists in its fabric. The Route Reflector therefore can distribute to all its local VTEPs the required host information that it is aware of; however, it needs first to update its BGP peer in the remote location, and it expects as well to be updated with the remote VTEP/host information.

In order for the BGP Route Reflector to populate its local host information table with their respective VTEP, VNI, IP and MAC addresses to the remote fabric, two main options are possible:

  • Route Reflector peering
  • Border-Leaf peering

BGP Route Reflector Peering

The BGP neighbor is specified for the spine and eBGP will be initiated, as two different AS exist.

Route Reflector Peering

Route Reflector Peering

Each BGP Route Reflector can now populate its local host table toward the remote Control Plane.

BGP Border-Leaf peering:

The BGP next hop for each spine (RR) is the local pair of Border-Leafs connecting the Core layer. eBGP peering is established between each Border-Leaf pair toward the core routers. All transit routers in the path (Layer 3-managed services via SP) do not require any knowledge of EVPN. To achieve this transparency, eBGP peering is configured to support multi-hop.

eBGP peering from the Border-Leaf

eBGP peering from the Border-Leaf

The Route Reflector will relay the host information distribution toward the Border-Leaf, which will forward to its peer eBGP using EVPN AF.

Global Host information distribution

Each Control Plane has now collected all host information from its local fabric as well as from the remote fabric. Each CP needs straightaway to populate its local VTEPs with all host identifiers (IP & MAC addresses) that exist across the two fabrics.

Global host information population

Global host information population

Packet walk with an ARP request

Each VTEP has an exhaustive knowledge of all existing local and remote hosts associated to locally active VNIs, including all other VTEPs, with their relevant host information, VNI, IP and MAC addresses. The VTEP neighbour information is distributed accordingly, as described in the previous phase.

H1 wants to establish communication with H4.

ARP suppression

ARP suppression

1 – H1 ARP’s for H4. The source MAC is therefore H1, and the destination MAC is FF:FF:FF:FF:FF:FF.

2 – Leaf 11 gets the ARP request from H1 and notes that this frame must be mapped to the VNI 10000. VTEP 1 checks against its ARP suppression cache table and notices that the information for H4 is already cached.

3 – VTEP responds directly Unicast to H1 with the MAC address for H4.

BGP EVPN AF and Unicast traffic

BGP EVPN AF and Unicast traffic

As a result, the ARP request is suppressed over and outside the fabric.

4 – Host H1 can now send a Unicast frame to host H4.

5 – NVE Leaf 11 receives the Unicast Layer 2 frame destined for H4 over VLAN 100 and observes this flow requires it to be encapsulated with VNI 10000 with the IP destination VTEP 3, which it knows.

  • The Unicast packet is load balanced across the fabric and ECMP determined the path to hit the Anycast VTEP address (V3) available to both vPC peer-switches.

6 – VTEP 3 on Leaf 24 receives the frame, strips off the VxLAN header and performs a MAC lookup for H4 in order to retrieve the interface attaching H4 (e.g., E1/2). It then forwards the original frame to H4.

Thus, after each VTEP has been notified and updated with the information about all the existing hosts (VTEP, VNI, Host IP & MAC), when a Leaf sees an ARP request from one of its network interfaces (VLAN), it performs a ARP snooping on behalf of the destination, responding directly to the ARP request with the destination MAC address as it knows it, for the local host originating the request. Accordingly, the amount of flooding for Unknown Unicast is highly reduced as ARP is not flooded when the destination of the ARP request is known.

This is assuming that all hosts have been dynamically discovered and distributed to all VTEPs. However, for silent hosts and non-IP protocol, UU are still flooded.

Silent host

Now, let’s pretend we have a silent host (H6) as a target destination for new communication. Consequently, its directly attached ToR (L25) is unaware of its existence and therefore the NVE function of the Leaf 25 cannot notify the BGP Control Plane about the existence of host H6.

BGP EVPN and silent host (H6)

BGP EVPN and silent host (H6)

1 – From the above scenario, Host H1 wants to communicate with H6 but has no ARP entry for this host. As a result, it “ARP requests” for H6.

2 – The NVE Leaf 11 does a lookup on its VTEP/Host table but gets a “miss” on information for host H6.

3 – Subsequently, VTEP 1 encapsulates the ARP request with VNI 10000 and replicates the VxLAN packet to all VTEPs that bind the VNI 10000 (HER).

Note that if IP multicast is enabled, it can also forward the ARP request using a multicast group ( as part of the BUM traffic).

4 – VTEP 3 and VTEP 4 receive this ARP request packet flow, respectively strip off the VxLAN header and flood their interfaces associated to VLAN100. H4 receives but ignores the L2 frame; however, H6 takes it for further treatment. Depending on its OS, H6 caches the host H1 information in its ARP table.

BGP EVPN AF Silent host update to BGP

BGP EVPN AF Silent host update to BGP

5 – Host H6 replies with its ARP response toward Host H1.

Two actions are completed in parallel:

Action 1: Data Plane transport

6 – VTEP 4 encapsulates the ARP reply with the VNI 10000 and sees on its VTEP host table that H1 belongs to VTEP 1. VTEP 4 forwards Unicast the VxLAN response to VTEP 1.

7 – VTEP 1 receives the packet from VTEP 4, strips off the VxLAN header, does a MAC lookup for Host 1 and forwards the response Unicast to H1 through the interface of interest (e.g., E1/12). H1 populates its local ARP cache.

Action 2: Control Plane host learning and population

8 – In the meantime, VTEP 4 populates its BGP Control Plane with the new host route for Host 6.

As soon as the local Route Reflector receives a new host entry, it updates its BGP neighbor. Both BGP Control Planes are now up-to-date with the new entry for Host H6.

BGP EVPN AF Silent Host Population

BGP EVPN AF Silent Host Population

9 – The BGP route reflector updates all VTEPs with the new host route H6.

In a nutshell, if there is a Unicast RIB “miss” on the VTEP, the ARP request will be forwarded to all ports except the original sending port (ARP snooping).

The ARP response from the silent host will be punted to the Supervisor of its VTEP with the new host information, and subsequently populates the BGP Control Plane with its Unicast RIB (learning). The ARP response is therefore forwarded in the Data Plane.

 Note: Inside the same Fabric, with MP-BGP EVPN control plane and Host discovery and distribution, it is certainly more appropriate to enabled Multicast to handle the BUM traffic, thus reducing the Ingress replication (IR) for a better scalability. In the above DCI scenario, Ingress replication is used to carry the BUM inside each site and across the two locations. The operational reason is that IP Multicast for Layer 3 managed services (L3 WAN) is not always an option for the Enterprise and we should consider the scenario that fits in all circumstances. However, if IP Multicast is available, the ARP request or any BUM can be transported as well inside a Multicast within the core network.

Host Mobility

With the virtual environment, host mobility should respond to a couple of requirements:

1 – Detect immediately the new location of the machine after its move and notify the concerned network platforms about the latest position.

2 – Maintain the session stateful while the machine moves from one physical host to another.

Traditionally when a station is powered up, it automatically generates a gratuitous ARP (GARP) sending its MAC address to the entire network.  The CAM table of all the switches belonging to the same layer 2 broadcast domain is updated accordingly with the known outbound interface to be used to reach the end-node. This allows any switches to learn dynamically the unique location of the physical devices.

For virtual machine mobility purposes, there are different ways to trigger an notification after a live migration, informing the network about the presence of a new VM.

When running Microsoft HyperV, Windows will send out a ARP after the migration of the guest virtual machine has been successfully achieved. With VMware vCenter (at least up to ESX 5.5) a Reverse ARP (RARP) is initiated by the hypervisor sent toward the upstream switch (Leaf). Note that RARP has no layer 3 information (actually, it asked for it, like DHCP does).

With the original VxLAN Flood & Learn (Multicast-only transport and Unicast-only Transport with HER), the GARP or RARP are flooded in the entire overlay network and the learning process is achieved like within a traditional Layer 2 broadcast domain. Hence there is no need to detail this workflow.

However, with VxLAN/EVPN MP-BGP Control Plane, the new parent switch detects the machine after the move and this information is populated consequently to all VTEP of interest.

Host Advertisement: Sequence number 0

Host Advertisement: Sequence number 0

In the example above, VM1 is enabled and “virtually” attached to its layer 2 segment. As the result, VTEP 1 detects its presence and advertises its BGP Route Reflector about its location.

  1. VTEP 1 notifies its local BGP Route Reflector with a sequence number equal to “0” informing that it owns VM1.
  2. The BGP RR, updates its BGP RR peer about this new host. As the result, both BGP Control plane are aware of the host VM1 and its parent VTEP.
  3. Both RR in each site, update their local VTEPs about VM1.

VxLAN 2.0 Host mobility 2

Host move, new advertisement using a Sequence number 1

VM1 migrates from its physical Host 1 (H1) in DC1, across the site interconnection up to the physical Host 4 (H4) located in DC2. We assume the port group in the virtual switch is configured accordingly.

  1. After the move of VM1, the hypervisor sends a RARP or GARP.
  2. The Leaf 23 or 24 (depending on the LACP hashing result) detects the new presence of VM1.
  3. VTEP 3, which knew VM1 from a previous notification as being attached behind VTEP 1, sends an update of new host VM1 using a sequence number equal to “1” in order to overwrite the previous host information for VM1.

If it’s a RARP (vMotion), the host IP address is unknown at this period of time, consequently the association IP <=> MAC is deducted from the local VTEP host table.

  1. The BGP RR in DC2 updates its RR peer in DC1
  2. Subsequently both BGP RR update their respective local VTEPs about VM1 with its new location behind VTEP 3

VTEP 1 sees a more recent route for VM1 and will update its routing table accordingly.

VxLAN 2.0 Host mobility 3

All VTEPs are updated with the new location for VM1


All VTEPs are updated with the new location VTEP 3 for VM1

Distributed Default gateway

In a DCI environment it is important to take into consideration the bandwidth consumption as well as the increased latency for E-W traffic across the multiple sites that may have a critical impact in term of application performances. To facilitate optimal east-west routing reducing the hair-pinning effect, while supporting transparent virtual machine mobility without interruption, the same Default Gateways must exist on each site, ensuring that the redundant first-hop routers exist within each data center.

There are two options to achieve this Layer 3 Anycast function of the default gateway on both sites.

  • The 1st option is to enable a pair of FHRP A/S gateways on each site and manually filter the FHRP multicast and virtual MAC addresses to not be propagated outside the site. As the result each pair of FHRP routers believe they are alone and become active on each location with the same virtual IP and MAC addresses. When a VM migrates to the remote location, it continues to use the same default gateway identifiers (L2 & L3) without interrupting the current active sessions. The drawbacks of this option are that it requires a manual configuration at the DCI layer (e.g. FHRP filtering) and it limits the number of active default gateway per site to one single device (per VLAN). This option should be applicable to any VxLAN mode.
  • The second option is to assign the same gateway IP and MAC addresses for each locally defined subnet in each Leaf. As the result, the first hop router for the servers exists across all Leaf switches for each VLAN of interest.

Note that this feature is compatible with traditional DCI solutions such as OTV or VPLS as it occurs on each ToR.

Anycast Layer 3 Gateway

Anycast Layer 3 Gateway

In the above example, in DC1, H1 from VLAN 100 wants to communicate with H2 on VLAN 200.  H10 from VLAN 100 wants also to speak with H2. In the meantime in DC2, H5 on VLAN 100 wants to connect with H3 on VLAN 200. They all use the same Default gateways for their respective VLAN 100. In a traditional deployment, all the routed traffic would hit a single centralised layer 3 gateway. With distributed L3 Anycast gateway, the traffic is directly routed by the first hop router initiated in each ToR.

H1 sends the packet destined to its DF (VLAN 100), routed and encapsulated with the concerned VNI by L11 toward VTEP 2. L12 and L13 are de first hop routers for H2 in regard to VLAN 200.

H10 sends the packet destined to the same DF (VLAN 100) with the same identifiers used by H1 toward H2 that sits behind VTEP 2.

H5 sends the packet destined to the same DF with the same identifiers used by H1 toward H3 on VTEP 3. As the result, although in this example all source use the same default gateway, the routing function is distributed among all ToR increasing VxLAN to VxLAN L3 routing performances, reducing the hair-pinning and therefore the latency improving the performances on multi-tier applications. Having the same gateway IP and MAC address helps ensure a default gateway presence across all leaf switches removing the suboptimal routing inefficiencies associated with separate centralised gateways.

Key takeaways for VxLAN MP-BGP EVPN AF and DCI purposes

Although the VxLAN Control Plane MP-BGP EVPN has not been originally thought out for DCI purposes, it could be diverted to provide a “good enough” DCI solution, specifically with VTEP and Host information dynamically learnt and populated, in conjunction with ARP suppression reducing UU across the two sites. On the resilient form of the architecture, redundancy tunnel end-points are addressed with a Cisco vPC peer switches. However, we must be cautious about some important shortcomings and risks for the DCI environment that need to be clarified and highlighted to Network Managers.

Some tools can be added manually, such as L2 loop detection and protection, while some others are not available yet, such as selective BUM or Multi-Homing discussed afterward.

Generic DCI purposes with VxLAN transport

Dual-homing for resilient DCI edge devices

Different solutions exist offering Dual-homing from a Layer 2 switching point of view, such as vPC or MLAG or VSS or nV. The tunnel created for the network overlay requires to be originated from a VTEP. The VTEP resides inside the Leaf mapping VLAN IDs to a VxLAN IDs. However it is crucial that the VTEP is always Up & Running for business continuity. The Host-based VTEP gateway supports an Active/Standby HA model. When the active fails, it may take some precious times before the standby VTEP takes over. To improve the convergence, transparency and efficiency, Cisco Nexus platforms support Active/Active VTEP redundancy by allowing a pair of virtual Port-Channel (vPC) switches to function as a logical VTEP device sharing an Anycast VTEP address. The vPC peer switch is leveraged for redundant host connectivity, while each device is individually running Layer 3 protocol with the upstream router in the underlay networks: Fabric layer and Core layer.

The intra-fabric traffic is load-balanced using ECMP (L3) among the 2 VTEPs configured with the Anycast VTEP address, while the vPC members in the network side, load distribute the traffic using an Ether-Channel load-balancing protocol such as LACP (L2).


vPC peer-switch & Anycast VTEP Address

For the VxLAN overlay network, both switches use the same Anycast VTEP address as the source to send the encapsulated packets. For the devices dual-attached in the underlay network, the two vPC VTEP switches appear to be one logical VTEP entity.

For VxLAN Data Plane as well as BGP Control Plane traffic, the Layer 3 is not directly concerned with vPC peering, as it is pure Layer 3 communication (Layer 3 intra-fabric to Layer 3 Core communication). Consequently, the Layer 3 Data Plane (VxLAN) is routed across L3 Border-Leaf devices. As a traditional standard routing protocol, it natively supports a L3 load-balancing algorithm such as ECMP or PBR via the two Layer 3 Border Leafs. If one DCI link or DCI device fails, traffic will be re-routed automatically to the remaining device.

Having said that, and contrary to the scenarios described previously, the Border-Leaf is certainly not being just dedicated  to route the network fabric to the outside world but is also going to be used as a ToR offering device dual-homing connectivity (host, switches, network services, etc.). That gives the flexibility to use a pair of Border-Leafs either for DCI connectivity (Data Plane and Control Plane) only, or use the Border-Leaf for both DCI connectivity as well as vPC VTEP for dual-homed devices with regard to VLAN to VNI mapping. The last use-case is certainly the most deployed.

The following figure shows each VxLAN overlay network (VNI) routed toward the core layer. A pair of Border-Leafs is used as redundant Layer 3 gateways for DCI purposes only. The VxLAN tunnel is initiated and terminated in the remote NVE of interest.

Border-Leaf L3 DCI only

Border-Leaf L3 DCI only

In the following figure, however the Border-Leaf vPC peer-switches are used for two functions. The first role is used to offer redundant Layer 3 gateways for DCI purposes. The second role is to initiate the VxLAN tunnel through its local VTEP, independent of the DCI functions.

Anycast VTEP gateway is leveraged for redundant VTEP, improving the resiliency of the whole VxLAN-based fabric as the function of redundant VTEP gateway is distributed. If one fails for any reason, it will impact only the ToR concerned by the failure, but traffic is just immediately sent via the remaining VTEP on the data-plane level. In addition, it allows improving performance (the same VNI packets can be distributed between the two VTEPs).

Border-Leaf vPC VTEP and L3 DCI

Border-Leaf vPC VTEP and L3 DCI

VxLAN deployment for DCI only

Another case study exists when VxLAN protocol is diverted to only run a DCI function while traditional Layer 2 protocols such as RSTP, vPC, FabricPath or others are deployed within each Data Center.

VxLAN for DCI only

VxLAN for DCI only

This design above looks pretty much like a traditional DCI solution in the sense that tunnels for the overlay networks are initiated and terminated at the DCI layer for DCI purposes only.

Each fabric uses a Layer 2 transport. There is no fabric dependency, one fabric can run vPC/RSTP, and the other can be FabricPath. All VLANs that require to be stretched across multiple sites are extended from each ToR concerned by those VLAN, toward the pair of Border-Leafs used for the DCI function. These VLANs are mapped to their respective VNI at the Border Leaf (DCI layer) supporting the VTEP: the VxLAN configuration, including the secondary VTEP address, is similar to both VTEP vPC peer-switches. The VTEPs on each site initiate the overlay network toward the remote VTEPs in a redundant configuration. If one VTEP device fails, the other takes over immediately (vPC Layer 2 and Anycast VTEP address convergence).

VLAN and VNI selection for DCI purposes

It is usually recommended to extend outside a fabric only a set of required VLAN, not all of them. It is not rare to deploy 1,000 or more VLANs within a Data Center, but only 100 of them are needed in the remote sites.

In a conventional DCI solution, as was noted previously, the VLANs are selectively encapsulated into an overlay network (e.g. OTV, VPLS) at the DCI edge layer. Hence, it’s easy to select and configure the set of VLANs that are required to be extended.

On the other hand, with VxLAN it is possible to control the number of VLANs to be stretched across the fabrics by using a mismatch between VLAN ID ó VNI mapping. Only VLAN mapped to the same VNI will be extended through the Overlay network.

For example, in DC-1, VLAN 100 is mapped to VNI 10000, and in DC-2, VLAN 100 is mapped to VNI 10001. As a result, VLAN 100 will never be stretched across the two fabrics.

Instead, with ToR-based VxLAN VTEP gateway, the VLAN is local significant, subsequently, this allows VLAN translation natively, by mapping different VLAN IDs to the same VNI.

BFD and fast convergence compared with OTV fast convergence

To offer fast convergence, one function required is fast failure detection (e.g., BFD, route tracking). However, another crucial function is a host/MAC reachability update (flush, discover, learning, distribution, etc.).

The function of host reachability on BGP EVPN relies on a centralized BGP Route Reflector that contains all the information required to forward the VxLAN packet to the desired destination. Both RR are synchronized and contain the same list of VTEP/Host entries. If one BGP RR fails, the other takes over the entire job. One main difference with the traditional DCI overlays such as OTV, VPLS or PBB-EVPN, is that the overlay tunnel is initiated at the Leaf device (ToR) and terminates on the remote Leaf. The DCI layer being just a Layer 3 gateway and doesn’t handle any VTEP/Host information (for the purpose of a DCI edge role). With the VxLAN overlay network, the encapsulation of the VLAN frame is distributed and performed in each Leaf/VTEP. To improve the convergence, each Leaf supports its local redundancy design based on vPC, for dual-homed Ethernet devices. It is important to note that BFD improves the detection of a remote failure in hundred of ms, allowing ECMP to select the remaining VTEP very quickly, in case of a issue (e.g. CPU) with one Leaf.

With OTV, only one DCI edge device is responsible for encapsulating and forwarding workflow for a particular Layer 2 segment toward a remote site. While the redundancy model between BGP EVPN AF and OTV ISIS are not similar (Centralized tunnel for OTV versus a distributed tunnel for VxLAN per Leaf), it is important to remember how OTV offers fast convergence.

BFD and route tracking are supported with OTV for fast failure detection (VLAN, route, AED, layer 2 and routed links). The centralized function of an AED assigns all VLANs between two local AED and selects the backup AED. This information (AED + Backup) is distributed between all remote OTV Edge devices. Consequently, all remote OTV edge devices know the active AEDs for a set of VLAN with their respective backup ED, with all MAC entries also pre-recorded for the backup peer device. When BFD detects a failure, the backup OTV peer device immediately notifies all remote OTV edge devices of interest about the failure and they instantly use the backup AED for the OTV routes without the need to flush and re-learn all the MAC entries. The convergence does not rely anymore on MAC routing table size. It is important that in any instances of a failure with a dual-fabric based on VxLAN MP-BGP EVPN, all host entries are maintained and up-to-date in the VTEP tables.

What is Needed for a Solid DCI Solution 

The second section discussed previously, described the weaknesses of VxLAN Unicast-only mode with Head-end replication for VTEP. The second section described VxLAN with MP-BGP EVPN AF and has demonstrated a more efficient way to discover and distribute host information among all the VTEPs. However, I mentioned that all of the requirements for a solid DCI deployment have not been met. Hence, let’s reviews the shortcomings that should be addressed in the next release of VxLAN to be considered as a solid DCI solution.


Sometimes mistaken for Dual-homing, Multi-homing goes beyond the two tightly coupled DCI edge devices. For the purpose of DCI, it aims to address any type of backdoor Layer 2 extension. In the following design, the DC interconnection is established from the spine layer (this is mainly to keep the following scenario clear to understand; however, it doesn’t matter where the DCI is initiated).

A backdoor Layer 2 link is created between the two leafs, L15 and L21, meaning all the VLANs are trunked. This may happens during a migration stage from one existing legacy interconnection to a new DCI solution.

Backdoor Layer 2 link across sites

Backdoor Layer 2 link across sites

As a result, on the Leaf directly concerned by the backdoor connection, the VLANs mapped to a VNI (e.g., VLAN 200), will face a large loop and a broadcast storm will be flooded through the VNIs of interest (Head-end replication) and from site to site. Consequently, the VNI will immediately saturate the original DCI service, taking all the available DCI bandwidth over the other VNI.

With the hardware-based VxLAN VTEP gateway, VLANs are local Leaf significant. Theoretically, only Leafs that bind the VLAN affected by the storm of BUM will be impacted. In the previous design, assuming only VLAN tag 100 exists in L11 and L25, they should not be directly impacted by the storm created by the backdoor link, which here concerns only VLAN 200. The reason is that each VLAN maps a different VNI and BUM will be flooded through the VNI of interest. However, L23 and L24, which supporting both VLAN 100 and VLAN 200, will be disrupted.

A behavior to consider is, if BUM is transported over a common Multicast group, the risk is that all VLANs may suffer from the storm.

From a standard point of view, VxLAN has neither an embedded multi-homing service nor L2 loop detection and protection tools. To avoid this situation, it is critical to protect against such L2 loop. This is also true inside the network fabric itself, for example, a wrap cable extended across two ToR within the same DC.

A workaround for Layer 2 loop protection

Consequently, the first necessity is to protect against potential Layer 2 loop. The most known feature for that purpose is certainly BPDU Guard. BPDU Guard is a Layer 2 security tool that prevents a port from receiving BPDUs. It can be configured at the global or interface level. When you configure BPDU Guard globally to the Leaf, it is effective only on operational Edge ports. In a valid configuration (Host attachment), Layer 2 Edge interfaces do not receive BPDUs. If a BPDU is seen on one of the Edge Port interfaces, this interface signals an invalid configuration and will be shutdown, preventing a L2 loop to happen.

BPDU Guard

BPDU Guard

The drawback of BPDU Guard is that when cascading classical Ethernet switches running STP, it requires a special attention. BPDU Guard is disabled for trunk interfaces. Which is fine when connecting the traditional network, but it may add some complexity for operational deployments, as it implies some manual configuration. The recommendation is to force all interfaces of the full ToR-based VxLAN to be configured in Edge Port and enable globally BDPU guard. BPDU Guard can be disabled afterward per interface connecting traditional Layer 2 network.

Note that if the interface is trunked for attaching a hypervisor, then BPDU Guard is disabled; hence risks exist for a back-door link to be established.

Multi-homing and layer 2 loop detection improvement is required.

What is missing with the current implementation of VxLAN, is an embedded security feature that allows multi-homing, detecting and protecting end-to-end layer 2 loop. This doesn’t necessarily supplant the BPDU Guard discussed previously, but brings an additional native service to protect against such a design mistake. This would also be used in site merging scenarios, where a back-door L2 exists between two fabrics, not due to a human mistake, but for example because of the migration from a traditional Data Center to Fabric with a VxLAN stretched end-to-end.

In Overlay network technologies such as VxLAN, managing BUM traffic is a key requirement. In the case where interconnection across multi-sites is multi-homed as described above, it is necessary that one, and only one edge device forwards the BUM traffic into the core or towards the remote Leaf when appropriate.

If by mistake two L2 segments are welded together (locally or across the two fabrics), an embedded function automatically detects the presence of another active device (e.g. sending and listening Hello’s) for the same Ethernet segment and the election of a unique forwarder will be triggered immediately to choose one device responsible for BUM traffic.

OTV uses from ground-up a concept called Dual-Site Adjacency with a “Site VLAN” used to detect and establish adjacencies with other OTV edge devices determining the Authoritative Edge devices for the VLANS being extended from the site. This is achieved inside each site using multiple Layer 2 paths (site adjacency) and outside the data center using the overlay network (overlay adjacency), offering natively a solid and resilient DCI solution. As the result only one device is responsible to forward broadcast traffic, preventing the creation of end-to-end layer 2 loop.

Rate Limiters

If a Layer 2 issue such as a broadcast storm is propagated across the remote locations, there are risks that the DCI links as well as the remote site are disrupted. In such a situation, none of the resources from the different locations will be available due to the total capacity of the bandwidth immediately overloaded. Consequently, it is required that the remote site is protected by identifying and controlling the rate allocated to the broadcast storm.

In a traditional DCI deployment, where tunnels are initiated at the DCI layer, it is possible to configure the Storm-control to rate-limit the exposure to BUM storms across different DCI blocks.

This required configuration is simplified in OTV because of the native suppression of UU frames and for broadcast containment capability of the protocol (ARP caching). Hence, just the storm of broadcast traffic can be rate-limited for the interface facing the WAN with a few command lines. However this implies treating the original VLAN frame. In the case of a VxLAN fabric interconnected using OTV, this forces all VxLAN overlays to be de-encapsulated at the DCI edge layer.

On the other hand, with VxLAN, it is challenging to rate-limit a VNI Unicast packet transporting BUM traffic with Ingress Replication. With Unicast mode IR, nothing differentiates a Unicast frame from a BUM. In addition, BUM will be also replicated, as many times as there are VTEP affected by this VNI.

In the case the fabric and L3 core is Multicast enabled, then, BUM can be transported within a Multicast group. Therefore, it will be possible to differentiate the flooded traffic from the Unicast workflow and rate-limit that BUM traffic over the Multicast groups at the Core layer. However, it is important to get that not all switches support Multicast rate limiter.


Network Managers are used to configuring OTV in a few command lines without the need to care about the Control Plane.

VxLAN deployment imposes long and complex configurations (BGP, EVPN, VLAN to VNI mapping) with risks of human error. A smart tool to simplify the configuration is a programmatic approach using Python or Perl or any other modern language to ease the deployment of VxLAN. Even though this helps to accelerate the deployment while reducing the risk of mistakes, it requires some time to compose the final scripts to be used in production.


DCI is not just a layer 2 extension between two or multiple sites. DCI/LAN extension is aiming to offer business continuity and elasticity for the cloud (hybrid cloud). It offers disaster recovery and disaster avoidances services. As it concerns on Layer 2 broadcast domain,  it is really important to understand the requirement for a solid DCI/LAN extension and the weaknesses that rely on the current implementation of VxLAN.

On the other hand, VxLAN with MP-BGP EVPN AF has some great functionalities that can be leveraged for DCI solution, if we understand its shortcomings. Flood and Learn VxLAN may be used for rudimentary DCI. VxLAN MB-BGP EVPN is taking a big step forward with its Control Plane and can be used for extending the Layer 2 across multiple sites. However it is crucial that we keep in mind some weaknesses related to DCI purposes. Current solution such as OTV provides a mature and proven DCI solution with all embedded DCI features from ground-up. Fortunately, the open standard VxLAN  and the necessary intelligence and services are being implemented into the protocol to provide in the future a viable and solid DCI solution.

DCI LAN Extension Requirements

DCI requirements reminders

  • Failure domain must be contained within a single physical DC
    • Leverage protocol Control Plane learning to suppress Unknown Unicast flooding.
    • Flooding of ARP requests must be reduced and controlled using rate-limiting across the extended LAN.
    • Generally speaking, rate-limiters for the Control Plane and Data Plane must be available to control the broadcast frame rate sent outside the physical DC.
  • Dual-Homing
    • Redundant path distributed across multiple edge devices are required between sites, with all paths active without creating any Layer 2 loop.
    • Any form of Multi-EtherChannel split between two distinct devices needs to be activated (e.g., EEM, MLAG, ICCP, vPC, vPC+, VSS, nV Clustering, etc.).
    • VLAN-based Load balancing is good; however, Flow-based Load-balancing is the preferred solution.
  • Multi-homing
    • Multi-homing going beyond the two tightly coupled DCI edge devices.
    • Any backdoor Layer 2 extension must be detected.
  • Layer 2 Loop detection and protection
    • Built-in loop prevention is a must-have.
  • Independent Control Plane on each physical site
    • A unique active Control Plane managing the whole VxLAN domain may lose access to the other switches located on the remote site.
    • Reduced STP domain confined inside each physical DC.
    • STP topology change notifications should not impact the state of any remote link.
    • Primary and secondary root bridges or Multicast destination tree (e.g., FabricPath) must be contained within the same physical DC.
  • Layer 2 protocol independence inside the network fabric
    • The DCI LAN extension solution should not have any impact on the choice of Layer 2 protocol deployed on each site.
    • Any Layer 2 protocols must be supported inside the DC (STP, MST, RSTP, FabricPath, TRILL, VxLAN, NVGRE, etc.) regardless of the protocols used for the DCI and within the remote sites.
    • Hybrid Fabric should be supported on both sides.
  • Remove or reduce the hair-pinning workflow for long distances
    • Layer 3 default gateway isolation such as FHRP isolation, Anycast gateway or proxy gateway.
    • Intelligent ingress path redirection such as LISP, IGP Assist, Route Health Injection, or GSLB.
  • Fast convergence
    • Sub-second convergence for any common failure (link, interface, line card, supervisor, core).
  • Multi-sites
    • The DCI solution should allow connecting more than two sites.
  • Any Transport
    • The DCI protocol must be transport-agnostic, meaning that the DC interconnection can be initiated on any type of link (dark fiber, xWDM, Sonet/SDH, Layer 3, MPLS, etc.).
    • Usage of IP Multicast may be an option for multi-sites but should not be mandatory. The DCI solution must be able to support a non-IP Multicast Layer 3 backbone because an enterprise would not necessarily obtain the IP Multicast service from their provider (especially for a small number of sites to be Layer 2 interconnected).
    • Enterprises and service providers are not required to change their existing backbone.
  • Path Diversity
    • Path diversity and traffic engineering to offer multiple routes or paths based on selected criteria (e.g., Control Plane versus Data Plane).
    • This is the ability to select specific L2 segments to use an alternate path (e.g., Data VLAN via path#1 and Control VLAN via Path#2).

A clarification on the taxonomy:

Entropy/Depolarization: Variability in the header fields to ensure that different flows being tunneled between the same pair of TEPs, do not all follow the same path as they traverse a multi-pathed (usually ECMP) underlay network.

Load Balancing:  The ability of an overlay to send different flows to different TEPs/EDs when a destination site is multi-homed (ideally also dual-homed, although dual-homing the way we’ve implemented it wouldn’t lend itself to this, we are reducing the destination to a single any cast address, trumping our ability to prescriptively load balance)

Path Diversity: The ability of an encapsulating TEP to steer different encapsulated flows over different uplinks. Different uplinks should either belong to different underlay networks or be mapped to underlay Traffic Engineering tunnels (MPLS-TE, Multi-topology routing, Segment Routing or other) that guarantee the end-to-end divergence of the paths taken by traffic that enters the TE tunnels.



Posted in DC & Virtualization, DCI | Tagged , , , , , | 17 Comments

27 – Bis-Bis – Stateful Firewall devices and DCI challenges – Part 1 (cont)

Back to the recent comments on  what is “officially” supported or not ?

First of all, let’s review the different Firewall forwarding mode officially supported

ASA cluster  deployed inside a single data center:

Firewall forwarding mode within  single DCFig.1 Firewall forwarding mode within a single DC. Please note the firewall routed mode supported with the Layer 2 load balancing (LACP) Spanned Interface mode.

When configured in Routed mode (e.g. default gateway for the machines), the same ASA identifiers IP/MAC are distributed among all ASA members of the cluster. When the ASA cluster is stretched across different locations, the Layer 2 distribution mechanism facing the ASA devices is achieved locally using pair of switches (usually leveraged the a Multi-chassis EthernetChannel technique such as VSS or vPC).

Subsequently the same virtual MAC address (ASA vMAC) of the ASA cluster is duplicated on both sites and as the result it hits the upward switch from different interfaces.











Fig.2 ASA and duplicate vMAC address

Therefore when the ASA cluster runs the firewall routed mode with Spanned interface method, in a geographically stretched fashion, it breaks the Ethernet rules due to the duplicate MAC address, with risks of affecting the whole network operation. Consequently by default it is not supported as expressed in the release notes for the Inter-site section.

Firewall forwarding mode with dual DCFig.3 The routed mode is not supported with Spanned Interface mode for ASA cluster stretched across multiple locations.

The above requires a clarification. In this context, “NOT supported” means that no other design alternative allowing the firewall routed mode to be enabled in conjunction with Spanned interface mode has been tested by the Quality Assurance (QA) process.

However it doesn’t mean that there is no design workaround to address this requirement, and some additional network services may help to keep the ASA in Routed mode.

  • The 1st option could be to filter the Virtual MAC address in the interface facing the DCI link.

This option may be desired if you need to use the ASA as the default gateway for the servers for example. However it must be enabled with caution, thus this option is definitely not recommended, except if you know exactly what you are doing.

The first drawback to keep in mind is that if the vMAC address is blocked, it results that the original ASA member cannot maintain the session stateful with a virtual machine  moving to the remote site. Consequently this option will be limited to “cold migration” using subnet extension (e.g. for cost containment or migration purposes).

The second concern is that to address any black hole scenarios, this option imposes as well the ingress traffic to be dynamically redirected to the new location as soon as the application of interest has migrated to (e.g. LISP mobility).

Statefull session is not supported with vMAC filter

Fig.4 Statefull session is not supported with vMAC filtering

  • A 2nd alternative more robust IMO is to isolate the layer 2 domain with a Router

With this option, the L2 domain used to distribute the traffic among the 2 local ASA devices is isolated from the data VLAN, hence the pair of switches facing the ASA members sees the vMAC only from the direct attached Firewalls. Only the data VLAN is stretched across the 2 locations (plus the CCL which is not represented in this diagram)

Router insertion

Fig.5 Router insertion to isolated the vMAC. ASA are configured in firewall routed mode with static routing.

However there are few disadvantages of using this design with the default gateway that resides between the application servers and the ASA clusters:

  1. The ASA cannot be the default gateway of the servers due to the inserted first hop router.
  2. The routing service on the ASA must be configured with static entries. Indeed, the dynamic routing is centralized to the master, hence routing adjacency on non-master site without a full stretched subnet will not succeed. On the other hand, if the subnet is stretched, we lose the ability to localize transit traffic to the local site.
  3. It becomes challenging to secure the E-W traffic (e.g. Web-tier <=> App-tier <=> DB) with the same ASA cluster.

Note for the last bullet that IPS are usually more efficient (and mandatory) between application tiers than a classical firewall function However the ASA offers embedded IPS too, thus this comment I think may be important.

If theoretically the second alternative works as I described it, it is not yet officially supported, and it will require several tests with different scenarios that you may want to run to this validate this option for your environment.

Having said that, a new East-West insertion mode has been qualified and validated with the recent release 9.3(2), technically nothing new, it ‘s just that this one has been deeply tested. In this scenario the E-W traffic can nowbe secured via the same ASA cluster.

ASA in transparent mode.

Fig.7 ASA in transparent mode. East-West traffic is secured using the ASA cluster.

When the ASA is configured in Firewall Transparent mode, the virtual MAC is not an issue anymore as the ASA members use a BVI address. The BVI address doesn’t appear throughout the data plane. Consequently there is no risk of duplicate MAC address. As the result, the outside router can act as a default gateway for all application servers of interest, allowing the E-W traffic (server to server communication) to be secured via the same ASA cluster. Keep in mind that it requires FHRP isolation to support hot live migration.

Additional details on E-W insertion design can be found in this release-notes 9.3(2)

I hope that post clarifies some statements within the official ASA document and alternative design.




Posted in DCI | 4 Comments

27 – Bis – Path Optimisation with ASA cluster stretched across long distances – Part 2

How can we talk about security service extension across multiple locations without elaborating on path optimisation ?  🙂

Path Optimization with ASA Cluster stretched across long Distances

In the previous post, 27 – Active/Active Firewall spanned across multiple sites – Part 1, we demonstrated the integration of ASA clustering in a DCI environment.

We discussed the need to maintain the active sessions stateful while the machines migrate to a new location. However, we see that, after the move, the original DC still receives new requests from outside, prior to sending them throughout the broadcast domain (via the extended layer 2), reaching the final destination endpoint in a distant location. This is the expected behavior and is due to the fact that the same IP broadcast domain is extended across all sites of concern. Hence the IP network (WAN) is natively not aware of the physical location of the end-node. The routing is the best path at the lowest cost via the most specific route. However, that behavior requires the requested workflow to “ping-pong” from site to site, adding pointless latency that may have some performance impact on applications distributed across long distances.

With the increasing demand for dynamic workload mobility across multiple data centers, it becomes important to localize the DC where the application resides in order to dynamically redirect the data workload to the right location, all in a transparent manner for the end-users as well as the applications and without any interruption to workflow.

The goal is to dynamically inform the upward layer 3 network of the physical location of the target (virtual machines, applications, HA clusters, etc.), while maintaining the active sessions stateful and allowing new sessions to be established directly to the site where the application resides. Maintaining the sessions stateful imposes one-way symmetry, meaning that the return traffic must hit the original firewall that owns the session. However, all new sessions must take advantage of the path optimization using dynamic redirection through only local firewalls.

There are two ways to efficiently achieve these requirements:

  1. The first is to use LISP mobility, as discussed in post 23 – LISP Mobility in a virtualized environment (update). LISP establishes a network overlay between Tunnel Routers (xTR) in the path between distant sites. When a machine is moved across two data centers, a notification is sent to a LISP database mapping the end-nodes with their respective last known location, which in turn notifies the LISP Egress Tunnel Routers (eTR). The session is therefore LISP-encapsulated to the data center where the application of interest currently resides. The underlying layer 3 network is not concerned or affected by any dynamic change configuration, which is a huge added value.
  2. The second is a new network service called LISP IGP Assist. Actually, it’s a sub-function of a LISP First Hop Router (FHR), which is leveraged to detect the machine’s movement and redistribute the LISP route of the specific end-node (/32) into the IGP routing tables on each site. With a more specific route, the routed traffic is natively directed to the DC where the application of interest exists.

1. ASA Cluster and LISP Mobility

In the following scenario an ASA cluster built with four ASA units, two on each site, is stretched across DC-1 and DC-2. The ASA cluster is configured in Spanned Mode (cLACP). Note that it could be configured in Individual Mode, nonetheless it’s going to be exactly the same configuration and behavior for OTV and LISP services. OTV is used here to extend the ASA Cluster Control Link (CCL) as well as the data VLAN’s of the compute layer affected by the movement of the VM’s. We assume that the ASA cluster is configured in Transparent Mode. However, when writing this note in regard to ASA cluster 9.2(4), the routed mode is not yet qualified in Spanned Mode in a DCI scenario.

Let’s go step by step to better dig through the end-to-end workflow.

An end-user from a branch office requests access to an application that resides in DC-1 (1).

The request is intercepted by the Ingress Tunnel Router (iTR).

The iTR solicits a request (2) from the mapping database (aka “Map Requestor Service”) for the location of the application.

The LISP Map Server (MS) relays the request to the eTR in DC-1 where the application resides. This redirection via the eTR aims to validate the path between the relevant eTR and the iTR.

Consequently, the ETR in DC-1 informs the iTR about the location for Subnet A residing in the eTR of DC-1 (3).


Now the iTR records the location of the specific end-node (application). It can now encapsulate the user request with a LISP header and route it toward the eTR in DC-1 (4).

The eTR in DC-1 strips off the LISP header and routes the packet to the next hop to reach the subnet where the application exists.

The first level of the switch computes its LACP hashing algorithm, selecting the link to ASA-2 and forwarding the packet accordingly.

ASA-2 encodes cookies with its identifier and forwards the request to the server (5).


The application “hot” migrates to DC-2 (6) and replies to the user request (8) while maintaining a stateful session.

LISP Control Plane: In the meantime, the LISP FHR detects the presence of the new virtual machine (the application) and notifies its local eTR (7) in DC-2 about this new End-Node Identifier (EID).

Subsequently, the eTR in DC-2 informs (10) the Map Server (M-DB) about the new location of the EID. The Map Server then signals the original eTR in DC-1 about the change of location. The eTR in DC-1 updates its LISP table accordingly with a “Null0” entry for the end-node of interest.

Data Plane: Consequently, the ASA-3 receives the response from the server (8), decodes the TCP cookies, learns the owner of the TCP session and redirects the packet toward ASA-2 over the CCL (9).

The eTR on DC-1 encapsulates the IP packet for the iTR (11), which de-encapsulates the LISP header and routes it to the end-user.

One-way symmetry has been established via the CCL. The session has not been interrupted; the movement of the application has been transparent for the end-user and the application.

The end-user continues using the application and sends the next request to it.


The iTR has cached the original location of the application into its local database. Hence, it still encapsulates the IP packet for the eTR in DC-1 (12).

The eTR in DC-1 has been previously notified that it is no longer the owner of this end-node identifier and replies with a “Solicit Map Request (SMR)” (13) to the iTR in order to confirm with the Map Resolver (14) the new location of the application (EID). It then receives the new location (15) for the application with the destination the egress tunnel router eTR located in DC-2.


As the result, the iTR encapsulates the end-user’s request with a LISP header and sends it toward the eTR in DC-2 (15).

ASA-3 receives the IP packets and determines that the owner is ASA-2, to which it redistributes the workflow via the extended CCL. ASA-2 treats the packet accordingly and routes it toward the next L3 hop, which forwards it to the machine supporting the application (16).

The application responds to its local default gateway (17), thanks to FHRP filtering, which distributes the return packet to ASA-3. The LACP hashing algorithm could have also elected the ASA-4; however, the behavior and final redirection via the CCL would have been exactly the same.

ASA-3 notices again that the owner is ASA-2, to which it redistributes the packet via the CCL.

The eTR encapsulates the IP packet and forwards it to the iTR (18).

The session is maintained stateful with zero interruption.

The drawback of maintaining active sessions stateful is hairpinning workflow for the duration of the session. Thus, a question that comes to mind is:

  • is the application response time still performing as before?

If the distance is very long, the active sessions may be impacted the moment the data leaves. Nonetheless, that’s only limited to the current active and stateful sessions.



For any new sessions, the traffic will be directed to the DC where the application resides. In addition, the traffic will be validated and inspected using one of the local ASA members (20).

Hence, while the current sessions are maintained stateful through the firewall that owns the active data flow, any new sessions will take the shortest path to reach the application with their respective local firewalls treating the flow accordingly. The application response time will be improved due to the distance between the end-user and the application being shortened and hairpinning reduced, if not eliminated.

2. ASA Cluster and LISP IGP Assist

Some enterprise network managers may think that LISP IP mobility might be a bit complex or challenging to deploy. The reality is that there is nothing very complex per se, as detailed here, but there are indeed several different LISP functions running on different platforms that require different configurations, including LISP iTR configuration at the branch offices. Another component to take into consideration is that the complexity depends on whether the enterprise owns the WAN edge layer or if it is layer 3 managed services.

Whatever the original reason is to not deploy LISP VM mobility, IGP Assist may be a nifty trick to think about as a slight alternative to LISP Mobility. Indeed, we just need to leverage the basic function of a LISP FHR, detecting any movements from the EID side and relaying this change to the IGP routing tables. Currently, IGP Assist is only supported on the Nexus 7k.

If you remember the great function of Route Health Injection (rest in peace) available on some SLB devices (e.g. CSM, ACE), it will be easy for you to deploy IGP Assist.

The main concept is to inject /32 upward to the Internet access (Intra- or Extranet usually) to direct traffic to the right site where the mobile application currently resides using a more specific route.

Let’s have a look, step by step, how this is achieved, and let’s apply the same scenario that we used with LISP mobility, still with OTV as the DCI solution deployed for the LAN extension.

From a routing point of view, this is traditional IGP access (Enterprise Intra-DC L3 Routing) connected to one or multiple ISPs built with an E-BGP network (L3 Internet Core).

By default, the traffic to the application (subnet A) is directed to DC-1 with the most specific route /25.

Usually for an HA cluster failover, the network manager performs a manual redirection process by withdrawing the /25 of the subnet of interest from DC-1, consequently all traffic to that subnet is re-routed to DC-2 in a few seconds (directed to the next most specific route /24).

And that’s what IGP Assist is aiming to achieve automatically with host-based granular redirection (/32).

The virtual machine (end-node) supporting the application migrates (“hot” live migration) from DC-1 to DC-2 across the LAN extension (1). The application server continues replying to the original request maintaining the whole session stateful without any interruption.

The new EID is detected on DC-2. IGP Assist notifies all LISP FHRs (2), or more precisely, the redundant LISP FHR in DC-2 as well as the two LISP FHRs in DC-1.

Meanwhile, the LISP FHR redistributes the LISP routes into the IGP routing table (3).

As the result, the IGP control plane handles this notification, the data traffic hits the local ASA-3 in DC-2, which notices that the owner of the stateful session is ASA-2. Immediately, ASA-3 redistributes the packet flow to ASA-2 via the CCL, which in turn forwards it toward the requestor throughout the WAN edge layer on DC-1.

Following the LISP route redistribution into the IGP routing protocol, on DC-2 where the detection of the EID has been made, IGP Assist installs a /32 LISP Interface Route of the host (EID or virtual machine supporting the application of concern). On DC-1, the original location of the application, it removes the /32 LISP Interface Route of the same host that has previously moved out there.

Consequently, the host route (/32) is propagated to the core via the DC-2 WAN edge with a more specific route, directing the traffic to its destination directly through DC-2.

Thanks to the ASA clustering this automatically maintains the current session stateful (7). Similar to the LISP IP Mobility previously described, when deploying the ASA cluster in conjunction with LISP IGP Assist, the path to the active DC where the application resides is optimized for all new sessions (6). The current sessions are maintained stateful through the original owner.

The drawback of this solution is that it injects a /32 host route to the L3 core. If the enterprise owns a block of IP address space or the usage of the L3 core is an intranet using a private IP network, this is straightforward. However, it might be challenging, if not impossible, with multiple ISPs to deal with a /32. Thus, in addition to the negotiation process, it may require a mechanism such as an AS path prepending or BGP conditional advertisement to identify the site where the host route is active and to redirect the traffic accordingly.


With the rapid growth of virtual machines and the easy mobility service that brings the abstraction of the OS from the physical host, it has now become usual that resources supporting the application happen on multiple locations that may be separated by long distances.

Therefore it turns out the IT manager is able to dynamically redirect in real-time the data user workflow in a very granular fashion to the site where the application resides.

In the meantime, the security stateful firewalls impose one-way symmetrical establishment with the return traffic, meaning that a “ping-pong” effect extending the workflow grows as the multi-tier application moves from one site to another.

In order to reduce the pointless latency from the hairpinning workflow as much as possible while achieving elasticity with the virtual DC through stateful live migration, we can leverage multiple network and security services to work in conjunction. Nevertheless, these solutions are not directly related. Hence, any one of them can be enabled for a specific use, for exampe if you think that reducing the latency has no impact on your application’s performance (e.g. metro distances).

What follows is a high-level summary of the different options with their respective added values:

Option A:

1 – ASA clustering spanned across multiple locations

  • All ASA units are active.
  • Sessions are maintained stateful with redirection via the CCL and without any artificial cheats such as source NAT.
  • New sessions are treated locally.
  • The maximum distance time-wise for the Cluster Control Link (CCL) is 20ms roundtrip (theoretically 2000km maximum between two sites[1]).
  • All configurations are synchronized between all ASA units.
  • Not limited to two sites.

2 – OTV for DCI LAN extension
(A long list. The following items constitute the most important ones only.)

  • Failure domain containment
    • Eliminate flooding of unknown unicast
    • STP domain confined inside the physical data center
    • Control plane independence
    • Site independence
  • Dual-homing with independent paths
  • Reduced hairpinning
    • Internal site behaves like any traditional switch with local lookup and forwarding
    • FHRP isolation support
  • Fast convergence
  • Transport-agnostic
    • IP, MPLS, Dark Fiber, etc.
    • IP multicast transport or unicast-only
  • ARP caching
  • VLAN translation
  • Control plane learning
  • Native multi-homing (no STP)
  • Load balancing
    • Today: VLAN-based
    • Future: flow-based

3 – LISP IP mobility for ingress path optimization

  • For our purpose of a “hot” live migration across two sites, we leveraged LISP mobility in Extended Subnet Mode or ESM (LAN extension).
  • Provides dynamic redirections for ingress traffic toward the site where the active application resides.
    • Requires an additional egress redirection such as FHRP localization.
  • Offers host-based granular and dynamic signal notification.
  • Doesn’t impose any configuration change on either the IP core or in the DNS service.
  • Allows end-point migration to a new location without changing the assigned IP address.
  • Detects dynamically any IP movements based on data plane events.
  • Selective prefix allowed to roam.
  • Works in conjunction with any type of LAN extension.
  • Endpoint-agnostic
    • No special agent running on the host required.

4 – FHRP isolation for egress path optimization

  • Available for any network redundancy protocols (VRRP, HSRP).
  • Offers same active default gateway in different locations.
  • Optimizes the outbound traffic (server to client).
  • Optimizes the intra-tier application traffic (server to server).

 Option B:

1 – ASA clustering with all firewalls active

idem as above

2 – OTV for DCI LAN extension

idem as above

3 – LISP IGP Assist for ingress path optimization
  • Simple and easy configuration
  • Host route-based notification
  • Endpoint-agnostic
    • No special agent running on the host required.
  • Works in conjunction with any type of LAN extension.
  • Can be deployed without LAN extension for “cold” migration.
    • Requires a LISP Mapping database to signal the remote LISP FHR engines.
  • Detects dynamically any IP movements based on data plane events.

4 – FHRP isolation
idem as above

Additional official guides (to list just a few):




[1] Assuming speed of light takes 1ms to travel 200km one way.
Posted in DCI | Leave a comment

27 – Stateful Firewall devices and DCI challenges – Part 1

Note: Since I wrote the following articles on ASA clustering stretched across multiple locations, additional improvements have been made to address some of the concerns listed in post 27.x. Please have a look at the ASA release-notes (especially 9.5(1) and 9.5(2)).

  • 9.1(4) Geographically dispersed ASA cluster up to 10ms of Latency
  • 9.2(1) Validated Spanned Interface mode (L2) North-South Insertion
  • 9.3(2) Spanned Interface mode (L2) – East-West Insertion
  • 9.5(1) Site Specific Identifier and MAC address
  • 9.5(2) LISP Inspection for Inter-site Flow Mobility

Use the last configuration guide for the updated features, not discussed in this post




Stateful Firewall devices and DCI challenges

Having dual sites or multiple sites in Active/Active mode aims to offer elasticity of resources available everywhere in different locations, just as with a single logical data center. This solution brings as well the business continuity with disaster avoidance. This is achieved by manually or dynamically moving the applications and software framework where resources are available. When “hot”-moving virtual machines from one DC to another, there are some important requirements to take into consideration:

  • Maintain the active sessions stateful without any interruption for hot live migration purposes.
  • Maintain the same level of security regardless the placement of the application
  • Migrate the whole application tier (not just one single VM) and enable FHRP isolation on each side to provide local default gateway (which works in conjunction with the next bullet point)
  • While maintaining the live migration, it can be crucial to optimise the workflow and reduce the hair-pining effect as much as we can since it adds latency.  As such, the distances between the sites as well as the network services used to optimize and secure the multi-tier application workflows amplify the impact of performances.

As with several other network and security services, the firewall is a stateful device that imposes a one-way symmetrical establishment. That means return traffic must hit the owner of the session, otherwise, the packet is dropped. Traditionally firewalls are deployed by a pair of devices in an Active/Standby manner, with dedicated layer 2 adjacency links to synchronize the states of all sessions and to probe the health of its peer. When the active firewall is stopped, the standby takes over, maintaining all active sessions stateful in a transparent manner for the application and for the end-user. As of today, most enterprises are deploying their perimeter firewalling in Active/Standby mode mainly for tightly coupled data center designs (metro distances using fiber links).

Figure 1:  Typical tighly-coupled DC deployment with firewalling. The primary DC-1 attracts all the traffic for the application of interest (best metrics). By default it is expected to maintain the session workflow within the same DC. 

As discussed in this previous high level post 13 – Network Service Localization and Path Optimization,  as a result of state failover of network services as well as application mobility, it is not rare to see 10 to 20 roundtrips between the two sites for the same active session. This is forced by all the stateful devices in the path imposing a one-way symmetrical establishment with the return traffic. This includes the security WAN edge, IPS, SSL offloader, SLB devices as well as the default gateways between application tiers, just to list the most common stateful devices.  Figure 2: The same application has moved to the secondary DC and a failover happened on the first firewall. NB: this is a basic design to keep the logic simple—usually additional stateful devices (SLB, SSL, IPS, WAAS, etc.) exist along the path, hence you can infer a longer final ping-pong effect.

If we consider that the signal propagation delay takes 1ms roundtrip to travel a 100km distance from each data center, 10 roundtrips bring almost 10ms between request and response for the same session, which might have a performance impact on the application. In the context of metro distances, it has been usually well accepted by network managers to work in “degraded” mode during maintenance windows, as these were fully controlled by the network and security organizations.

With the increased demand of virtual machines and dynamic workload mobility, it becomes challenging to control all of the component states and placement impacting the application workflow. Hence, the desire to control dynamically the optimum path to reach the application.

ASA Firewall clustering

Last year at Cisco Live in London, I presented a new concept based on firewall clustering to improve the DCI architecture; however, this enhanced solution was not yet supported due to some limitations with the ASA code (9.0) as well as the lack of testing.

Since v9.1(4) and recently version 9.2(1), the ASA clustering software has been improved to support long distances between members of a cluster (up to 10ms one-way latency) and several designs have been tested and qualified in DCI deployment scenarios. Thus, the excitement to post this article now :).

There are several detailed documents available on ASA clustering itself. Hence, for purposes of this post let’s focus only on the mechanisms that we can leverage in a DCI scenario. For further details on the ASA cluster, I recommend to read the configuration guide, which gives all details and explains the concept and nomenclature of ASA clustering. You will also find many great posts from others on the web.

Originally, ASA clustering aims to provide high-scale firewalling by stacking several physical ASA devices to form a single logical high-end firewall. All ASA devices are active and work in concert to pass connections as a single firewall.

To achieve this function, a “new” component called Cluster Control Link (CCL) was created to collapse all physical members of the ASA cluster together to form a single logical firewall. The CCL is used for the control plane, health-check, state sync, config sync as well to redirect the data plane traffic to the original owner of the session when needed. Remember, classical A/S or clustering A/A mode, it is mandatory that the return traffic hits the owner of the session, otherwise the packet will be dropped. With ASA clustering, this rule still applies, but instead of dropping an asymmetric flow, the firewall that owns the session is known by all other ASA devices, thus the packet is automatically redirected to the original owner via the CCL. The traffic is load-distributed in a clever fashion between the members of the ASA cluster.

There are two possible modes to load balance the data traffic from the upstream device across the ASA units:

  • Individual Interface Mode (layer 3) using ECMP or PBR
  • Spanned Ether-Channel Mode (layer 2) using LACP

Figure 3: Individual Interface Mode (left) versus Spanned Ether-Channel Mode (right) in a fully redundant deployment using Multi-chassis Ether-Channel (MEC).

In both modes, each ASA device is dual-homed using a virtual port-channel toward a Multi-chassis Ether-Channel engine (e.g. vPC) and the traffic from the upstream device is layer 3 load distributed (ECMP or PBR) for the Individual mode (left) or is layer 2 distributed (LACP) for the Spanned mode. It is not possible to mix different modes for the same cluster.

From a protocol point of view, a unique device can exist on each side of the LACP establishment between two entities.

Figure 4: A single logical device back to back for the LACP establishment.

Thus, in Spanned Mode, in order to form a logical LACP peer device, on the network side, the upstream pair of Nexus switches uses a virtual port-channel (vPC) and on the firewall side, the ASA cluster uses an enhanced LACP mode called cluster LACP (cLACP), which forms a virtual port-channel extended across all ASA units in the cluster, using the same IP address and the same virtual MAC address.

The ASA clustering deployment design is flexible. It can be deployed using a single port-channel to the same insertion point at the aggregation layer. As a result, both external (non-secured) and internal traffic (secured) traverse the same physical port-channel. The separation is achieved at layer 2 using VLAN tagging. The other method is to deploy the ASA cluster in sandwich mode between two Virtual Device Contexts (VDC) in order to physically separate external and internal traffic, each dedicated to an inside and outside port-channel as discussed in this post. This article relies on that method of sandwich mode providing solid hierarchy architecture.

Figure 5: Don’t be confused between the forwarding mode of the firewall and the interface distribution mode among the ASA cluster. The ASA cluster can be running in Routed Mode or Transparent Mode like most firewalls, and with or without multiple contexts. It is important to clarify which mode is supported. For example, if an enterprise wants the firewall to run in Transparent Mode, the only option to distribute the load among the ASA units is Spanned Ether-Channel Interface Mode.

Whichever load balancing protocol is used, there is no symmetrical algorithm that ensures return traffic re-uses the same path in reverse. Hence, it is not guaranteed to hit automatically the original owner of the session. The ASA cluster gets around this one-way symmetrical establishment by redirecting the session to its owner via the CCL.

Figure 6: TCP hand check establishment with redirection to the session owner over the CCL

  • In the example above, a new session (TCP SYN) is established, hitting the ASA 2, which encodes SYN cookies with its own information (1) and then forwards the SYN packet to the next destination (2).
  • When the TCP SYN-ACK is responded to, the traffic is load balanced based on the layer 2 or layer 3 source and destination identifiers and sent toward a new ASA; in our example the TCP SYN/ACK arrives at ASA 3.
  • ASA 3 decodes the owner information from the SYN cookie and notices that ASA 2 is the owner (4) for that session.
  • ASA 3 immediately forwards the packet to the owner unit ASA 2 over the CCL, which in turn forwards the SYN ACK via its inbound interface.

Consequently, the CCL must be dimensioned according to the speed of the inbound and outbound interfaces (e.g. 2 x 10GE for resiliency and performance).

Figure 7: To increase resiliency, it is recommended to dual-home each data interface using a port-channel split between the two upstream switches. There is one port-channel per ASA device.

ASA Firewall clustering spanned across multiple sites

Now that we understand how ASA clustering is created, let’s see how to leverage the clustering mode for a DCI solution.

For our purposes and examples, an ASA cluster formed with four physical ASA units is stretched across two DCs, with two ASA members on each site. We will keep this scenario for the whole article, but nothing prevents us from adding more ASA units (up to 16 units per ASA cluster are currently supported) stretched across three or more DCs interconnected using LAN extension (we are using OTV for multi-site LAN extension).

ASA cluster Configuration

Some of the added value of deploying the ASA cluster is that configuration is synchronized between all units, hence there is no need to manually replicate all policies on each unit, thus avoiding risk of misconfigurations.

Extending the Cluster Control Link across 2 DC

The first component to extend between the DCs is the Cluster Control Link. This CCL connection must be fully resilient, as discussed previously, and extended across sites using a solid DCI LAN extension. Each ASA uses a dedicated port-channel split between the two vPC peers. From the upstream logical switch, the port-channels are not related except that they carry the same VLAN from ASA to ASA units across the two sites. From a security point of view, it is preferable to deploy it inside the secure perimeter.

Figure 8: CCL deployment, each ASA uses a dedicated port-channel split between the two vPC peers.

As discussed above, there are 2 mode to load distribute the data traffic among the ASA members: A layer 3 mode called Individual interface mode and a layer 2 mode called Spanned Etherchannel mode. Both modes are valid for DCI with some slight different added values for one mode versus the other that will be discussed has as we move forward.

Extending the Data plane using Individual Interface Mode

The ASA units are deployed in sandwich mode between the inside and outside routers and ECMP is used to load distribute the traffic across the local ASA members. The CCL as well as the data plane VLANs are extended between sites.

Figure 9: For the purposes of this topic, the CCL is represented using a logical straightforward link (orange) extended between the two DCs; however, in the final deployment it should be distributed in a sturdy and redundant fashion as described in Figure 8.

The application data traffic (layer 2) is isolated by a layer 3 hop. The CCL is also isolated from the data workflow. However, nothing prevents network managers from collapsing the CCL VLAN with the application data VLANs within the same overlay network as the segmentation is therefore maintained at layer 2 with dot1Q tagging.

From a logical layer 3 point of view, IGP adjacency is established between the inside and outside routers through the local ASA units, as well as between sites (higher cost across sites used for disaster recovery purposes).

Figure 10: The default gateway is unique and still active in DC-1. After the migration of the application, the traffic hits the original router, returns to DC-2 in regard to the communication with the application upper tier, to finally return to DC-1. The ping-pong effect starts after the move.

  • In the scenario on the left, the primary DC-1 attracts the request for the application that exists in DC-1.
  • The default gateway for the application is active in DC-1 and standby in DC-2.
  • The ASA-1 (far left) becomes the owner of the session and routes the packet to the application.
  • In the scenario on the right, the application has moved (hot stateful live migration) and continues to respond with no interruption.
  • The return traffic hits the active default gateway in DC-1 which routes the packet toward the frontend server.
  • The frontend server responds to end-user via its default gateway (DF) active on DC-1.
  • The DG on DC-1 distributes the packet to the ASA-2.
  • ASA-2 checks the owner of the session and redirects the packet to ASA-1 over the CCL.
  • The workflow exits DC-1 toward the end-user.
  • The session is maintained stateful with zero interruption

Beyond the ASA clustering, the concern is that, even if the tiers of the applications are all moved to the distant DC-2,  the routed communication between the tiers is established via the active default gateway still located in DC-1. Hence the traffic from the back-end to the front-end is hair-pinned via DC-1.

Consequently there is a the great interest for most of network managers to enable HSRP isolation to improve the server to server communication as shown below.

Figure 11: HSRP isolation reduces the hair pining, however the return traffic must still hit the owner of the session to maintain the establishment stateful.  

With FHRP isolation techniques, server-to-server communication is routed locally, eliminating the pointless latency (far right bottom). The outbound traffic from the frontend server toward the end-user is routed to the upstream local firewall (shortest path). However, as we want to maintain the session as stateful, the local ASA-3 redirects the traffic workflow to its original owner (ASA-1) via the CCL extension.

While machines migrate from one location to another, the session is maintained stateful with zero interruption. However, although the application has moved through DC-2, the next request will hit DC-1 until we inform manually or dynamically the layer 3 network about the move. This is discussed in Part 2, next.

The added value with Individual Interface Mode is that it maintains the layer 2 failure domain in isolation from the ASA control and data planes. Traffic is routed up and down the ASA cluster. The application data VLAN can be extended in a transparent fashion using any validated DCI solution. However, the other side of the coin is that the ASA unit cannot be the first hop default gateway for the application, as another layer 3 router separates it. Thus, Individual Interface mode might be challenging for enterprises that would like the firewall to be the default gateway of the application servers. Another point to mention about Individual Interface Mode is that it doesn’t support firewalls configured in Transparent Mode.

Extending the Data plane using Spanned Ether-channel Mode

In our cluster, all ASA members share the same IP address and can be the first hop default gateway for the application (not yet qualified though). However, for the last option we need to be very cautious with the layer 2 extension and the vMAC address.

Figure 12: The ASA Cluster LACP (cLACP) is spanned across the 2 data centers. The LACP established from the vPC peers is local. A layer 3 device isolates the ASA Spanned Ether-channel interface from the application data VLAN.

From each vPC peer, a local port-channel is established on each site between the local ASA units. From the ASA cluster, a single LACP port-channel is spanned across the two distant DCs (cLACP). cLACP imposes to span the same port-channel across the same ASA cluster, therefore the same vPC domain identifier must be identical on each vPC peer.

In regard to workflow, the behavior is the same as with the Individual Interface Mode. Note a slight difference in case of failure; convergence with LACP should happen faster than with ECMP or PBR.

Figure 13: The ASA Cluster is running in transparent mode, HSRP isolation is enable improving the workflow while the session is maintained stateful.

When the ASA cluster is running in routed mode, the members share the same IP address and same vMAC address. Consequently to avoid any duplicate MAC address to appear from different switch ports, a router on each site can be added (preferred method) to separate the layer 2 data traffic between the spanned ether-channel and the LAN extension. Indeed, if the VLAN attaching the front-end servers is L2-adjacent with the firewall, then the vPC peer will detect the same vMAC address bouncing from different interfaces (toward the ASA members and from the data LAN extension, which is definitely not an expected situation in Ethernet).

Figure 14: The same vMAC is learnt from both sides of each vPC peers. A challenging situation definitely not supported by the protocol Ethernet.

To prevent this situation with the same MAC address learnt on different sides, a layer 3 gateway is inserted to separate the L2 data traffic between the spanned ether-channel and the extended data VLAN. It also prevents layer 2 loop in case of a human design mistake.

Figure 15: a router inserted between the data VLAN and the spanned port-channel isolates the duplicate MAC address.

In the figure above, the design on the left shows the layer 3 separation between the spanned ether-channel VLAN 10 and the application data VLAN 20. Only VLAN 20 is extended. On the right side, the application servers are L2-adjacent with the inside interfaces of the ASA units via VLAN 10. As a result, the vPC peers on both sites learn the duplicate vMAC address from their respective ASA units and from the extended VLAN 10, which is not acceptable.

If you are willing to offer the default gateway from the firewall (not yet qualified though), it definitely requires filtering the vMAC address between sites as well as the ARP requests.

Figure 16: Careful: Please don’t get me wrong – currently this is neither recommended nor supported. Don’t do the following until you understand the exact ramifications 🙂 .

To filter the vMAC address with OTV, you will need to perform the following on the internal interface of each OTV edge device.

The following access list will block the vMAC 1111.2222.3333 shared between all ASA units:

In addition we need to apply a route-map on the OTV control plane to avoid communicating vMAC information to the remote OTV edge devices.

Indeed OTV uses the control plane to populate each remote OTV edge device with its local layer 2 MAC table. As this MAC table is built from its regular MAC learning, it is important that OTV doesn’t inform any remote OTV Edge device about the existence of this vMAC address as it exists on each site.

However a possible drawback of the Spanned mode is that all ASA are active and share the same IP and MAC addresses, hence an ARP request will hit all units that forms the ASA cluster and all members will reply with the same source (IP and vMAC). Regarding the local ASA, the reply passes along the unique local port-channel, which is fine, however it could be tricky if the reply comes from the remote site via the DCI link. Fortunately the vMAC of the remote ASA will be blocked as described with the previous access list. Therefore the ARP reply will come only from the local ASA, which is the desired behaviour. However to reduce the broadcast traffic, you may want to filter the ARP request destined to the default gateway, across the DCI connection.

Some additional recommendations

It is recommended to enable JUMBO frame reservation and MTU cluster at least at 1600 for use with the cluster control link. When a packet is forwarded over cluster control link, an additional trailer will be added, which could cause fragmentation. Set this to 9216 to match the system jumbo frame size configured on the N7k. Hence, the MTU of the IP network inter-sites must be sized accordingly.

For a deep understanding of what is supported and what is not, please follow the inter-site Clustering guidelines recommendation in the ASA 9.2 Configuration Guide.

Posted in DCI | 22 Comments

26 – Bis – VxLAN VTEP GW: Software versus Hardware-based

Just a slight note to clarify some VxLAN deployment for an hybrid network (Intra-DC).

As discussed in the previous post, with the software-based VxLAN, only one single VTEP L2 Gateway can be active for the same VxLAN instance.

This means that all end-systems connected to the VLAN concerned by a mapping with a particular VNID must be confined into the same leaf switch where the VTEP GW is attached. Other end-systems connected to the same VLAN but on different leaf switches isolated by the layer 3 fabric cannot communicate with the VTEP L2 GW. This may be a concern with hybrid network where servers supporting the same application are spread over multiple racks.

To allow bridging between VNID and VLAN, it implies that the L2 network domain is spanned between the active VTEP L2 Gateway and all servers of interest that share the same VLAN ID. Among other improvements, VxLAN is also aiming to contain the layer 2 failure domain to its smallest diameter, leveraging instead layer 3 for the transport, not necessarily both. Although it is certainly a bit antithetical to VxLAN purposes, nonetheless if all leafs are concerned by the same mapping of VNID to VLAN ID, it is feasible to extend the Layer 2 via the fabric using a layer 2 multi-pathing protocol, such as FabriPath.

In the following example, the server 4 attached to leaf 4 cannot communicate with the VTEP L2 GW located on leaf 1. As a result, VM-1 cannot communicate with server 4.

Fortunately the hardware solves this. The great added value of enabling the VTEP L2 gateway on the hardware switch (ToR) is that it is distributed and active on each leaf. Thus communication between VTEP on each switch is handled using the VxLAN tunnel. Hence, VNID 5000 can be bridged with VLAN 100 on leaf 4 and therefore VM-1 can communicate with server 4.

The other interesting added-value with the hardware-based anycast L2 gateway is the VLAN translation using the VLAN stitching, that can be useful for some migration purposes. Each leaf can map the same VNID with a different VLAN on its own side. In the following example VNID 5000 can be bridged with VLAN 100 on leaf 1 and VLAN 200 on leaf 6. Consequently, VLAN 100 and VLAN 200 share now the same broadcast domain.

If the software-based solution of VxLAN is a flexible solution in a fully virtualised environment, it is not always so well adapted to the hybrid network built with a mix of virtual and physical devices spread over unorganised racks.

Hope that clarifies the choice of VxLAN mode that you wish to deploy.


Posted in DCI | 6 Comments

26 – Is VxLAN (Flood&Learn) a DCI solution for LAN extension ?

One of the questions that many network managers are asking is “Can I use VxLAN stretched across different locations to interconnect two or more physical DCs and form a single logical DC fabric?”

The answer is that the current standard implementation of VxLAN has grown up for an intra-DC fabric infrastructure and would necessitate additional tools as well as a control plane learning process to fully address the DCI requirements. Consequently, as of today it is not considered as a DCI solution.

To understand this statement, we first need to review the main requirements to deploy a solid and efficient DC interconnect solution and dissect the workflow of VxLAN to see how it behaves against these needs. All of the following requirements for a valid DCI LAN extension have already been discussed throughout previous posts, so the following serves as a brief reminder.

DCI LAN Extension requirements

Strongly recommended:

  • Failure domain must be contained within a single physical DC
    • Leverage protocol control plane learning to suppress the unknown unicast flooding.
    • Flooding of ARP requests must be reduced and controlled using rate limiting across the extended LAN.
    • Generally speaking, rate limiters for the control plane and data plane must be available to control the broadcast frame rate sent outside the physical DC.
    • The threshold must be set carefully with regard to the existing broadcast, unknown unicast and multicast traffic (BUM).
  • Redundant paths distributed across multiple edge devices are required between sites, with all paths being active without creating any layer 2 loop
    • A built-in loop prevention is a must have (e.g. OTV).
    • Otherwise the tools and services available on the DCI platforms to address any form of Multi-EtherChannel split between 2 distinct devices (e.g. EEM, MC-LAG, ICCP, vPC, vPC+, VSS, nV Clustering, etc.) need to be activated.
  • Independent control plane on each physical site
    • Reduced STP domain confined inside each physical DC.
    • STP topology change notifications should not impact the state of any remote link.
    • Primary and secondary root bridges or multicast destination tree (e.g. FabricPath) must be contained within the same physical DC.
    • A unique active controller managing the virtual switches may lose access to the other switches located on the remote site.
  • No impact on the layer 2 protocol deployed on each site
    • Any layer 2 protocols must be supported inside the DC (STP, MST, RSTP, FabricPath, TRILL, VxLAN, NVGRE, etc.) regardless of the protocols used for the DCI and within the remote sites.
  • Remove or reduce the hairpinning workflow for long distances
    • Layer 3 default gateway isolation such as FHRP isolation, anycast gateway or proxy gateway.
    • Intelligent ingress path redirection such as LISP, IGP Assist, Route Health Injection, GSLB.
  • Fast convergence
    • Sub-second convergence for any common failure (link, interface, line card, supervisor, core).
  • Any transport
    • The DCI protocol must be transport-agnostic, meaning that it can be initiated on any type of link (dark fiber, xWDM, Sonet/SDH, Layer 3, MPLS, etc.).

Must Have

In addition to the previous fundamentals, we can enable additional services and features to improve the DCI solution:

  • ARP caching to reduce ARP broadcasting outside the physical DC.
  • VLAN translation allowing the mapping of one VLAN ID to one or multiple VLAN IDs.
  • Transport-agnostic core usage – thus enterprises and service providers are not required to change their existing backbone.
  • Usage of IP multicast grouping is a plus, but should not be mandatory. However, the DCI solution must be able to support non-IP multicast layer 3 backbone as an enterprise would not necessarily obtain the IP multicast service from their provider (especially for a small number of sites to be layer 2 interconnected).
  • Path diversity and traffic engineering to offer multiple routes or paths based on selected criteria (e.g. control plane versus data plane) such as IP based, VLAN-based, Flow-based.
  • Removal or reduction of hairpinning workflows for metro distances (balancing between the complexity of the tool versus efficiency).
  • Control plane learning versus data plane MAC learning.
  • Multi-homing going beyond the two tightly coupled DCI edge devices.
  • Site independence – the status of a DCI link should not rely on the remote site state.
  • Load balancing: Active/Standby, VLAN-based or Flow-based

A Brief Overview of VxLAN

VxLAN is just another L2 over L3 encapsulation model. It has been designed for a new purpose offering logical layer 2 segments for virtual machines while allowing layer 2 communication between the same VxLAN segment identifier (VNID) over an IP network. VxLAN offers several great benefits: Firstly, as just expressed, the capability to extend the layer 2 segments pervasively over an IP network, allowing the same subnet to be stretched across different physical PoD (Point of Delivery) and therefore preserving the same IP schema in use. A PoD delimits usually the layer 2 broadcast domain. Secondly, it offers the capacity of creating of huge amount of small logical segments, well beyond the traditional dot1Q standard limited to 4k. This improves the agility needed for the cloud supporting stateful mobility with the support of native segmentation required by the multi-tenancy.

The other very important aspect of VxLAN is that it is getting a lot of momentum across many vendors, software and hardware platforms, with enterprises closely monitoring its evolution as the potential new de-facto transport standard inside the new generation of data center network fabric, addressing the host mobility.

However, VxLAN doesn’t natively support systems or network services connected to a traditional network (bare metal servers and virtual machines attached to VLAN, NAS, physical firewalls, IPS, physical SLB, DNS, etc.), hence additional components such as the VxLAN VTEP gateways layer 2 and layer 3 are required to communicate beyond the VxLAN fabric.

Many papers and articles have been posted on the web regarding the concept and details of VxLAN, therefore I’m not going to dig deeper into it here. Nevertheless, let’s still review some important VxLAN workflows and behaviors pertinent to this topic, step by step to help address the DCI requirements listed above more efficiently.

Current Implementation of VxLAN.

A virtual machine (or more generally speaking an end-system) within the VxLAN network sits in a logical L2 network and communicates with other distant machines using so-called VxLAN Tunnel End-Points (VTEP) enabled on the virtual or physical switch. The VTEP offers two interfaces, one connecting the layer 2 segment, which the virtual machine is attached to, and an IP interface to communicate over the routed layer 3 network with other VTEPs that bridge the same segment ID. The VTEPs encapsulate the original Ethernet traffic with a VxLAN header and send it over a layer 3 network toward the VTEP of interest, which then de-encapsulates the VxLAN header in order to present the original layer 2 packet to its final destination end-point.

In short, VxLAN aims to extend the logical layer 2 segment over a routed IP network. Hence, while it remains not a good idea, the idea emerging among many network managers to go a bit further leveraging the VxLAN protocol to stretch its segments across geographically distributed data centers is not fundamentally surprising.

VxLAN Implementation Models

Two types of overlay network-based VxLAN implementations exist: a host-based and network-based overlay implementations.

Confined the Failure Domain at its Physical Location

As of today, both software and hardware-based VxLAN implementations perform the MAC learning process based on traditional layer 2 operations, flooding and dynamic MAC address learning. They both offer the same transport and follow the same encapsulation model, however, dedicated ASIC-based encapsulation offers better performance as the number of VxLAN segments and VTEPs grows.

Each VTEP joins an IP multicast group according to its VxLAN segment ID (VNID), thus reducing the flooding to only the VTEP group concerned with the learning process. This IP multicast group is used to carry the VxLAN broadcast, the unknown unicast as well as the multicast traffic (BUM).

We need to walk (high level) through the learning process for our original DCI requirements. This kinematic of data flow assumes that the learning process starts from the ground-up.


  • An end-system A sends an ARP request for another end-system B that belongs to the same VxLAN segment ID behind the VTEP 2.
  • Its local VTEP 1 doesn’t know about the destination IP address, thus it encapsulates the original ARP request to its IP multicast group with its identifiers (IP address and segment ID).
  • All VTEPs that belong to the same IP multicast group receive the packet and forward it to their respective VxLAN segments (only if this is the same VNID). They learn the MAC address of the original end-system A as well as its association with its VTEP 1 (IP address). They update their corresponding mapping table.
  • The remote end-system B learns the IP <=> MAC address of the original requestor system A, ARP replies to it with its own MAC address.
  • The VTEP 2 receives the reply to be sent to the end-system A. It knows the mapping of end-system A’s MAC address and the IP address of VTEP 1, then it encapsulates the ARP reply and unicasts the packet to VTEP 1.
  • VTEP 1 receives the layer 3 packet from VTEP 2 via an unicast packet, learns the IP address of VTEP 2 as well as the mapping of the MAC address belonging to end-system 2 and the IP address belonging to VTEP 2.
  • VTEP 1 strips off the VxLAN encapsulation and sends the original ARP reply.
  • All subsequent traffic between both end-systems A and B is forwarded using unicast.

VxLAN uses the traditional data plane learning bridge to populate the layer 2 map tables. If this learning process is suitable within a CLOS fabric architecture, it is certainly not efficient for DCI. The risk is that a flooding of unknown destinations can burden the IP multicast group and therefore pollute the whole DCI solution (e.g. overload the inter-site link) instead of containing the failure domain to its physical data center. The issue is that, contrary to other overlay protocols such as EoMPLS or VPLS, it is very challenging to rate limit a specific IP multicast group responsible to handle any VxLAN-based flooding of BUM.


Flooding unknown unicast can saturate the IP multicast group over the DCI link, hence it is not appropriate for DCI solutions.

Unicast-only Mode

An enhanced implementation of VxLAN allows the use of a unicast-only mode. In this case the BUM is replicated to all VTEPs. The best current example is the enhanced mode of VxLAN supported on the Nexus 1000v. However, to improve the head-end replication, the Nexus 1000v controller (VSM) redistributes all VTEPs to its VEMs maintaining dynamically each VM entry for a given VxLAN. Consequently each VEM knows all the VTEP IP addresses spread over the other VEMs for a particular VxLAN. In addition to the VTEP information, the VSM learns and distributes all MAC addresses in a VxLAN to its VEMs. At any given time, all VEM MAC address tables contain all the known MAC addresses in the VxLAN domain. As a result, any packet sent to an unknown destination is dropped.

The unicast-only mode currently only concerns the set of Virtual Ethernet Modules (VEM) managed by the controller (VSM) with a maximum limit of 128 supported hosts even when the Nexus 1000v instance is distributed across two data centers. Currently, one unique Nexus 1000v instance controls the whole VxLAN overlay network spanned across the two data centers. Although other VxLAN domains could be created using additional VxLAN controllers, they are currently not able to communicate with each other. Therefore, despite improving the control over flooding behavior, this feature could potentially be stretched only between two small data centers and only for extended VxLAN logical segments. However, it is important to take into consideration that, in addition to the current scalability, other crucial concerns still exist in regard to the DCI requirement, with only one controller active for the whole extended VxLAN domain and remote devices connected to a traditional VLAN that won’t be extended across sites (discussed below with the VTEP gateway).

Note: some implementations of VxLAN support the function of ARP caching, reducing the amount of flooding over the multicast for previous similar ARP requests. However, unknown unicast is flooded over the IP multicast group. Thus ARP caching must be considered as an additional improvement to DCI, not a DCI solution on its own, hence even if adding great value, it is definitely not sufficient to allow stretching the VxLAN segment across long distances.

Active/Active Redundant DCI Edge Device

As most of the DC networks are running in a “hybrid” model with a mix of physical devices and virtual machines, we need to evaluate the different methods of communication that exist between VxLANs and with a VLAN.

  • VxLAN A to VxLAN A. This is the default communication just described above between two virtual machines belonging to the same logical network segment. It uses the VTEP to extend the same segment ID over the layer 3 network. Although the underlying IP network can offer multiple active paths, only one VTEP for a particular segment can exist on each host. This mode is limited for communicating between strictly identical VxLAN IDs, meaning also limited to a fully virtualized DC.
  • VxLAN Gateway to VLAN. To communicate with the outside of the virtual network overlay, the VxLAN logical segment ID is bridged to the traditional VLAN ID using the VxLAN VTEP Layer 2 gateway. It is a one-to-one translation that can be achieved using different tools. From the software side, this service can be initiated using a virtual service blade gateway installed on the Nexus 1010/1110 or from the security edge gateway that comes with the vCloud network, to list only two, acting in transparent bridge mode: one active VTEP gateway per VNID to VLAN ID translation, nevertheless a VTEP gateway supports many VNID <=> VLAN ID translations. From the hardware side, the VTEP gateway is embedded in each switch supporting this function using a dedicated ASIC. Regardless of whichever software component is used for the function of the VxLAN VTEP gateway, currently only one L2 gateway can be active at the same time for each particular VNID <=> VLAN ID mapping, preventing layer 2 loops. On the other end with the hardware implementation of VxLAN, for hosts directly attached to a physical leaf, it is possible to run the same L2 gateway mapping locally on each leaf of interest. Nonetheless this is possible only if there is not cascaded layer 2 network attached to the leaf of interest.

Although it is not a viable solution for a production network, the following needs to be stated in order to prevent it from being understood as an alternative solution: indeed a bandage option could be used to connect a virtual machine with a virtual interface attached to each segment side. This would bridge a VNID to a VLAN ID with no VTEP function per se, like a virtual firewall for example configured in transparent bridging mode. Even though this may work technically for a specific secured bridging between a VxLAN and a VLAN segment, it is important to understand the limitation and the risks.


There are some important weaknesses to take into consideration:

  • Risk of layer 2 loops by cloning and enabling a virtual machine or just a human mistake connecting the same VxLAN segment with its peer VLAN twice.
  • Scalability bottlenecks due to the limited number of supported virtual interfaces (VNIC) per virtual machine: in theory a maximum of five VNID to VLAN ID mappings is feasible, but if we remove one VNIC for management purposes, only four possible translations remain. While it is possible to enable dot1q to increase the number of VNID <=> VLAN ID mappings, some limitations do remain (100-200 VLAN), which also impact performance and bandwidth.
  • Live migration: if the VM running the bridge gateway role live migrates to a new host, then the entire MAC to IP bindings will be purged and would need to be re-learned.

Therefore, based on this non-exhaustive list of caveats, it is preferable to not use this method to bridge a VxLAN with a VLAN segment. To avoid any risk of disruption, it is more desirable to enable the virtual firewall in routed mode.

 There is another method connecting to the outside of the VxLAN fabric that consists of a logical segment routed to a VLAN or another VxLAN segment using the VxLAN layer 3 gateway. This function can be embedded into the virtual or physical switch or eventually an external VM running the function of a router configured in routed mode, yet only with the same limitations mentioned above. This layer 3 gateway is out of scope for the purpose of this article, so let’s focus on the VxLAN gateway to VLAN for the hybrid model.

As mentioned previously, if the VTEP layer 2 gateway can be deployed in high availability mode, only one is active at a time for a particular logical L2 segment. An additional active VxLAN gateway may create a layer 2 loop, since currently there is no protocol available preventing an alternate path forward.

That means that all translations from VNID to VLAN ID and vice versa must hit the same VTEP gateway. If this behavior works well inside a network fabric, when projected into a DCI scenario, only end-systems on the same data center as the VxLAN gateway can communicate with each other.

In the above structure VxLAN 12345 is mapped to VLAN 100 in DC 1 on the right. Only Srv 1 can communicate with VM2 or VM1. Srv 2 on DC 2 is isolated because it has no path to communicate with the VxLAN gateway on the remote site DC 1.

Theoretically we could create a VxLAN perimeter on each data center, each managed by its own VxLAN controller. This would allow to duplicate the VTEP gateway and to map the same series of VNID <=> VLAN ID on each site.


Each VTEP on each side can share the same IP multicast group used by the VNIDs of interest. Yet currently, and due to the data plane learning, we are still facing ARP and BUM flooding. Until a control plane can distribute the VTEP and MAC reachability on all virtual or physical edges and between VxLAN domains, the data center interconnection will suffer from high flooding of BUM.

Thus, the paradox is that, in order to allow communication between remote machines connected to a particular VLAN and a virtual machine attached to the same VxLAN segment with a single active VTEP gateway, a validated DCI solution is required extending the VLAN in question.

On the minus side using a DCI VLAN extension and one single active VxLAN gateway for both sites, all traffic exchanged between a VLAN and a VxLAN segment will have to hit the VxLAN gateway on the remote site, creating a hairpinning workflow.

Independent Control Plane on each Site

Today, with the current standard implementation of VxLAN there is no control plane for for the learning process. By definition this should be a showstopper for DCI. Although the enhancement mode of VxLAN uses a controller (VSM) to continuously distribute all VTEPs for a given VxLAN to all virtual switches, only one controller is active at a time. This means that the same control plane is shared between the two locations with an inherent risk of partitioning. This is being improved in the near future with a new control plane.


This is dependent on the vendor and the way VxLAN is implemented in software or hardware. Therefore what is really needed in terms of the number of segments and MAC addresses supported is connected to a number of conditions.

  • How many VxLAN can a software engine really support? Usually this is limited to few thousand, don’t expect hundreds of thousands.
  • Can I distribute the consumed VxLAN resources such as VxLAN gateways between multiple engines?
  • How many IP multicast groups can the network support?

Key Takeaways

  • VTEP discovery and data plane MAC learning leads to flooding over the interconnection
    • Excessive flood traffic and BW exhaustion.
    • Exposure to security threats.
  • No isolation of L2 failure-domain
    • No rate limiters.
  • Current implementation of standard VxLAN assumes an IP multicast transport exists in the layer 3 core (inside and outside the DC)
    • IP multicast in the layer 3 backbone may not necessarily be an option.
    • Large amounts of IP multicast between DCs could be a challenge from an operational point of view.
  • Only one gateway per VLAN segment
    • More than one VTEP layer 2 gateway will lead to loops.
    • Traffic is hairpinned to the gateway.
  • VxLAN transport cannot be relied upon to extend VLAN to VLAN
    • Valid DCI solution required to close black holes.
  • No network resiliency of the L2 network overlay
  • No multi-homing preventing layer 2 back-door
    • Most validated DCI solutions offer flow-based or VLAN-based load balancing between two or more DCI edge devices.
  • No path diversity
    • It is desirable to handle private VLAN and public VLAN using different paths (HA clusters).


  • Can we extend a VxLAN segment over long distances?

The VxLAN is aimed at running over an IP network, hence from a transport protocol point of view there is no restriction in term of distances.

  • Does that mean that it can be used natively as a DCI solution interconnecting two or multiple DCs extending the layer 2 segment?

VxLAN would necessitate additional tools as well as a control plane to address the DCI requirements listed at the beginning of this post, which are not applicable with the current implementation of VxLAN.

Note: as these lines are written, there is work being done at the IETF nvo3 working group for VXLAN control-plane for learning and notification processes (BGP with E-VPN) reducing the overall amount of flooding (BUM) currently created by the data plane learning implementation. In addition, an anycast gateway feature will be available soon in the physical switches allowing the distribution of active VTEP gateways on each leaf. Stay tuned!


Until VxLAN evolves with a control plane and implements the necessary tools needed for DCI, the best recommendation to interconnect two VxLAN network fabrics is to use a traditional DCI solution (OTV, VPLS, E-VPN).

Posted in DCI | Tagged | 11 Comments

25 – Why the DC network architecture is evolving to fabric?

The Datacenter network architecture is evolving from the traditional multi-tier layer architecture, where the placement of security and network service is usually at the aggregation layer, into a wider spine and flat network also known as fabric network ( ‘Clos’ type), where the network services are distributed to the border leafs.

This evolution has been conceived at improving the following :

  • Flexibility : allows workload mobility everywhere in the DC
  • Robustness : while dynamic mobility is allowed on any authorised location of the DC, the failure domain is contained to its smallest zone
  • Performance: full cross sectional bandwidth (any-to-any)  – all possible equal paths between two endpoints are active
  • Deterministic Latency : fix and predictable latency between two endpoints with same hop count between any two endpoints, independently of scale.
  • Scalability : add as many spines as needed to increase the number of servers while maintaing the same oversubscription ratio everywhere inside the fabric.

If most of qualifiers above have been already successfully addressed with traditional multi-tier layer architecture, today’s Data centres are experiencing an increase of East-West data traffic that is the result of:

  • Adoption of new software paradigm of highly distributed resources
  • Server virtualization and workload mobility
  • Migration to IP based storage (NAS)

The above is responsible of the recent interest in deploying multi-pathing  and segment-ID-based protocols (FabricPath, VxLAN, NVGRE ). The evolution of network and compute virtualization is also creating a new demand to enable the segmentation inside the same fabric to support multiple subsidiaries or tenants over the top of a shared physical infrastructure, helping IT Mgr’s to save money by reducing their CAPEX while reducing the time that it takes to deploy a new application.

This brings a couple of  important questions in regard to DC Interconnect such as:

  • Is my DCI still a valid solution to interconnect the new fabric-based DC ?
  • Can I extend my network overlay deployed inside my DC fabric to interconnect multiple fabric network together to form a single logical fabric ?

Prior to elaborate on those critical questions on a separate post, it is important to better understand through this article why and how the intra-DC network is changing into a fabric network design.

The Cloud consumption model adoption is taking over the legacy monolithic architecture. The evolution of the Data centre network architecture is essentially driven by the evolution of software, which, on its turn is also evolving because of increasing performance of the hardware (computer). 
The following non exhaustive list highlights what important changes are happening inside the DC:
  • On the application side,  the new software paradigm of highly distributed resources  is emerging very quickly, consequently modern software frameworks such as BigData (e.g. Hadoop), used to collect and process huge amount of data, are as of today more and more deployed in most of large enterprises DC.
  • The above massive dataset processing directly impacts the amount of data to be replicated for business continuances.
  • The growth of virtualisation allows the enterprises to accelerate the usage of the application mobility for dynamic resource allocation or for maintenance purposes with zero interruption for the business.
  • We also need to pay attention to SDN applications and how services might be invoked throughout different elements (compute, network, storage, services) via their respective API.
  • On the physical platform side, with the important evolution of the CPU performance as well as the increased capacity of the memory at lower cost, the (*)P2V ratio has exceeded the 1:100, resulting an utilisation of the access interfaces connecting physical hosts close to 100% (peaks) of its full capacity at 10Gbps.

(*) The P2V is the theoretical capacity of converting a physical server into a certain number of virtual machines, the ratio being the amount of virtual entities per physical device. In a production environment, the P2V ratio concerning Applications is very conservative, usually under the 1:40. This is mainly driven by the number of vPCU that an application has been qualified for. However the ration for virtual desktop is usually higher.

Consequently, the P2V ratio directly impacts the oversubscription ratio of the bandwidth between the access layer and the distribution layer, evolving from the historical 20:1 oversubscription  to 10:1 few years back, to recently breaking it down to 1:1. This drives the total bandwidth of uplinks between the access layer and the aggregation layer to be close or equal to the total bandwidth computed with all active access ports that serve the physical hosts.

  • With the implication of massive dataset processing, the IO capacity on the servers must be boosted, therefore there is an increase integration of DAS inside the server to accommodate the spindle bound performance including SSD for fast access storage and NAS  for long term large capacity storage.

UCS c240 M3


All these new changes are increasing the server-to-server network communication inside the DC, commonly called EAST-WEST (E-W) traffic. The outcome is that while we have been focusing on North-South (N-S) workflows for years, with the quick adoption of virtualisation and BigData, the E-W flow has already exceeded the N-S traffic and it is expected  to represent +80% of  the traffic flows inside DC by end of this year.

Let’s take the following example with a large size DC. We will use 480 physical servers to keep the math simpler.  There are different options that we can think of, but for the purpose of this example we consider 24 physical servers per rack and we assume each ToR switch is populated at 50%, implying 24 active 10GE interfaces per device. This deployment would require 6 x 40GE uplinks, load distributed to the upstream aggregation devices. Considering that the servers are usually dual-homed to a pair of ToR switches for high availability,  it is then required to provision two times more links. On the server side, the HA is configured in Active/Standby fashion (NIC Teaming in Network Fault Tolerant mode), hence we don’t change the oversubscription ratio.  That gives us a total of 20 racks with 2 ToR switches per rack. Theoretically we would need a total of 12 x 20 40GE uplinks distributed to the aggregation layer.  Each aggregation switch needs to support 120 x 40GE at line rate.

There have been several discussions around the 40GE-SR deployment and the high impact on the CAPEX due to a specific requirement of new 12-fiber ribbon with MPO connectors per 40GE-SR fibre link. Actually there is a great alternative with the 40GE SR-BiDi QSFP that allows the IT Mgr to migrate from the existing 10GE fibre (both 10GE MMF cable infrastructure and patch-cables with same LC connectors) removing this barrier.


Historically the role of the aggregation layer has been for years the following:

  • It aggregates all southbound access switches, usually maintaining the same layer 2 broadcast domains stretched between all access PoD. This has been used to simplify the provisioning of physical and virtual servers.
  • It is the first hop router (aka default gateways) for all inside VLAN’s. This function can also be provided by the firewall or the server load balancer device, which usually are placed at the aggregation layer.
  • It provides attachment to all security and network services.
  • It provides the L2 and L3 Data Centre boundary and the gateway to the outside of the world throughout an upstream Core layer.

The placement of those functions has been very efficient for many years due to the fact that most of the traffic flow used to be North-Southbound are essentially driven by non-so-mobile bare metal servers as well as traditional software frameworks.

With the evolution of the DC, it is not necessarily optimal to centralise all these network services into the aggregation layer. The distribution spine (formally known as Aggregation layer) can be kept focusing on switching packets between any leaf (f.k.a. access switches).

Note: This role could be compared to the Provider router “P” in a MPLS core only responsible for switching the labels used to route packets between “PE”.

With the availability of chaining the network services (physical and virtual service nodes)  using policy-based forwarding decisions, we can move the network services down to the leaf. An optional changes that’s emerging with the new generation of DC fabric network is that the Layer 2 and Layer 3 boundary moving down at the leaf layer. In addition, the layer 2 failure domain tends to be confined to its smallest zone, reducing the risks of polluting the whole network fabric. Thus resulting a function of anycast gateway that allows distributing the same default gateway service to multiple leafs. This is achieved while we maintain the same level of flexibility and transparency for the workload mobility. Finally we can move the DC boundary behind a border leaf (boudary of the fabric network) from which only the traffic to and from outside the network fabric will go through it.

By removing these functions from the aggregation layer, it is not limited anymore to two devices tightly coupled together, as they don’t need to work anymore by pair. On the contrary, additional spine switches can be added for scalability and robustness.


Back to the previous example, the new generation of multi-layer switches and line cards offer very large fabric capacity of 40GE interfaces at line rate making the traditional multi-tier layer architecture still valid. It is important to keep the failure domain as small as possible, making sure that no failure or maintenance work will impact the performance on the whole fabric network. If one spine switch is removed from the fabric for any reason, it is certainly preferred a N+2 design rather than 1 single remaining switch handling all the business traffic. Finally looking at a very large scalability, it becomes easier using the wider spine approach to maintain the same oversubscription ratio while adding new racks of servers.

It may be worth also looking at a possible lower CAPEX. It is expected that small size spines devices may cost less than a large modular switch.

Hence what are the risks to keep the traditional hierarchical multi-tier layer design ?

Obviously it depends on the enterprise application types and business requirements, the traditional Multi-tier layer is still valid for many enterprises. But if, as an IT administrator, you are concerned by the exponential growth of E-W workflows and a large amount of physical servers, you may encounter some limitations:

  • Limited number of active layer 3 network service
    • Although some services  like the default gateway may act as Act/Act using vPC
  • Unique connection point for any N-S and E-W traffic with risks of being oversubscribed.
  • Unpredictable latency due to multiple hops between two endpoints
  • Shared services initiated at the same aggregation layer affecting the whole resources: e.g. SVI/Default gateways being all configured at the aggregation layer, in case of a switch control plane resource issue (e.g. DDoS attack), all the server-to-server traffic can be impacted. Thousand of VLAN’s centralised may affect the network resources such as the logical ports.

It’s possible for a multi-tier layer DC architecture to evolve seamlessly into a ‘Clos’ fabric network model. It can be gradually enabled on a per-PoD basis without impacting the Core ayer and other PoD. An option is to build a ‘fat-tree Clos’ fabric networks with 2 spines and few leafs to start with, and add more leafs and spines as the compute stack grows.

This seamless migration to the fabric network can be achieved either:

  • by enabling a Layer 2 multi-pathing  protocol such as FabricPath on an existing infrastructure.
  • by deploying the new switches or upgrading existing modular switches with new line cards supporting in hardware VxLAN Layer 2 and layer 3 gateways.

Both FabricPath and VxLAN offer a network overlay using a multi-pathing underlay network. vPC can coexist with any of both options if needed, although the trend is to move to the leaf for all dual-homed services (FEX, servers, etc..). Consequently, whatever protocol is used to build the Clos fabric network, the STP protocol between the Spine and Leaf nodes is removed. All paths become forwarding, breaking the barrier of the 2 traditional uplinks offered by Multi-hassis Ether-Channel (MEC).


Scalability and segmentation for multi-tenancy are improved using the hardware-based network overlay supporting 24-bit identifier (e.g. FabricPath or VxLAN).

Intra-VLAN Routing can be performed at each fabric leaf node. The same gateway for a given subnet is distributed dynamically everywhere on each leaf. Hence, on each leaf, a local gateway can exist for each given subnet. This is not mandatory, but it can be enabled with some enhanced distributed fabric architecture using FabricPath (DFA) and VxLAN (near future, stay tuned). It’s a valid option to be able to confine and maintained the layer 2 broadcast traffic limited inside the local leaf, thus preventing flooding across the fabric network. To improve the switch resources such as logical ports, VLAN, subnets and SVIs can be created on demand and dynamically, just as needed.

in summary, as of today, the Clos network fabric becomes very interesting for IT managers concerned by the exponential increase of E-W traffic, a rapid growth of servers and virtualisation and the massive amount of data to proceed. The network fabric gives a better utilisation of the network resources with all links forwarding, it distributes the network service functions closest to the application, it maintains the same predictable latency for any traffic. It allows to increase the fabric capacity as needed by simply adding spines and leafs transparently.

That gives a short high level overview of the DC network fabric and why we need to think of this emerging architecture.

I recommend you to read this excellent paper from  Mohammad Alizadeh and Tom Edsall “On the Data Path Performance of Leaf-Spine Datacenter Fabrics


Posted in DC & Virtualization | Tagged | Leave a comment

24 – Enabling FHRP filter

 Isolating Active HSRP on both sites.

You have been several to ask about details on the HSRP filtering configuration as discussed in the LISP IP Mobility article (23 – LISP Mobility in a virtualized environment), so here below is a short description of the configuration. And thanks to Patrice and Max for they help validating this configuration.

FHRP isolation is required only with LAN extension. It comes with two flavours:

  1. It works in conjunction with any Ingress path optimisation techniques such as LISP IP Mobility or LISP IGP Assist or intelligent DNS (GSLB) or SLB Route Health Injection tools or even a manual setting (e.g. changing the metric for a specific route). Whatever solution is used to redirect the traffic from the end-user to the location supporting the active application, in order to be really efficient and to eliminate the “tromboning” workflows, the egress traffic must be controlled to use the same WAN edge gateway as the Ingress traffic  as described in this note 15 – Server to Client traffic.
  2. Although it depends on the distance between remote DC’s, usually it is recommended to move the whole software framework when migrating an application from one site to another, still to reduce as much as possible the “tromboning” workflow impacting the performances. In short do migrate the Web-tier with its Application and Data-base tiers, all together. As the server to server (tier to tier) traffic imposes that the same default gateway is used wherever it is located, the way to  prevent the traffic to return to hit the original Default gateway is to duplicate the same gateway on each site. This is also described here 14 – Server to Server Traffic. However due to the VLAN extended between site for the it is mandatory to filter the handshake protocol (HSRP Multicast and vMAC) between gateways belonging to the same HSRP group.

HSRP Setup

In our previous example with LISP IP mobility we used CSR 1000v as default gateway inside each tenant container.

Both CSR 1000v distributed on each site offer the same Default Gateway to the server resources; hence to be both active on each site it is mandatory to filter out HSRP Multicasts and vMAC between the two locations.

This filtering setup is initiated on the OTV VDC

  1. Create and apply the policies to filter out HSRP messages (both v1 and v2). This config concerns VLAN 1300 and VLAN 1300 for the following test with LISP, however additional VLAN (or range) can be added there for other purposes.    
  2. Configure ARP filtering to ensure ARP replies (or Gratuitous ARP) are not received from the remote site
  3. Apply a route-map on the OTV control plane to avoid communicating vMAC info to remote OTV edge devices. Indeed OTV uses a control plane to populate with its Layer 2 local MAC table, each remote OTV edge devices. This MAC table being built from its regular MAC learning, it is important that  OTV doesn’t inform any remote OTV Edge device about the existence of this HSRP vMAC addresses as it exists on each site.HSRP and Nexus 1000v

By default the Nexus1000v implements a loop detection mechanism based on the source and destination MAC addresses. Any duplicate MAC addresses between the vEthernet interface and uplink ports will be dropped.

Thus in case of redundant Layer protocol such as HSRP enabled on the upstream virtual routers (e.g. CSR 1000V), it is critical to disable the loop detection on the vethernet interfaces supporting the redundant protocol.

For more details please visit :


HSRP filtering – Quick check

App1 ( is currently located in DC-1 sitting on top of VLAN 1300 where CSR-106 is its local default gateway

App1 pings continually a loopback address located somewhere to the outside of the world. The show command of the outbound interface on the local CSR-106 shows the rate of 1 packet per second as expected while the CSR-103 on remote site doesn’t treat any packet.

Live Migration happens through OTV, moving App1 to DC-2 using vCenter.

Counters show the new local CSR-103 on DC-2 taking over the default gateway for the App1 on VLAN 1300




Posted in DCI, DR&DA | 6 Comments

23 – LISP Mobility in a virtualized environment (update)

Note: When I talked about this solution almost a year ago, we were using alpha versions of software releases, from which some improvements and command lines have changed with last released codes. Thus I’m elaborating on this original article including the final CLI syntaxes. Hope that helps !

Special thanks to Patrice Bellagamba and Marco Pessi for their great support 

Enjoy the revolutionary evolution of DCI 🙂 


The following article describes the mechanism for detection and notification of movement of virtual machines across multiple locations and how to redirect accordingly the traffic from the end-user to the new location supporting the active application. When deployed in conjunction with subnets extension across sites (e.g. OTV) this solution is also known as Ingress Path Optimisation using IP localisation as it redirects automatically and dynamically the original request directly to the DC that hosts the application (VM), reducing the hair-pining workflows via the primary site.

You could say, that it’s exactly what LISP IP Mobility is aimed to achieve. That’s true, however the concept here is to go a bit further with hybrid network compared to traditional physical deployment and see how this solution fits with the rapid evolution of the virtualisation (network & hypervisor) by detecting the movement of the virtual machine from inside a virtual container (A.K.A. Tenant Container in the Hybrid cloud service) which is usually isolated from the LISP Site gateway by multiple layer 3 hops (e.g. virtual FW, virtual SLB or virtual Layer 3 routers).

A tenant container consists of one or multiple application zones belonging to a c0-tenant  with virtual network and security services (virtual SLB, virtual FW…) enabled inside the container according to the service class that the client has subscribed to. Hundred or thousand of tenant containers can co-exists into an hybrid cloud, all containers being isolated from each other usually using a L3 segmentation.

Workload Mobility across multiple locations


LISP IP Mobility: a brief description

Check this section for more details 20 – Locator/ID Separation Protocol (LISP)

In a very short high level description, LISP IP Mobility in conjunction with a DCI LAN extension techniques such as OTV, allows redirecting dynamically the traffic from the end-user to the location where the Virtual Machine has moved while maintaing the session stateful.

LISP IP Mobility supports two main functions:

The first main function of LISP Mobility is the LISP Site gateway that initiates the overlay tunnel between the two end-points, encapsulating/decapsulating the original IP packet with an additional IP header (LISP component identifiers). Usually, but not limited, the tunnel is built between a remote location where the end-user resides and the DC where the application exists. The IP-in-IP encapsulation is performed by the Ingress Tunnel Router (ITR) or from  a Proxy Tunnel Router (PxTR). The overlay workflow is decapsulated from the Egress Tunnel Router (ETR) and forwarded to the server of interests.  In general the function of ETR sits at the edge of the DC. When a VM migrates from one data centre to another, the movement of the machine known as EndPoint Identifier (EID) is detected in real time by the LISP device and the association between the EID and the Locator is immediately updated into the RLOC Mapping database. All traffic destined to the VM is dynamically, automatically, and transparently redirected to the new location.

The other important component in LISP is the function of LISP Multi-hop. Indeed it is not usual to attach the application/server directly to the LISP Gateway device (xTR). For security and optimization reason, stateful devices such as Firewall, IPS, load balancers are often deployed between the applications and the function of LISP encap/decap. Therefore due to multiple routed hops that may exist between the application (EID) and the border LISP Gateway (xTR), it becomes challenging to detect from the LISP xTR device the movement of the EID (i.e. VM). Thus a LISP agent located at the 1st hop router (the default gateway of the server of interest) must be present to detect the movement of the EID and to notify accordingly its local xTR which in its turn will apply the required action (e.g. update the LISP Mapping Data Base). This service is called LISP Multi-hop, however the function itself initiated on the default gateway is called LISP First Hop Router (LISP FHR).

For the purpose of this document, the workload mobility is addressed with a combination of LISP mobility with a LAN extensions solution (OTV) to offer a fully transparent stateful live migration (zero interruption). This mode is known as LISP Extended Subnet Mode (ESM).

The other great method which is out of scope here, and thanks to LISP, is to allow IP Mobility without IP address renumbering and above all without any LAN extension techniques (this is also known as Across Subnet Mode ASM). This mode is often deployed to address Cold Migration requirements (usually requiring the applications to restart) in a fully routed environment providing dynamic redirection to the active DC. Avoiding IP address renumbering not only eases the migration of the applications from the Enterprise, but also reduces the risks and accelerates the deployment of those services.

LISP Evolution requirements

Many tests on LISP IP mobility have been already performed in ESM and ASM modes but always running both functions of xTR (encap/decap) and Multi-hop / First Hop Router (FHR) on physical LISP devices (Nexus7000 or ASR1000) .

Despite that not so many DC are fully virtualized today, there is an increasing need for data center to support a mix of physical and virtual network and security services in conjunction with virtual machines and bare metal platforms in a hybrid network environment. This is especially more important in the SP cloud supporting virtual DC from enterprises (a virtual tenant container)  deployed in a physical Multi-tenants infrastructure. As of today, you still need ASIC-based network devices maintaining the same level of performances handling multiple network overlays.

Each tenant DC is being represented by a “autonomous” virtual container (Virtual DC) inside the SP public cloud or the Enterprise private cloud, supporting respectively the tenant or subsidiary’s applications with the level of SLA that the client subscribed to (e.g. Virtual switching, Virtual Firewall, IPS, Virtual Secure gateway, Multiple security zones, Multiple-tiers). The goal here being to test the function of LISP 1st Hop Router inside the tenant container: An active session established from a branch office must be dynamically redirected to a distant DC supporting the stateful hot migration of the  application  of interest with zero interruption between the end-user and the apps as well as between the tiers of the software frameworks (front-end and back-end).   For the focus of the 1st test, we will keep the network services as simple as possible.

Test description :

For the tests LISP is deployed in a hybrid model with the function of xTR achieved in the physical edge router (Nexus7000) while the LISP 1st hop router (FHR) function is initiated inside the virtual tenant container using the Cisco virtual router CSR 1000v (Cloud Service Router) directly facing the virtual servers on VLAN 1300 (Front-end) and VLAN 1301 (Back-end); the CSR LISP FHR being the default gateway of the virtual machines. In addition the same Nexus 7000 platform offers the OTV services from a dedicated Virtual Device Context (VDC). VDC is a form a virtual chassis available on the Nexus 7000 fully abstracted from the physical device.

Traffic destined to the application is encapsulated between the ITR (located on the end-user side) and the ETR initiated in the DC hosting currently the active application (VM). The traffic is then decapsulated from the LISP ETR  toward the application servers.

The first series of tests consists of a tenant container built with 2 security zones (Front and Back-end) from which a two tier application is currently accessible with no access control list of any form.

The main goal of this test is to detect automatically and dynamically the movement of the EID (IP identifier of a VM) using the remote LISP FHR (CSR1000v) which relays  the information consecutively its local LISP ETR then to the LISP map servers, so now the ITR on the branch office can initiate the LISP tunnel toward the new DC.

Two data centres, DC-1 and DC-2 supporting on both sites LISP xTR and LISP Multi-hop functions serve the same virtual container environment for a specific tenant. The tenant container is provisioned using a dedicated virtual router on each site. Although a CSR1000v is a Virtual Machine and could migrate easily from one site to another, it has been decided that only the Virtual machine supporting the tenant’s applications will migrate from one DC to the other. The LISP FHR routers are duplicated on both sides, offering the same default gateway parameters for the server sites (here VLAN 1300 and VLAN 1301) maintaining stateful live migration without any interruption. However they use different outbound network connectivity to communicate with the outside of the world (VLAN 10 on DC-1 and VLAN 11 on DC-2).

Note that for the purpose of this test I am using a single CSR1000v on each DC, in a production private or public cloud, all network and security services should be redundant.

To offer symmetrical egress traffic and to reduce hair-pining, FHRP isolation is being enabled on the OTV virtual device context (Nexus 7k) so default gateway can be replicated on both sides without posing any particular issue due to duplicate identifiers. After its migration the VM uses the same default gateway (same VIP and vMAC). See 24 – Enabling FHRP filter for additional details on HSRP filtering.

The LISP ETR on its turn notifies dynamically a LISP Mapping database that notifies the original LISP ETR, thus all xTR are informed automatically or on request about the new location of the end-point identifier. The whole update process of notification is achieved very quickly and full redirection is performed in a couple of seconds without any interruption.

In order to test the function of First Hop Router, let’s focus on single device attachment (no redundant HA config are described here).

Four main functions are deployed for LISP Mobility:

  1. The LISP Site gateway or egress Tunnel Router (eTR) role terminates the tunnel (Network Overlay) established with a remote Tunnel Router (Ingress Tunnel Router iTR). It de-encapsulates the LISP header and sends the original IP packet toward its final Endnode Identifier (EID) destination. For a comprehensive LISP workflow it is assumed that the eTR resides inside the DC. This function is usually achieved at the Core or distributed layer of the DC architecture. The eTR is also used to register the subnet of the dynamic host toward the mapping server. For our current test purposes this function is initiated at the Aggregation layer (Nexus7000). When achieved with the Nexus 7000, encapsulation and de-encapsulation tasks are computed by hardware with no impact on performances.
  2. The iTR role that initiates the tunnel router with the remote site. For a comprehensive LISP workflow it is admitted that the iTR resides inside the remote site. Upon a ‘no route’ event or a ‘default route’ event, the iTR requests via the LISP map server to receive the related association between the unknown EID and its current location called RLOC.
  3. LISP First Hop Router (FHR) addresses the Multi-hop functionality when the End-station ID is not directly routed by the LISP gateway but from an additional router (default gateway) facing the server. For the current test purposes this function is initiated inside the tenant container using the Cloud Service Router (CSR 1000v).
  4. LISP Mapping Database (M-DB) is responsible to maintain the location of EID in real time and it comprises two sub-functions: Map-Resolver (MR) and Map-Server (MS).
    • The Map-Resolver receives the map requests directly from the iTRs requesting a mapping between an Endpoint Identifier and its current location (Locator) and pushes the map requests to the Map-Server system where all eTR are registered, exchanging EID mapping between all LISP devices.
    • The Map-Server system collects the map requests and forward it to the registered eTR that currently owns it. The eTR of interest will respond directly to the original iTR seeking for that EID.

These two functions can cohabit on the same physical device but this is not mandatory and they can be distributed over dedicated hosts. For the purpose of our test, both functions MS and MR runs on the same device.

eTR (egress Tunnel Router) and iTR (Ingress TR) are often called xTR as each xTR can act as both iTR and eTR according to the way of the workflow. When the traffic is encapsulated with a LISP header sent toward the remote site (Locator) the LISP gateway acts as iTR, when the encapsulated traffic hits the Locator on the DC of interest, the LISP gateway acts as eTR removing the LISP header and send the original IP packet toward its final destination (EID).

LISP  Mapping database initially used a BGP-based mapping system called LISP ALternative Topology (LISP+ALT), however this has now been replaced by a DNS-like indexing system called DDT inspired from LISP-TREE. 





The Mapping Database:

The Mapping Database is configured with the EID prefixes concerned by the tunnel router. As each site can act as ETR (send request and return traffic), the Branch_1 office is also configured with eid-prefix. This is optional, it depends if the return traffic must be encapsulated or not. For this test let’s admit that inbound and outbound traffic flow are LISP encapsulated. Thus the three EID prefixes represent  different source addresses (end-users and server VLAN) concerned by the encapsulation and their location to know with their respective network mask.

Branch Office (ITR):

For the purpose of this test, the interface loopback address 1 ( is used to simulate the remote end-user. This address will be used to send pings to the EID.

Data Center 1 : LISP ETR – Nexus 7000:

The eTR is responsible to de-encapsulate traffic from the LISP tunnel initiated from the ITR (branch office).

The loopback address 10 is the well-known ip address used by the RLOC in DC-1.

The database-mapping describes the VLAN 1300 and VLAN 1300 concerned by the mobility of the eid’s.

The Interface VLAN 10 is used as target notification from the downstream LISP First Hop Router (CSR-106). Any new EID move within VLAN 1300 and 1301 will be notified toward Interface VLAN 10.

Data centre 1 : LISP First Hop Router on the tenant container CSR:

Prerequisite prior to configure LISP: LISP requires the Premium license to be activate with XE-3.11 minimum.

The database mapping concerns the VLAN’s 1300 and 1301 from which Applications (EID) will migrate. As soon as a new EID movement is detected the CSR 1000V First Hop Router notifies its upstream LISP gateway (xTR) using the destination IP address of the outbound VLAN10 ( configured in the DC-1 aggregation 1 (aka eTR).


Let’s see also the configuration details on LISP devices belonging to DC-2

Data Center 2 : ETR – Aggregation Nexus 7000


Data centre 2 : LISP First Hop Router on the tenant container CSR1000v:

The database mapping as well as concerns the VLAN 1300 and VLAN 1301 from which the Applications of interest (EID) will migrate. The CSR1000v LISP First Hop Router notifies any movement of EID’s to the IP address of the upstream router achieved by the LISP gateway for the outband VLAN 11 ( for DC-2 aggregation 1) .

Test results:  a brief workflow

  1. End-user ( sends Request to the Application (
  2. iTR (branch office) intercepts the user requests and check the localization (RLOC) of the application (EID) with the Map Server (
  3. The Map Server (MS) replies with the location (mapping eid <=> RLOC) of the application being eTR in the primary DC-1
  4. The iTR encapsulates the packet and sends it to RLOC ETR-DC-L (
  5. The Application migrates to the remote site, similar tenant container to which the VLAN 1300  (Apps) is extended.
  6. The LISP 1st Hop router (CSR-103 on the remote location DC-2 detects the movement of the application and informs its local eTR ( about the new EID
  7. Meanwhile the eTR on DC-2 informs the MS about the new location of App in DC-2
  8. The Map Request (MR) updates eTR DC-1 accordingly
  9. The eTR on DC-1 updates its table (
  10. The ITR continues sending traffic to eTR DC-1
  11. Original eTR in DC-1 replies to the remote iTR from the branch office with a LISP Solicit Map Request (SMR) informing that it isn’t any more the owner of this EID, but it needs the get the new one from the Map Requestor (MR)
  12. The iTR sends a Map Request to the Map Requestor, get the respond directly from the  eTR on DC-2 as being the new RLOC for, thus the traffic from iTR being consequently redirected to the RLOC in DC-2 (eTR DC2)

Pings are generated from the remote site end-user (loopback 10 @

Before the move

Primarily the application of interest ( is currently located in DC-1. On the eTR DC-1 the LISP dynamic-EID table shows the EID been notified from the LISP FHR CSR-106 interface G2

1- The LISP eTR (DC-1-Agg1) located on DC-1 shows several EID including silent hosts that belongs to different VLAN’s 1300 & 1301. The local EID’s including our EID of interest (.14) are learnt from the CSR 1000V interface

2- On the LISP FHR (CSR-106) in DC-1, the EID of interest is learn form its local interface G1 (connecting VLAN 1300).

3- On the remote site DC-2, the dynamic EID summary from the LISP ETR (DC2-Agg1) shows several EID including silent hosts located on DC-2 from different VLAN’s. The local EID’s are learnt from the CSR 1000V interface

4- The LISP FHR in DC-2 (CSR-103) shows additional local stations  that belongs to the same extended VLAN 1300 as well as VLAN 1301.

5- The Mapping Database (Map server) shows the last registered RLOC for being in DC-1 (eTR, hence traffic is redirected to this RLOC until a migration is performed.

6- The native routing from the branch office knows nothing about the network hence the traffic is encapsulated and routed by LISP.

Note that if the remote knew about how to explicitly route the traffic to the final destination, no LISP encapsulation would be achieved. In case of Default route (no explicit route), LISP is therefore invoked..

However the LISP map cache from the remote iTR knows that the EID have been seen last from the ETR located on DC-1 ( associated with the locator on DC-1.

7- Although there is no native routing toward the final destination, pings from the end-user (loopback 10 to the EID of interest succeed as the traffic is routed via LISP. The iTR knows from its map-cache that to reach the final destination, it needs to encapsulate the traffic of interest and send it toward the RLOC located in DC-1. This explains why there is no packet drop for this 1st series of pings.


Migrate the Virtual Machine from DC-1 to DC-2

8- Just after the EID has migrated to DC-2, the CSR1000v running LISP FHR on DC-2 detects immediately the movement; it then notifies its local eTR about the EID of interest, which in turn notifies the Map Server. This detection/notification takes a couple of seconds.

The original LISP FHR on DC-1 (CSR-106) has been updated accordingly and consequently the EID disappears immediately on the LISP FHR on DC-1.

However it now exists now on the FHR located on DC-2 (CSR-103)

9- The eTR DC2-Agg1 has been eid-noticed by its local FHR accordingly. However on DC1-Agg1 has been notified to with an entry “Null0” for the host route

10- The eTR DC2-Agg1 has been eid-noticed by its local FHR accordingly. However on DC1-Agg1 has been notified to with an entry “Null0” for the host route

11-  The MS shows who last registered ( being the eTR in the Data Center 2) the EID since 31 secs

12- The iTR from the branch office is not aware of any movement of the EID until it makes a request to the original RLOC, hence until next request its LISP map cache still shows the original RLOC 12.1.19.

13- A ping from the original RLOC will solicit the iTR for a new map request (SMR) to the Map Server, which in turn will update its LISP map cache with the new location. That explains the 1st ping timeout, while all the following pings work as expected.

14- The ITR LISP map cache is therefore updated with the map source being eTR DC-2 (



This series of tests described the process, components and configuration required to access geographically the applications running in a virtual environment within a tenant container (also known as a virtual DC).

The application from a virtual tenant container is able to migrate from one DC without any business interruption to another DC while the user traffic is redirected automatically within a few seconds to the new location:

It demonstrated that a movement of virtual machines moving between fully virtual routed containers can be dynamically detected in real time notifying the upstream physical LISP gateway toward the LISP Map-server about a new location of the EID. The physical Nexus 7000 can centralize the encapsulation/de-encapsulation of the user traffic into a LISP tunnel while the detection of the movement is performed directly inside the tenant container using the CSR1000v.

This approach offers two main flavours:

  1. It offers a very granular deployment of tenant containers maintaining the autonomous approach for each virtual DC, allowing selectively the detection of new virtual machines to trigger a dynamic redirection of the traffic flow from a branch office to the DC hosting the active virtual machine.
  2. It gives a huge flexibility to add any additional layer 3 device bumped in the wire between a fully virtual container (and/or bar-metal server) and the core layer of the Data centre while maintaining the detection of any movement of EID ‘s between virtual containers.


Posted in DCI, DR&DA, Path Optimization | 8 Comments

22 – Which DR or DA solution do you need ?


Having described the different components required to interconnect multiple DC to offer business continuity and disaster recovery in the previous posts, I think it may be useful to provide a series of questions that you may ask yourself to better understand which solution fits better for your business.


Consider 2 main models for Business Continuance:

Disaster Recovery (DR) Plan :

  • Disaster Recovery usually means that the service is ”restarted” in a new location.
  • Time to recover the application is mainly dictated by the mission and business criticality levels.
  • Recovery happens manually or automatically after an uncontrolled situation (power outage, natural disaster, etc..)
    • Automatically recovery may require additional components to be efficient such as Intelligent DNS, Route Host Injection, LISP.
  • Interconnection and load distribution are achieved using intelligent L3 services
  • Traditionally DR scenario implies the hosts/applications/services to be renumbered (understand using a new IP address space) according to the new location. However notice since recently LISP allows to migrate and to locate IP components without any IP parameter changes (and without LAN extension either).

Disaster Prevention/Avoidance (DP/DA) Plan:

  • Interconnection is usually achieved using L2 extension for business continuity (without any interruption for the current active sessions – statefull sessions are maintained in a transparent way) but can also be achieved using L3 extension techniques (sessions are stopped and immediately reinitiated in the new location)
  • The recovery happens automatically after a controlled situation (maintenance, machine migration)

For both scenarios listed above, network services can be leveraged to provide automatic redirection to the active DC in case of DC failover. These services can improve the path for accessing the active application in a clever fashion and/or to optimize the server to server communication. These services are not mandatory to address the DR or DP but are complementary and strongly recommended to accelerate the recovery process.



1) What is the impact to your Business when the service is available or degraded mode ?

For most of Enterprises, mission and business services fit onto one of these criticality levels.

C1 Mission Imperative Any outage results in immediate impact to revenue generation – zero downtime is acceptable under any circumstances
C2 Mission Critical Any outage results in critical impact to revenue generation – downtime is scheduled under short specific windows time.
C3 Business Critical Any outage results in moderate impact to revenue generation. It is accepted to interrupt the service for a short time.
C4 Business Operational A sustained outage results in minor impact to revenue generation. It is accepted to interrupt the service for a longer period of time.
C5 Business Administrative A sustained outage has little to no impact on a the service

When the business  framework finds its criticality level, it is therefore possible to classify it into a more generic matrix. The matrix can be split into 2 main models as discussed previously: Disaster Recovery and Disaster Prevention/Avoidance

Disaster Prevention/Avoidance

Disaster Recovery

Availability RTO (hours) RPO
RTO, (Hours) RPO (Hours) Criticality Level
Up to 99.999% ~0 ~0 n/a** n/a C1
Up to 99.995% 1 0 4 1 C2
Up to
4 0 24 1 C3
Up to
24 1 48 24 C4
Up to
Best Effort 24 Best Effort 1 week C5


1)    What is the maximum time to recover the critical applications (aka RTO) ?

Business continuance may have different requirements depending on the business of the enterprise driven from their respective DC, or the time of the day the recovery happens. It can be “zero” down time for strict business continuance or it can accept couple of hours down time.

2)    What is the maximum data loss accepted during the recovery process (aka RPO) ?

The shortest usually is the best, but shortest implies higher complexity, higher cost and shorter distances between sites that will weight the balance for the final decision. It is sometime accepted to loose data for couple of hours between an active DC and its recovery sites while some business will impose zero data lost. Synchronous data replication leads to RPO=0 but has distance limitations and bandwidth implications

3)    Recovery mode ?

Cold: Need to power up all machines for the recovery process (network/Compute/Storage and start application from backup)

  • RTO can take several days or weeks
  • RPO may be several hours

This mode is not often deployed as a main backup site although it’s the less expensive to maintain. However it is inefficient and useless during normal working phase and RTO is usually very large (several days/weeks). That said some large enterprises may deploy this DR solution for a 2nd small backup site limited to some applications.

Warm: All applications are already installed and ready to start from backup data. The Network access needs to be a routed (manually) to the backup site after applications have started.

  • RTO can take several hours
  • RPO may be several minutes to few hours

Same as previous mode, it is expensive due to its inefficiency during normal working phase. As of today, that mode is not the best choice for a main backup site.

Hot: backup site is always accessible and some applications are up and running.

  • RTO can take several minutes to hours
  • RPO may be zero to few seconds

This mode is very interesting for DR functions as the infrastructure already deployed for the backup function can be leveraged to support active applications. In addition LISP can also offer stateless services for business continuity using the existing routed network.

4)    How many Data centers to interconnect?

Different scenarios can be considered

1 Primary DC (active) + 1 Backup DC (standby) – usually Region/Geo Distributed DC

  • All applications runs on the Active DC (primary DC)
  • Cold – The backup DC is inactive at all layers (network/compute/storage)
  • Warm/Hot – The backup DC is inactive for application and network but storage may be active for data replication in synchronous or asynchronous mode to be ready for the recovery process. In some cases, data warehousing activity takes place on data in the secondary datacenter

1 Primary DC (Active) + 1 Backup DC (Active)  – usually Region/Geo Distributed DC

  • Active Applications on both DC but not necessarily related together
  • Each DC is backup of the other (meaning double-way data replication)

2 Active DCs (Twin DC or Metro Distributed Virtual DC) + 1 Backup DC (Region/Geo Distributed DC)

  • Usually 2 dispersed DC (Metro distance using L2) looks like a single logical DC from the Application & mgmt point of view. The 3rd one is only used for recovery purposes (Cold/Warm restart using L3).
  • This one is the preferred solution as it offers service continuity and disaster prevention without any interruption inside the Metro DVDC plus the disaster recovery for uncontrolled situation.
  • The DVDC requires VLAN extension for a full transparent live migration (maintaining the session active) between the 2 sites.
  • LISP can improve the original function of disaster recovery by offering disaster prevention on the backup site interconnected using a routed network.

5)    If Active/Active DC what are the drivers ?

Some applications are up and running in the secondary DC in an autonomous fashion (meaning no directly linked to applications running in the primary DC) Applications are often distributed to balance the load (network or compute) between the 2 DC.

  • i.e. Application A supporting service ABC running on DC 1 & Application B supporting service XYZ running on site 2
  • Each DC can be backup of the remote site (double-ways).

Duplicate Applications can also be deployed on both DC (primary & backup sites) to offer a closest access to the service XYZ for the remote users (low response time using geo Dispersed Virtual DC may have a huge impact on the business)

  • i.e. Application A supporting service XYZ is deployed in DC 1 and duplicated application A-bis supporting the same service XYZ is deployed in DC2.
  • Notice for such scenario, usually the DB tier is shared between the 2 duplicated applications which may impact the distances between both DC

Applications (virtual) can migrate to the backup site in case of burst:

  • Manually
  • Automatically
  • Migration can be statefull (zero interruption) or Stateless (imposed reestablishing the session)

Applications (virtual) can migrate to the backup site for maintenance purposes (manual)

Migration purposes for physical devices (for undetermined temporary period)

  • Some mainframes IP parameters are hardcoded making difficult and costly any  changes hence usually require the same subnet to be extended on both sites
  • This can be achieved using LAN Extension or L3/LISP

HA clusters are spread over the 2 sites with 1 member of the cluster active on primary site and standby member(s) on secondary site(s) (and vise versa for other HA framework)

 6)    What is the distance between DC?

The distance can be dictated by:

Software Framework (HA clusters, GRID, Live migration). Distances usually imposed by the max latency supported between members (consider 500ms as unlimited distances)

  • Some HA clusters support L3 interconnection with unlimited distances
  • Some GRID systems are latency sensitive hence only short distances are supported.

Storage mode and replication methods (Shared Storage, Act/Act Storage..)

  • Some Act/Act storage methods usually support hundred km using synchronous replication but some solutions support several thousand kms using asynchronous replication.
  • Synchronous (+/-100kms max) to Asynchronous (unlimited distances) replication modes
  • In case of HA cluster Quorum Disk may impose Synchronous replication
    • it is also possible to change this Quorum disk using other method adapted to the Geo-Clusters such as Majority Node set which supports asynchronous mode.

Keep in mind that the latency (hence the distance between DC) is one of the most important criteria for DCI deployment. First for the Storage side (TPS between the server and the disk array) for operation continuity (recommended A/A storage mode).  Secondly for Multi-tier applications spread over multiple DC (Server to Server communication). Indeed, the latency may have a strong impact on response time due to the ping pong effect (often +10 handshakes for a single ACK) hence when migrating or restarting a framework from site to site, it is important to understand how the workflows are related between the different components (i.e; front-end, middle-tier, back-end, storage). Avoid a partial movement of framework and if possible optimize the ingress and egress traffic  in a symmetrical way through the statefull security and network services.




Posted in DCI, DR&DA | Leave a comment

21 – Data Center Interconnect – summary

Achieving the high level of flexibility, resource availability, and transparency necessary for distributed cloud services DCI requires four components:

  • Routing Network: The routing network offers the traditional interconnection between remote sites and gives end-users access to the services supported by the cloud. This component is improved using GSLB-based services such as DNS, HTTP redirection, dynamic host routes, and LISP.
  • LAN Extension: The technical solution for extending the LAN between two or more sites using dedicated fibers or Layer 2 over Layer 3 transport mechanisms for long distances.
  • Storage Services: The storage services used to extend access between SAN resources. SANs are highly sensitive to latency and therefore impose the maximum distances supported for the service cloud. It is preferable to use an Active/Active replication model to reduce the latency to its minimum value.
  • Path Optimization Solution: The path optimization solution improves the server-to-server traffic as well as the ingress and the egress workflows.

Unlike the classical data center interconnection solutions required for geo-clusters that can be stretched over unlimited distances, DA and live migration for the service cloud require that active sessions remain stateful. As a result, maintaining full transparency and service continuity with negligible delay requires that the extension of the LAN and the SAN be contained within metro distances.

Enterprises and service providers may still have strong requirements to extend the LAN and SAN over very long distances, such as the need for operation cost containment or DP in stateless mode. These needs can be addressed if interrupting (even for a short period of time) and restarting sessions after workloads are migrated is acceptable to the system managers. Those requirements can be achieved using tools such as Site Recovery Manager from VMware© or an active-active storage solution such as EMC VPLEX Geo© for more generic system migration.

Posted in DCI | 5 Comments

20 – Locator/ID Separation Protocol (LISP)

LISP VM-Mobility:  

Traditionally, an IP address uses a unique identifier assigned to a specific network entity such as physical system, virtual machine or firewall, etc. The routed WAN uses the identifier to also determine the network entity’s location in the IP subnet. When VMs migrate from one data center to another, the traditional IP address schema retains its original unique identifier and location, although the location has actually changed.

This is done because the Layer 2 VLAN between the physical and virtual machines supports the same IP subnet. The extended VLAN must share the same subnet so that the TCP/IP parameters of the VM remain the same from end to end, which is necessary to maintain active sessions for migrated applications.

To identify the location of a network device, LISP separates the identifier of the network device, server or application (known as the EndPoint Identifier) and its location (known as the Locator), once the separation is done, the LISP Mapping System will maintain an association between these two distinct address spaces (End-Point-Identifiers and Locators) The IP address of the identifier is preserved by having it encapsulated into a traditional IP frame for which the destination IP is the location (Locator) of the site where the server or application (EndPoint Identifier) has moved.

A traditional routed network provides reachability to the Locator while the IP address of the EndPoint Identifier can dynamically migrate to a new location without modification of the routing information pertinent to the locator space. Only the information in the LISP Mapping System is updated so the end-point identifier is now mapped to its new locator.

When a VM migrates from one data center to another, the movement is detected in real time and the association between the EndPoint Identifier and the Locator is immediately updated in the RLOC Mapping database. All traffic destined for the VM is dynamically, automatically, and transparently redirected to the new location.

For hybrid clouds, a service provider can move and house the data center of the enterprise without necessarily changing the full address space of network devices and servers.

LISP IP mobility:

With the introduction of LISP, IP based solutions allowing a subnet to be dispersed across any location become a reality. As we move forward, many of the workload mobility requirements may be addressed with a combination of LISP mobility and LAN extensions as outlined in this paper, but there may also be a number of deployments in which the LISP mobility functionality alone may be sufficient to address the workload mobility requirements, eliminating the need for Layer 2 extensions. This works well today for specific scenarios such as disaster recovery and live moves. Just to list few use cases where LISP can be very efficient and help remove the need to extend the Layer 2 between sites:

  • During process of migrating physical servers to a new data center some applications may be difficult to re-address at Layer 3, such as a Mainframe for example. Avoiding IP address renumbering may ease physical migration projects and reduces cost substantially.
  • With hybrid cloud, Enterprise Private/internal cloud resources are moved to a Public/external cloud location. Avoiding IP address renumbering not only ease the migration of the applications from the Enterprise premise equipment, but also reduces the risks and accelerates the deployment of those services. In addition LISP guarantees optimal routing to the active application, regardless of its location, removing the hairpin effect and therefore improving the response time to access the service.

As the technology evolves more and more scenarios will be addressed. In the future, the network architect will have the choice between an L2 and an L3 solution in order to satisfy the DCI requirements that traditionally were focused exclusively on L2 solutions.

Cisco LISP VM-Mobility provides an automated solution to IP mobility with the following characteristics:

  • Guaranteed optimal shortest path routing
  • Support for any combination of IPv4 and IPv6 addressing
  • Transparent to the EndPoints and to the IP core
  • Fine granularity per EndPoint
  • Autonomous system agnostic
Posted in DCI, Path Optimization | 4 Comments

19 – vCenter, ACE and OTV – Dynamic Workload Scaling (DWS)

VCenter has the ability to manually or dynamically control system resource use and allocate workload based on the physical resources available throughout the cloud.

The Cisco ACE has the ability to distribute traffic load to multiple physical or virtual servers within a server farm, using different weights based on the performances of the systems or other criteria understood by the system managers.

OTV has the native ability to detect MAC address movement from one site to another via the extension of Layer 2 that it provides.

Combining vCenter, ACE, and OTV can provide a complete solution that can detect the movement of VMs from one site to another in real time (OTV) and modify the weight associated with each real server accordingly, via the Application Network Manager (ANM) of the ACE. By monitoring and defining alarm thresholds using vCenter©, this solution can alleviate the local server farm by sending requests with variable ratio to the servers located on the remote data center.

This provides a very fine granularity and dynamic distribution of the resources that can be preferable for a specific duration, and therefore optimizes the bandwidth between remote sites used by these flows.

Posted in DCI, Path Optimization | Leave a comment

18 – Dynamic Routing Based on the Application State.

The Cisco load balancing service module, ACE, provides real-time information about the preferred route to access a specific application or service after it has moved to a new location. The ACE continually probes the state of the applications for which  it is responsible.

When one of these services appears in its data center, it immediately sends a static route to reach the application to the adjacent router, which in turn notifies the routed wide area network (WAN) with a preferred metric. At the same time, the site of origin withdraws the IP route of the host (or application) that no longer exists from its local routers.

Unlike the GSS, information on Layer 3 routes concerns all existing sessions (current and new) and all sessions will be redirected almost in real time – according to the route table updates – to the new data center hosting the concerned application. However, the current active sessions will be kept safe during the migration because the IP address is unchanged for both the private and the public side. Nevertheless, local stateful devices such as firewalls and load balancers must initiate and validate a new session. Except for some specific protocols primarily related to maintenance purposes such as Telnet or FTP, usually for applications related to the cloud services such as IaaS and/or traditional http based software, this is not detrimental to the services supported.

In fact, an HTTP session that was established through stateful devices such as a firewall in the primary data center can be redirected to a secondary data center offering the same access and application security level and policy rules. After the migration of the concerned application on the new data center, the local stateful devices will accept and initiate a new session for that workflow according to the security policies.

Once granted, the session will be established transparently to the end user. Note that mechanisms based on cookies or SSL IDs or other identifiers used to maintain session persistence between the server supporting the application and the end-user, must be maintained.

Posted in DCI, Path Optimization | Leave a comment

17 – Intelligent Domain Name Server

The Global Site Selector (GSS) is an Intelligent Domain Name Server that distributes the user’s requests to the remote sites where the applications are active. The GSS has already been described at the beginning of this article (Global Site Load Balancing Services) in a traditional routed environment. The same equipment can be used for LAN extension in conjunction with other network services such as the SLB devices, as well as with centralized management tools used for migrating VMs.

GSS with KAL-AP: In conjunction with load balancing equipment such as the Cisco Application Content Engine (ACE), the GSS periodically sends probes (Keep Alive Appliance Protocol, KAL-AP) to the load balancing device in order to determine the status of VMs and services distributed throughout the cloud.

When a VM migration is complete, the GSS locates the active application based on regular keep-alive probing, and immediately associates it with the public address of the hosting data center.

The existing sessions are maintained via the original data center to avoid interruption, while all the DNS requests from new clients for this application are updated with the public IP address used for the new site. This mechanism supports load distribution across multiple WANs (different active applications distributed between multiple locations).

In the meantime, this mechanism optimizes traffic by sending all new sessions directly to the location hosting the active service of the cloud without using the Layer 2 extension path.

By correlating the keep-alive function (KAL) from the GSS with other egress path optimization functions such as FHRP localization as described above, new incoming sessions established at the new data center will be optimized for direct return flow using their local default gateway.

To keep the existing sessions secure, the traffic for those current sessions must return to the original site via a mechanism of source-NAT activated at the upstream layer. This allows both scenarios to be used while ensuring symmetrical flows for all sessions.

vCenter and GSS: A module script is added to the central management of VMs from VMware (vCenter) so that when the VM management migrates a VM on a remote host, manually or dynamically, it informs the upstream GSS about the new location of this particular VM in real time.

The GSS then immediately assigns a public address associated with the new site to the VM. Similarly with the KAL-AP probes described previously, the established sessions remain directed to the same site of origin to maintain the workflows.

Posted in DCI, Path Optimization | Leave a comment

16 – Client to Server Traffic

When a user accesses an application running in a distant resource, the client must be able to use the optimal path and be dynamically redirected to the data center supporting the active application or VM. However, as explained previously, the routed Layer 3 network cannot determine the physical location of an IP device within the same subnet when it is stretched between different locations.

Without any ingress optimization, for long distances between remote sites, more bandwidth is consumed and some delays may be noticeable for the service offered by the cloud.

For example, assuming a default application “A” available on the primary data center “X”, migrates to a data center “Y”, the requests from the remote user will be directed to the primary data center “X” and then the extended Layer 2 path to reach the active application that has moved to data center “Y”, and vice versa for the return traffic.

It is therefore important to optimize the path for a remote user as soon as the application has migrated to the next location.

Cisco provides a number of IP localization services that, combined with other IP functions, support path optimization:

  • Intelligent Domain Name Server
  • Host Route Injection
  • Locator/ID Separator Protocol, LISP
Posted in DCI, Path Optimization | Leave a comment

15 – Server to Client traffic

The same function of IP localization can be applied to outbound traffic so that the responses from a server sent to an end-user can exit through its local WAN access without returning the session to the default gateway of origin.

However, it is imperative that when stateful services are deployed, the return traffic remains symmetrical with the incoming flows. This ensures the security of all current sessions without disrupting established sessions.

It is therefore important to involve the service of IP localization that exists for outgoing traffic with the other optimizations mechanisms available for the ingress traffic client to server.

Posted in DCI, Path Optimization | Leave a comment

14 – Server to Server Traffic

When a server migrates from one site to another, it must return the traffic to its default gateway because its IP address schema remains the same regardless of its physical location. Since there is one IP address (or virtual IP addresses (VIP)) for a given default gateway per subnet, this implies that after the migration of a logical server, the traffic must be returned to the original site where the active default gateway stands. In a complex multi-tier architecture, routers and firewalls are usually enabled to improve the communication and security between the tiers.

If, for example, a solution built with a 3-tier application (e.g. Web Server, Application and Database tiers) is moved from one data center to another, the traffic between each tier will have to return to the site where the gateways or firewalls are active. If we add to that the different network services required for optimization and data security (load balancer, SSL termination, IPS) enabled at different tiers, then up to ten round trips for a simple query may occur. Consequently, depending on the distance between the data centers, the latency for a request may be significantly affected (i.e. additional 10 to 20 ms for 100 km using dedicated fiber for a 10 round trips).

It is therefore crucial that the inter-application-tier or server-to-server traffic is better controlled to minimize the “ping-pong” effect.

Emerging solutions such as EMC VPLEX Geo, which support active/active data over thousands of km with no service interruption, separate performance and distance considerations.

Cisco supports deployment options for enabling the same default gateway functionalities in different data center sites (FHRP localization). This functionality is completely transparent to the application layer as well as the network layer. By activating this IP localization service, after the migration of VMs it is possible to use a local default gateway configured with the same IP identification (same virtual MAC addresses and virtual IP) that were defined on the original site.


Posted in DCI, Path Optimization | Leave a comment

13 – Network Service Localization and Path Optimization

The ability to avoid disasters is improved by distributing physical compute and network resources between data centers that are geographically distributed over long distances. Geographic distribution provides higher elasticity and almost unlimited flexibility of the resources required to dynamically deploy VM loads.

The network side can transparently support distributed applications by extending Layer 2 between multiple sites. Yet by definition, the Layer 3 traffic carried between users and active applications through the cloud does not have native knowledge of the physical IP device locations, other than the network prefix given through the most significant bit-group of the IP address. The IP subnet is a logical visible subdivision of the network that is usually limited to the local network. It therefore delimits its broadcast domain defined by the system mask. The IP subnet is usually established by the enterprise or service provider network team. In general, an IP subnet addresses a set of IP equipment that belongs to the same VLAN.

Traditionally, if an IP subnet or a VLAN is associated with a physical location inside a data center, with the concept of interconnecting cloud resources, the broadcast domain is stretched over the distances that separate the data centers (DCI theoretically can be established up to unlimited distances with L2 over L3 transport). Therefore, the concept of location induced natively by the IP subnet subdivision loses one of its original functions of localization.

Thus, depending on the distance between the remote sites, the native routing mechanism can have an impact on performance for three major types of communication.

  1. Traffic from the user to the server
  2. Traffic from the server to the user
  3. Traffic from server to server (such as in a multi-tier application)


Posted in DCI | Leave a comment

12 – Network and Security Service Placement

Modern firewalls, load balancers, and most stateful devices support the concept of virtual context, which is the ability to support multiple virtual firewalls or virtual load balancers. Up to 250 virtual contexts, fully autonomous and isolated from each other, can be enabled on a single physical appliance or service module.

To offer the high availability service required for business continuity, firewalls and load balancers work by pairing physical devices. Both devices remain active while the virtual contexts run in an active/standby fashion. In addition to providing redundancy, this mechanism distributes the active context between the two devices, improving the total throughput for active workflows.

When interconnecting multiple data centers and deploying firewalls and other stateful devices such as load balancers, the distance between remote sites is an important consideration.

When data center sites are deployed in close proximity (such as within the few kilometers that is typical for large campus deployments), they can be considered a single, physically-stretched data center location. Under these premises, it would probably be acceptable to deploy the stateful devices in a stretched fashion, with one physical member of the HA pair in each data center site. For deployments where the distance between locations is farther, a pair of HA devices is typically deployed in each physical data center.

There are some important reasons to maintain each active/standby pair within the same data center. Redundancy is controlled by two devices[1], which means that dispersing the active/standby contexts in two different locations would limit the maximum number of data centers to two.  On the other hand, keeping the active/standby pair of network services inside the same physical data center, allows replicating the same security policies in more than two data centers.

In addition, the link between the physical devices used for health check and process synchronization (replication of the active flows for stateful failover) must be extended in a very solid fashion. Due to its function of fault tolerance, it is also very sensitive to latency.

Last but not least, security and optimization functions usually require maintaining a stateful session. Therefore, for the same session, the traffic should be returned to the original virtual context that acknowledged the first flow, otherwise the flow will be dropped.

This behavior of symmetrical paths should be controlled and maintained, especially with the migration of VMs over a LAN extension as explained in the next topics.

Posted in DCI | 5 Comments

11 – Storage Extension

The distance between the physical resources and the effects of VM migration must be addressed to provide business continuity and DA when managing storage extension. The maximum distance is driven by the latency supported by the framework without impacting the performance.

VMs can be migrated manually for DA, or dynamically (e.g. VMware Dynamic Resource Scheduler) in a cloud environment. VM migration should occur transparently, without disrupting existing sessions.

VMs use dedicated storage volumes (LUN ID) provisioned from a SAN or a NAS disk array. These storage volumes cannot be replaced dynamically without affecting the active application and stateful sessions.

In order for the VMs to move from one physical host to another in a stateful mode, they need to keep the same storage. This behavior is not a concern when the VM moves from one host to another within the same physical PoD or between PoDs inside the same physical data center, as the distances between hosts and storage disks are very short. The storage is therefore provisioned in shared mode with the same physical volumes accessible by any host.

Shared storage means that during and after the movement of a VM, the operating system remains attached to the same physical LUN ID when the migration occurs between two hosts.

However, this shared mode of operating storage may have an adverse effect in the environment of DCI due to a long distance between hardware components. According to the rate of transactions per second (TPS) and depending on how much I/O the application itself consumes (e.g. database), beyond 50 km (1ms latency in synchronous mode), there is a risk of impacting the performance of the application. Assuming a VM has moved to a remote site, by default it continues to write and read data stored on its original physical volume.

Several storage services can be enabled to compensate for this behavior:

Cisco IOA: Cisco provides an I/O Acceleration (IOA) function on Cisco MDS 9000 Series Fabric Switches. IOA can halve the latency for synchronous write replication, thus doubling the distance between two sites for the same latency. In partnership with ECO partners NetApp and EMC, Cisco and VMware have tested and qualified two types of storage services to improve the sensitive remote I/O effect due to VM mobility between sites.

FlexCache from NetApp: This feature supports local storage cache of (i.e. secondary DC) data that has been previously read on the original disk. Any read command associated with the data already stored locally doesn’t have to cross the long distance between the two sites, and thus this function has a negligible latency on read commands, although the original ID is still physically on the primary data center (shared storage). Therefore the current stateful sessions can retain their active state during and after VM migration without being disturbed. FlexCache operates in a NAS environment. The actual data is still written on the single location at the original site. Therefore this mode of storage remains shared.

VPLEX Metro from EMC: This feature allows users to create a virtual volume that is distributed between two remote sites. Both volumes can synchronously present the same information on two different sites. The volume is created and shared between two VPLEX clusters, connected via an extended Fiber Channel running in synchronous mode. The data is replicated and synchronized between the VPLEX devices using dedicated FC link.

The initiator (host) writes data on the same but virtual LUN ID available on both sites at the same time. This technology replicates the same settings of the SCSI parameters on both storage targets (VPLEX), making the change of physical volume transparent to the hypervisor.

The maximum distance between the two cluster members of the VPLEX metro should not exceed 100 km due to the replication running in synchronous mode. Synchronous mode is required to maintain the transparency of this service.

This function works in a SAN environment and the storage mode is therefore known as Active/Active.

Posted in DCI, SAN-FC | Leave a comment

9 – Overlay Transport Virtualization (OTV)

Cisco has recently introduced a new feature called OTV that extends Layer 2 traffic between multiple sites over a Layer 3 network. The edge devices that interconnect data centers are known as OTV edge devices.

OTV dynamically encapsulates Layer 2 packets into an IP header for the traffic sent to the remote data centers. Routing Layer 2 traffic on top of a Layer 3 network is known as “MAC routing” transport. MAC Routing leverages the use of a control protocol to propagate MAC address reachability information, this is in contrast with the traditional data plane learning done in technologies like VPLS.

In the next example, MAC 1 sends a Layer 2 frame to destination MAC 2. On the MAC table of the OTV edge device (DC1), MAC 1 is a local address (Eth1), while the destination MAC 2 belongs to a remote location reachable via the IP address B (remote OTV Edge device).

The local OTV-ED encapsulates the Layer 2 frame using an IP header with as IP destination “IP B”. The remote OTV-ED removes the IP header and forwards the frames to its internal interface (Eth 5). Local Layer 2 traffic is treated like any classical Ethernet switch (i.e. MAC 2 <=> MAC 3 on DC2).


A control plane protocol is used to exchange MAC reachability information between network devices, extending the VLANs between the remote sites while the learning process inside the data center is performed as in any traditional Layer 2 switch. This mechanism of advertisement destined to the remote OTV edge device differs fundamentally from classical Layer 2 switches, which traditionally leverage the data plane learning mechanism based on L2 source MAC address discovery: if the destination address is unknown after a MAC lookup on the MAC table, the traffic is flooded everywhere.

With OTV, the process for learning MAC addresses is performed by advertising the local MAC tables to all remote OTV edge devices. Consequently, if a destination MAC address is not known, the packet destined to the remote data center is dropped.

This technical innovation has the advantage of removing the risk of broadcasting unknown Unicast addresses from one site to another. This technique is based on a routing protocol, and provides a very stable and efficient mechanism of MAC address learning and Layer 2 extension while maintaining the failure domain inside each data center.

While OTV natively maintains the STP and the failure domain within each local data center, it provides the ability to deploy multiple OTV edge switches in the same data center in active mode. This function is known as Multi-Homing.

OTV works across any type of transport (Fiber, TCP/IP, MPLS) extended between the remote sites with the reliability and effectiveness of the Layer 3 protocol.

In addition to these primary functions, which are essential for the cloud networking, OTV offers several very important innovations:

OTV connects two or more sites to form a single virtual data center (Distributed Virtual Data Center). No circuit states are required between the remote sites to establish the remote connection. Each site and each link are independent and maintain an active state. This is known as “Point to Cloud” service, which allows a data center to be securely attached or removed at any time without configuring the remote sites and without disturbing cloud services.


OTV offers a native multicast traffic optimization function between all remote sites. OTV is currently available on Cisco Nexus 7000 Series Switches and the Cisco ASR 1000 Series Aggregation Services Routers.

Posted in DCI | 3 Comments

10 – Ethernet Virtual Connection (EVC)

Ethernet Virtual Connection (EVC) is a Cisco carrier Ethernet equipment function dedicated to service providers and large enterprises. It provides a fine granularity to select and treat the inbound workflows known as service instances, under the same or different ports, based on flexible frame matching.

Ethernet Virtual Connection

One of benefits of EVC is the ability to address independent Ethernet encapsulation on the same physical interface with a mix of different services such as dot1q trunk, dot1q tunneling, EoMPLS Xconnect, VPLS attachment circuit, routing, and Integrated Routing and Bridging (IRB). Each service instance can match a unique identifier or a range of identifiers (i.e. a VLAN tag deployed mainly in the context of DCI).

Another important feature is the ability to aggregate multiple service instances into the same transport virtual instance. For example, EVC can multiplex multiple VLANs into the same Bridge Domain (BD) connected to a Virtual Forwarding Instance (VFI) of a VPLS.

Multiplexing Multiple VLANS

Another important element for a multi-tenancy environment is the ability for each service instance to change the VLAN tag using a new identifier, which allows dynamic VLAN ID translation.

In addition, the flexibility of the service mapping of EVC allows improving the scalability in terms of service instances, VLAN per Bridge Domain or per EoMPLS Xconnect or per VFI.

Under the same or different physical interface, multiple service instances can be mapped into the same bridge-domain to provide L2 local bridging between physical interfaces and leverage the usage of VPLS to bridge L2 frames across the MPLS core.

If we leverage the concept of EVC in the public cloud context, a customer service is an EVC. This EVC is identified by the encapsulation of the customer VLANs within an Ethernet island or Bridge Domain, and is identified by a globally unique service ID. A customer service can be point-to-point or multipoint-to-multipoint.

The next drawing shows two customer services: Service Green is point to point; Service Red is multipoint to multipoint.


Posted in DCI | Leave a comment

8 – Extended Layer 2 over Layer 3 (L2 over L3) – MPLS


For point-to-point networks across very long distances, Ethernet over Multiprotocol Label Switching (EoMPLS) Pseudowire can be useful. The EoMPLS service is supported natively on Cisco Catalyst 6500 Series Switches with the Sup720 and Sup2T cards. In conjunction with the VSS function (clustered switches), the resiliency of the L2 VPN service can be easily improved for a DCI LAN extension. VSS provides a fully-redundant physical solution that enables a logical L2 over L3 link (Pseudowire) flawlessly and without the need to activate the Spanning Tree protocol between the remote sites. EoMPLS is also supported on Cisco ASR1000 Series Routers. L2 over L3 extends the Layer 2 Pseudowire over unlimited distances.

With an additional SIP or ES+ card on the Catalyst 6500, the EoMPLS function can be encapsulated directly into a GRE tunnel. This gives the option to extend the Layer 2 VPN over a pure IP network. In this case, the technical knowledge and experience required for an MPLS environment is no longer imposed. In addition, the GRE tunnel may be encrypted using the standard point-to-point encapsulation method of IPSec.

For Multiple data center interconnections, Cisco offers two technologies to address the requirements of cloud computing:

  • VPLS
  • OTV

Virtual Private LAN Service (VPLS) is available via two approaches:

  • A-VPLS: Advanced VPLS (A-VPLS) is designed for enterprise environments. A-VPLS is available on Cisco Catalyst 6500 Series Switches using a SIP-400 or ES+ WAN card10. This option takes advantage of the system virtualization capabilities of the Catalyst 6500 VSS so that all physical links and edge switches are redundant and active without extending the Spanning Tree protocol between sites. A-VPLS has been specifically designed to offer simplicity of implementation with all the features and performance of the MPLS transport protocol. This feature can also be implemented on a pure IP core network via a GRE tunnel.
  • H-VPLS: H-VPLS is designed for service provider environments, in which very high capacity interconnections and segmentation are required. It can process information in very large private, public and hybrid cloud environments with large numbers of multi-tenants. H-VPLS is available with the Cisco ASR9000 Series Routers.Chassis Link Aggregation Group (MC-LAG)MC-LAG enables downstream devices to dual-home one or more bundles of links using the Link Aggregation Control Protocol (LACP) 802.3ad in an active/standby redundancy mode, so the standby takes over immediately if the active link(s)fails. The dual-homed access device operates as if it is connected to a single virtual device. MC- LAG is usually enabled on the provider edge (PE) device.Cisco Router 7600 Series Routers and the Cisco ASR9000 Series Aggregation Services Routers support this feature. With MC-LAG, the two routers function as Point of Attachment (POA) nodes and run an Inter-Chassis Communication Protocol (ICCP) to synchronize state and to form a Redundancy Group (RG). Each device controls the state of its MC-LAG peer for a particular link-bundle; one POA is active for a bundle of links while the other POA is a standby. Multiple active link bundles per chassis are supported11.MC-LAG can work in conjunction with L2VPN such as VPLS, but other network transport services such as EoMPLS, L3VPN or QoS can be leveraged as well.

    A use case in the context of DCI LAN extension is that at the edge of a provider’s network, each customer edge (CE) device supporting LACP is dual-homed to two provider edge (PE) devices and distributes the load on a VLAN-based hashing mechanism onto multiple link bundles. Then the MC-LAG device bridges and extends the concerned VLANs over an MPLS core using a VPLS Pseudowire. The MEC function can be enabled on the aggregation layer to improve Layer 2 multipathing intra-data center, so all Layer 2 uplinks from the access layer to the aggregation layers are forwarded.

    MC-LAG offers rapid recovery times in case of a link or node failure, while VPLS addresses the traditional fast convergence, fast reroute and path diversity features supported by MPLS.

    MEC and MC-LAG Function


Posted in DCI | 2 Comments

7 – Native Extended Layer 2

The diversity of services required in a cloud computing environment and the constraints related to the type of applications moving over the extended network require a set of diversified DCI solutions. Cisco offers three groups of technical solutions that meet these criteria:

Point-to-Point Interconnections: For point-to-point interconnections between two sites using a dedicated fiber or a protected dense wavelength-division multiplexing (DWDM8)mode, Cisco offers Multi-Chassis Ether Channel (MEC) solutions that allow multiple physical links of a Port Channel to be distributed over two different chassis. MEC is available through two approaches:

  • A single control plane managing the two chassis: This method is available on the Catalyst 6500 series with the function of Virtual Switching System (VSS).
  • An independent control plane: This option is available on Cisco Nexus 5000 and Cisco Nexus 7000 Series switches with the function of a virtual Port-Channel (vPC).

These options can provide active physical link and edge device redundancy to ensure the continuity of traffic between the remote sites, for type 1 and type 2 faults. Both approaches eliminate the use of the Spanning Tree protocol to control the loops. In addition, the MEC solution improves bandwidth utilization.

MEC Solution


Multiple Site Interconnections: For multi-site interconnections using optical links or using a DWDM service running in protected mode, FabricPath (TRILL) can quickly and seamlessly connect multiple remote sites in a fabric fashion, remove the extension of the Spanning Tree Protocol between remote data centers, and offer huge scalability compared to classical Ethernet. FabricPath is available on the Cisco Nexus 7000 Series Switches, with upcoming availability on Cisco Nexus 5500 Series9 Switches. FabricPath can also be used in a point-to-point model, which supports tying additional data centers into the cloud without impacting the production network or affecting existing connections.

Security: Traffic sent through the DCI Layer 2 extension can also be encrypted between a Cisco Nexus 7000 Series Switch deployed at the network edge using the Cisco feature called TrustSec (CTS). With CTS, encryption is performed by the hardware at line rate without impacting the performance or the latency of the traffic crossing the inter-site network. CTS offers a rich set of security services including the confidentiality of data transmitted over the WAN via a standard encryption mechanism (802.1AE).

Posted in DCI | 4 Comments

6 – Layer 2 Extension

Layer 2 switching over the WAN or the metro network, whether it is a native Ethernet frame format or a Layer 2 over TCP/IP over any type of transport, should not add any latency to that imposed by the physical distance between sites (mostly dictated by the speed of light). Switching technologies used to extend the Layer 2 over Layer 3 (L2oL3)and obviously the native Layer 2 protocol must be computed by the hardware (ASIC) to achieve line rate transport. The type of LAN extension and the choice of distance are imposed by the maximum latency supported by HA cluster framework and the virtual mobility.

It is crucial to consider whether the number of sites to be connected is two or more than two. Technologies used to interconnect two data centers in a back-to-back or point-to-point fashion are often simpler to deploy and to maintain. These technologies usually differ from more complex multipoint solutions that provide interconnections for multiple sites.

The drawback to simple point-to-point technology is reduced scalability and flexibility. This is a serious disadvantage for cloud computing applications, in which supporting an increasing number of resources in geographically distributed remote sites is critical.

Therefore, enterprises and service providers should consider possible future expansion of cloud networking and the need to operate without disruption when considering Layer 2 Extension technologies. Solutions for interconnecting multiple sites should offer the same simplicity as interconnecting two sites, with transparent impact for the entire site. Dynamically adding or removing one or several resource sites in an autonomous fashion is often referenced as “Point to Cloud DCI”. The network manager should be able to seamlessly insert or remove a new data center or a segment7of a new data center in the cloud to provide additional compute resources without modifying the existing interconnections and regardless of the status of the other remote sites.

Whatever DCI technology solution is chosen to extend the Layer 2 VLANs between remote sites, the network transport over the WAN must also provide secure ingress and egress access into those data centers.

When extending Layer 2, a number of rules must be applied to improve the reliability and effectiveness of distributed cloud networking:

  • The spanning tree domain should not be extended beyond a local data center, although all the links and switches that provide Layer 2 extension must be fully redundant.
  • The broadcast traffic must be controlled and limited by the edge devices to avoid the risk of polluting remote sites and should not have any performance impact.
  • All existing paths between the data centers must be forwarded and intra-cloud networking traffic should be optimized to better control bandwidth and latency.
  • Some workflows are more sensitive than others. Therefore, when possible, diverse path should be enabled for some services such as heartbeat, used by clusters, or for specific applications such as management or monitoring.
  • The Layer 3 services may not be able to natively locate the final physical position of a VM that has migrated from one host to another. This normal behavior of routed traffic may not be efficient when the Layer 2 network (Broadcast Domain) is extended over a long distance and hosts are spread over different locations. Therefore, the traffic to and from the default gateway must be controlled and restricted onto each local data center when appropriated. Similarly, the incoming traffic should be redirected dynamically on the physical site where virtualized applications have been activated.
  • For long-distance inter-site communication, mechanisms to protect the links must be enabled and rapid convergence algorithms must be provided to keep the transport as transparent as possible.
  • The VLANs to be extended must have been previously identified by the server and network team. Extending all possible VLANs may consume excessive hardware resources and increase the risk of failures. However, the DCI solution for LAN extension must provide the flexibility to dynamically remove or add any elected VLAN on demand without disrupting production traffic.
  • Multicast traffic should be optimized, especially in a cloud architecture made up of geographically dispersed resources.

Posted in DCI | Leave a comment

5 – High Availability Cluster Requirement Versus Virtual Machine Mobility

When a failover occurs in an HA cluster, the software components have to be restarted on the standby node. Assuming the storage has been replicated to the remote location using synchronous or asynchronous mode5, the standby node can continue to handle the application safely.

For cloud computing, it is usually necessary to keep the session stateful during and after the migration of a VM. When stateful session is necessary during the movement of the system, the distance between the two physical hosts is driven by the maximum latency supported, the synchronous replications for data mirroring, and the services associated with the storage replication. This means a maximum distance of 100 km between two hosts should not be exceeded when using an Active/Active storage mode requiring synchronous data replication engines. It is important to note that this distance drops to 40-50 km when using the default shared storage mode.

The elasticity of cloud computing is also driven by the requirement of the active sessions to be maintained with no interruption of service, therefore live migration services in real time are limited to metro distances due to the synchronous mirroring (zero RPO). Beyond metro distances6and using current storage tools, the whole cloud computing solution becomes a DRservice. Therefore service functions such as Site Recover Manager (SRM®) are very efficient and built for that purpose, but in a stateless session mode.

 Four Components of DCI

Cisco addresses DCI requirements with a complete set of network and storage architecture solutions. The Cisco solution is based on the four components that are critical to providing transport elasticity, flexibility, transparency, and resiliency between two or more sites:

• Layer 2 extension

• SAN extension

• Routing interconnection between sites

• IP Localization for Ingress, Egress and Server to Server workflows

Posted in DCI | 1 Comment

4 – Cloud Computing and Disaster Avoidance

The Need for Distributed Cloud Networking

If Hot Standby disaster recovery solutions running in a traditional routed DCI network are still valid and are often part of the specifications of the enterprise, there are some applications and emerging services enabled with cloud computing that require an RTO and RPO of zero with a full transparent connectivity between the data centers.

These requirements have already been addressed, especially with the deployment of high-availability clusters stretched between multiple sites. However, the rapid evolution of business continuity for cloud computing requires us to offer more reliable and efficient DCI solutions to meet the flexibility and elasticity needs of the service cloud: dynamic automation and almost unlimited resources for the virtual server, network, and storage environment in a fully transparent fashion.

Disaster recovery is critical to cloud computing for two important reasons. First, physical data centers have “layer zero” physical limits and constraints such as rack space, power, and cooling capacity. Second, currently not many applications are written for the distributed cloud that accounts for the type of network transports and distances needed to exchange data between different locations.

DCI Solution with Resource Elasticity

To address these concerns, the type of network interconnection between the remote data centers that handles the cloud infrastructure needs to be as resilient as possible and must be able to support any new connections where resources may be used by different services. It should also be optimized to provide direct, immediate, and seamless access to the active resources dispersed at different remote sites.

The need for applications and services to communicate effectively and transparently across the WAN or metro network is critical for businesses and service providers using private, hybrid, or public clouds.

To support cloud services such as Infrastructure as a Service (IaaS), VDI/VXI, and UCaaS in a sturdy, resilient, and scalable network, it is also crucial to provide highly-available bandwidth with dynamic connectivity regardless of VM movement. This is true for the network layer as well as for the storage layer.

The latency related to the distances between data centers is another essential element that must be taken into account for the deployment of the DCI. Each service and function has its own criteria and constraints, so it relies on the application requiring the lowest latency to be used as a reference to determine the maximum distance between physical resources.

To better understand the evolution of DCI solutions, it is important to highlight one major difference when comparing the active/standby behavior between members of a high availability (HA) cluster and the live migration of VMs spread over multiple locations. Both software frameworks require LAN and SAN extension between locations.

Posted in DCI | Leave a comment

3 – Network Services for Active/Active access


Global Site Load Balancing Services: To accelerate the disaster recovery service and the dynamic distribution of the workload between the primary and secondary data centers, Cisco provides different network services to optimize the access and the distribution of the user traffic to the remote sites using a Global Site Load Balancing (GSLB) solution. This global GSLB solution for traditional Layer 3 interconnection between sites relies on three major technologies:

Intelligent Domain Name System: An intelligent Domain Name System (DNS) known as the Global Site Selector (GSS) redirects the requests from end-users to the physical location where the application is active and fewer network resources are consumed. In addition, the GSS can be used to distribute traffic across multiple active sites, either in collaboration with the local services of a server load balancing (SLB) application. For example, a Cisco Application Control Engine (ACE) is deployed on each data center to inform the GSS of the health of the service it offers, or based on the load of the network, or in collaboration with the existing WAN edge routers in the data center (e.g. redirection based on physical distances between the user and the application3), just to name the most common functions. Hence the user traffic is distributed accordingly across the routed WAN.

Data Replication Based on Network Load


Data Replication in Collaboration with Routers


 HTTP Traffic Redirection Between Sites: In case of insufficient resources, the local SLB device will return an HTTP redirection message type (HTTP status code 3xx) to the end-user so that the web browser of the client can be automatically and transparently redirected to the elected backup data center where resources and information are available.

Route Health Injection: Route Health Injection (RHI) provides a real-time, very granular distribution of user traffic across multiple sites based on application availability. This method is initiated by an SLB device that will inform the upward router about the presence or absence of selected applications based on extremely accurate information. This information is usually related to the status of the services that it supports. Therefore, the redirection of the user request to a remote site occurs in real time.

Posted in DCI | Leave a comment

2 – Active/Active DC

Active-Standby versus Active-Active A Hot Standby data center can be used for application recovery or to relieve the primary data center from a heavy workload. Relieving data center resources from a heavy workload is usually referred to as Active-Active DR mode.

Active-Active DR Mode


One example of Active/Active DR mode involves an application that is active on a single physical server or VM while the network and compute stack are active in two locations. Some exceptions to this definition are specific software frameworks such as GRID computing, distributed database (i.e. Oracle RAC®) or some cases of server load balancing (SLB). When the resources are spread over multiple locations running in Active-Active mode, some software functions are active in one location and on standby in the other location. Active applications can be located in either site. This approach distributes the workload into several data centers.

It is important to clarify these Active/Active modes by considering the different levels of components, which are all related for the final recovery process:

• The network is running on each site interconnecting all compute, network, and security services; and advertising the local subnets outside each data center. Applications active in the remote data centers are therefore accessible from the traditional routed network without changing or restarting any IP processes.

• All physical compute components are up and running with their respective bare metal operating system or hypervisor software stack.

• Storage is replicated on different locations and can be seen as Active/Active for different software frameworks. However, usually a write command for a specific storage volume is sent to one location at a time while the same data is mirrored to the remote location.

Let’s have a deeper look at the service itself offered by multiple data centers in Active/Active mode. Assuming that we have application A on data center 1 (i.e. Paris) that offers an e-commerce web portal for a specific set of items. The same e-commerce portal offering the same items can also be available and active on a different location (i.e. London), but with a different IP identifier. For the end-user, the service will be unique and the location transparent, but the request can be distributed by the network services based on the proximity criteria established between the end-user and the data center that hosts the same application. So, the same application in this case looks Active/Active, but the software that runs on each compute system is performed autonomously in the front-end tier. They are not related except from a database point of view. Finally the whole session is maintained at the same servers and in the same location until the session is closed.

Posted in DCI | 8 Comments