The Datacenter network architecture is evolving from the traditional multi-tier layer architecture, where the placement of security and network service is usually at the aggregation layer, into a wider spine and flat network also known as fabric network ( ‘Clos’ type), where the network services are distributed to the border leafs.
This evolution has been conceived at improving the following :
- Flexibility : allows workload mobility everywhere in the DC
- Robustness : while dynamic mobility is allowed on any authorised location of the DC, the failure domain is contained to its smallest zone
- Performance: full cross sectional bandwidth (any-to-any) – all possible equal paths between two endpoints are active
- Deterministic Latency : fix and predictable latency between two endpoints with same hop count between any two endpoints, independently of scale.
- Scalability : add as many spines as needed to increase the number of servers while maintaing the same oversubscription ratio everywhere inside the fabric.
If most of qualifiers above have been already successfully addressed with traditional multi-tier layer architecture, today’s Data centres are experiencing an increase of East-West data traffic that is the result of:
- Adoption of new software paradigm of highly distributed resources
- Server virtualization and workload mobility
- Migration to IP based storage (NAS)
The above is responsible of the recent interest in deploying multi-pathing and segment-ID-based protocols (FabricPath, VxLAN, NVGRE ). The evolution of network and compute virtualization is also creating a new demand to enable the segmentation inside the same fabric to support multiple subsidiaries or tenants over the top of a shared physical infrastructure, helping IT Mgr’s to save money by reducing their CAPEX while reducing the time that it takes to deploy a new application.
This brings a couple of important questions in regard to DC Interconnect such as:
- Is my DCI still a valid solution to interconnect the new fabric-based DC ?
- Can I extend my network overlay deployed inside my DC fabric to interconnect multiple fabric network together to form a single logical fabric ?
Prior to elaborate on those critical questions on a separate post, it is important to better understand through this article why and how the intra-DC network is changing into a fabric network design.
- On the application side, the new software paradigm of highly distributed resources is emerging very quickly, consequently modern software frameworks such as BigData (e.g. Hadoop), used to collect and process huge amount of data, are as of today more and more deployed in most of large enterprises DC.
- The above massive dataset processing directly impacts the amount of data to be replicated for business continuances.
- The growth of virtualisation allows the enterprises to accelerate the usage of the application mobility for dynamic resource allocation or for maintenance purposes with zero interruption for the business.
- We also need to pay attention to SDN applications and how services might be invoked throughout different elements (compute, network, storage, services) via their respective API.
- On the physical platform side, with the important evolution of the CPU performance as well as the increased capacity of the memory at lower cost, the (*)P2V ratio has exceeded the 1:100, resulting an utilisation of the access interfaces connecting physical hosts close to 100% (peaks) of its full capacity at 10Gbps.
(*) The P2V is the theoretical capacity of converting a physical server into a certain number of virtual machines, the ratio being the amount of virtual entities per physical device. In a production environment, the P2V ratio concerning Applications is very conservative, usually under the 1:40. This is mainly driven by the number of vPCU that an application has been qualified for. However the ration for virtual desktop is usually higher.
Consequently, the P2V ratio directly impacts the oversubscription ratio of the bandwidth between the access layer and the distribution layer, evolving from the historical 20:1 oversubscription to 10:1 few years back, to recently breaking it down to 1:1. This drives the total bandwidth of uplinks between the access layer and the aggregation layer to be close or equal to the total bandwidth computed with all active access ports that serve the physical hosts.
- With the implication of massive dataset processing, the IO capacity on the servers must be boosted, therefore there is an increase integration of DAS inside the server to accommodate the spindle bound performance including SSD for fast access storage and NAS for long term large capacity storage.
All these new changes are increasing the server-to-server network communication inside the DC, commonly called EAST-WEST (E-W) traffic. The outcome is that while we have been focusing on North-South (N-S) workflows for years, with the quick adoption of virtualisation and BigData, the E-W flow has already exceeded the N-S traffic and it is expected to represent +80% of the traffic flows inside DC by end of this year.
Let’s take the following example with a large size DC. We will use 480 physical servers to keep the math simpler. There are different options that we can think of, but for the purpose of this example we consider 24 physical servers per rack and we assume each ToR switch is populated at 50%, implying 24 active 10GE interfaces per device. This deployment would require 6 x 40GE uplinks, load distributed to the upstream aggregation devices. Considering that the servers are usually dual-homed to a pair of ToR switches for high availability, it is then required to provision two times more links. On the server side, the HA is configured in Active/Standby fashion (NIC Teaming in Network Fault Tolerant mode), hence we don’t change the oversubscription ratio. That gives us a total of 20 racks with 2 ToR switches per rack. Theoretically we would need a total of 12 x 20 40GE uplinks distributed to the aggregation layer. Each aggregation switch needs to support 120 x 40GE at line rate.
There have been several discussions around the 40GE-SR deployment and the high impact on the CAPEX due to a specific requirement of new 12-fiber ribbon with MPO connectors per 40GE-SR fibre link. Actually there is a great alternative with the 40GE SR-BiDi QSFP that allows the IT Mgr to migrate from the existing 10GE fibre (both 10GE MMF cable infrastructure and patch-cables with same LC connectors) removing this barrier.
Historically the role of the aggregation layer has been for years the following:
- It aggregates all southbound access switches, usually maintaining the same layer 2 broadcast domains stretched between all access PoD. This has been used to simplify the provisioning of physical and virtual servers.
- It is the first hop router (aka default gateways) for all inside VLAN’s. This function can also be provided by the firewall or the server load balancer device, which usually are placed at the aggregation layer.
- It provides attachment to all security and network services.
- It provides the L2 and L3 Data Centre boundary and the gateway to the outside of the world throughout an upstream Core layer.
The placement of those functions has been very efficient for many years due to the fact that most of the traffic flow used to be North-Southbound are essentially driven by non-so-mobile bare metal servers as well as traditional software frameworks.
With the evolution of the DC, it is not necessarily optimal to centralise all these network services into the aggregation layer. The distribution spine (formally known as Aggregation layer) can be kept focusing on switching packets between any leaf (f.k.a. access switches).
Note: This role could be compared to the Provider router “P” in a MPLS core only responsible for switching the labels used to route packets between “PE”.
With the availability of chaining the network services (physical and virtual service nodes) using policy-based forwarding decisions, we can move the network services down to the leaf. An optional changes that’s emerging with the new generation of DC fabric network is that the Layer 2 and Layer 3 boundary moving down at the leaf layer. In addition, the layer 2 failure domain tends to be confined to its smallest zone, reducing the risks of polluting the whole network fabric. Thus resulting a function of anycast gateway that allows distributing the same default gateway service to multiple leafs. This is achieved while we maintain the same level of flexibility and transparency for the workload mobility. Finally we can move the DC boundary behind a border leaf (boudary of the fabric network) from which only the traffic to and from outside the network fabric will go through it.
By removing these functions from the aggregation layer, it is not limited anymore to two devices tightly coupled together, as they don’t need to work anymore by pair. On the contrary, additional spine switches can be added for scalability and robustness.
Back to the previous example, the new generation of multi-layer switches and line cards offer very large fabric capacity of 40GE interfaces at line rate making the traditional multi-tier layer architecture still valid. It is important to keep the failure domain as small as possible, making sure that no failure or maintenance work will impact the performance on the whole fabric network. If one spine switch is removed from the fabric for any reason, it is certainly preferred a N+2 design rather than 1 single remaining switch handling all the business traffic. Finally looking at a very large scalability, it becomes easier using the wider spine approach to maintain the same oversubscription ratio while adding new racks of servers.
It may be worth also looking at a possible lower CAPEX. It is expected that small size spines devices may cost less than a large modular switch.
Hence what are the risks to keep the traditional hierarchical multi-tier layer design ?
Obviously it depends on the enterprise application types and business requirements, the traditional Multi-tier layer is still valid for many enterprises. But if, as an IT administrator, you are concerned by the exponential growth of E-W workflows and a large amount of physical servers, you may encounter some limitations:
- Limited number of active layer 3 network service
- Although some services like the default gateway may act as Act/Act using vPC
- Unique connection point for any N-S and E-W traffic with risks of being oversubscribed.
- Unpredictable latency due to multiple hops between two endpoints
- Shared services initiated at the same aggregation layer affecting the whole resources: e.g. SVI/Default gateways being all configured at the aggregation layer, in case of a switch control plane resource issue (e.g. DDoS attack), all the server-to-server traffic can be impacted. Thousand of VLAN’s centralised may affect the network resources such as the logical ports.
It’s possible for a multi-tier layer DC architecture to evolve seamlessly into a ‘Clos’ fabric network model. It can be gradually enabled on a per-PoD basis without impacting the Core ayer and other PoD. An option is to build a ‘fat-tree Clos’ fabric networks with 2 spines and few leafs to start with, and add more leafs and spines as the compute stack grows.
This seamless migration to the fabric network can be achieved either:
- by enabling a Layer 2 multi-pathing protocol such as FabricPath on an existing infrastructure.
- by deploying the new switches or upgrading existing modular switches with new line cards supporting in hardware VxLAN Layer 2 and layer 3 gateways.
Both FabricPath and VxLAN offer a network overlay using a multi-pathing underlay network. vPC can coexist with any of both options if needed, although the trend is to move to the leaf for all dual-homed services (FEX, servers, etc..). Consequently, whatever protocol is used to build the Clos fabric network, the STP protocol between the Spine and Leaf nodes is removed. All paths become forwarding, breaking the barrier of the 2 traditional uplinks offered by Multi-hassis Ether-Channel (MEC).
Scalability and segmentation for multi-tenancy are improved using the hardware-based network overlay supporting 24-bit identifier (e.g. FabricPath or VxLAN).
Intra-VLAN Routing can be performed at each fabric leaf node. The same gateway for a given subnet is distributed dynamically everywhere on each leaf. Hence, on each leaf, a local gateway can exist for each given subnet. This is not mandatory, but it can be enabled with some enhanced distributed fabric architecture using FabricPath (DFA) and VxLAN (near future, stay tuned). It’s a valid option to be able to confine and maintained the layer 2 broadcast traffic limited inside the local leaf, thus preventing flooding across the fabric network. To improve the switch resources such as logical ports, VLAN, subnets and SVIs can be created on demand and dynamically, just as needed.
in summary, as of today, the Clos network fabric becomes very interesting for IT managers concerned by the exponential increase of E-W traffic, a rapid growth of servers and virtualisation and the massive amount of data to proceed. The network fabric gives a better utilisation of the network resources with all links forwarding, it distributes the network service functions closest to the application, it maintains the same predictable latency for any traffic. It allows to increase the fabric capacity as needed by simply adding spines and leafs transparently.
That gives a short high level overview of the DC network fabric and why we need to think of this emerging architecture.
I recommend you to read this excellent paper from Mohammad Alizadeh and Tom Edsall “On the Data Path Performance of Leaf-Spine Datacenter Fabrics”