29 – Interconnecting two sites for a Logical Stretched Fabric: Full-Mesh or Partial-Mesh

This post discusses about design considerations when interconnecting two tightly coupled fabrics using dark fibers or DWDM, but not limited to Metro distances. If we think very long distances, the point-to-point links can be also established using a virtual overlay such as EoMPLS port x-connect; nonetheless the debate will be the same.

Notice that this discussion is not limited to one type of network fabric transport, but any solutions that use multi-pathing is concerned, such as FabricPath, VxLAN or ACI.

Assuming the distance between DC-1 and DC-2 is about 100 km; if the following two design options sound quite simple to guess which one might be the most efficient, actually it’s not as obvious as we could think of, and a bad choice may have a huge impact for some applications. I met several networkers discussing about the best choice between full-mesh and partial-mesh for interconnecting 2 fabrics. Some folks think that full-mesh is the best solution. Actually, albeit it depends on the distances between network fabrics, this is certainly not the most efficient design option for interconnecting them together.

Full-Mesh and Partial-Mesh

Partial-Mesh with transit leafs design (left)                                                                 Full-Mesh (right)

For this Thread, we assume the same distance between the two locations exists with both option designs (let’s take 100 km for a comprehensive math). On the left side, we have a partial-mesh design with transit leafs to interconnect the two fabrics. On the right side, we have a full-mesh established between the two fabrics.

With a full-mesh approach the inter-fabric workflow will take one hop to communicate with a remote host. As the result the total latency induced here with full-mesh will be theoretically 1 ms (100 km) between two remote end-points + the few micro-secs for traversing the spine in DC-2 (one hop).

Full-mesh

Full-Mesh

 

However, if the 2 locations are interconnected using a transit leaf, the traffic between the two tiers will take 3 hops. In this case, it will add theoretically few micro-secs for traffic crossing the spine switch in DC-1 + 1ms (100km) + few micro-secs (remote transit leaf + spine switch in DC-2 ).

transit-leaf 1

Partial-Mesh using Transit Leaf

 

With the partial-mesh, each transit leaf interconnects the remote Spine. Thanks to the Equal Cost Multi-Pathing (ECMP) algorithm used to distribute the traffic among all possible equal paths. The server to server workflows (E-W) are therefore established between hosts and confined within the same location (e.g Web/App tier and DB tiers) with a maximum of one hop. A flow sent to the remote location via the DCI connection would consume two additional hops. As the result, ECMP will never elect the link toward the remote spine and the traffic between the application-tiers will be contained inside the same location (DC-1).

transit-leaf 2

In a Partial-mesh design, ECMP contains the local workflow within the same location

 

However if the two locations were interconnected using a full-mesh design,  nothing would prevent the ECMP hashing to hit a remote spine, as the number of hops is equal across the two locations.

As the result a full-mesh will add theoretically 2 ms (2 x 100km) for a portion of the traffic.

full-mesh 2

In a Full-mesh design, ECMP will distribute a portion of the local traffic via the remote fabric

 

If we compare the Full-mesh and the Partial-mesh design, we have the same latency induces by the distances between sites (we can ignore the few micro-secs due to multi-hops ). In Partial-mesh the local communication between two local end-points is contained within the same fabric, while with full mesh, although it saves some few micro-seconds with a maximum of one hop between two end-points, it is not guaranteed that the traffic between two local end-points will be always contained within the same location. Risks are that for the same application, some flows will take a local path (with a negligible latency) and some will take the remote path (with n msec due to the hair-pining via the remote spine switch).

Key considerations

  • With long distances between two fabrics, E-W communication between two local end-points may have an impact on the Application performances with a full-mesh design.
  • Any network transport that uses multi-pathing is concerned, such as FabricPath, VxLAN or ACI.
  • A full-mesh will provide a single hop between the 2 remote end-points. However it will allow distributing the E-W flows equally between the 2 sites. As the result, half of the traffic may be impacted by the latency induced by the distance.
  • A Transit leaf design will impose a 3 hops between the two remote end-points, however the latency induced by the local hop in each site is insignificant compared to the DCI distance. A Transit Leafs design (partial mesh) will contain the E-W flows within a fabric locally to each site.

As a global conclusion, for a single logical fabric geographically stretched across two physical distant location, when distances go beyond the campus distances, it may be important to deploy a transit leaf model.

All that said, it is certainly unlikely that several enterprises deploy a full-mesh interconnecting fabric for long distances, but nothing is impossible, it depends who owns the fiber :)

 

 

This entry was posted in DCI. Bookmark the permalink.

8 Responses to 29 – Interconnecting two sites for a Logical Stretched Fabric: Full-Mesh or Partial-Mesh

  1. smarunich says:

    In theory, we always assume we require edge/border/transit leaf, but in reality it is an extra device with multiple capabilities like extra buffers to manage over-subscription DC1-to-DC2, bigger routing tables and mac address table to manage external fabric connections in multiple scenarios or advanced services like LISP – it’s hard to generalize, but usually the edge/border/transit leaf it’s something we assume should handle much more services and functions against to regular leaf.

    From other standpoint, we assume that spines are simple forwards with limited capabilities – but eventually spines will take over the space of legacy equipment in data centers. By saying space the cabinets where current distribution placed has access as to internal facilities like ToR switches as well to external facilities like DWDM, Dark Fiber and other external facing connections.

    Saying that, I like your general idea about transit node, but looking from hands-on view, we will never have so much dark fiber between data centers or DWDM resources, specially if we manage at least 3:1 OSM 10G (where you need at least 4 uplinks from leafs), also the risk of having 2 spines (50% bandwidth drop-off) is not something that probably any of shops is going to take. It could be a long discussion, but let me share few thoughts and questions.

    Spines currently running over 500 (10GE OSM 3:1) ports will run in chassis format with redundant fabrics, Sups if required, line cards and etc., by saying that spine layer by itself will have extra level of redundancy. Then Spine Border appears to possible option to go in case of DCI, as we don’t need to extra hardware to support this capability, the current spines are already have significant level of redundancy and the functionality of chassis appears to reach as well. True, requirements raised for spine, as VTEP termination support, LISP, OTV or other DCI services, but it’s like extra two-three features against running a separate pair of devices.

    As from hop prospective, it’s DC1-LEAF1 DC1-SPINE1 DC2-SPINE2 DC2-LEAF1, it’s even better of worst case scenario: DC1-LEAF1 DC1-SPINE DC2-EDGE-LEAF1 DC2-SPINE1 DC2-LEAF1.

    Would be very interesting to hear your tests stories and opinion about Border Spine concept, as looks like it’s something that a lot of customers looking for.

    • Yves says:

      Thank you for your comments. I updated the title to better reflect this scenario. I was focusing on stretched fabric. Thus the reason I was talking about Transit Leafs and not Border Leaf nor Border Spine in this Thread. We may want to enable services to connect the fabric to outside through the same Transit leafs if we wish or other Border leafs or Border Spines. Those Border Leafs or Border Spine will connect to network Service (Core Router, DCI edge, OTV, LISP, FW, IPS, SLB, etc..). Then we need to pay attention to the scalability (L2 and L3 tables) at the Border edge as well as how network services are treated (Anycast Gateway, ARP suppression, etc..).

      manurich >> … we assume that spines are simple forwards with limited capabilities
      You are correct, we assume here for the stretch fabric that Spine switches are dedicated to distribute traffic among the Leafs of interest with limited capabilities. Depending on the Fabric transport, we will bring additional functions for the Control Plane such as MP-BGP EVPN , Route Reflector (VxLAN/EVPN) or COOP (ACI), which is not discussed in this post. For this particular stretch CLOS purposes, I am not terminating any overlay tunnel for data plane at the Spine layer.
      However, due to the original physical cable plant (e.g. renew brownfield DC), I agree, it may be required to enable Border functions at the Spine switches.

      manurich >> … we will never have so much dark fiber between data centers
      That’s true for most of deployments I agree, nonetheless, I know several large University or town halls, Airports, with buildings distributed across wide cities with many fibres available between locations.

      manurich >> … also the risk of having 2 spines (50% bandwidth drop-off) is not something that probably any of shops is going to take.
      Can you please elaborate a bit more, I’m not sure I got your point in this scenario of stretch fabric ? I tend to agree in the concept of loosing 50% of the bandwidth, but it depends on the oversubscription ratio, and how the Enterprises deploy/move applications across multiple location (E-W traffic).

      manurich >> … Then Spine Border appears to possible option to go in case of DCI, as we don’t need to extra hardware to support this capability.
      It depends which functions/services need to be initiated from there. For a double-side vPC-based DCI (back-to-back), or Layer 3 DCI routing functions, that’s certainly true, we can enable these functions within the same platform (platform dependent though). But some advance functions (especially when re-encapsulation is required such as OTV, LISP, PBB-EVPN) may require an additional devices or Virtual Device Context to be activated. And that depends on the transport protocol (FP, VxLAN).

      manurich >> … it’s DC1-LEAF1 DC1-SPINE1 DC2-SPINE2 DC2-LEAF1.

      It’s also an option, but certainly not for all kind of stretch fabrics. For two independent fabrics (eg. 2 VxLAN domains) interconnected via the Border Spine, it may works, but we may face some limitation with L3 routing.

      manurich >> … Would be very interesting to hear your tests stories and opinion about Border Spine concept, as looks like it’s something that a lot of customers looking for.
      Fully agree and that’s a long debate, difficult to elaborate in few lines w/o any drawing, exemples, etc… It’s not so obvious, it mainly relies on network services and service optimisations to be supported or not. I will try to make a post on this subject asap (I’m going to test this scenario (Spine-to-Spine) in Jul).

      Thank you, yves

      • smarunich says:

        Thank you Yves!
        As for loosing 50% of the bandwidth, I meant to talk about fabric itself (where as less spines you have no matter what OSM you choose, loosing a spine will have a more dramatic effect on the network state). It is not touching transit node discussion.

        Now if we link 4 Spines scenario with transit node and assume we push OSM 3:1, the amount of uplinks for transit nodes goes up to 4 in partial mesh, and much more in full-mesh. I agree with you for some customers it could be an option, specially if customers using DWDM systems, but the cost probably will be a show stopper.

        Thank you so much for your comments and looking forward to see new posts at your blog, meanwhile I will be testing Spine-to-Spine topology as well, by saying that could you provide details on what hardware and software you will primary and secondary target your tests?

        Thanks,
        Sergey

        • Yves says:

          Thank you Sergey,

          I got your point. And indeed technically nothing prevents the Enterprise to increase the number of Spine switches for the intra-Fabric as well for the Inter-Fabric. Unfortunately that will have a high impact on the CAPEX.
          For intra-Fabric, in addition to the point you made with the bandwidth impact, adding Spine switches also improves the “Flow Completion Time” (new measurement tools to capture the DC efficiency). In short what that means is that when Spine utilisation goes up, the FCT goes up (consuming resources), as the result TCP flows take longer to go from one leaf to another. Consequently high FCT may have an impact on Apps efficiency. It is certainly worth to review a breakout session from Loy Evans on Designing DC for the Midsize Enterprises (BRKDCT-2218).
          As for the stretch fabric across 2 locations and with Border Spine switches as pure L3 platform, we are not physically limited to 2 pairs of devices (like would dictate a dual-sides vPC (back-to-back)).

          For the test, I’m using Nexus 9500 series (Spine) and Nexus 9300 (Leafs). For the software, as of now, I’m using 7.0(3)I1(1a), but I should move to the next rel soon.

          Kind regards, yves

  2. Thanks Yves for your post and your excellent point over ECMP & full-meshed stretched fabrics.
    Two questions remain though:
    – how do you compute 1 ms delay for 100 km, considering very different interconnection approaches (dark fibers/DWDM and EoMPLS)?
    – could you simply elaborate over ACI transport mechanism? You mention “COOP” in a comment but it seems that this stands for “cooperative key server protocol” in ACI terminology.

    • Yves says:

      Hello Jean-Christophe,

      1 ms delay for 100 km:
      I should have clarified this statement since the beginning of this blog :(
      Actually it is acquired that it takes 1 ms to cross 100 km of fiber, but how this number has been computed?
      Theoretically, across an empty volume of nothing (:), the speed of Light is about 300000 Km/s.
      Practically, when the light crosses a glass material (fiber), it is admitted that the speed is reduced to 200000 Km/s. Sometimes it is mentioned 5 μs per Km (or 8 μs per Mile).
      Consequently, that gives us an average of 1ms for the light to traverse 200 Kms of fiber. However, this is the time it takes for the light to travel from point A to point B, in one single direction !
      In our network world, we need to consider the round trip established by the protocol of communication between the two end-points to carry the data, hence the distance of 100 km traversed in 1 ms. Does that make sense ?

      Just a slight additional note, when we talk about data synchronous replication, the distance ran during 1 ms becomes 50 kms. The reason is that in synchronous mode, there 2 two round-trips:
      1) The first round-trip is required to check if the remote end-point is ready to receive data (e.g. remote storage disk)
      – received ready?
      – sends response…
      2) The second round-trip is used to send the data
      – send data…
      – wait for Ack <== this ack validates that the data replicated in the remote volume is exactly the same as the data in the local disk (required for RPO=0)
      And that explains why in the storage side, it is sometimes mentioned 2 ms to run through 100 km with synchronous replication.

      Except for very long distances using IP, most of interconnection transports rely on fiber. Consequently and regarding the different approaches to interconnect remote data centers such as Dark fiber, DWDM, VPLS, EoMPLS, OTV, PBB-EVPN, you should consider the same latency computed based on the distances, as most of hardware platforms offer encapsulation at line rate (dedicated ASICs). Well, for the last comment, I am assuming that the enterprise will make the right choice in term of platforms for a DCI requirement.

      ACI and COOP
      COOP stands for “Council Of Oracles Protocol”
      COOP is used for the leaf nodes to communicate information about its locally connected end points toward a centralised database (inside the Spines) which, in its turn, synchronises the information amongst the switches of the ACI fabric. COOP is responsible for telling the truth of all existing end-point information and location at any time within the fabric (Council of Oracles).

  3. Thanks Yves.
    1 ms for 100 km:
    This is the best case scenario (dark fiber/DWDM) and you won’t get that delay if your DCI is over EoMPLS.

    ACI and COOP:
    Thanks for the clarification; if I understand correctly, this is the proprietary version of BGP EVPN VXLAN control plane, right?

    • Yves says:

      Hi Jean-Chritophe.

      I will not say that COOP is a proprietary version of BGP EVPN VxLAN.

      I agree though, MP-BGP/EVPN is one option to discover dynamically the end-points and to populate the host information amongst all switches that belong to the same fabric. That’s the preferred choice for traditional network overlay transport such as VxLAN. But it’s not plug&play, and you need strong networking background to setup the MP-BGP/EVPN control plane to setup, monitor and troubleshoot.

      COOP is more efficient in the sense that it has been designed from ground-up to focus on end-point information with synchronisation across the ACI fabric. In addition to its efficiency, it’s plug&play, the network mgr doesn’t have to configure it and to care about it. That’s the reason why you won’t see many documentation on COOP, this is a internal mechanism within the fabric.
      However, ACI uses MP-BGP establishing the network peering between the Spines and Leafs, in order to distribute the external routes inside the fabric and propagate the public networks from different tenants to outside the ACI fabric. Nonetheless, like COOP, BGP is fully transparent from a configuration point of view, there is no BGP peering, no RD, no import/export RT, VPNv4 AF, route-map for route redistribution, etc, nothing to configure within the fabric. It just required to enable BGP in case of L3 outside connection are needed, giving an AS number and specify where the Route Reflectors are enabled. What is needed to configure afterward is the ACI border-leaf using OSPF or BGP to connect a L3_out/vrf to an external router.

      Hope that clarifies the function of COOP.

      kind regards, yves

Leave a Reply