In this instalment, we’ll have a look at what things you need to think about when planning your Hardware VTEP deployment. While I’ll be using Brocade VCS as the HW VTEP for this post, some of this info should be applicable to other Vendors’ solutions.
Note that the scope of this discussion is quite extensive. This blog post is not meant to cover all intricacies of your design process. Its purpose is to give you a high level understanding of what’s involved, and to help you formulate better questions. With that out of the way, let’s carry on.
Brocade’s HW VTEP (a.k.a. “overlay gateway”) functionality is available on two model lines of VDX switches: 6740 and 6940. These switch families use different chipset models, 6940 having a more capable one. Because of this, 6940-based HW VTEP has more flexible network placement options.
Let’s consider how HW VTEP options change for most common DC network configurations.
1: Brocade VCS fabric as your DC network
You have a set up with spine and leaf switches in a Clos topology. All switches are part of a single VCS fabric. This fabric can contain any mix of 6740, 6940, and 8870 VDX switches.
If your fabric has one or more 6940 switches, you can enable HW VTEP functionality without having to add any new devices. It does not matter where in the fabric your 6940s are – leaf, spine, or border leaf.
You create an overlay-gateway in your VCS, and add rbridge IDs of one to four of your 6940s to it. These switches don’t have to be directly connected to your VLAN-based equipment that needs to talk to VXLANs, and they don’t need to have direct physical links between themselves.
VLAN traffic to be bridged to VXLAN will travel VCS from a VLAN-tagged port over TRILL to one of the 6940 nodes, where it will be re-encapsulated into VXLAN and sent toward NSX infrastructure over IP.
1.1: As above, but you don’t have any 6940s
In case you have some 6740s but don’t have any 6940s in your VCS fabric, you have two options:
- Get additional one to four VDX switches – 6940 or 6740; or
- Select one to four of the existing 6740s, and split them out into their own separate VCS fabric.
If you decided to get 6940s, you can just join them into your existing VCS fabric, and you’re back to scenario #1 above.
If you go with 6740s – you will need to set them up as a separate VCS fabric with fabric links between themselves, and then connect that new fabric to your existing fabric via 802.1Q trunk(s).
The reason for this is chipset used in 6740 switches cannot handle VXLAN-encapsulated traffic arriving over TRILL. This means that VXLAN traffic must hit a 6740 over a VLAN interface.
2: You have a DC-wide L2 fabric
Solution here is similar to #1.1 above – get one to four 6740 or 6940 switches, interconnect them into their own VCS, and then connect that VCS via 802.1Q trunk(s) to your existing L2 fabric.
3: You are running an L3 (IP) fabric for your DC
It can be a single or multi-Vendor IP fabric.
Since an L3 fabric does not provide end to end L2 connectivity, you will be more limited in placement of your HW VTEP. Any VLAN-connected devices will need to be, well, VLAN-connected to the VCS fabric that provides HW VTEP functionality.
Most commonly, this would be a pair or Top of Rack (ToR) switches in your physical equipment rack. These two ToRs will then be connected via L3 links to the rest of your L3 fabric.
So the option here is again to get between one and four 6940 or 6740 switches, and connect them into a VCS that’s placed as ToR for your VLAN-based equipment rack.
Once we’ve sorted out what we need equipment-wise, it’s time to work through the network connectivity required for the solution to work.
The diagram below provides an overview:
As you can see, there are two distinct networks: Management and Data Transport. These networks can be totally isolated, if necessary. Endpoints connected to Management network only need to reach each other. Similarly, endpoints on the Data Transport network only talk between themselves, and never to anything on the Management network.
You are probably wondering about this “Router” and “IP Network” in the bottom right corner of the diagram. This is to show that with NSX for vSphere, there is a requirement for VTEP IP addresses of HW VTEPs in a subnet that is separate from any and all subnets where you have your ESXi hypervisor VTEPs.
In other words, VXLAN traffic from ESXi hypervisors must travel through an IP gateway before reaching a HW VTEP. This requirement applies to HW VTEP solutions from all Vendors.
The reason for this is VXLAN module on ESXi hosts doesn’t ARP for VTEPs, instead relying on “Outer MAC” addresses that you can see in per-VNI MAC table. Contents of this table result from the info ESXi hosts share with Controllers when joining a VNI – they report their VTEP IP and MAC address. Please see the recording my VMworld presentation (or slides) for more detail on that process.
HW VTEPs don’t have the means of sharing such info, so ESXi hosts can’t know what MAC address to use for the outer header of VXLAN packets for a HW VTEP on the same subnet. The workaround is to place your HW VTEP into a separate subnet, which obviates the need for ESXi hosts to know its MAC address.
Note: Brocade’s solution can use Loopback interfaces with a /32 address for VTEP, which solves this problem without requiring additional configuration.
Management and Control plane connectivity
So what conversations will be happening over these connections? Here is another diagram, showing TCP sessions (arrows representing client -> server direction):
Dashed lines from Hardware Switch Controller (HSC) to ToR Agents represent “backup” connections, since only one of the ToR Agents is a master for a given HW VTEP / HSC.
A keen observer may notice that arrows from the OVSDB server on HSC are pointing to the OVSDB clients on ToR agents. This is not a mistake, and no, I wasn’t in the room when this was decided. 🙂 It is what it is, I guess.
A bit more about what’s in the picture
Again, we’re not going to cover what’s going on in great detail here – that’s for later posts. To get a more rounded view, I highly recommend reading the white paper called Hardware Layer 2 Gateways Integration with NSX from ever-great Francois Tallet at VMware.
So, for a quick overview:
- Control plane communications between Controllers and ESXi hosts use a proprietary protocol. HW VTEPs, on other hand, use OVSDB. To help with that, VMware added component called “ToR Agent”, which runs inside each NSX Controller, serving as an “adapter” that understands both protocols.
- Since only one of the ToR Agents is a master for an individual HW VTEP, it needs to know about all VXLANs that HW VTEP may be bridging for. Since VXLANs are sliced across different Controllers, ToR Agent needs to connect to them all.
- You can have more than one HW VTEP, which means different ToR Agents may be masters for different HW VTEPs. That is why all ToR Agents connect to all Controllers.
- HW VTEPs don’t support head-end or optimised head-end replication for BUM traffic. To help with this, NSX uses a Replication Nodes, or “PTEPs” (where first “P” stands for “Physical”). HW VTEPs send any BUM traffic to these Replication Nodes.
- For performance and scalability, HW VTEPs load-balance BUM traffic between Replication Nodes.
- To quickly detect failure of Replication Nodes, HW VTEPs run BFD protocol between themselves and Replication Nodes.
Brocade specific bits:
- Hardware Switch Controller, or HSC, runs in a distributed fashion inside VCS fabric. In case not all your rbridge nodes were configured as overlay-gateway (see case #1 above), HSC function may be performed on an rbridge that isn’t doing VLAN to VXLAN bridging. This is taken care of automatically.
- VCS fabric configured with an overlay-gateway always uses a single IP address for its VTEP, irrespective of how many rbridges there are – one, two, three, or four.
- The point above also means that there is only a single BFD session from a VCS fabric to each Replication Node, which saves resources on both sides, while providing the same level of protection. If you’re confused why I’m talking about BFD in relation to VTEP, it’s because BFD packets are forwarded inside a VXLAN tunnel with VNI=0. This is what the arrow pointing to ESXi Kernel Module (BFD) is showing.
Where to from here?
If you haven’t yet, please do read the whitepaper I mentioned above.
Next posts in this series will be an actual deep dive, and to get the best value out of them you’ll need to be very comfortable with how VXLAN Logical Switching works in NSX-v.
There are many great resources out there on the topic. From what I’ve done in the past, I think my presentation from VMworld 2015 is probably the most appropriate refresher / primer. Give it a shot. If you missed it, the recordings of both US and EU sessions have been posted for general availability, too.
Thanks to Ramguhan Sundararajan and Steve Day for review and feedback comments.
Next post in the series is here.