NSX for vSphere supports three VXLAN Control Plane modes:
- Multicast (described in Section 4 of RFC 7348);
- Hybrid; and
None of these is “simply better” than others; each has it’s positive and negative sides. In this post, I’m covering how each mode works along with some of those negatives and positives, to hopefully help you can make a better informed choice for your circumstances.
Since this topic is a bit large, here’s a brief outline of what we’ll be talking about.
- Fundamental VXLAN Control Plane functions, and how they’re implemented in three modes
- How VXLAN Control Plane information gets to the hosts
- Traffic replication in the three modes, and how to calculate corresponding overheads
- Additional Control Plane function enabled by NSX Controllers (ARP suppression)
- Operational considerations for each of the modes
Quick recap of Control Plane functions
In general, each host running VXLAN needs to be able to do the following two things:
- When a VM on that host connected to an LS sends a frame addressed to a particular MAC address, figure out VTEP IP address of a host where this destination MAC address resides; and
- If that MAC address was a Broadcast, Unknown Unicast, or Multicast (a.k.a. “BUM”), send it to all VTEPs for that given VXLAN.
Note: NSX doesn’t include provisions for Multicast switching or routing, and follows standard Ethernet behaviour, which is to flood frames with Multicast destination MAC addresses within their L2 domain, i.e., individual VXLAN.
A frame is considered Unknown Unicast by a host if it doesn’t have a corresponding MAC to VTEP IP cached entry.
Mapping MACs to VTEPs
To do #1, each host maintains a per-VNI cache of MAC to VTEP mappings:
~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000 Inner MAC Outer MAC Outer IP Flags ----------------- ----------------- -------------- -------- 00:50:56:a6:a1:e3 00:50:56:61:23:00 192.168.250.52 00000111 00:50:56:8e:45:33 ff:ff:ff:ff:ff:ff 192.168.150.51 00001111
In the sample output above we see that the host where this command was run knows about a couple MAC addresses. One of them, 00:50:56:a6:a1:e3, is located behind VTEP 192.168.250.52, listed under Outer IP. The Outer MAC is the MAC address of that VTEP, included here because it is on the same subnet as the VTEP of the host we’re looking at.
For VTEPs with IPs in other subnets, Outer MAC is set to ff:ff:ff:ff:ff:ff, since communication with them is via the default gateway of VXLAN IP stack, so the MAC is not needed. Second entry in the table below shows this scenario.
First difference: how MAC to VTEP table is populated
In all three modes, this table is populated by looking at packets arriving from remote VTEPs. Incoming VXLAN packet contains source VM’s MAC address (Inner MAC) in the header of encapsulated Ethernet frame that was generated by that VM, and VTEP’s MAC and IP of that VM’s host (Outer MAC and Outer IP) in the outer L2/L3 headers.
If remote VTEP is in a different IP subnet, that MAC will be of L3 gateway, not of VTEP, and will be replaced with ff:ff:ff:ff:ff:ff.
In Multicast mode, host has no other choice but to flood a frame for which it doesn’t have a MAC:VTEP entry, in hope that one of the copies of flooded frame will reach the destination VM, causing it to respond back, so that the above process can complete.
Flooding will also create MAC:VTEP cache entry for the source VM’s MAC and VTEP of the host it’s running on on all hosts participating in VXLAN this VM is connected to.
In Hybrid and Unicast mode, hosts will first query Controller for this mapping, and use information in Controller’s response to populate their local cache, avoiding initial flooding.
Here’s an example of request and response; note what information is sent back by the Controller:
2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] Vxlan: receive a message from kernel 2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] VXLAN Message --> VM MAC Query: len = 18 --> SwitchID:0, VNI:5000 --> Num of entries: 1 --> #0 VM MAC:00:50:56:a6:a1:e3
2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: receive UPDATE from controller 192.168.110.203:0 2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: send a message to the dataplane 2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] VXLAN Message --> VM MAC Update: len = 30 --> SwitchID:0, VNI:5000 --> Num of removed entries: 0 --> Num of added entries: 1 --> #0 VM MAC:00:50:56:a6:a1:e3 VTEP IP:192.168.250.52 VTEP MAC:00:50:56:61:23:00
As mentioned above, this table is a cache. Entries in that cache expire in accordance with the expiration algorithm at around 200 seconds after host stops seeing matching VXLAN ingress or egress packets.
What if the Controller didn’t have the answer? In that case, there will be no Response from it, and host will have an Invalid entry similar to the below:
~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000 Inner MAC Outer MAC Outer IP Flags ----------------- ----------------- -------------- -------- 00:00:00:00:00:01 ff:ff:ff:ff:ff:ff 192.168.250.52 00000110
Things to note in regards to the above:
- The last bit in the Flags field is set to 0, which indicates that this entry is Invalid and will not be used;
- Outer IP populated in Invalid entry is a cosmetic bug (recently fixed). Remember that entries marked as Invalid are not used, so while it’s confusing, it’s also harmless.
- Host relies on contents of MAC:VTEP table to find destination VTEP
- This table is populated by learning from incoming traffic
- If there’s no entry, LS in Unicast or Hybrid mode will query Controller for the mapping. LS in Multicast mode will not, as Controller isn’t used
- When the destination MAC has no matching entry (and Controller hasn’t replied in Unicast or Hybrid mode), or is a Broadcast or Multicast destination (a.k.a. “BUM”), then host will flood that frame to all other hosts that are members of that VXLAN
How does VXLAN-related dvPg info get to the hosts?
You probably aware that VXLAN-backed dvPortgroups are just that – DVS portgroups, but with additional VXLAN-specific information “attached” to them. When these dvPortgroups are created, NSX Manager provides that information to the VC, which stores it as part of DVS configuration.
When a VM is connected to a dvPort that is member of a VXLAN-backed dvPg, VC will create a dvPort on a host, and set a number of opaque attributes that include:
- VNI (VXLAN ID)
- Control Plane (0 = Disabled (for Multicast), 1 = Enabled (for Hybrid or Unicast)
- Multicast IP address (or 0.0.0.1 = for Unicast)
Once dvPort is created, VXLAN kernel module will read and cache this inforrmation for future use.
The information above can be seen in the output of the
"net-dvs -l"; while cached information can be viewed with esxcli:
~ # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS VXLAN ID Multicast IP Control Plane Controller Connection Port Count MAC Entry Count ARP Entry Count -------- ------------------------- ----------------------------------- --------------------- ---------- --------------- --------------- 5002 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.201 (up) 2 2 0 5000 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.203 (up) 1 0 0 5003 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.203 (up) 1 0 0 5004 22.214.171.124 Disabled 0.0.0.0 (down) 2 0 0 5001 126.96.36.199 Enabled (multicast proxy,ARP proxy) 192.168.110.201 (up) 1 2 0
BUM frame forwarding
For a host to correctly handle BUM frames within a given VXLAN, it needs some means of finding out which other hosts have something connected to that VXLAN, and somehow making sure a copy of each BUM frame reaches all of them.
Multicast Control Plane mode
For Logical Switches in Multicast mode, NSX follows the process described in Section 4.2, “Broadcast Communication and Mapping to Multicast” of RFC 7348.
In short, each VXLAN VNI is associated with a Multicast IP address, allocated by NSX Manager from a user-configured range. If that range runs out of IP addresses, NSX Manager will start re-using them, causing more than one VNI to be associated with a given Multicast IP address.
When a VM is connected to a VXLAN-backed dvPortgroup, host’s VXLAN kernel module reads Multicast address associated with that dvPortgroup from dvPort configuration, and then joins the Multicast group accordingly.
Once this is done, any host that has joined Multicast group corresponding to a particular VXLAN can reach all other hosts that have done the same by sending a VXLAN packet with the destination IP address set to that VXLAN’s Multicast IP address. Physical network will then take care of the necessary replication and forwarding.
Muliticast Control Plane mode does not rely on or requires presence of NSX Controllers.
Each BUM frame leaves source ESXi host once. There is no head-end replication.
Hybrid Control Plane mode
Unlike Multicast one, the Hybrid Control Plane mode relies on both NSX Controllers and physical network to reach other hosts with VMs connected to a given VXLAN.
Controllers maintain a table of VTEPs that have “joined” each VXLAN. Here’s an example of such table for VNI 5001:
nsx-controller # show control-cluster logical-switches vtep-table 5001 VNI IP Segment MAC Connection-ID 5001 192.168.250.53 192.168.250.0 00:50:56:67:d9:91 1845 5001 192.168.250.52 192.168.250.0 00:50:56:64:f4:25 1843 5001 192.168.250.51 192.168.250.0 00:50:56:66:e2:ef 7 5001 192.168.150.51 192.168.150.0 00:50:56:60:bc:e9 3
When a first VM on a host connects to a VXLAN in Hybrid or Unicast mode, the following things happen:
- Host realises that this VXLAN requires cooperation with the Controller when it sees “1” in the “Control Plane” attribute on a dvPort that just got connected
- Query is sent from host to the Controller Cluster to find out which of the 3 Controllers is looking after this VXLAN’s VNI
- If there was no existing connection to that Controller, host will connect to it and ask for the list of all other VTEPs Controller has for that VNI; otherwise it will use an existing connection to that Controller to do it
- Host will then send a “VTEP Membership Update” message to the Controller telling it that this host’s VTEP has also joined that VNI
- Controller will generate an update to all other hosts on that VNI, telling them about this new VTEP
- Host looks up Multicast IP address associated with the VXLAN VNI in question, and sends an IGMP “Join” message for that Multicast group toward physical network from its VTEP vmkernel interface
Once the above is complete, all hosts that have VMs connected to this VXLAN will have a table of other hosts’ VTEP IP addresses. Notice that 192.168.250.51 is missing compared to the previous Controller command output – it’s the VTEP IP of the host where this command is run:
~ # esxcli network vswitch dvs vmware vxlan network vtep list --vds-name Compute_VDS --vxlan-id=5001 IP Segment ID Is MTEP -------------- ------------- ------- 192.168.150.51 192.168.150.0 true 192.168.250.53 192.168.250.0 false 192.168.250.52 192.168.250.0 false
As you can see, the table also includes a Segment ID for each VTEP, which is a product of a hosts’ VTEP IP address and netmask (essentially, the IP Subnet). VTEPs with the same Segment ID are presumed to be in the same L2 broadcast domain.
Last, but not least, there is Is MTEP field. Each host will randomly nominate one VTEP in every Segment ID other than it’s own as an MTEP. This is done per VNI; so different host may be selected as MTEP for a different VNI.
MTEP is a host which will perform replication of BUM traffic to other hosts with the same Segment ID as its own.
Note: don’t forget that the VTEP table is per-VNI, and includes only hosts where there are VMs connected, powered on, and with vNIC in link-up state. This means that list of hosts in VTEP table for different VNIs may be drastically different depending on what’s connected to these VNIs.
At this point:
- All hosts with VMs on a given VXLAN know each other’s VTEP IP addresses and IP subnets
- Each of these hosts has sent out an IGMP join for the Multicast IP address associated with that VXLAN
- Controller has a complete list of all these hosts’ VTEP IPs
Now, let’s see what happens when our host has a BUM frame to send.
Following command outputs above, VNI 5001 has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. Let’s say host 192.168.250.51 has a BUM frame to send. Based on the contents of its VTEP table above, the following will happen:
- Host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
- Look up Multicast IP address associated with the VNI, and send one copy of the BUM frame directed to that Multicast address, setting IP TTL to 1 to prevent it from being forwarded outside of the local broadcast domain.
- Each MTEP, after receiving a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above.
Hybrid mode relies on physical Ethernet network to forward the VXLAN frames with Multicast destination IP address to the hosts that have joined corresponding multicast group.
If physical network has correctly configured functional IGMP snooping and an IGMP querier, a copy of such frames will be only delivered to hosts that have sent IGMP join messages. Otherwise, physical network will deliver these frames to all hosts within the same broadcast domain (eg, VLAN).
For each BUM frame, source host will:
- Send one copy to Multicast IP associated with the LS
- Send one copy to Unicast IP of each MTEP on this LS
Based on the command output samples in this section, our host will send out two copies of each BUM frame; one Unicast to 192.168.150.51, and one Multicast to 188.8.131.52.
Unicast Control Plane mode
Unicast Control Plane mode does not have any dependencies on physical network replication, instead relying purely on Controllers to do the job.
The difference between Hybrid and Unicast is that the later doesn’t use Multicast IP addresses to reach VTEPs with the same Segment ID. Instead, both source host and MTEPs compile a list of VTEPs with the same Segment ID as themselves, and send a separate copy of BUM frame to each one of them.
Assuming that VNI 5001 was reconfigured for Unicast Control Plane mode and using the same command outputs as above: VNI 5001 still has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. If our host 192.168.250.51 has a BUM frame to send, the following will happen:
- Just as before, host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
- Unlike Hybrid, host will find all VTEPs that have the same Segment ID as it’s own VTEP, and send a “personal” copy of BUM frame to each VTEP’s IP.
- Each MTEP, after receiveng a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above, creating multiple copies if necessary.
For each BUM frame, source host will:
- Send one copy to each of VTEPs with Segment ID as host’s VTEP
- Send one copy to Unicast IP of each MTEP on this LS
Based on the command output samples in this section, our host will send out three copies of each BUM frame:
- Two to hosts in the same Segment: 192.168.250.52 and 192.168.250.53
- One to the MTEP 192.168.150.51, to take care of Segment 192.168.150.0
Additional benefit of Controllers: VXLAN ARP suppression
In Hybrid and Unicast modes NSX can also provide reduction of flooding due to VMs sending ARP requests for other VMs on the same VXLAN. For more details, please see my other blog post – NSX-v under the hood: VXLAN ARP suppression
Operational considerations for the three modes
As stated at the beginning of this post, there is no simply “best” mode, as each has up and down sides.
On upside, Multicast is very “easy” on ESXi hosts – all they need to do is to join the right Multicast groups, and physical network will do the rest. Also, Controllers are not required, if you’re not using any Logical Switches in Hybrid or Unicast mode, or need NSX distributed routing.
Some of the considerations for going this way are as follows:
- Your physical network must support L2 and L3 Multicast (See the RFC linked above for more details), meaning:
- Team looking after the physical network must be able to configure and support necessary Multicast features
- Troubleshooting NSX connectivity will more likely require involvement of the networking team (and potentiallty Vendor)
- NSX has strong dependency on Multicast in physical network for any BUM traffic handling, which includes for example ARP resolution and Multicast communications, such as OSPF, within logical space, eg, when running OSPF between DLR and ESG
- Because of the point above, if you’re operating highly available environment, every upgrade to ESXi or NSX code or your physical switches’ firmware (and potentially some configuration changes) will need to go through regression functional, load, and soak testing before rolling out into production
- It is fairly common for lower-end networking switches to handle Multicast well below line rate due to slow-path replication
- You will need to create and maintain additional global addressing plan for Multicast IPs
On upside, Hybrid provides some of the offload capabilities of Multicast, while offering additional benefits of VXLAN ARP suppression. Multicast configuration on the physical network needed to support Hybrid is much simpler – all that’s needed is IGMP snooping and an IGMP querier per broadcast domain where VTEPs live.
When going Hybrid, consider that:
- All Multicast considerations listed above, except the need for Multicast Routing, are still applicable for Hybrid
- There is a slightly larger replication overhead associated with BUM handling on ESXi hosts
- Unlike in Multicast mode, you’ll need Controllers
Unicast mode completely removes dependency on physical network, when it comes to BUM handling – all replication is done on the host, and resulting communications are purely unicast between VTEPs. ARP suppression is also available, since Unicast mode utilises Controllers.
On the other hand, the decoupling above comes at a cost of further replication overheads that need to be taken into consideration, using the math provided above.
I hope this blog provided you with enough information to help you make a call on which VXLAN Control Plane mode is better suited for your environment. A brief summary of things to think about:
- How much BUM traffic you’re expecting to see on your Logical Switches? Think packets/sec and average packet size.
- How many VMs are on your Logical Switches, and how many hosts do they translate to? This will tell you how many VTEPs you will have to replicate to.
- How big your VTEP subnets, and how many of them are there? Remember how MTEP replication works.
- How robust is your physical switching infrastructure, and how familiar are you with IP multicast? What will you do if/when things break?
- How close is your relationships with the team that’s looking after your physical network? Will they be willing to come and help?
- Do you have a model environment (production replica) to test upgrades and changes to ESXi, NSX, and physical network firmware and configuration? Do you have time and resources to do proper regression testing on each upgrade?
Above all – do the math and look at the real numbers – gut feel can be deceiving 🙂
Also, do not forget that Control Plane mode can be selected for each Logical Switch independently, and changed later with very small impact.