NSX for vSphere supports three VXLAN Control Plane modes:
- Multicast (described in Section 4 of RFC 7348);
- Hybrid; and
- Unicast.
None of these is “simply better” than others; each has it’s positive and negative sides. In this post, I’m covering how each mode works along with some of those negatives and positives, to hopefully help you can make a better informed choice for your circumstances.
Overview
Since this topic is a bit large, here’s a brief outline of what we’ll be talking about.
- Fundamental VXLAN Control Plane functions, and how they’re implemented in three modes
- How VXLAN Control Plane information gets to the hosts
- Traffic replication in the three modes, and how to calculate corresponding overheads
- Additional Control Plane function enabled by NSX Controllers (ARP suppression)
- Operational considerations for each of the modes
Quick recap of Control Plane functions
In general, each host running VXLAN needs to be able to do the following two things:
- When a VM on that host connected to an LS sends a frame addressed to a particular MAC address, figure out VTEP IP address of a host where this destination MAC address resides; and
- If that MAC address was a Broadcast, Unknown Unicast[1], or Multicast (a.k.a. “BUM”), send it to all VTEPs for that given VXLAN.
Note: NSX doesn’t include provisions for Multicast switching or routing, and follows standard Ethernet behaviour, which is to flood frames with Multicast destination MAC addresses within their L2 domain, i.e., individual VXLAN.
[1]A frame is considered Unknown Unicast by a host if it doesn’t have a corresponding MAC to VTEP IP cached entry.
Mapping MACs to VTEPs
To do #1, each host maintains a per-VNI cache of MAC to VTEP mappings:
~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000 Inner MAC Outer MAC Outer IP Flags ----------------- ----------------- -------------- -------- 00:50:56:a6:a1:e3 00:50:56:61:23:00 192.168.250.52 00000111 00:50:56:8e:45:33 ff:ff:ff:ff:ff:ff 192.168.150.51 00001111
In the sample output above we see that the host where this command was run knows about a couple MAC addresses. One of them, 00:50:56:a6:a1:e3, is located behind VTEP 192.168.250.52, listed under Outer IP. The Outer MAC is the MAC address of that VTEP, included here because it is on the same subnet as the VTEP of the host we’re looking at.
For VTEPs with IPs in other subnets, Outer MAC is set to ff:ff:ff:ff:ff:ff, since communication with them is via the default gateway of VXLAN IP stack, so the MAC is not needed. Second entry in the table below shows this scenario.
First difference: how MAC to VTEP table is populated
In all three modes, this table is populated by looking at packets arriving from remote VTEPs. Incoming VXLAN packet contains source VM’s MAC address (Inner MAC) in the header of encapsulated Ethernet frame that was generated by that VM, and VTEP’s MAC[2] and IP of that VM’s host (Outer MAC and Outer IP) in the outer L2/L3 headers.
[2]If remote VTEP is in a different IP subnet, that MAC will be of L3 gateway, not of VTEP, and will be replaced with ff:ff:ff:ff:ff:ff.
In Multicast mode, host has no other choice but to flood a frame for which it doesn’t have a MAC:VTEP entry, in hope that one of the copies of flooded frame will reach the destination VM, causing it to respond back, so that the above process can complete.
Flooding will also create MAC:VTEP cache entry for the source VM’s MAC and VTEP of the host it’s running on on all hosts participating in VXLAN this VM is connected to.
In Hybrid and Unicast mode, hosts will first query Controller for this mapping, and use information in Controller’s response to populate their local cache, avoiding initial flooding.
Here’s an example of request and response; note what information is sent back by the Controller:
Request:
2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] Vxlan: receive a message from kernel 2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] VXLAN Message --> VM MAC Query: len = 18 --> SwitchID:0, VNI:5000 --> Num of entries: 1 --> #0 VM MAC:00:50:56:a6:a1:e3
Response:
2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: receive UPDATE from controller 192.168.110.203:0 2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: send a message to the dataplane 2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] VXLAN Message --> VM MAC Update: len = 30 --> SwitchID:0, VNI:5000 --> Num of removed entries: 0 --> Num of added entries: 1 --> #0 VM MAC:00:50:56:a6:a1:e3 VTEP IP:192.168.250.52 VTEP MAC:00:50:56:61:23:00
As mentioned above, this table is a cache. Entries in that cache expire in accordance with the expiration algorithm at around 200 seconds after host stops seeing matching VXLAN ingress or egress packets.
What if the Controller didn’t have the answer? In that case, there will be no Response from it, and host will have an Invalid entry similar to the below:
~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000 Inner MAC Outer MAC Outer IP Flags ----------------- ----------------- -------------- -------- 00:00:00:00:00:01 ff:ff:ff:ff:ff:ff 192.168.250.52 00000110
Things to note in regards to the above:
- The last bit in the Flags field is set to 0, which indicates that this entry is Invalid and will not be used;
- Outer IP populated in Invalid entry is a cosmetic bug (recently fixed). Remember that entries marked as Invalid are not used, so while it’s confusing, it’s also harmless.
To recap:
- Host relies on contents of MAC:VTEP table to find destination VTEP
- This table is populated by learning from incoming traffic
- If there’s no entry, LS in Unicast or Hybrid mode will query Controller for the mapping. LS in Multicast mode will not, as Controller isn’t used
- When the destination MAC has no matching entry (and Controller hasn’t replied in Unicast or Hybrid mode), or is a Broadcast or Multicast destination (a.k.a. “BUM”), then host will flood that frame to all other hosts that are members of that VXLAN
How does VXLAN-related dvPg info get to the hosts?
You probably aware that VXLAN-backed dvPortgroups are just that – DVS portgroups, but with additional VXLAN-specific information “attached” to them. When these dvPortgroups are created, NSX Manager provides that information to the VC, which stores it as part of DVS configuration.
When a VM is connected to a dvPort that is member of a VXLAN-backed dvPg, VC will create a dvPort on a host, and set a number of opaque attributes that include:
- VNI (VXLAN ID)
- Control Plane (0 = Disabled (for Multicast), 1 = Enabled (for Hybrid or Unicast)
- Multicast IP address (or 0.0.0.1 = for Unicast)
Once dvPort is created, VXLAN kernel module will read and cache this inforrmation for future use.
The information above can be seen in the output of the "net-dvs -l"
; while cached information can be viewed with esxcli:
~ # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS VXLAN ID Multicast IP Control Plane Controller Connection Port Count MAC Entry Count ARP Entry Count -------- ------------------------- ----------------------------------- --------------------- ---------- --------------- --------------- 5002 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.201 (up) 2 2 0 5000 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.203 (up) 1 0 0 5003 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 192.168.110.203 (up) 1 0 0 5004 239.0.0.1 Disabled 0.0.0.0 (down) 2 0 0 5001 239.0.0.2 Enabled (multicast proxy,ARP proxy) 192.168.110.201 (up) 1 2 0
BUM frame forwarding
For a host to correctly handle BUM frames within a given VXLAN, it needs some means of finding out which other hosts have something connected to that VXLAN, and somehow making sure a copy of each BUM frame reaches all of them.
Multicast Control Plane mode
For Logical Switches in Multicast mode, NSX follows the process described in Section 4.2, “Broadcast Communication and Mapping to Multicast” of RFC 7348.
In short, each VXLAN VNI is associated with a Multicast IP address, allocated by NSX Manager from a user-configured range. If that range runs out of IP addresses, NSX Manager will start re-using them, causing more than one VNI to be associated with a given Multicast IP address.
When a VM is connected to a VXLAN-backed dvPortgroup, host’s VXLAN kernel module reads Multicast address associated with that dvPortgroup from dvPort configuration, and then joins the Multicast group accordingly.
Once this is done, any host that has joined Multicast group corresponding to a particular VXLAN can reach all other hosts that have done the same by sending a VXLAN packet with the destination IP address set to that VXLAN’s Multicast IP address. Physical network will then take care of the necessary replication and forwarding.
Muliticast Control Plane mode does not rely on or requires presence of NSX Controllers.
Replication Math:
Each BUM frame leaves source ESXi host once. There is no head-end replication.
Hybrid Control Plane mode
Unlike Multicast one, the Hybrid Control Plane mode relies on both NSX Controllers and physical network to reach other hosts with VMs connected to a given VXLAN.
Controllers maintain a table of VTEPs that have “joined” each VXLAN. Here’s an example of such table for VNI 5001:
nsx-controller # show control-cluster logical-switches vtep-table 5001 VNI IP Segment MAC Connection-ID 5001 192.168.250.53 192.168.250.0 00:50:56:67:d9:91 1845 5001 192.168.250.52 192.168.250.0 00:50:56:64:f4:25 1843 5001 192.168.250.51 192.168.250.0 00:50:56:66:e2:ef 7 5001 192.168.150.51 192.168.150.0 00:50:56:60:bc:e9 3
When a first VM on a host connects to a VXLAN in Hybrid or Unicast mode, the following things happen:
- Host realises that this VXLAN requires cooperation with the Controller when it sees “1” in the “Control Plane” attribute on a dvPort that just got connected
- Query is sent from host to the Controller Cluster to find out which of the 3 Controllers is looking after this VXLAN’s VNI
- If there was no existing connection to that Controller, host will connect to it and ask for the list of all other VTEPs Controller has for that VNI; otherwise it will use an existing connection to that Controller to do it
- Host will then send a “VTEP Membership Update” message to the Controller telling it that this host’s VTEP has also joined that VNI
- Controller will generate an update to all other hosts on that VNI, telling them about this new VTEP
- Host looks up Multicast IP address associated with the VXLAN VNI in question, and sends an IGMP “Join” message for that Multicast group toward physical network from its VTEP vmkernel interface
Once the above is complete, all hosts that have VMs connected to this VXLAN will have a table of other hosts’ VTEP IP addresses. Notice that 192.168.250.51 is missing compared to the previous Controller command output – it’s the VTEP IP of the host where this command is run:
~ # esxcli network vswitch dvs vmware vxlan network vtep list --vds-name Compute_VDS --vxlan-id=5001 IP Segment ID Is MTEP -------------- ------------- ------- 192.168.150.51 192.168.150.0 true 192.168.250.53 192.168.250.0 false 192.168.250.52 192.168.250.0 false
As you can see, the table also includes a Segment ID for each VTEP, which is a product of a hosts’ VTEP IP address and netmask (essentially, the IP Subnet). VTEPs with the same Segment ID are presumed to be in the same L2 broadcast domain.
Last, but not least, there is Is MTEP field. Each host will randomly nominate one VTEP in every Segment ID other than it’s own as an MTEP. This is done per VNI; so different host may be selected as MTEP for a different VNI.
MTEP is a host which will perform replication of BUM traffic to other hosts with the same Segment ID as its own.
Note: don’t forget that the VTEP table is per-VNI, and includes only hosts where there are VMs connected, powered on, and with vNIC in link-up state. This means that list of hosts in VTEP table for different VNIs may be drastically different depending on what’s connected to these VNIs.
At this point:
- All hosts with VMs on a given VXLAN know each other’s VTEP IP addresses and IP subnets
- Each of these hosts has sent out an IGMP join for the Multicast IP address associated with that VXLAN
- Controller has a complete list of all these hosts’ VTEP IPs
Now, let’s see what happens when our host has a BUM frame to send.
Following command outputs above, VNI 5001 has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. Let’s say host 192.168.250.51 has a BUM frame to send. Based on the contents of its VTEP table above, the following will happen:
- Host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
- Look up Multicast IP address associated with the VNI, and send one copy of the BUM frame directed to that Multicast address, setting IP TTL to 1 to prevent it from being forwarded outside of the local broadcast domain.
- Each MTEP, after receiving a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above.
Hybrid mode relies on physical Ethernet network to forward the VXLAN frames with Multicast destination IP address to the hosts that have joined corresponding multicast group.
If physical network has correctly configured functional IGMP snooping and an IGMP querier, a copy of such frames will be only delivered to hosts that have sent IGMP join messages. Otherwise, physical network will deliver these frames to all hosts within the same broadcast domain (eg, VLAN).
Replication Math:
For each BUM frame, source host will:
- Send one copy to Multicast IP associated with the LS
- Send one copy to Unicast IP of each MTEP on this LS
Based on the command output samples in this section, our host will send out two copies of each BUM frame; one Unicast to 192.168.150.51, and one Multicast to 239.0.0.2.
Unicast Control Plane mode
Unicast Control Plane mode does not have any dependencies on physical network replication, instead relying purely on Controllers to do the job.
The difference between Hybrid and Unicast is that the later doesn’t use Multicast IP addresses to reach VTEPs with the same Segment ID. Instead, both source host and MTEPs compile a list of VTEPs with the same Segment ID as themselves, and send a separate copy of BUM frame to each one of them.
Assuming that VNI 5001 was reconfigured for Unicast Control Plane mode and using the same command outputs as above: VNI 5001 still has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. If our host 192.168.250.51 has a BUM frame to send, the following will happen:
- Just as before, host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
- Unlike Hybrid, host will find all VTEPs that have the same Segment ID as it’s own VTEP, and send a “personal” copy of BUM frame to each VTEP’s IP.
- Each MTEP, after receiveng a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above, creating multiple copies if necessary.
Replication Math:
For each BUM frame, source host will:
- Send one copy to each of VTEPs with Segment ID as host’s VTEP
- Send one copy to Unicast IP of each MTEP on this LS
Based on the command output samples in this section, our host will send out three copies of each BUM frame:
- Two to hosts in the same Segment: 192.168.250.52 and 192.168.250.53
- One to the MTEP 192.168.150.51, to take care of Segment 192.168.150.0
Additional benefit of Controllers: VXLAN ARP suppression
In Hybrid and Unicast modes NSX can also provide reduction of flooding due to VMs sending ARP requests for other VMs on the same VXLAN. For more details, please see my other blog post – NSX-v under the hood: VXLAN ARP suppression
Operational considerations for the three modes
As stated at the beginning of this post, there is no simply “best” mode, as each has up and down sides.
Multicast
On upside, Multicast is very “easy” on ESXi hosts – all they need to do is to join the right Multicast groups, and physical network will do the rest. Also, Controllers are not required, if you’re not using any Logical Switches in Hybrid or Unicast mode, or need NSX distributed routing.
Some of the considerations for going this way are as follows:
- Your physical network must support L2 and L3 Multicast (See the RFC linked above for more details), meaning:
- Team looking after the physical network must be able to configure and support necessary Multicast features
- Troubleshooting NSX connectivity will more likely require involvement of the networking team (and potentiallty Vendor)
- NSX has strong dependency on Multicast in physical network for any BUM traffic handling, which includes for example ARP resolution and Multicast communications, such as OSPF, within logical space, eg, when running OSPF between DLR and ESG
- Because of the point above, if you’re operating highly available environment, every upgrade to ESXi or NSX code or your physical switches’ firmware (and potentially some configuration changes) will need to go through regression functional, load, and soak testing before rolling out into production
- It is fairly common for lower-end networking switches to handle Multicast well below line rate due to slow-path replication
- You will need to create and maintain additional global addressing plan for Multicast IPs
Hybrid
On upside, Hybrid provides some of the offload capabilities of Multicast, while offering additional benefits of VXLAN ARP suppression. Multicast configuration on the physical network needed to support Hybrid is much simpler – all that’s needed is IGMP snooping and an IGMP querier per broadcast domain where VTEPs live.
When going Hybrid, consider that:
- All Multicast considerations listed above, except the need for Multicast Routing, are still applicable for Hybrid
- There is a slightly larger replication overhead associated with BUM handling on ESXi hosts
- Unlike in Multicast mode, you’ll need Controllers
Unicast
Unicast mode completely removes dependency on physical network, when it comes to BUM handling – all replication is done on the host, and resulting communications are purely unicast between VTEPs. ARP suppression is also available, since Unicast mode utilises Controllers.
On the other hand, the decoupling above comes at a cost of further replication overheads that need to be taken into consideration, using the math provided above.
Conclusion
I hope this blog provided you with enough information to help you make a call on which VXLAN Control Plane mode is better suited for your environment. A brief summary of things to think about:
- How much BUM traffic you’re expecting to see on your Logical Switches? Think packets/sec and average packet size.
- How many VMs are on your Logical Switches, and how many hosts do they translate to? This will tell you how many VTEPs you will have to replicate to.
- How big your VTEP subnets, and how many of them are there? Remember how MTEP replication works.
- How robust is your physical switching infrastructure, and how familiar are you with IP multicast? What will you do if/when things break?
- How close is your relationships with the team that’s looking after your physical network? Will they be willing to come and help?
- Do you have a model environment (production replica) to test upgrades and changes to ESXi, NSX, and physical network firmware and configuration? Do you have time and resources to do proper regression testing on each upgrade?
Above all – do the math and look at the real numbers – gut feel can be deceiving 🙂
Also, do not forget that Control Plane mode can be selected for each Logical Switch independently, and changed later with very small impact.
January 14th, 2015 at 8:36 pm
Another amazing article! Please keep them coming. Possible typo here in the section “Unicast Control Plane mode” should be UTEP as the Proxy not MTEP.
January 14th, 2015 at 8:43 pm
Thanks 🙂
The reason why I didn’t use term “UTEP” is because internally there’s only MTEP, both in case of Hybrid and Unicast. “UTEP” is a documentation term referring to the same bit of functionality as MTEP.
May 14th, 2015 at 11:31 am
Hi Dmitri,
Is it possible for a host or a NSX controller to maintain multiple MAC to VTEP mappings for the same MAC?
I am assuming that there are multiple VMs running the same service and are distributed across multiple VTEPs. All of these VMs have the same IP and MAC address.
May 14th, 2015 at 6:34 pm
Hi James,
I didn’t test this, but I’m reasonably sure having multiple VMs with the same MAC/IP connected to the same Logical Switch on different hosts will not work.
When a VM with the same MAC/IP address pair as another one would come up and ARP for something, NSX (host and Controller) will consider it a “move” event, and will think that this MAC/IP pair is now behind a different VTEP. This will “unlearn” association of that MAC/IP with any other VTEP, causing all traffic for this MAC/IP pair to be directed to that new VM, and not to the first one.
May 15th, 2015 at 3:24 am
Hi Dmitri,
Thanks for your quick reply to my server farm question.
So “multiple MAC to VTEP mappings for the same MAC” will not work for NSX.
In that case, how do we implement a server farm distributed across multiple VTEPs in NSX?
I thought the client VMs will see the servers in the server farm as one single server. To the client VMs, the server has only one IP and one MAC. Load balancing/distribution is performed at the VTEPs.
Any thoughts?
— James Huang
May 16th, 2015 at 9:28 am
Hi James,
Apologies for being kind of obvious, but server farms would typically sit behind a load balancer, which solves the problem with individual servers having to have different IP/MAC. Load balancers also have additional benefit with them being able to quickly detect a dead or busy server, and stop / limit new connections to it.
Hope this makes sense.
NSX has a reasonable load balancer included as one of the functions of Edge Service Gateways, or you can use any other 3rd party’s one.
May 15th, 2015 at 4:10 am
Hi Dmitri,
I think one possible answer for my question is to use NAT at the VTEPs.
But NAT is not straight-forward and may not be able to handle every possible protocol packets.
Just wondering if there is another solution and how it is done in the data centers?
— James Huang
April 26th, 2016 at 2:40 pm
Hi Dmitri,
Great article here.. This is the only article on-line that explained the process in detail and to my knowledge, explained it accurately. Kudos to you!
April 26th, 2016 at 2:43 pm
Hi Kyle, thanks for the comment. Glad you found it useful! 🙂
July 7th, 2016 at 3:28 am
Wow, amazing post. Thanks
October 6th, 2016 at 6:28 am
[…] NSX for vSphere: VXLAN Control Plane modes explained […]
November 7th, 2016 at 2:01 pm
Hi,
Thanks for posting. Question
What happens if the 3 controllers go down in Unicast mode? I guess as long as nothing changes we are ok, but if by any reason, I move a VM on VNI 5001 (for example), to a host that didn’t have VMs on that VNI, then no one in the NSX world would know where that VM is now. Am I right? That meas that the risk of using Unicast Mode is that in case of failure of the 3 controllers, we may see communication impacted, something that we don’t see in Hybrid Mode.
Would appreciate your feedback.
November 7th, 2016 at 2:32 pm
Hi Luis,
You are correct assessing the impact; however Hybrid mode will break too for any “new” hosts on a VNI that fall outside of the sending host’s Segment ID. Remember that in Hybrid mode you still need to know your VTEPs to figure out the MTEPs.
HTH,
— Dmitri
November 8th, 2016 at 1:46 pm
Understood, thank you!
January 9th, 2017 at 7:12 am
Hello Dmitri.
today i decide to check this sentence:
“In Hybrid and Unicast mode, hosts will first query Controller for this mapping, and use information in Controller’s response to populate their local cache, avoiding initial flooding.”
I enable unicast mode, check “mac learning” box is enabled in LS, enable “verbose” logging level for netcpa. In reallity esxi sends query to Controller and then recieve update from them with correct VTEP IP. But it doesn’t use the recieved VTEP IP and do initial flooding. All local VTEPs and remote MTEP recieve first packet.
It can be easily checked by observing mac cache on all of local VTEPs and remote MTEPs. Sender’s VM MAC address appears in the caches of all ESXi hosts after initial packet.
Thank you in advance!
January 9th, 2017 at 8:03 am
Could you please confirm that your Logical Switch has the “Enable IP Discovery” option turned on? That’s the function that does ARP suppression. “Enable MAC Learning” used to do nothing on LS last time I checked (year+ ago), and was only relevant to dvPortGroups.
March 22nd, 2017 at 4:06 am
[…] is basically traffic that doesn’t have a specific Layer 3 destination. (More info: [1], [2], [3] … and many on the Internet but these three are what I came across initially via Google […]
April 25th, 2017 at 4:02 pm
Hi,
I was looking to find this “replicate locally” flag in the VxLAN header, but I could not see it. I don’t see it defined in the VxLAN RFC 7348 as well. So wondering where exactly this bit is positioned in the vxlan header.
Thanks,
Arun
April 25th, 2017 at 4:33 pm
“Replicate locally” bit (“RLB”) is a VMware proprietary extension, and I have not seen it described in any open standard document.
It uses one of the “Reserved” bits in the VXLAN header’s “Flags” field. Can’t remember which one off the top of my head (I think it’s the bit #3). You should be able to see it easily in BUM VXLAN packets sent to an MTEP.
Keep in mind that these are only used between ESXi VXLAN endpoints (but that said my memory is a bit hazy on NSX-T OVS/KVM ones). HW VTEP however will never be sent a packet with the RLB set.
April 25th, 2017 at 7:48 pm
Thanks for your quick response!
If HW VTEP uses service node replication, then the HW VTEP would be sending the unknown-unicast(say overlay-dest-mac as mac1) traffic to service node for replication. Lets say the service node also host some VM. Now how would the service node knows that it has to replicate to other VTEPs or it has to consume the packet(as there is are some VMs running in it)??
April 26th, 2017 at 7:59 am
Hmm, from a quick Google search it looks like nobody cared enough to cover how HW VTEP BUM replication through Service Nodes works. 😦 It’s a bit complex, and was on my plans to cover in these series, but there we are.
In short, Replication Nodes (RNs) open a special control session on VNI 2147483647 through which they receive information about which VNIs have a HW VTEP on them and require replication assistance. VXLAN stack on RNs knows that it is special (referred to as “PTEP”), and has to do these extra things.
RNs “silently” join all VNIs that have LS with a physical port (ignoring Transport Zone settings, BTW), and thus receive info about all other VTEPs on these VNIs. That is how they know where to replicate a BUM frame to. They don’t need a “Replicate Locally” flag set on these frames (HW VTEPs don’t do it).
If an RN also has a VM on an LS with a physical port, RN would send a copy of the BUM frame to that VM too following the normal BUM replication process.
HTH.
April 26th, 2017 at 3:34 pm
This is really a great info! I think I’m able to understand now.
BTW, I guess the RN to replicate BUM traffic from the HW-VTEP it should have an active BFD session with the HW-VTEP. Am I right? (presume BFD is enabled when the HW gateway service is provisioned in the NSX).
April 26th, 2017 at 3:47 pm
RN doesn’t care if its BFD session with HW VTEP is up or not. HW VTEP itself does, because this is how it decides whether RN is up or not, and then decides whether to use a particular RN for replication.
From what I’ve seen when I worked with this around August last year BFD config is sent to RNs only when there’s an LS with a hardware port on it. When there are no LS with hardware ports, BFD would be down.
April 26th, 2017 at 5:11 pm
Thanks a lot Dmitri! This really helps!
I was trying for a protoype to integrate OVSDB(with vtep schema) on a HW switch.
This HW switch capability is:
– VxLAN gateway with Head-End-Replication(HER).
– VNI and remote VTEPs on the VNI are manually configured. Physical port attachment to the VNI is also manually configured. But by having OVSDB, I see we don’t need this to be manually configured as we get these config from NSX.
– Does not support BFD.
In this case, if NSX is the controller, as the replication nodes state can’t detected in the HW switch because BFD is not supported on the switch.
I was thinking the HW switch will do HER(for BUM) based on the VTEPs learned from ucast_macs_remote table. But I think based on what you described, I think we can have duplicate packets in this scenario because the PTEPs would also replicate the packet when they receive a copy.
Am I right??
April 26th, 2017 at 5:48 pm
AFAIK NSX supports HW VTEPs with no BFD. If you’re talking to the NSX your OVSDB server will receive RN info in mcast_macs_remote table (IIRC), and you should be able to use it to get RN replication. Yes, you won’t be able to see if an RN is up or down.
Back to your original question: I have not investigated this area in detail, and don’t know what is the logic that RN uses to decide whether it should replicate a given BUM frame. It might be checking the source IP of the VXLAN packet against a list of known HW VTEPs, but as I said I don’t know for sure.
If you’re doing a HW VTEP development, your best bet is to get in touch with the VMware partner team, who should be able to furnish you with the proper support and documentation.
April 26th, 2017 at 6:01 pm
Sure. Yet again thanks a lot for your valuable inputs and time taken to look into my queries!!
April 12th, 2019 at 6:19 am
[…] NSX for vSphere: VXLAN Control Plane modes explained […]
February 4th, 2020 at 5:04 pm
[…] Reference: https://telecomoccasionally.wordpress.com/2015/01/11/nsx-for-vsphere-vxlan-control-plane-modes-expla… […]