NSX for vSphere: VXLAN Control Plane modes explained

NSX for vSphere supports three VXLAN Control Plane modes:

  1. Multicast (described in Section 4 of RFC 7348);
  2. Hybrid; and
  3. Unicast.

None of these is “simply better” than others; each has it’s positive and negative sides. In this post, I’m covering how each mode works along with some of those negatives and positives, to hopefully help you can make a better informed choice for your circumstances.

Overview

Since this topic is a bit large, here’s a brief outline of what we’ll be talking about.

  1. Fundamental VXLAN Control Plane functions, and how they’re implemented in three modes
  2. How VXLAN Control Plane information gets to the hosts
  3. Traffic replication in the three modes, and how to calculate corresponding overheads
  4. Additional Control Plane function enabled by NSX Controllers (ARP suppression)
  5. Operational considerations for each of the modes

Quick recap of Control Plane functions

In general, each host running VXLAN needs to be able to do the following two things:

  1. When a VM on that host connected to an LS sends a frame addressed to a particular MAC address, figure out VTEP IP address of a host where this destination MAC address resides; and
  2. If that MAC address was a Broadcast, Unknown Unicast[1], or Multicast (a.k.a. “BUM”), send it to all VTEPs for that given VXLAN.

Note: NSX doesn’t include provisions for Multicast switching or routing, and follows standard Ethernet behaviour, which is to flood frames with Multicast destination MAC addresses within their L2 domain, i.e., individual VXLAN.

[1]A frame is considered Unknown Unicast by a host if it doesn’t have a corresponding MAC to VTEP IP cached entry.

Mapping MACs to VTEPs

To do #1, each host maintains a per-VNI cache of MAC to VTEP mappings:

~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000
Inner MAC          Outer MAC          Outer IP        Flags
-----------------  -----------------  --------------  --------
00:50:56:a6:a1:e3  00:50:56:61:23:00  192.168.250.52  00000111
00:50:56:8e:45:33  ff:ff:ff:ff:ff:ff  192.168.150.51  00001111

In the sample output above we see that the host where this command was run knows about a couple MAC addresses. One of them, 00:50:56:a6:a1:e3, is located behind VTEP 192.168.250.52, listed under Outer IP. The Outer MAC is the MAC address of that VTEP, included here because it is on the same subnet as the VTEP of the host we’re looking at.

For VTEPs with IPs in other subnets, Outer MAC is set to ff:ff:ff:ff:ff:ff, since communication with them is via the default gateway of VXLAN IP stack, so the MAC is not needed. Second entry in the table below shows this scenario.

First difference: how MAC to VTEP table is populated

In all three modes, this table is populated by looking at packets arriving from remote VTEPs. Incoming VXLAN packet contains source VM’s MAC address (Inner MAC) in the header of encapsulated Ethernet frame that was generated by that VM, and VTEP’s MAC[2] and IP of that VM’s host (Outer MAC and Outer IP) in the outer L2/L3 headers.

[2]If remote VTEP is in a different IP subnet, that MAC will be of L3 gateway, not of VTEP, and will be replaced with ff:ff:ff:ff:ff:ff.

In Multicast mode, host has no other choice but to flood a frame for which it doesn’t have a MAC:VTEP entry, in hope that one of the copies of flooded frame will reach the destination VM, causing it to respond back, so that the above process can complete.

Flooding will also create MAC:VTEP cache entry for the source VM’s MAC and VTEP of the host it’s running on on all hosts participating in VXLAN this VM is connected to.

In Hybrid and Unicast mode, hosts will first query Controller for this mapping, and use information in Controller’s response to populate their local cache, avoiding initial flooding.

Here’s an example of request and response; note what information is sent back by the Controller:

Request:

2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] Vxlan: receive a message from kernel
2014-12-31T02:59:16.249Z [FFB44B70 verbose 'Default'] VXLAN Message
-->     VM MAC Query: len = 18
-->     SwitchID:0, VNI:5000
-->     Num of entries: 1
-->             #0      VM MAC:00:50:56:a6:a1:e3

Response:

2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: receive UPDATE from controller 192.168.110.203:0
2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] Vxlan: send a message to the dataplane
2014-12-31T02:59:16.357Z [FF9BD100 verbose 'Default'] VXLAN Message
-->     VM MAC Update: len = 30
-->     SwitchID:0, VNI:5000
-->     Num of removed entries: 0
-->     Num of added entries: 1
-->             #0      VM MAC:00:50:56:a6:a1:e3        VTEP IP:192.168.250.52  VTEP MAC:00:50:56:61:23:00

As mentioned above, this table is a cache. Entries in that cache expire in accordance with the expiration algorithm at around 200 seconds after host stops seeing matching VXLAN ingress or egress packets.

What if the Controller didn’t have the answer? In that case, there will be no Response from it, and host will have an Invalid entry similar to the below:

~ # esxcli network vswitch dvs vmware vxlan network mac list --vds-name Compute_VDS --vxlan-id=5000
Inner MAC          Outer MAC          Outer IP        Flags
-----------------  -----------------  --------------  --------
00:00:00:00:00:01  ff:ff:ff:ff:ff:ff  192.168.250.52  00000110

Things to note in regards to the above:

  • The last bit in the Flags field is set to 0, which indicates that this entry is Invalid and will not be used;
  • Outer IP populated in Invalid entry is a cosmetic bug (recently fixed). Remember that entries marked as Invalid are not used, so while it’s confusing, it’s also harmless.

To recap:

  • Host relies on contents of MAC:VTEP table to find destination VTEP
  • This table is populated by learning from incoming traffic
  • If there’s no entry, LS in Unicast or Hybrid mode will query Controller for the mapping. LS in Multicast mode will not, as Controller isn’t used
  • When the destination MAC has no matching entry (and Controller hasn’t replied in Unicast or Hybrid mode), or is a Broadcast or Multicast destination (a.k.a. “BUM”), then host will flood that frame to all other hosts that are members of that VXLAN

How does VXLAN-related dvPg info get to the hosts?

You probably aware that VXLAN-backed dvPortgroups are just that – DVS portgroups, but with additional VXLAN-specific information “attached” to them. When these dvPortgroups are created, NSX Manager provides that information to the VC, which stores it as part of DVS configuration.

When a VM is connected to a dvPort that is member of a VXLAN-backed dvPg, VC will create a dvPort on a host, and set a number of opaque attributes that include:

  • VNI (VXLAN ID)
  • Control Plane (0 = Disabled (for Multicast), 1 = Enabled (for Hybrid or Unicast)
  • Multicast IP address (or 0.0.0.1 = for Unicast)

Once dvPort is created, VXLAN kernel module will read and cache this inforrmation for future use.

The information above can be seen in the output of the "net-dvs -l"; while cached information can be viewed with esxcli:

~ # esxcli network vswitch dvs vmware vxlan network list --vds-name Compute_VDS
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------
    5002  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            2                2                0
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                0                0
    5003  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  192.168.110.203 (up)            1                0                0
    5004  239.0.0.1                  Disabled                             0.0.0.0 (down)                  2                0                0
    5001  239.0.0.2                  Enabled (multicast proxy,ARP proxy)  192.168.110.201 (up)            1                2                0

BUM frame forwarding

For a host to correctly handle BUM frames within a given VXLAN, it needs some means of finding out which other hosts have something connected to that VXLAN, and somehow making sure a copy of each BUM frame reaches all of them.

Multicast Control Plane mode

For Logical Switches in Multicast mode, NSX follows the process described in Section 4.2, “Broadcast Communication and Mapping to Multicast” of RFC 7348.

In short, each VXLAN VNI is associated with a Multicast IP address, allocated by NSX Manager from a user-configured range. If that range runs out of IP addresses, NSX Manager will start re-using them, causing more than one VNI to be associated with a given Multicast IP address.

When a VM is connected to a VXLAN-backed dvPortgroup, host’s VXLAN kernel module reads Multicast address associated with that dvPortgroup from dvPort configuration, and then joins the Multicast group accordingly.

Once this is done, any host that has joined Multicast group corresponding to a particular VXLAN can reach all other hosts that have done the same by sending a VXLAN packet with the destination IP address set to that VXLAN’s Multicast IP address. Physical network will then take care of the necessary replication and forwarding.

Muliticast Control Plane mode does not rely on or requires presence of NSX Controllers.

Replication Math:

Each BUM frame leaves source ESXi host once. There is no head-end replication.

Hybrid Control Plane mode

Unlike Multicast one, the Hybrid Control Plane mode relies on both NSX Controllers and physical network to reach other hosts with VMs connected to a given VXLAN.

Controllers maintain a table of VTEPs that have “joined” each VXLAN. Here’s an example of such table for VNI 5001:

nsx-controller # show control-cluster logical-switches vtep-table 5001
VNI      IP              Segment         MAC               Connection-ID
5001     192.168.250.53  192.168.250.0   00:50:56:67:d9:91 1845
5001     192.168.250.52  192.168.250.0   00:50:56:64:f4:25 1843
5001     192.168.250.51  192.168.250.0   00:50:56:66:e2:ef 7
5001     192.168.150.51  192.168.150.0   00:50:56:60:bc:e9 3

When a first VM on a host connects to a VXLAN in Hybrid or Unicast mode, the following things happen:

  • Host realises that this VXLAN requires cooperation with the Controller when it sees “1” in the “Control Plane” attribute on a dvPort that just got connected
  • Query is sent from host to the Controller Cluster to find out which of the 3 Controllers is looking after this VXLAN’s VNI
  • If there was no existing connection to that Controller, host will connect to it and ask for the list of all other VTEPs Controller has for that VNI; otherwise it will use an existing connection to that Controller to do it
  • Host will then send a “VTEP Membership Update” message to the Controller telling it that this host’s VTEP has also joined that VNI
  • Controller will generate an update to all other hosts on that VNI, telling them about this new VTEP
  • Host looks up Multicast IP address associated with the VXLAN VNI in question, and sends an IGMP “Join” message for that Multicast group toward physical network from its VTEP vmkernel interface

Once the above is complete, all hosts that have VMs connected to this VXLAN will have a table of other hosts’ VTEP IP addresses. Notice that 192.168.250.51 is missing compared to the previous Controller command output – it’s the VTEP IP of the host where this command is run:

~ # esxcli network vswitch dvs vmware vxlan network vtep list --vds-name Compute_VDS --vxlan-id=5001
IP              Segment ID     Is MTEP
--------------  -------------  -------
192.168.150.51  192.168.150.0     true
192.168.250.53  192.168.250.0    false
192.168.250.52  192.168.250.0    false

As you can see, the table also includes a Segment ID for each VTEP, which is a product of a hosts’ VTEP IP address and netmask (essentially, the IP Subnet). VTEPs with the same Segment ID are presumed to be in the same L2 broadcast domain.

Last, but not least, there is Is MTEP field. Each host will randomly nominate one VTEP in every Segment ID other than it’s own as an MTEP. This is done per VNI; so different host may be selected as MTEP for a different VNI.

MTEP is a host which will perform replication of BUM traffic to other hosts with the same Segment ID as its own.

Note: don’t forget that the VTEP table is per-VNI, and includes only hosts where there are VMs connected, powered on, and with vNIC in link-up state. This means that list of hosts in VTEP table for different VNIs may be drastically different depending on what’s connected to these VNIs.

At this point:

  • All hosts with VMs on a given VXLAN know each other’s VTEP IP addresses and IP subnets
  • Each of these hosts has sent out an IGMP join for the Multicast IP address associated with that VXLAN
  • Controller has a complete list of all these hosts’ VTEP IPs

Now, let’s see what happens when our host has a BUM frame to send.

Following command outputs above, VNI 5001 has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. Let’s say host 192.168.250.51 has a BUM frame to send. Based on the contents of its VTEP table above, the following will happen:

  • Host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
  • Look up Multicast IP address associated with the VNI, and send one copy of the BUM frame directed to that Multicast address, setting IP TTL to 1 to prevent it from being forwarded outside of the local broadcast domain.
  • Each MTEP, after receiving a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above.

Hybrid mode relies on physical Ethernet network to forward the VXLAN frames with Multicast destination IP address to the hosts that have joined corresponding multicast group.

If physical network has correctly configured functional IGMP snooping and an IGMP querier, a copy of such frames will be only delivered to hosts that have sent IGMP join messages. Otherwise, physical network will deliver these frames to all hosts within the same broadcast domain (eg, VLAN).

Replication Math:

For each BUM frame, source host will:

  1. Send one copy to Multicast IP associated with the LS
  2. Send one copy to Unicast IP of each MTEP on this LS

Based on the command output samples in this section, our host will send out two copies of each BUM frame; one Unicast to 192.168.150.51, and one Multicast to 239.0.0.2.

Unicast Control Plane mode

Unicast Control Plane mode does not have any dependencies on physical network replication, instead relying purely on Controllers to do the job.

The difference between Hybrid and Unicast is that the later doesn’t use Multicast IP addresses to reach VTEPs with the same Segment ID. Instead, both source host and MTEPs compile a list of VTEPs with the same Segment ID as themselves, and send a separate copy of BUM frame to each one of them.

Assuming that VNI 5001 was reconfigured for Unicast Control Plane mode and using the same command outputs as above: VNI 5001 still has 4 VTEPs in it; three in the subnet 192.168.250.0 and one in 192.168.150.0. If our host 192.168.250.51 has a BUM frame to send, the following will happen:

  • Just as before, host will find all MTEPs, and send each one of them a copy of the BUM frame, in a VXLAN packet addressed to the Unicast IP address of each MTEP’s VTEP. These VXLAN packets will have the bit #5 in their VXLAN header Flags field set to “1”, indicating to the receiving host that it must replicate them locally.
  • Unlike Hybrid, host will find all VTEPs that have the same Segment ID as it’s own VTEP, and send a “personal” copy of BUM frame to each VTEP’s IP.
  • Each MTEP, after receiveng a copy of the VXLAN-encapsulated BUM frame, will reset the “replicate locally” flag, send a copy of the BUM frame to all VM(s) running on them connected to VXLAN in question, and then also execute the step described immediately above, creating multiple copies if necessary.

Replication Math:

For each BUM frame, source host will:

  1. Send one copy to each of VTEPs with Segment ID as host’s VTEP
  2. Send one copy to Unicast IP of each MTEP on this LS

Based on the command output samples in this section, our host will send out three copies of each BUM frame:

  • Two to hosts in the same Segment: 192.168.250.52 and 192.168.250.53
  • One to the MTEP 192.168.150.51, to take care of Segment 192.168.150.0

Additional benefit of Controllers: VXLAN ARP suppression

In Hybrid and Unicast modes NSX can also provide reduction of flooding due to VMs sending ARP requests for other VMs on the same VXLAN. For more details, please see my other blog post – NSX-v under the hood: VXLAN ARP suppression

Operational considerations for the three modes

As stated at the beginning of this post, there is no simply “best” mode, as each has up and down sides.

Multicast

On upside, Multicast is very “easy” on ESXi hosts – all they need to do is to join the right Multicast groups, and physical network will do the rest. Also, Controllers are not required, if you’re not using any Logical Switches in Hybrid or Unicast mode, or need NSX distributed routing.

Some of the considerations for going this way are as follows:

  • Your physical network must support L2 and L3 Multicast (See the RFC linked above for more details), meaning:
    • Team looking after the physical network must be able to configure and support necessary Multicast features
    • Troubleshooting NSX connectivity will more likely require involvement of the networking team (and potentiallty Vendor)
    • NSX has strong dependency on Multicast in physical network for any BUM traffic handling, which includes for example ARP resolution and Multicast communications, such as OSPF, within logical space, eg, when running OSPF between DLR and ESG
    • Because of the point above, if you’re operating highly available environment, every upgrade to ESXi or NSX code or your physical switches’ firmware (and potentially some configuration changes) will need to go through regression functional, load, and soak testing before rolling out into production
    • It is fairly common for lower-end networking switches to handle Multicast well below line rate due to slow-path replication
  • You will need to create and maintain additional global addressing plan for Multicast IPs

Hybrid

On upside, Hybrid provides some of the offload capabilities of Multicast, while offering additional benefits of VXLAN ARP suppression. Multicast configuration on the physical network needed to support Hybrid is much simpler – all that’s needed is IGMP snooping and an IGMP querier per broadcast domain where VTEPs live.

When going Hybrid, consider that:

  • All Multicast considerations listed above, except the need for Multicast Routing, are still applicable for Hybrid
  • There is a slightly larger replication overhead associated with BUM handling on ESXi hosts
  • Unlike in Multicast mode, you’ll need Controllers

Unicast

Unicast mode completely removes dependency on physical network, when it comes to BUM handling – all replication is done on the host, and resulting communications are purely unicast between VTEPs. ARP suppression is also available, since Unicast mode utilises Controllers.

On the other hand, the decoupling above comes at a cost of further replication overheads that need to be taken into consideration, using the math provided above.

Conclusion

I hope this blog provided you with enough information to help you make a call on which VXLAN Control Plane mode is better suited for your environment. A brief summary of things to think about:

  • How much BUM traffic you’re expecting to see on your Logical Switches? Think packets/sec and average packet size.
  • How many VMs are on your Logical Switches, and how many hosts do they translate to? This will tell you how many VTEPs you will have to replicate to.
  • How big your VTEP subnets, and how many of them are there? Remember how MTEP replication works.
  • How robust is your physical switching infrastructure, and how familiar are you with IP multicast? What will you do if/when things break?
  • How close is your relationships with the team that’s looking after your physical network? Will they be willing to come and help?
  • Do you have a model environment (production replica) to test upgrades and changes to ESXi, NSX, and physical network firmware and configuration? Do you have time and resources to do proper regression testing on each upgrade?

Above all – do the math and look at the real numbers – gut feel can be deceiving 🙂

Also, do not forget that Control Plane mode can be selected for each Logical Switch independently, and changed later with very small impact.

About Dmitri Kalintsev

Some dude with a blog and opinions ;) View all posts by Dmitri Kalintsev

28 responses to “NSX for vSphere: VXLAN Control Plane modes explained

  • trainingrevolution

    Another amazing article! Please keep them coming. Possible typo here in the section “Unicast Control Plane mode” should be UTEP as the Proxy not MTEP.

    • Dmitri Kalintsev

      Thanks 🙂

      The reason why I didn’t use term “UTEP” is because internally there’s only MTEP, both in case of Hybrid and Unicast. “UTEP” is a documentation term referring to the same bit of functionality as MTEP.

  • James

    Hi Dmitri,
    Is it possible for a host or a NSX controller to maintain multiple MAC to VTEP mappings for the same MAC?
    I am assuming that there are multiple VMs running the same service and are distributed across multiple VTEPs. All of these VMs have the same IP and MAC address.

    • Dmitri Kalintsev

      Hi James,

      I didn’t test this, but I’m reasonably sure having multiple VMs with the same MAC/IP connected to the same Logical Switch on different hosts will not work.

      When a VM with the same MAC/IP address pair as another one would come up and ARP for something, NSX (host and Controller) will consider it a “move” event, and will think that this MAC/IP pair is now behind a different VTEP. This will “unlearn” association of that MAC/IP with any other VTEP, causing all traffic for this MAC/IP pair to be directed to that new VM, and not to the first one.

      • James Huang

        Hi Dmitri,
        Thanks for your quick reply to my server farm question.
        So “multiple MAC to VTEP mappings for the same MAC” will not work for NSX.
        In that case, how do we implement a server farm distributed across multiple VTEPs in NSX?
        I thought the client VMs will see the servers in the server farm as one single server. To the client VMs, the server has only one IP and one MAC. Load balancing/distribution is performed at the VTEPs.
        Any thoughts?

        — James Huang

      • Dmitri Kalintsev

        Hi James,

        Apologies for being kind of obvious, but server farms would typically sit behind a load balancer, which solves the problem with individual servers having to have different IP/MAC. Load balancers also have additional benefit with them being able to quickly detect a dead or busy server, and stop / limit new connections to it.

        Hope this makes sense.

        NSX has a reasonable load balancer included as one of the functions of Edge Service Gateways, or you can use any other 3rd party’s one.

      • James Huang

        Hi Dmitri,

        I think one possible answer for my question is to use NAT at the VTEPs.
        But NAT is not straight-forward and may not be able to handle every possible protocol packets.
        Just wondering if there is another solution and how it is done in the data centers?

        — James Huang

  • Kyle McKay

    Hi Dmitri,

    Great article here.. This is the only article on-line that explained the process in detail and to my knowledge, explained it accurately. Kudos to you!

  • Paul

    Wow, amazing post. Thanks

  • NSX Troubleshooting and vCloud Director - Jeffrey Kusters

    […] NSX for vSphere: VXLAN Control Plane modes explained […]

  • Luis M

    Hi,

    Thanks for posting. Question

    What happens if the 3 controllers go down in Unicast mode? I guess as long as nothing changes we are ok, but if by any reason, I move a VM on VNI 5001 (for example), to a host that didn’t have VMs on that VNI, then no one in the NSX world would know where that VM is now. Am I right? That meas that the risk of using Unicast Mode is that in case of failure of the 3 controllers, we may see communication impacted, something that we don’t see in Hybrid Mode.

    Would appreciate your feedback.

  • vladimir

    Hello Dmitri.
    today i decide to check this sentence:
    “In Hybrid and Unicast mode, hosts will first query Controller for this mapping, and use information in Controller’s response to populate their local cache, avoiding initial flooding.”

    I enable unicast mode, check “mac learning” box is enabled in LS, enable “verbose” logging level for netcpa. In reallity esxi sends query to Controller and then recieve update from them with correct VTEP IP. But it doesn’t use the recieved VTEP IP and do initial flooding. All local VTEPs and remote MTEP recieve first packet.
    It can be easily checked by observing mac cache on all of local VTEPs and remote MTEPs. Sender’s VM MAC address appears in the caches of all ESXi hosts after initial packet.
    Thank you in advance!

    • Dmitri Kalintsev

      Could you please confirm that your Logical Switch has the “Enable IP Discovery” option turned on? That’s the function that does ARP suppression. “Enable MAC Learning” used to do nothing on LS last time I checked (year+ ago), and was only relevant to dvPortGroups.

  • Notes to self while installing NSX 6.3 (part 1) « rakhesh.com

    […] is basically traffic that doesn’t have a specific Layer 3 destination. (More info: [1], [2], [3] … and many on the Internet but these three are what I came across initially via Google […]

  • mac-entry

    Hi,

    I was looking to find this “replicate locally” flag in the VxLAN header, but I could not see it. I don’t see it defined in the VxLAN RFC 7348 as well. So wondering where exactly this bit is positioned in the vxlan header.

    Thanks,
    Arun

    • Dmitri Kalintsev

      “Replicate locally” bit (“RLB”) is a VMware proprietary extension, and I have not seen it described in any open standard document.

      It uses one of the “Reserved” bits in the VXLAN header’s “Flags” field. Can’t remember which one off the top of my head (I think it’s the bit #3). You should be able to see it easily in BUM VXLAN packets sent to an MTEP.

      Keep in mind that these are only used between ESXi VXLAN endpoints (but that said my memory is a bit hazy on NSX-T OVS/KVM ones). HW VTEP however will never be sent a packet with the RLB set.

      • mac-entry

        Thanks for your quick response!

        If HW VTEP uses service node replication, then the HW VTEP would be sending the unknown-unicast(say overlay-dest-mac as mac1) traffic to service node for replication. Lets say the service node also host some VM. Now how would the service node knows that it has to replicate to other VTEPs or it has to consume the packet(as there is are some VMs running in it)??

      • Dmitri Kalintsev

        Hmm, from a quick Google search it looks like nobody cared enough to cover how HW VTEP BUM replication through Service Nodes works. 😦 It’s a bit complex, and was on my plans to cover in these series, but there we are.

        In short, Replication Nodes (RNs) open a special control session on VNI 2147483647 through which they receive information about which VNIs have a HW VTEP on them and require replication assistance. VXLAN stack on RNs knows that it is special (referred to as “PTEP”), and has to do these extra things.

        RNs “silently” join all VNIs that have LS with a physical port (ignoring Transport Zone settings, BTW), and thus receive info about all other VTEPs on these VNIs. That is how they know where to replicate a BUM frame to. They don’t need a “Replicate Locally” flag set on these frames (HW VTEPs don’t do it).

        If an RN also has a VM on an LS with a physical port, RN would send a copy of the BUM frame to that VM too following the normal BUM replication process.

        HTH.

  • mac-entry

    This is really a great info! I think I’m able to understand now.

    BTW, I guess the RN to replicate BUM traffic from the HW-VTEP it should have an active BFD session with the HW-VTEP. Am I right? (presume BFD is enabled when the HW gateway service is provisioned in the NSX).

    • Dmitri Kalintsev

      RN doesn’t care if its BFD session with HW VTEP is up or not. HW VTEP itself does, because this is how it decides whether RN is up or not, and then decides whether to use a particular RN for replication.

      From what I’ve seen when I worked with this around August last year BFD config is sent to RNs only when there’s an LS with a hardware port on it. When there are no LS with hardware ports, BFD would be down.

  • mac-entry

    Thanks a lot Dmitri! This really helps!

    I was trying for a protoype to integrate OVSDB(with vtep schema) on a HW switch.
    This HW switch capability is:
    – VxLAN gateway with Head-End-Replication(HER).
    – VNI and remote VTEPs on the VNI are manually configured. Physical port attachment to the VNI is also manually configured. But by having OVSDB, I see we don’t need this to be manually configured as we get these config from NSX.
    – Does not support BFD.

    In this case, if NSX is the controller, as the replication nodes state can’t detected in the HW switch because BFD is not supported on the switch.
    I was thinking the HW switch will do HER(for BUM) based on the VTEPs learned from ucast_macs_remote table. But I think based on what you described, I think we can have duplicate packets in this scenario because the PTEPs would also replicate the packet when they receive a copy.

    Am I right??

    • Dmitri Kalintsev

      AFAIK NSX supports HW VTEPs with no BFD. If you’re talking to the NSX your OVSDB server will receive RN info in mcast_macs_remote table (IIRC), and you should be able to use it to get RN replication. Yes, you won’t be able to see if an RN is up or down.

      Back to your original question: I have not investigated this area in detail, and don’t know what is the logic that RN uses to decide whether it should replicate a given BUM frame. It might be checking the source IP of the VXLAN packet against a list of known HW VTEPs, but as I said I don’t know for sure.

      If you’re doing a HW VTEP development, your best bet is to get in touch with the VMware partner team, who should be able to furnish you with the proper support and documentation.

  • mac-entry

    Sure. Yet again thanks a lot for your valuable inputs and time taken to look into my queries!!

  • NSX Troubleshooting and vCloud Director - jeffreykusters.nl

    […] NSX for vSphere: VXLAN Control Plane modes explained […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: