NSX-v under the hood: VXLAN ARP suppression

When troubleshooting, it’s important to know how the system is supposed to work, so that one can tell whether the currently observed behaviour is normal or not.

This post covers VXLAN ARP suppression in NSX-v.

NSX-v Logical Switches, implemented as VXLAN, include ARP suppression functionality (which can be turned off per logical switch as of NSX 6.1). This functionality is there to minimise the amount of ARP traffic flooding within individual VXLAN segments, ie., between VMs connected to the same Logical Switch.

Disclaimer: this post uses a few unsupported/undocumented commands, which means “if you use them and break something, it’s 100% your responsibility“. The primary purpose of this post is reader education, not call to action.

VXLAN ARP suppression is a function of Switch Security module (SwSec), which is a dvFilter attached to VMs’ vNICs.

We can see that dvFilter with summarize-dvfilter command:

[skip]
...
world 1442427 vmm0:web-sv-02a vcUuid:'50 26 b7 4d c5 6c 1e d9-47 c0 09 25 95 80 2f ad'
 port 50331657 web-sv-02a.eth0
  vNic slot 2
   name: nic-1442427-eth0-vmware-sfw.2
   agentName: vmware-sfw
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Dynamic Filter Creation
  vNic slot 1
   name: nic-1442427-eth0-dvfilter-generic-vmware-swsec.1
   agentName: dvfilter-generic-vmware-swsec <= Here it is
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel

Note: DLR is connected to a special VDS port called “vdrPort”, which doesn’t have SwSec dvFilter; therefore DLR does not benefit from VXLAN ARP suppression. I will cover details on DLR ARP resolution process in a future post.

NSX Control Plane logs are written into /var/log/netcpa.log on ESXi host. Logging level by default is “info”, and needs to be changed to “verbose” to observe VXLAN control plane operations described here. To do that (at your own risk!):

1) chmod +wt /etc/vmware/netcpa/netcpa.xml
2) edit /etc/vmware/netcpa/netcpa.xml and change “info” in <level></level> section of <log> to “verbose”
3) /etc/init.d/netcpad restart

Let’s take a look at the following example:

Host H1 with VM1 on Logical switch associated with the VNI 5000
Host H2 with VM2 on the same VNI 5000

All tables/caches are empty: H1 and H2 ARP and MAC cache; VM1 and VM2 ARP tables:

H1:

~ # net-vdl2 -M arp -s Compute_VDS -n 5000
ARP entry count: 0
~ # net-vdl2 -M mac -s Compute_VDS -n 5000
MAC entry count: 0

H2:

~ # net-vdl2 -M arp -s Compute_VDS -n 5000
ARP entry count: 0
~ # net-vdl2 -M mac -s Compute_VDS -n 5000
MAC entry count: 0

Now, VM1 issues two ping packets to VM2.

In the netcpa log file we observe:

On H1:

ARP request captured by SwSec and forwarded as a Query to Controller, asking it if it knows a MAC corresponding to the IP in the request

-->     ARP (v4) Query: len = 16
-->     SwitchID:0, VNI:5000
-->     Num of entries: 1
-->             #0      VM IP:172.16.10.12

Since this is an ARP request, SwSec also generates a VM IP Update for VM1, which is sent to the Controller, telling it about VM1’s IP and MAC

-->     VM IP (v4) Update: len = 24
-->     SwitchID:0, VNI:5000
-->     Num of removed entries: 0
-->     Num of added entries: 1
-->             #0      VM IP:172.16.10.11      VM MAC:00:50:56:a6:7a:a2

In our case, Controller didn’t have any cached ARP info for VM2, so it responds with “I don’t know” to SwSec’s ARP request

-->     ARP (v4) Update: len = 32
-->     SwitchID:0, VNI:5000
-->     Num of entries: 1
-->             #0      VM IP:172.16.10.12      VM MAC:ff:ff:ff:ff:ff:ff        VTEP IP:255.255.255.255 VTEP MAC:ff:ff:ff:ff:ff:ff

Meanwhile on H2:

Since VM2 (running Linux) will send an ARP to VM1 to get an authoritative confirmation of VM1’s MAC address, SwSec on H2 will catch it and generate a VM IP Update for VM2

-->     VM IP (v4) Update: len = 24
-->     SwitchID:0, VNI:5000
-->     Num of removed entries: 0
-->     Num of added entries: 1
-->             #0      VM IP:172.16.10.12      VM MAC:00:50:56:a6:a1:e3

As the result, tables on H1 and H2 are updated as follows:

H1:

~ # net-vdl2 -M mac -s Compute_VDS -n 5000
MAC entry count:        1
        Inner MAC:      00:50:56:a6:a1:e3 <== VM2
        Outer MAC:      00:50:56:62:b2:26
        Outer IP:       192.168.250.53
        Flags:          7

~ # net-vdl2 -M arp -s Compute_VDS -n 5000
ARP entry count:        1
        IP:             172.16.10.12
        MAC:            ff:ff:ff:ff:ff:ff <== “I don’t know” that came from Controller
        Flags:          F

H2:

~ # net-vdl2 -M mac -s Compute_VDS -n 5000
MAC entry count:        1
        Inner MAC:      00:50:56:a6:7a:a2 <== VM1
        Outer MAC:      00:50:56:65:15:14
        Outer IP:       192.168.250.52
        Flags:          7

~ # net-vdl2 -M arp -s Compute_VDS -n 5000
ARP entry count:        0 <== Nothing learned, since H2 didn't query the Controller, as per logs above

And on the Controller we see the ARP cache populated by the two VM IP Updates (which will time out in 180 seconds):

nvp-controller # show control-cluster logical-switches arp-table 5000
VNI      IP              MAC               Connection-ID
5000     172.16.10.12    00:50:56:a6:a1:e3 6
5000     172.16.10.11    00:50:56:a6:7a:a2 7

Now, the above prompts a few questions.

VXLAN SwSec module generates VM IP Update in two cases:

  1. When it receives an ARP request from a VM; and
  2. When it is about to send a DHCP ACK from DHCP server to a VM

According to the logs above, H2 has sent a VM IP Update to the controller, but there’s no corresponding ARP Query and ARP Update messages. What’s going on?

Let’s have a look at what is coming out of the VM2 during our ping operation:

16:00:08.484523 00:50:56:a6:a1:e3 > 00:50:56:a6:7a:a2, ethertype ARP (0x0806), length 60: Reply 172.16.10.12 is-at 00:50:56:a6:a1:e3, length 46 <== replied to ARP from VM1
16:00:08.485879 00:50:56:a6:a1:e3 > 00:50:56:a6:7a:a2, ethertype IPv4 (0x0800), length 98: 172.16.10.12 > 172.16.10.11: ICMP echo reply, id 36462, seq 1, length 64 <== Replied to ping #1 from VM1
16:00:09.465441 00:50:56:a6:a1:e3 > 00:50:56:a6:7a:a2, ethertype IPv4 (0x0800), length 98: 172.16.10.12 > 172.16.10.11: ICMP echo reply, id 36462, seq 2, length 64 <== Replied to ping #2 from VM1
16:00:13.493710 00:50:56:a6:a1:e3 > 00:50:56:a6:7a:a2, ethertype ARP (0x0806), length 60: Request who-has 172.16.10.11 tell 172.16.10.12, length 46 <== Sent a unicast ARP for VM2's MAC

So the difference with what’s going on on H1 is that ARP request that is coming from VM2 has a unicast DST MAC.

Due to this subtle difference, SwSec module on H2 does the following:

  1. It generates a VM IP Update (since it has seen an ARP coming from a VM); but
  2. Does not generate an ARP Query to the controller (and thus doesn’t see a corresponding ARP Update).

Update Jan 2015: In NSX-v 6.1, there is an option to disable ARP suppression per Logical Switch. At the time of writing, this is implemented by setting Controller to ignore the VM IP Update messages sent to it by hosts for that LS.

What you will see is hosts will continue sending VM IP Update and ARP Query messages to the Controller, but Controller’s ARP table will stay empty, causing it to respond to any ARP Query with ff:ff:ff:ff:ff:ff.

Somewhat confusingly, esxcli network vswitch dvs vmware vxlan network list --vds-name [DVS_Name] command output will still show “ARP proxy” against the Control Plane of VNI of the Logical Switch, even if its ARP suppression is disabled.

Advertisements

About Dmitri Kalintsev

Some dude with a blog and opinions ;) View all posts by Dmitri Kalintsev

15 responses to “NSX-v under the hood: VXLAN ARP suppression

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: