With Hardware VTEP being implemented in, well, hardware, how things work depends on capabilities of the underlying chipset. This means that when we design solutions using these products, we need to keep these capabilities in mind and configure things accordingly.
In this short post I’ll cover a situation we’ve encountered at one of our customers where things “should” have worked but didn’t, and what was the reason for that.
I’ve been contacted to help solve a mysterious issue that popped up during a customer PoC, where a 6740 HW VTEP connected to an NSX environment was behaving in unexpected ways.
Everything looked great on the control plane – VTEP and MAC tables, both on Controllers and hosts, and on the underlay data plane – everything that should be able to ping each other, could. Traffic capture on ESXi host showed that a VM on a logical switch bridged to a VLAN through HW VTEP were sending traffic correctly encapsulated, and receiving traffic sent from a physical server.
But, two things weren’t happening: no traffic from VM arrived at the physical server, and BFD sessions between the HW VTEP and replication nodes were refusing to come up.
The first reaction was to check whether ESXi hosts’ VTEPs and the VTEP on 6740 were in different IP subnets. Yep, they were.
Here are the relevant bits of a mock 6740’s config:
rbridge-id 10 ! interface Port-channel 20 description L2 to ESXi hosts VTEP segment switchport switchport mode trunk switchport trunk allowed vlan all ! interface Ve 110 description Default gw for ESXi VTEPs ip address 192.168.0.1/24 ! interface Ve 112 description HW VTEP ip address 192.168.1.10/24 vrrp-extended-group 112 virtual-mac 02e0.5200.00xx virtual-ip 192.168.1.1 short-path-forwarding ! overlay-gateway NSXv type hardware-vtep ip interface Ve 112 vrrp-extended-group 112 attach rbridge-id add 10 attach vlan 3000 activate ! nsx-controller nsxv ip address 172.16.0.100 port 6640 activate
ESXi hosts were connected to the 6740 on L2 through the
Port-Channel 20, with their VTEPs sitting in the
VLAN 110. Default gateway (pay attention now, this is important!) for ESXi VXLAN IP stack was set to
192.168.0.1, which is as you see above is on the interface
We could ping perfectly fine between either
Ve 110 or
Ve 112 and ESXi VTEPs, so it wasn’t a problem with connectivity. VXLAN packets leaving ESXi hosts had the correct outer DMAC = MAC of the Ve 110, since it’s the default gateway that is used to reach HW VTEP’s IP (
On ESXi hosts acting as replication nodes, we could see incoming and outgoing VXLAN packets with BFD PDUs inside to and from HW VTEP. Yet, no worky.
Next thing we tried was to switch over from VRRP-E VTEP to Loopback, by adding interface Loopback 1 and switching from
Ve 112 to
Loopback 1 in the
rbridge-id 10 ! interface Port-channel 20 switchport switchport mode trunk switchport trunk allowed vlan all ! interface Loopback 1 ip address 10.10.10.10/32 ! interface Ve 110 ip address 192.168.0.1/24 ! interface Ve 112 ip address 192.168.1.10/24 vrrp-extended-group 112 virtual-mac 02e0.5200.00xx virtual-ip 192.168.1.1 short-path-forwarding ! overlay-gateway NSXv type hardware-vtep ip interface Loopback 1 attach rbridge-id add 10 attach vlan 3000 activate ! nsx-controller nsxv ip address 172.16.0.100 port 6640 activate
Well, still no worky. Puzzled, we’ve brought in some engineering help (hi Ram!). After some further collective head-scratching, we’ve noticed that the
"In" packet counters on BFD sessions on the 6740 weren’t increasing, meaning that BFD packets from ESXi hosts weren’t reaching the BFD function. (Remember – BFD packets between HW VTEP to ESXi travel inside VXLAN tunnel with VNI=0).
After seeing this, we decided to try and delete the VRRP-E configuration from
Ve 112 (the only VRRP-E interface in this particular configuration). Lo and behold – we were cooking with gas!
With the problem out of the way, it was the time to dig in to understand what caused the problem, and how to prevent it in future.
The first clue was (surprise!) 🙂 in the documentation:
If the VXLAN packet entering a VDX VTEP-enabled device on a Layer 3 interface (such as a routing next hop) is different from the VRRP-E based-VE interface configured for the VTEP, but at an ingress interface where the VTEP VRRP-E VLAN is also configured, and the final destination is an NSX-configured VXLAN tunnel (as identified by the VXLAN tunnel parameters of source IP and destination IP), then the VXLAN traffic is routed to the VTEP interface in the VDX and is a candidate for VXLAN to VLAN bridging, but only if the destination mac of the ingressing VXLAN traffic is the same as that of the virtual mac of the VTEP VRRP-E session.
(A bit hard to parse for me, but once you know what you’re looking for, it kind of makes sense).
So let’s see how this applies to our situation.
Remember, since 6740’s VTEP is in a different IP subnet from ESXi hosts’ VTEPs, said ESXi hosts will use their default gateway as the next hop, which in our case happens to be on the same 6740 as the HW VTEP. So ESXi hosts will set outer DMAC in their VXLAN packets to the MAC of the
Ve 110, which is not the same as the VRRP-E virtual MAC of
Ve 112. And, in line with the quote above, 6740 will drop these VXLAN packets.
If we look carefully, we note that the
Port-Channel 20 has
switchport trunk allowed vlan all on it, which of course includes
VLAN 112, that matches
Ve 112 that has VRRP-E enabled. I know it’s a different VLAN ID! Looks like it doesn’t matter, though.
In further lab testing we found that if we kept the original configuration (with the
Ve 112 used as the VTEP), but deleted
VLAN 112 from
Port-Channel 20, we were back to happy days.
Another way would be to follow recommendation in the document linked above, which is:
…creating VE interfaces for each of the ingressing transport VLANs on all the RBridges in the VCS, and then configuring them with the same VRRP-E VRID and virtual-mac address as the VRRP-E VRID and virtual-mac address that was configured for the VTEP.
Translating this for our situation: our “ingressing transport VLAN” is VLAN 110, so all we need to do is change
Ve 110 configuration so it looks something like:
! interface Ve 110 ip address 192.168.0.110/24 vrrp-extended-group 112 virtual-mac 02e0.5200.00xx virtual-ip 192.168.0.1
The changes are:
- Switch the main interface IP to a different one in the same IP range;
vrrp-extended-group 112with the corresponding
virtual-ipset to the
Ve 110’s original IP address.
Make sure that you use the same group number (112) as the one used on the Ve used for VTEP.
This solution probably makes a lot of sense since you’d typically have a pair of 6740s for your HW VTEP, and configure default gw redundancy for ESXi VTEPs with VRRP-E on
Ve 110. Just make sure to use the same VRRP-E group ID.
Q: What happens if I have multiple VRRP-E groups, each with different group ID?
A: one of the virtual MACs will be chosen randomly and programmed into ports. You don’t get to choose, AFAIK.
Q: So what do I do?
A: You should be able to safely use the same group ID for all relevant VRRP-E groups. It will use the same MAC on the associated VLANs, which should be fine since VLANs are separate MAC domains.
Q: Does it matter if I use Loopback or Ve for my HW VTEP?
Q: Is there an alternative solution?
A: You can add an external L3 hop between your ESXi VTEPs and the VTEP on a 6740. For an example, please see the “Infrastructure” diagram my previous post on the topic. The L3 hop in question is the one labelled “Router (Data)”.
Q: What about removing VLAN IDs of Ve interfaces with VRRP-E from the interface where VXLAN packets coming in?
A: Yep, can do that, too. As long as you remove all VLANs with Ve interfaces that have VRRP-E. Likely scenario is where you have only a single 6740 node, like in our pictured case.