NSX 6.1.2 Bug – DLR interface communication issues & How to troubleshoot using net-vdr command

I have NSX 6.1.2 deployed on vSphere 5.5 and wanted to share the details of a specific bug applicable to this version of NSX that I came across, especially since it doesn’t have a relevant KB article published for it.

After setting up the NSX manager and all the host preparations including VXLAN prep, I had deployed a number of DLR instances, each with multiple internal interfaces and an uplink interface which in tern, was connected to an Edge Gateway instance for external connectivity. And I kept on getting communication issues between certain interfaces of the DLR, where by for example, the internal interfaces connected to vNIC 1 & 2 would communicate with one another (I can ping from a VM on VXLAN 1 attached to the internal interface on vNIC 1 to a VM on VXLAN 2 attached to the internal interface on vNIC 2) but none of them would talk to the internal interface 3, attached to the DLR vNic 3 or even uplink interface on vNic 0 (Cannot ping the Edge gateway). The interfaces that cannot communicate were completely random however and was persistent on multiple DLR instances deployed. All of them had one thing in common which was that no internal interface would talk to the uplink interface IP (of the Edge gateway attached to the other end of the uplink interface).

One symptom of the issue was what was described in this blog post I posed on the VMware communities page, at https://communities.vmware.com/thread/505542

Finally I had to log a call with NSX support at VMware GSS and according to their diagnosis, it turned out to be an inconsistency issue with the netcpa daemon running on the ESXi hosts and its communication with the NSX controllers. (FYI – netcpa gets deployed during the NSX host preparation stage, as a part of the User World Agents and is responsible for communication between DLR and the NSX controllers as well as the VXLAN and NSX controllers- see the diagram here)

During the troubleshooting, it transpired that some details (such as VXLAN details) were out of sync between the hosts and the controllers (different from what was shown as the VXLAN & VNI configuration in the GUI) and the temporary fix was to stop and start the netcpa daemon on each of the hosts in the compute & edge cluster ESXi nodes (commands “/etc/init.d/netcpad stop” followed up by “/etc/init.d/netcpad start” on the ESXi shell as root).

Having analysed the logs thereafter, VMware GSS confirmed that this was indeed an internally known issue with NSX 6.1.2. Their message was “This happens due to a problem in the tcp connection handshake between netcpa and the manager once the last one is rebooted (one of the ACKs is not received by netcpa, and it does not retry the connection). This fix added re-connect for every 5 seconds until the ACK is received“. Unfortunately, there’s no published KB article out (as of now) for this issue which doesn’t help much. expecially if you deploying this in a lab…etc.

This issue has (allegedly) been resolved in NSX version 6.1.3 even thought its not explicitly stated within the release notes as of yet (the support engineer I dealt with mentioned that he’s requested this to be added)

So, if you have similar issues with NSX 6.1.2 (or lower possibly), this may well be the way to go about it.

One thing I did lean during the troubleshooting process (which was probably the most important thing that came out of this whole issue for me, personally) was the understanding of the importance of the net-vdr command, which I should emphasize here though. Currently, there are no other ways to check if the agents running on the ESXi hosts have the correct configuration settings to do with NSX other than looking at it on command line…. I mean, you can force a resync of the appliance configuration or redeploy the appliances themselves using NSX GUI but that doesn’t necessarily update the ESXi agents and was of no help in my case mentioned above.

net-vdr command lets you perform a number of useful operations relating to DLR, from basic operations such as adding / deleting a new VDR (Distributed Router) instance, dumpi9ng the instance info, configure DRl settings including changing the controller details, adding / deleting DLR route details, listing all DLR instances, ARP operations such as show, add & delete ARP entries in the DLR & DLR Bridge operations and has turned out to be real handy for me to do various troubleshooting & verification operations regarding DLR settings. Unfortunately, there don’t appear to be much documentation on this command and its use, not on the VMware NSX documentation NOR within the vSphere documentation, at least as of yet, hence why I thought I’d mention it here.

Given below is an output of the commands usage,

net-vdr

So, couple of examples….

If you want to list all the DLR instances deployed and their internal names, net-vdr –instance -l will list out all the DLR instances as how ESXi sees them.

net-vdr --instance -l

If you NSX GUI says that you have 4 VNI’s 5000, 5001, 5002 & 5003 defined and you need to check whether if all 4 VNI configurations are also present on the ESXi hosts, net-vdr -L -l default+edge-XX (where XX is the unique number assigned to each of your DLR) will show you all the DLR interface configuration, as how the ESXi host sees it.

net-vdr -L l

If you want to see the ARP information within each DLR,  net-vdr –nbr -l default+edge-XX (where XX is the unique number assigned to each DLR) will show you the ARP info.

net-vdr --nbr -l default+edge-XX

 

Hope this would be of some use to those early adaptors of NSX out there….

Cheers

Chan

 

1. Brief Introduction to NSX

Next: How to gain access to NSX media ->

NSX is the next evolution of what used to be known as vCloud Networking and Security suite within the VMware’s vCloud suite – A.K.A vCNS (now discontinued) which in tern, was an evolution of the Nicira business VMware acquired a while back. NSX is how VMware provides the SDN (Software Defined networking) capability to the Software Defined Data Center (SDDC). However some may argue that NSX primarily provide a NFV (Network Function Virtualisation) function which is slightly different to that of SDN.

The current version of NSX available comes in 2 forms

  1. NSX-V : NSX for vSphere – This is the most popular version of NSX and is what appears to be the future of NSX. NSX-V is inteneded to be used by all existing and future vSphere users alongside their vSphere (vCenter and ESXi) environment. All the contents of the rest of this post and all my future posts within this blog are referring to this version of NSX and NOT the multi hypervisor version.
  2. NSX-MH : NSX for multi hypervisors is a special version of NSX that is compatible with other hypervisors outside of just vSphere. Though it suggests multi- hypervisors in the name, actual support (as of the time of writing) is limited and is primarily aimed at offering networking and security to OpenStack (Linux KVM) rather than all other hypervisors (currently supported hypervisors are XEN, KVM & ESXi). Also, the rumour is that VMware are phasing NSX-MH out anyway which means all if not most future development and integration efforts would likely be focused around NSX-V. However if you are interested in NSX-MH, refer to the NSX-MH design guide (based on the version 4.2 at the time of  writing) which seems pretty good.

Given below is a high level overview of the architectural differences between the 2 offerings.

1. Differnces between V & MH

NSX-V

NSX-V, or as commonly referred to as NSX, provide a number of features to a typical vSphere based datacentre

2. NSX features

NSX doesn’t do any physical packet forwarding and as such, doesn’t add anything to the physical switching environment. It only exist in the ESXi environment and independent (theoretically speaking) of the underlying network hardware. (Note that NSX however is reliant on a properly designed network in a spine and leaf architecture and require support for MTU > 1600 within the underlying physical network).

  • NSX virtualises Logical Switching:- This is a key feature that enables the creation of a VXLAN overlay network with layer 2 adjacency over an existing, legacy layer 3 IP network. As shown in the diagram below, a layer 2 connectivity between 2 VM’s on the same host never leaves the hypervisor and the end to end communication all takes place in the silicon.  Communication between VM’s in different hosts still has to traverse the underlying network fabric however, compared to before (without NSX), the packet switching is now done within the NSX switch (known as the Logical switch). This logical switch is a dvPort group type of construct added to an existing VMware distributed vSwitch during the installation of NSX

3. Logical Switching

  • NSX virtualises logical routing:- NSX provides the capability to deploy a logical router which can route traffic between different layer 3 subnets without having to physical be routed using a physical router. The diagram below shows how NSX virtualise the layer 3 connectivity in different IP subnets and logical switches without leaving the hypervisor to use a physical router. Thanks to this, routing between 2 VMs in 2 different layer 3 subnets in the same host would no longer require the traffic to be routed by an external, physical router but instead, routed within the same host using the NSX software router allowing the entire transaction to all occur in the silicon. In the past, a VM1 on a port group tagged with vlan 101 on host A, talking to VM2 on a port group tagged with vlan 102 on the same host would have required the packet to be routed using an external router (or a switch with Layer 3 license) that both uplinks / vlans connects to. With NSX, this is no longer required and all routing, weather VM to VM communication in the same host or between different hosts will all be routed using the software router.

4. Logical Routing

 

  • NSX REST API:-  The built in REST API provide the programmatically access to NSX by external orchestration systems such as VMware vRealize Automation (vCAC). This programmatically access provide the ability to automate the deployment of networking configurations, that can now be tied to application configurations, all being deployed automatically on to the datacentre.

5. Programmatical access

  • NSX Logical Firewall:-  The NSX logical firewall introduces a brand new concept of micro segmentation where, put simply, through the use of a ESXi kernel module driver, un-permitted traffic are blocked at the VM’s vmnic driver level so that the packets are never released in to the virtual network. No other SDN / NFV solution in the market as of now is able to provide this level of micro segmentation (though Cisco ACI is rumoured to bring this capability to ACI platform through the use of the Appliance Virtual Switch).  The NSX logical firewall provide the East-West traffic filtering through the distributed firewall while North-South filtering is provide through the NSX Edge services gateway. The Distributed firewall also allows the capability to integrate with advanced 3rd party layer 4-7 firewalls such as Palo-Alto network firewalls.

6. Firewalls

There are many other benefits of NSX all of which cannot be discussed within the scope of this article. However the above should provide you with a  reasonable insight in to some of the most notable and most discussed benefits of NSX.

Next: How to gain access to NSX media ->

Cheers

Chan