...
The issue has been seen in 2 Node Stretched cluster of vSAN ESXi version 6.7 u2. Cluster is partitioned with Witness able to communicate over ping to only one Data Node.The 2 Data nodes are able to communicate with each other and show in "vsan cluster get " command The issue related to NIC card ( physical )
How to isolate the issue with the NIC
The NIC was in a " hung state " ( With the latest driver )
Packets get dropped upon ping to VSAN vmkernel.NODE2# vmkping -I vmk2 192.168.1.111 -c 1000PING 192.168.1.111 (192.168.1.111): 56 data bytes64 bytes from 192.168.1.111: icmp_seq=2 ttl=64 time=0.133 ms64 bytes from 192.168.1.111: icmp_seq=3 ttl=64 time=0.111 ms64 bytes from 192.168.1.111: icmp_seq=4 ttl=64 time=0.129 ms64 bytes from 192.168.1.111: icmp_seq=5 ttl=64 time=0.133 ms64 bytes from 192.168.1.111: icmp_seq=6 ttl=64 time=0.137 ms64 bytes from 192.168.1.111: icmp_seq=7 ttl=64 time=0.140 ms64 bytes from 192.168.1.111: icmp_seq=8 ttl=64 time=0.141 ms64 bytes from 192.168.1.111: icmp_seq=9 ttl=64 time=0.127 ms64 bytes from 192.168.1.111: icmp_seq=10 ttl=64 time=0.139 ms64 bytes from 192.168.1.111: icmp_seq=11 ttl=64 time=0.087 ms<======= Sequence missed64 bytes from 192.168.1.111: icmp_seq=37 ttl=64 time=0.137 ms<======= Sequence missed64 bytes from 192.168.1.111: icmp_seq=38 ttl=64 time=0.151 ms Packet capture shows UDP traffic is working but We have seen the "sequence 11 is followed by sequence 37" # pktcap-uw --uplink vmnic4 --dir 0 --stage 1 --proto 0x11 -o -| tcpdump-uw -r - -nne >> Run this command on one of the data node where uplink 4 is used for vSAN vmkernel. ----- Output of the above command is as below ----- The Stage is Post.The session filter IP protocol is 0x11.pktcap: The output file is -.pktcap: No server port specifed, select 21248 as the port.pktcap: Local CID 2.pktcap: Listen on port 21248.reading from file -, link-type EN10MB (Ethernet)pktcap: Accept...pktcap: Vsock connection from port 1029 cid 2.07:39:15.068063 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 178: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 13607:39:16.068090 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 178: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 13607:39:17.068136 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 258: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 21607:39:17.068162 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 186: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 14407:39:18.068157 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 258: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 21607:39:18.068186 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 186: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 14407:39:19.068208 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:20.068203 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:21.068238 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:22.068288 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:23.068326 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:24.068347 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:25.068365 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:26.068417 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 20007:39:27.068432 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 466: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 42407:39:28.068511 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 466: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 424The same packet capture with ICMP filter shows more drops:# pktcap-uw --uplink vmnic5 --dir 0 --stage 0 --proto 0x01 -o -|tcpdump-uw -r - -nneThe name of the uplink is vmnic5.The Stage is Pre.The session filter IP protocol is 0x01.pktcap: The output file is -.pktcap: No server port specifed, select 42606 as the port.pktcap: Local CID 2.pktcap: Listen on port 42606.reading from file -, link-type EN10MB (Ethernet)pktcap: Accept...pktcap: Vsock connection from port 1026 cid 2.7:45:06.559790 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 98, length 6407:45:07.561992 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 99, length 6407:45:08.562521 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 100, length 6407:45:09.564725 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 101, length 6407:45:10.566928 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 102, length 6407:45:11.569107 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 103, length 6407:45:27.598571 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 119, length 64 <======== show sequence missed again.07:45:28.600526 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 120, length 6407:45:29.602738 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 121, length 6407:45:30.604959 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 122, length 6407:45:31.607195 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 123, length 64NODE2# esxcli vsan cluster getCluster Information Enabled: true Current Local Time: 2019-09-03T07:02:40Z Local Node UUID: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e Sub-Cluster Backup UUID: Sub-Cluster UUID: 52694080-22bb-7ced-95fb-01b8e5a92f67 Sub-Cluster Membership Entry Revision: 0 Sub-Cluster Member Count: 1 Sub-Cluster Member UUIDs: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e Sub-Cluster Member HostNames: NODE2 Sub-Cluster Membership UUID: 01106e5d-52e2-c75d-907e-20040ff07c0e Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: de0bb183-3c7e-418b-b652-258770e89e01 12 2019-08-19T09:12:12.1NODE2# esxcli network ip interface ipv4 getName IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type Gateway DHCP DNS---- ------------- --------------- -------------- ------------ ------------- --------vmk0 10.12.132.247 255.255.255.224 10.12.132.255 STATIC 10.12.132.225 falsevmk2 192.168.1.222 255.255.255.0 192.168.1.255 STATIC 0.0.0.0 falsevmk3 192.168.2.20 255.255.255.0 192.168.2.255 STATIC 0.0.0.0 falseNODE1# esxcli network ip interface ipv4 getName IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type Gateway DHCP DNS---- ------------- --------------- -------------- ------------ ------------- --------vmk0 10.12.132.246 255.255.255.224 10.12.132.255 STATIC 10.12.132.225 falsevmk2 192.168.1.111 255.255.255.0 192.168.1.255 STATIC 0.0.0.0 falsevmk3 192.168.2.10 255.255.255.0 192.168.2.255 STATIC 0.0.0.0 falseIsolating 1 NIC shows 100 % packet loss:NODE2# esxcli network nic listName PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description------ ------------ ------- ------------ ----------- ----- ------ ----------------- ---- -----------------------------------------------------------------vmnic0 0000:18:00.0 ntg3 Up Up 1000 Full 20:04:0f:f0:7c:0c 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic1 0000:18:00.1 ntg3 Up Down 0 Half 20:04:0f:f0:7c:0d 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic2 0000:19:00.0 ntg3 Up Up 1000 Full 20:04:0f:f0:7c:0e 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic3 0000:19:00.1 ntg3 Up Down 0 Half 20:04:0f:f0:7c:0f 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic4 0000:87:00.0 qedentv Down Down 0 Half 34:80:0d:0f:0a:2c 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adaptervmnic5 0000:87:00.1 qedentv Up Up 10000 Full 34:80:0d:0f:0a:2d 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet AdapterNODE2#vmkping -I vmk2 192.168.1.111 -c 100 -i 0.005PING 192.168.1.111 (192.168.1.111): 56 data bytes--- 192.168.1.111 ping statistics ---100 packets transmitted, 0 packets received, 100% packet lossBringing up other NIC and making faulty down show packet is not lost by verifying it on esxtop command and selecting option "n" to see association between NIC and vmkernel port.NODE2# esxcli network nic up -n vmnic4NODE2#esxcli network nic listName PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description------ ------------ ------- ------------ ----------- ----- ------ ----------------- ---- -----------------------------------------------------------------vmnic0 0000:18:00.0 ntg3 Up Up 1000 Full 20:04:0f:f0:7c:0c 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic1 0000:18:00.1 ntg3 Up Down 0 Half 20:04:0f:f0:7c:0d 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic2 0000:19:00.0 ntg3 Up Up 1000 Full 20:04:0f:f0:7c:0e 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic3 0000:19:00.1 ntg3 Up Down 0 Half 20:04:0f:f0:7c:0f 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernetvmnic4 0000:87:00.0 qedentv Up Up 10000 Full 34:80:0d:0f:0a:2c 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adaptervmnic5 0000:87:00.1 qedentv Up Up 10000 Full 34:80:0d:0f:0a:2d 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet AdapterNODE2#esxcli network nic down -n vmnic5NODE02# vmkping -I vmk2 192.168.1.111 -c 100 -i 0.005PING 192.168.1.111 (192.168.1.111): 56 data bytes64 bytes from 192.168.1.111: icmp_seq=0 ttl=64 time=0.148 ms64 bytes from 192.168.1.111: icmp_seq=1 ttl=64 time=0.069 ms64 bytes from 192.168.1.111: icmp_seq=2 ttl=64 time=0.066 ms64 bytes from 192.168.1.111: icmp_seq=3 ttl=64 time=0.072 ms64 bytes from 192.168.1.111: icmp_seq=4 ttl=64 time=0.068 ms64 bytes from 192.168.1.111: icmp_seq=5 ttl=64 time=0.061 msNIC was using latest driver.NODE2#vmkload_mod -s qedentvvmkload_mod module information input file: /usr/lib/vmware/vmkmod/qedentv Version: 3.9.31.2-1OEM.670.0.0.8169922 Build Type: release License: QLogic_Proprietary Required name-spaces: com.vmware.vmkapi#v2_5_0_0 Parameters:
Isolated the faulty NIC in the standard switch with the working NIC. Select "load balancing" setting to "Route based on originating port ID"Moving the faulty NIC to standby.