...
BugZero found this defect 929 days ago.
ScenarioDue to a driver incompatibility within the ixgbe Intel NIC driver the VxFlex system had multiple SDS disconnections, causing DU and APD within the ESXi nodes. The APD caused ESXi hostd hang which causes the nodes to become inaccessible. SymptomsVmkernel logs: 2019-04-10T04:47:02.092Z cpu43:946022)NetLB: 2233: Driver claims supporting 15 TX queues, and 15 queues are accepted. 2019-04-10T04:47:02.092Z cpu43:946022)NetLB: 2237: Driver claims supporting 15 RX queues, and 15 queues are accepted. ... 2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105 2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105 2019-04-10T04:47:02.100Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105 2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136 2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136 2019-04-10T04:47:02.104Z cpu55:946022)Tcpip_Vmk: 96: get connection stats failed with error code 195887136 .... 2019-04-10T04:47:02.132Z cpu55:946022)WARNING: Tcpip_Vmk: 781: vmk_get_gateway failed with error = 0x2d, status = 0xbad0105 ... 2019-04-10T04:47:10.498Z cpu13:948008)WARNING: UserObj: 5436: vmkvsitools: Unimplemented operation on 0x439e817fc850/SOCKET_VMCI 2019-04-10T04:47:10.498Z cpu13:948008)WARNING: UserObj: 5436: vmkvsitools: Unimplemented operation on 0x439e817f3c40/SOCKET_VMCI 2019-04-10T04:47:11.587Z cpu40:66684)nsxt-switch-security: SwSecDelVmi:1121: [nsx@6876 comp="nsx-esx" subcomp="swsec"]Filter 67112517Deleting vmi: 2 vlanId = 0 mac = 02:50:56:00:70:e8 ip = 10.255.15.33 2019-04-10T04:47:11.587Z cpu40:66684)nsxt-switch-security: SwSecDelVmi:1165: [nsx@6876 comp="nsx-esx" subcomp="swsec"]Filter 67112517After deleting: [0 0 0 0] 2019-04-10T04:47:12.407Z cpu34:948237)DLX: 4310: vol 'F2_DS1', lock at 174866432: [Req mode 1] Checking liveness: 2019-04-10T04:47:12.407Z cpu34:948237)[type 10c00001 offset 174866432 v 139, hb offset 3932160 gen 17, mode 1, owner 5c8e202a-f1c1830f-af9b-246e96c9cad0 mtime 446607 Vmkernel log entries showing TX hangs: 2019-04-10T12:01:55.265Z cpu33:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4571: vmnic5 : scheduler(0x430acbf450e0)/device(0x4306fee843c0) 0/1 lock up [stopped=0]: 2019-04-10T12:01:55.265Z cpu33:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4602: vmnic5: packets completion seems stuck, issuing reset 2019-04-10T12:01:59.626Z cpu48:65693)ixgbe 0000:05:00.1: vmnic5: Fake Tx hang detected with timeout of 5 seconds CPU locks declared prior to driver state: 2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 32 didn't have a heartbeat for 7 seconds; may be locked up. 2019-04-10T12:01:54.547Z cpu23:73050)WARNING: Heartbeat: 794: PCPU 45 didn't have a heartbeat for 7 seconds; may be locked up. 2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 33 didn't have a heartbeat for 7 seconds; may be locked up. 2019-04-10T12:01:54.547Z cpu13:73515)WARNING: Heartbeat: 794: PCPU 35 didn't have a heartbeat for 7 seconds; may be locked up. 2019-04-10T12:01:54.547Z cpu10:73512)WARNING: Heartbeat: 794: PCPU 34 didn't have a heartbeat for 8 seconds; may be locked up. 2019-04-10T12:01:54.547Z cpu13:73515)WARNING: Heartbeat: 794: PCPU 36 didn't have a heartbeat for 7 seconds; may be locked up. Further we see logged evidence of the msgs to hostd failing: 2019-04-10T21:05:31.753Z cpu1:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20 2019-04-10T21:05:35.011Z cpu2:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20 2019-04-10T21:05:35.046Z cpu0:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20 2019-04-10T21:05:37.795Z cpu26:65707)VmkEvent: 93: Msg to hostd failed with timeout, dropping function 2092 len 20 2019-04-14T22:32:35.007Z cpu5:95941)WARNING: Heartbeat: 794: PCPU 28 didn't have a heartbeat for 8 seconds; *may* be locked up. ... 2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4571: vmnic5 : scheduler(0x430accad10e0)/device(0x4306fee843c0) 0/1 lock up [stopped=0]: 2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4578: detected at 407655639 while last xmit at 407650438 and 39742 bytes in flight [window 86460 bytes] 2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4583: and last enqueued/dequeued at 407652355/407655639 [stress 0] 2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4586: with 394 pkts inflight 2019-04-14T22:33:36.644Z cpu30:67014)WARNING: netschedHClk: NetSchedHClkWatchdogSysWorld:4602: vmnic5: packets completion seems stuck, issuing reset ... 2019-04-14T22:55:33.715Z cpu50:66861)WARNING: Lock: 1675: (held by 2: Spin count exceeded 1 time(s) - possible deadlock. ... 2019-04-15T01:00:01.810Z cpu29:608949)ALERT: hostd detected to be non-responsive ... 2019-04-15T00:55:16.800Z cpu0:546988)WARNING: Heartbeat: 498: One or more PCPUs didn't perform a heartbeat check for 7 seconds. Impact Cause network latency that can affect the HCI VxFlex SVMs installed on the ESXi nodes that cause APD and hostd hang on the nodes.
Intel cards on ESXi installed on Ready Nodes require the native mode driver to be used.
Workaround Change the Intel NIC driver from ixgbe to ixgben native driver through reinstall. Impacted Versions N/A Fixed In Version N/A - Driver issue