...
Traffic traversing redirect rules and through the service VM periodically stops working. This can't be diagnosed using the logs. Below are some ways to confirm this condition. 1. Run net-stats -A -t WwQqihVv > /<path>/<filename.txt> a. Search the output for the service VM name and the vnic (probably eth1 but might not be) that is connected to the service overlay segment. b. The sections will look similar to the below. {"name": "PaloAltoNetworks_PA-VM-NST_DepSpec (180).eth1", "switch": "DvsPortset-1", "id": 67108901, "mac": "00:50:56:b6:18:4a", "rxmode": 0, "tunemode": 0, "uplink": "false", "ens": "false", "promisc": "false", "sink": "false" , "txpps": 131644, "txmbps": 1303.2, "txsize": 1237, "txeps": 0.00, "rxpps": 131681, "rxmbps": 1303.5, "rxsize": 1237, "rxeps": 0.00, "vnic": { "type": "vmxnet3", "ring1sz": 1024, "ring2sz": 1024, "tsopct": 0.0, "tsotputpct": 0.0, "txucastpct": 100.0, "txeps": 0.0, "lropct": 0.0, "lrotputpct": 0.0, "rxucastpct": 100.0, "rxeps": 0.0, "maxqueuelen": 0, "requeuecnt": 0.0, "agingdrpcnt": 0.0, "deliveredByBurstQ": 0.0, "dropsByBurstQ": 0.0, "droppedbyQueuing": 0.0 , "txdisc": 0.0, "qstop": 0.0, "txallocerr": 0.0, "txtsosplit": 0.0, "r1full": 0.0, "r2full": 0.0, "sgerr": 0.0}, "rxqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 7, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131674, "mbps": 1303.5, "errs": 0.0} ]}, "txqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 0, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131646, "mbps": 1303.2, "errs": 0.0} ]}, c. You can see in the above 2 tx queues one is showing 131646 pps of traffic the other shows 0. 2. Run the below vsish command. A difference of 1 between next2Tx and next2Comp shows the issue. a. vsish -e get /net/portsets/DvsPortset-<X>/ports/<switchport number>/vmxnet3/txqueues/<queue number>/status i. Example: vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/1/status status of a vmxnet3 vNIC tx queue { intr index:0 stopped:0 error code:0 next2Tx:787 next2Comp:788 genCount:348131 next2Write:788 next2Tx from timeout:980 next2Comp from timeout:788 timestamp in milliseconds in check:384765941 } [root@seesxp13c2-las:~] vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/0/status status of a vmxnet3 vNIC tx queue { intr index:0 stopped:0 error code:0 next2Tx:663 next2Comp:663 genCount:780117 next2Write:663 next2Tx from timeout:598 next2Comp from timeout:597 timestamp in milliseconds in check:0 }
In a working scenario, the SPF port code calls an ESXi function to forward packets from the Guest VM if the Guest VM port is active. In this case, due to the Guest VM port undergoing a reset caused by a snapshot of the VM, the ESXi hypervisor is unable to process the packets being sent from the Guest VM vNic port. This behavior results in the I/O completion of the packet to get missed and the hypervisor proceeds to free the packet which in turn, is causing the Tx queue to hang and no longer process traffic. This leads our Engineering team towards a code fix at the SPF port level.
Once a Tx queue is hung, the datapath flowing through that queue will be broken until the vNic of where the hung queue sits, is reset (disconnect and reconnect in the VM settings).
Code fix is in ESXi 7.0 U3oCode fix for 8.0 U2 is still TBD
1. The stuck tx queue can be reset by disconnecting the impacted vNic via the vSphere GUI then reconnecting it. 2. A longer term workaround could be to to create a "no redirect rule" for the impacted traffic above the "redirect rule" to bypass the service insertion data-path.