BugZero | VMware BugID 91779 - Intermittent connectivity lost from outside of NSX...

VMware - Defect ID: 91779

Intermittent connectivity lost from outside of NSX-T to destination VM behind Palo Alto Service Insertion VM

VMware - Defect ID: 91779

Intermittent connectivity lost from outside of NSX-T to destination VM behind Palo Alto Service Insertion VM

Last updated on 9/29/2023

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Symptoms

Traffic traversing redirect rules and through the service VM periodically stops working. This can't be diagnosed using the logs. Below are some ways to confirm this condition. 1. Run net-stats -A -t WwQqihVv > /<path>/<filename.txt> a. Search the output for the service VM name and the vnic (probably eth1 but might not be) that is connected to the service overlay segment. b. The sections will look similar to the below. {"name": "PaloAltoNetworks_PA-VM-NST_DepSpec (180).eth1", "switch": "DvsPortset-1", "id": 67108901, "mac": "00:50:56:b6:18:4a", "rxmode": 0, "tunemode": 0, "uplink": "false", "ens": "false", "promisc": "false", "sink": "false" , "txpps": 131644, "txmbps": 1303.2, "txsize": 1237, "txeps": 0.00, "rxpps": 131681, "rxmbps": 1303.5, "rxsize": 1237, "rxeps": 0.00, "vnic": { "type": "vmxnet3", "ring1sz": 1024, "ring2sz": 1024, "tsopct": 0.0, "tsotputpct": 0.0, "txucastpct": 100.0, "txeps": 0.0, "lropct": 0.0, "lrotputpct": 0.0, "rxucastpct": 100.0, "rxeps": 0.0, "maxqueuelen": 0, "requeuecnt": 0.0, "agingdrpcnt": 0.0, "deliveredByBurstQ": 0.0, "dropsByBurstQ": 0.0, "droppedbyQueuing": 0.0 , "txdisc": 0.0, "qstop": 0.0, "txallocerr": 0.0, "txtsosplit": 0.0, "r1full": 0.0, "r2full": 0.0, "sgerr": 0.0}, "rxqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 7, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131674, "mbps": 1303.5, "errs": 0.0} ]}, "txqueue": { "count": 2, "details": [ {"intridx": 0, "pps": 0, "mbps": 0.0, "errs": 0.0}, {"intridx": 0, "pps": 131646, "mbps": 1303.2, "errs": 0.0} ]}, c. You can see in the above 2 tx queues one is showing 131646 pps of traffic the other shows 0. 2. Run the below vsish command. A difference of 1 between next2Tx and next2Comp shows the issue. a. vsish -e get /net/portsets/DvsPortset-<X>/ports/<switchport number>/vmxnet3/txqueues/<queue number>/status i. Example: vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/1/status status of a vmxnet3 vNIC tx queue { intr index:0 stopped:0 error code:0 next2Tx:787 next2Comp:788 genCount:348131 next2Write:788 next2Tx from timeout:980 next2Comp from timeout:788 timestamp in milliseconds in check:384765941 } [root@seesxp13c2-las:~] vsish -e get /net/portsets/DvsPortset-1/ports/100663335/vmxnet3/txqueues/0/status status of a vmxnet3 vNIC tx queue { intr index:0 stopped:0 error code:0 next2Tx:663 next2Comp:663 genCount:780117 next2Write:663 next2Tx from timeout:598 next2Comp from timeout:597 timestamp in milliseconds in check:0 }

Cause

In a working scenario, the SPF port code calls an ESXi function to forward packets from the Guest VM if the Guest VM port is active. In this case, due to the Guest VM port undergoing a reset caused by a snapshot of the VM, the ESXi hypervisor is unable to process the packets being sent from the Guest VM vNic port. This behavior results in the I/O completion of the packet to get missed and the hypervisor proceeds to free the packet which in turn, is causing the Tx queue to hang and no longer process traffic. This leads our Engineering team towards a code fix at the SPF port level.

Impact / Risks

Once a Tx queue is hung, the datapath flowing through that queue will be broken until the vNic of where the hung queue sits, is reset (disconnect and reconnect in the VM settings).

Resolution

Code fix is in ESXi 7.0 U3oCode fix for 8.0 U2 is still TBD

Workaround

1. The stuck tx queue can be reset by disconnecting the impacted vNic via the vSphere GUI then reconnecting it. 2. A longer term workaround could be to to create a "no redirect rule" for the impacted traffic above the "redirect rule" to bypass the service insertion data-path.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 91779

Intermittent connectivity lost from outside of NSX-T to destination VM behind Palo Alto Service Insertion VM

VMware - Defect ID: 91779

Intermittent connectivity lost from outside of NSX-T to destination VM behind Palo Alto Service Insertion VM

Last updated on 9/29/2023

Vendor details

Vendor details

Description

Symptoms

Cause

Impact / Risks

Resolution

Workaround

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?