...
NSX-T Data Center version < 3.2.1.2NSX-T Load Balancer configured with source IP persistence profileUI reports high CPU usage on the Edgenginx core dumps generated followed by High or 100% CPU on the nginx worker processClients cannot connect to the LB backend serversThe LB operation process spikes to 100% Relevant log locationLog Indicating that a nginx coredump is generated/var/log/syslog2022-08-13T11:04:24.986Z edge-02.corp.local NSX 22492 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.nginx.1660388664.16937.134.11.gzLog indicating the load-balancer CPU usage is very high just after the generation of the coredump/var/log/syslog2022-08-13T11:09:12.986Z edge-02.corp.local NSX 2781 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71" tid="3145" level="WARNING" eventState="On" eventFeatureName="load_balancer" eventSev="warning" eventType="lb_cpu_very_high"] The CPU usage of load balancer a8xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx71 is very high. The threshold is 95%.edge-02> get processestop - 09:56:10 up 34 days, 17:09, 0 users, load average: 2.86, 2.16, 1.73Tasks: 268 total, 4 running, 166 sleeping, 0 stopped, 14 zombie%Cpu(s): 2.6 us, 2.6 sy, 0.0 ni, 94.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 stKiB Mem : 32734844 total, 6939616 free, 18403544 used, 7391684 buff/cacheKiB Swap: 0 total, 0 free, 0 used. 13488368 avail Mem/opt/vmware/nsx-netopa/bin/agent.py14111 lb 20 0 598552 84140 3672 R 100.0 0.3 3:56.79 14111 nginx: worker process15422 lb 20 0 624816 83956 3404 R 100.0 0.3 0:45.09 15422 nginx: worker process
The shared memory has been removed from L4LB CP nginx but the queue node pointer of this persistence session is still pointing to the address in this removed shared memory.This means the persistence session is still in the queue of this removed shared memory. When the persistence session is freed, we try to remove it from the list so a crash of the load balancer happens.
Load Balancer crashesThe persistence table is not unlocked due to Load Balancer crash making the CPU usage very high in the new L4LB process.Some persistence session nodes may be lost. They are not in any queue and out of management. So the total number of persistence entries cannot reach the capacity of this load balancer size.Some new connections may go into different backend servers
Upgrade to NSX-T 3.2.1.2 or 4.1.0 and greater
Temporary WorkaroundsThe nginx process needs to be restarted completely to resolve the issue temporarily on the affected Edge nodes. There are multiple option which can be followed to perform this, below are the same. Note: anyone of them can be used.1. Change the active edge node into maintenance mode and then exit maintenance mode2. Restart Edge Node3. Restart Load Balancer Docker - ps | grep edge-lb | awk '{print $1}' | xargs docker restartLong-term WorkaroundChange pool selection algorithm from round robin to IP hash and disable persistence in all virtual servers