...
You have NSX-T 3.2.x deployed.You are using IDFW and have log scrapping configured AD and log insight.The NSX-T manager UI becomes unavailable displaying the following error: Some appliance components are not functioning properly. Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP. Error code: 101 In NSX-T manager cli as admin, running the following command fails: get cluster status In the NSX-T managers as root user you see 1 or more recent core dumps: ls -l /image/core/ total 1191400-rw------- 1 nsx-cbm nsx-cbm 45579417 Mar 8 15:06 cbm_oom.hprof.gz-rw------- 1 root root 230343252 Mar 8 20:17 compactor_oom.hprof.gz-rw------- 1 corfu corfu 944060040 Mar 9 14:14 corfu_oom.hprof.gz In the NSX-T manager root '/' partition, you see a large amount of files starting with: hs_err_pidXXXX.log XXXX represents the PID of the process and will be different on your setup. The managers layout may not be complete for all managers seen in file: /config/corfu/LAYOUT_CURRENT.ds "sequencers": [ "192.168.1.131:9000", "192.168.1.133:9000", "192.168.1.132:9000" ], "segments": [ { "replicationMode": "CHAIN_REPLICATION", "start": 0, "end": 40397089, "stripes": [ { "logServers": [ "192.168.1.131:9000" ] } ] }, { "replicationMode": "CHAIN_REPLICATION", "start": 40397089, "end": 40397196, "stripes": [ { "logServers": [ "192.168.1.131:9000", "192.168.1.133:9000" ] } ] }, { "replicationMode": "CHAIN_REPLICATION", "start": 40397196, "end": 40397804, "stripes": [ { "logServers": [ "192.168.1.131:9000", "192.168.1.133:9000" ] } ] }, { "replicationMode": "CHAIN_REPLICATION", "start": 40397804, "end": -1, "stripes": [ { "logServers": [ "192.168.1.131:9000", "192.168.1.133:9000", "192.168.1.132:9000 from the above file, we see the below managers do not have the complete database syn'ed to them: Manager 10.1.1.133 is missing from replication: "start": 0, "end": 40397089,Manager 192.168.1.132 is missing from replication: "start": 40397089, "end": 40397196,And replication: "start": 0, "end": 40397089,Manager 192.168.1.131 is the only one with a complete database. In /var/log/corfu-compactor-audit.log, we see: corfu-compactor-audit.9.log:2022-03-02T18:54:23.170Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24113825 (exclusive).corfu-compactor-audit.9.log:2022-03-02T19:09:22.956Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24163751 (exclusive).corfu-compactor-audit.log:2022-03-07T15:53:56.666Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive)....corfu-compactor-audit.log:2022-03-09T12:57:25.040Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).corfu-compactor-audit.log:2022-03-09T13:57:11.401Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).corfu-compactor-audit.log:2022-03-09T14:42:42.964Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).corfu-compactor-audit.log:2022-03-09T14:59:07.533Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).corfu-compactor-audit.log:2022-03-09T15:15:43.943Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(4s), log address up to 28012247 (exclusive). The above log entries imply compactor is running, but not trimming, as the log address is not increasing.In the same log, we see the last completed checkpoint for table 3c54c60e-5a89-3f7c-9f1f-f03724af9649 has a very large number of entries, large in size and took a long time to complete: 2022-03-03T15:29:29.765Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(621841), cpSize(213995471) bytes at snapshot Token(epoch=13, sequence=27948473) in 934074 ms2022-03-03T16:57:20.892Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(619376), cpSize(213145572) bytes at snapshot Token(epoch=13, sequence=28017532) in 4606838 ms We can see the corfu compactor service crashing: 2022-03-03T17:33:09.152Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Starting checkpoint namespace: nsx, tableName: LoginLogoutEvent2022-03-03T17:33:09.152Z INFO main MultiCheckpointWriter - appendCheckpoints: appending checkpoints for 1 maps2022-03-03T17:33:09.164Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649 at snapshot Token(epoch=13, sequence=28250089)......Aborting due to java.lang.OutOfMemoryError: Java heap space...... Aborted (core dumped)2022-03-03T17:47:22.761Z INFO Runner - Failed to run compactor tool: Command 'MALLOC_TRIM_THRESHOLD_=1310720 nice -n -10 java -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core/compactor_oom.hprof -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" -XX:+CrashOnOutOfMemoryError -Xms963m -Xmx963m -Djava.io.tmpdir=/image/corfu-tools/temp -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -DlogFilePrefix=/var/log/corfu/corfu-compactor-audit -Dlog4j.configurationFile=/opt/vmware/ufo-tools/corfu-compactor-log4j2.xml -Dcorfu-property-file-path=/opt/vmware/cbm/etc/ufo-factory.properties -cp "/opt/vmware/ufo-tools/*" com.vmware.nsx.platform.ufo.UfoCompactorMain -hostname 10.1.1.132 -hostname 10.1.1.133 -hostname 10.1.1.131 -port 9000 -trim -useDistributedLock -lockCorfuHostname 10.1.1.131 -lockCorfuPort 9000 -bulkReadSize 50' returned non-zero exit status 134. And other services are also running out of memory: grep "| java.lang.OutOfMemoryError: Java heap space" tanuki.log | headINFO | jvm 1 | 2022/03/04 20:45:02 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 2 | 2022/03/07 03:01:18 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 3 | 2022/03/07 12:02:34 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 4 | 2022/03/07 18:21:28 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 5 | 2022/03/07 19:44:54 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 6 | 2022/03/07 20:47:07 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 7 | 2022/03/08 03:36:20 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 8 | 2022/03/08 05:35:02 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 9 | 2022/03/08 08:10:50 | java.lang.OutOfMemoryError: Java heap spaceINFO | jvm 10 | 2022/03/08 11:18:34 | java.lang.OutOfMemoryError: Java heap space
This issue can occur when IDFW is configured with AD log scrapping and there are to many login and logout events.The mechanism used to clean up this table can not keep up with the events and eventually the table grows to big for the corfu compactor service to complete, which causes the corfu compactor service to crash.This issue can cause other services to crash as seen in the tanuki log above, due to the amount of memory taken by the corfu compactor service when trying to complete the compaction.
This is a known issue impacting NSX-T Data Center.
It is possible to increase the intensity of the IDFW cleaner to start more often and cleanup these entries, thus reducing the retention time of the events in the corfu table.If you believe you have encountered this issue, please open a support request and reference this KB.