...
ESXi host may crash with a PSOD - Spin count exceeded - possible deadlock with PCPUvSphere Replication is used to replicate the VMs.The backtrace will be similar to the below Panic Details: Crash at YYYY-MM-DDTHH:MM:SS.825Z on CPU 0 running world 2102090. VMK Uptime:72:07:15:03.736 Panic Message: @BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0xc42ce(0x418002400000):0x43190b62a500:0xfc8] (Src 0x4, CPU0) 0x450a00002d10:[0x41800250ac15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x418002889fe0, 0x418002889f28, 0x450a00002db8, 0x43026acd2028, 0x450a00000001 0x450a00002db0:[0x41800250aea1]Panic_WithBacktrace@vmkernel#nover+0x56 stack: 0x450a00002e20, 0x450a00002dd0, 0x0, 0x0, 0xc42ce 0x450a00002e20:[0x418002507c91]NMI_Interrupt@vmkernel#nover+0x3c2 stack: 0x0, 0xfc8, 0x5320302075706370, 0x206b636f4c6e6970, 0x74756f206e697073 0x450a00002ea0:[0x418002543ffc]IDTNMIWork@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0 0x450a00002f20:[0x4180025454f0]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x418002560067, 0xfd0, 0xfd0, 0x0 0x450a00002f40:[0x418002560066]gate_entry@vmkernel#nover+0x67 stack: 0x0, 0x0, 0xf, 0x1c0477, 0x0 0x451ac901be88:[0x4180024c42ce]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f, 0x16b7487, 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30 0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30, 0x432062092050, 0x41800386f366 0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b, 0x418002514853, 0x8818c0, 0x4180025502c7, 0x451ac90232c0 0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c, 0x417fd92021c0, 0x0, 0x451ac9023000, 0x451ac6123100 0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0 Saved backtrace from: pcpu 0 SpinLock spin out NMI 0x451ac901be88:[0x4180024c42cd]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f 0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0 0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b 0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c 0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0
To avoid crashing the ESXi host with a PSOD.
hbr_filter searches for a whole contiguous region in the transfer bitmap. This usually works well when the regions are small. When the regions are large enough (for ex. when full syncing large disk with checksumming disabled), iterating them may result in PSOD (because the disk lock is held for long time, this way exceeding the spin count of other contending cpu's)
Stopping the replication of VM that caused the crash.
This issue is resolved in VMware vSphere ESXi 6.0 Patch ESXi600-201909001 , ESXi 6.5 U3 and ESXi 6.7 U3.
To workaround follow the below stepsIdentify the VM which is part of the replication and notice the RDID for exampleThis is the replication ID of the disk: RDID-13d1285d-e660-4da9-8ffd-9e921a84ea2cThe corresponding replication group ID: GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9You can find the VM by checking the replication configuration of the VMs on the host:$ vim-cmd hbrsvc/vmreplica.getConfig <vmID>where <vmID> can be obtained from the list of the registered VMs:$ vim-cmd vmsvc/getallvmsThe replication ID should match GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9Then stop the replication for this VM.