Symptoms
HPE Gen10 servers running ESXi 6.0.x fails with PSOD Machine Check Exception: Fatal (unrecoverable) MCE randomly.PSOD backtrace looks similar:
2017-10-01T02:40:01.688Z cpu50:55173)Backtrace for current CPU #50, worldID=55173, rbp=0x02017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bcf8:[0x418039b0515a]Power_HaltPCPU@vmkernel#nover+0x1ee stack: 0x417ff9a83f20, 0x41804c92017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bd48:[0x418039a12078]CpuSchedIdleLoopInt@vmkernel#nover+0x2f8 stack: 0x21aaa22dde9e0, 0x12017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bdc8:[0x418039a157d3]CpuSchedDispatch@vmkernel#nover+0x16b3 stack: 0x43927eaa7100, 0x0, 02017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bee8:[0x418039a16398]CpuSchedWait@vmkernel#nover+0x240 stack: 0x41003b4acde0, 0x0, 0xa0002017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bf68:[0x418039a164ea]CpuSched_VcpuHalt@vmkernel#nover+0x11e stack: 0xffffffff00002001, 0x2017-10-01T02:40:01.688Z cpu50:55173)0x43947c29bfb8:[0x4180398ac529]VMMVMKCall_Call@vmkernel#nover+0x139 stack: 0x4180398ac074, 0x0, 0x42017-10-01T02:40:01.712Z cpu50:55173)ESC[45mESC[33;1mVMware ESXi 6.0.0 [Releasebuild-5572656 x86_64]ESC[0mMachine Check Exception: Fatal (unrecoverable) MCE on PCPU50 in world 55173:vmm0:fvst-ca System has encountered a Hardware Error - Please contact the hardware vendor
ESXi6.0.x installation may also fail with an error as shown in screenshot
Note:The preceding log excerpts are only examples.Date,time and environmental variables may vary depending on your environment.
Cause
ESXi was mapping 8 GB above the top of memory which allows the CPU to touch addresses above the top of memory and causing the failure.
Resolution
To resolve this issue upgrade to VMware ESXi ,Patch release ESXi-6.0.0-20171104001 or later.For more information refer to the HPE customer advisoryDisclaimer:VMware is not responsiblee for the reliability of any data,opinions,advice or statements made on third-party websites.Inclusion of such links does not imply that VMware endorses,recommends or accepts any responsibility for the content of such sites
Workaround
To workaround this Go HPE RBSU and set the memory to be at 1TB. Contact HPE Support for more information.