...
What is DDR4 "self-healing" on AMD Rome / Milan processor based Dell PowerEdge Servers (R65xx, R75xx, and C65xx)?Do the previous generation AMD based PowerEdge servers with AMD EPYC processors (R64xx and R74xx) support these same "self-healing" capabilities?How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?
There are on-going improvements and enhancements to the Dell EMC PowerEdge BIOS to improve Memory error event messaging, error handling, and "self-healing" upon a server reboot, that prevents the need for a scheduled maintenance window and/or onsite presence to replace a DDR4 memory DIMM that was logging error events.
There are two main memory-related "self-healing" BIOS enhancements that are included with AMD processor based PowerEdge Servers (65xx/75xx) with DDR4 memory available at product launch. These enhancements do change the recommended steps/actions to take if memory errors occur and are logged to the LifeCycle log. Note: The "self-healing" enhancements discussed in this article do not apply to the previous generation of AMD based PowerEdge servers with AMD EPYC processors. The 64xx/74xx AMD PowerEdge Servers do not contain any of the "self-healing" enhancements described in this article. Memory retraining only occurs when changes in server memory configuration are detected. The version 1.0 of the Engineering white paper does describe some of the RAS features available for AMD EPYC processors - PowerEdge YX4X Server Memory RAS Whitepaper v1.0 (dell.com) Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.With AMD Rome / Milan based PowerEdge servers, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.We always encourage customers to update to the latest available BIOS release (and iDRAC firmware) so that they take advantage in the latest self-healing enhancements. 1. Memory retraining enhancements - Memory retraining, which happens during boot, optimizes the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons: Changes in Server memory configurationBIOS changesDifferent operating temperatures of the Server or DIMMThe general age of the DIMM Current AMD Rome / Milan based PowerEdge servers (65xx/75xx) perform Memory retraining upon every boot. This differs from the current Intel based PowerEdge server implementation.If any of the following errors are logged to in the SEL/LifeCycle logs, the Dell EMC Engineering recommendation is to reboot the server to allow for Memory retraining to occur. Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX." Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location(s) DIMM_XX. With any of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing/margining for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur. 2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size. Previously, this functionality was limited to the manufacturing process. Just like with the memory retraining enhancements mentioned earlier, in there are certain correctable and uncorrectable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically force a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are: Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."Critical - MEM9072 - "The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location arg1." Any of these errors being logged in the SEL/LifeCycle log results in PPR being scheduled for the next reboot (warm or cold). Note: A Message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX.) Without a corresponding MEM0005/MEM0701/MEM0702 on the same DIMM location, currently does not result in a PPR being scheduled for the next reboot. After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to: Message ID MEM9060 - "The PostPackage Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier." A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is: Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X." Updated April 24, 2020Dell EMC is continuing to enhance and expand our "self-healing" capabilities. The following section documents the updates/enhancements and what BIOS version the changes were implemented in.BIOS 1.0.x - Initial article publication of the "self-healing" capabilities available starting with BIOS 1.0.x and higher, including example error messages as well as recommended actions.BIOS 1.1.x and newer changes (December 2019) MEM0702 (Correctable error rate exceeded ...) - Message updated from a critical to warning. With recommended actions updated to reboot the server to allow "self-healing" to occur - ie Post Package Repair. Requires December 2019 or newer iDRAC to also be installed to get the updated messageRecomended Action: Reboot the server to allow PPR to run MEM9060 - Message description updated to indicate "self-healing" was successfully completed BIOS 1.2.x and newer changes (February 2020) A "Correctable Error Logging" BIOS option was added to allow customers to disable all LifeCycle/SEL logging related to correctable errors. All the "self-healing" features still function - ie PPR and memory retraining is still scheduled and run during the next reboot.Addition of MEM08xx errors for RDIMMs and LRDIMMs replacing existing error messages and actions. Existing error messages are still used for platforms that do not support the "self-healing" capabilities. Requires February 2020 or newer iDRAC for messages to get logged. Note: Without updated iDRAC, new BIOS messages are "unknown" in the SEL/LC logs. MEM0802 - Replaced MEM0702 - correctable error rate exceeded Recommended Action: Reboot the server to allow PPR to run MEM0804 - Replaced MEM9060 indicating PPR was successful. Now includes DIMM slot location(s) that ran PPR Recommended Action: None. Indicates "self-healing" occurred, no DIMM replacement is needed. MEM0805 - Replaced UEFI0278 indicating PPR failed Recommended Action: Replacing failing DIMM Updated January 25, 2021BIOS 1.7.x and newer changes (December 2020) MEM8000 (Correctable error logging disabled) - Early in BIOS, Dell EMC Engineering made a BIOS change to enhance the rate of correctable error detection that may impact performance. This change resulted in an uptick in MEM8000 events that was not substantiated by results from memory component failure analysis. Starting with BIOS1.7.x there are two changes related to MEM8000. The first is signaling of the MEM8000 event has been modified. Secondly, BIOS schedules self-healing (PPR) for the next reboot. iDrac messages are not yet updated to reflect the new actions Recommended Action: Reboot the server to allow self-healing/PPR to run. Confirm that PPR was successful (MEM0804). There are additional RAS feature enhancements being evaluated for inclusion in future BIOS updates.A white paper is planned that describes Dell EMC PowerEdge server (AMD Rome / Milan based processors) Memory related Reliability, Availability, and Serviceability (RAS) features.This article will be updated as new information becomes available.