...
Through the evolution of RAS (Reliability, Accessibility and Serviceability) features across Enterprise-Class memory, Dell has taken a conservative approach in error reporting to provide transparency to our customers. As this evolution continues, so too does Dell’s approach to error reporting to enable focus on notices that require a more urgent response versus notices that are primarily informational in nature.As DRAM based memory geometries continue to shrink, providing customers with the increased performance they demand, an increasing number of correctable errors are expected as a natural part of uniform scaling.
Within the global server industry, there is an increasingly accepted understanding, shared by Dell, that some correctable errors per DIMM are unavoidable and do not inherently warrant a memory module replacement or even an immediate reboot to initiate self-healing.
Continuing to operate a system reporting correctable errors without a reboot to self-heal, does not significantly increase the risk of experiencing uncorrectable errors that may lead to unplanned downtime. In fact, others in the industry have publicly communicated their memory handling does not report correctable errors.In 14G Intel PowerEdge BIOS version 2.5.4 and newer, a BIOS setting was added called "Correctable Error Logging", to allow customers the option to disable correctable error reporting if they choose, and many have. BIOS will continue to schedule self-healing for correctable threshold events even without the logging. This scheduled self-healing will automatically occur during the subsequent system reboot.To draw more in line with the industry and continuing customer feedback, beginning in March 2022, Dell PowerEdge BIOS updates will change the "Correctable Error Logging" BIOS setting to disabled by default. This BIOS option can be re-enabled for customers wanting to continue to see correctable memory threshold events. BIOS versions with this BIOS setting change included are: 14G Intel Platforms - BIOS versions 2.13.3 or newer15G AMD Platforms - BIOS versions 2.6.5 or newer15G Intel Platforms - BIOS versions 1.5.5 or newer. The benefits of DDR4 DIMM self-heal via a system reboot: Enables repair of a DDR4 DIMM without removal from the system; all Dell-sourced DDR4 DIMMs support memory self-heal. Note - 14G AMD PowerEdge servers do not have this self-healing capability.Utilizes available spare rows architected into the DRAM where a bad row is permanently replaced with a known good row by electrical fusing.The subsequent memory retrain optimizes the “data eyes” by recalibrating the center points to ensure that the memory bus operates at the highest level of signaling integrity. For correctable threshold events with the "Correctable Error Logging" BIOS setting Enabled, if memory threshold events occur, Dell recommends rebooting at the customer’s regular maintenance schedule to allow the scheduled memory self-healing or self-correcting to occur. After the reboot, successful or unsuccessful self-healing events will be logged for the associated DIMMs. With the "Correctable Error Logging" BIOS setting Disabled, Dell recommends rebooting at the customer’s regular maintenance schedule. Upon reboot, any scheduled self-healing operations will automatically run. The system will log an event (MEM0805 or MEM7114 type events) if the self-healing / self-correction operation was unsuccessful and further recommend physically replacing the affected DIMM.Recommendation:Dell EMC Memory Engineering recommends that PowerEdge Server customers on older BIOS versions (pre March block 2022 BIOS releases), adopt changing the "Correctable Error Logging" BIOS setting to Disabled. This will eliminate the sporadic correctable memory threshold events (such as MEM0802 or MEM5104 type events) across their server infrastructure that recommend server reboots to allow self-healing or self-correction to occur. As mentioned previously, any scheduled self-healing or self-correction operations will run automatically when the server is rebooted and any failures will be reported. The "Correctable Error Logging" BIOS setting can be changed either by rebooting the server to F2 Settings or via the iDRAC GUI. To change the BIOS setting using F2 Settings: Reboot the servers stopping at F2 settings In the BIOS Settings -> Memory Settings selection, change the "Correctable Error Logging" to disabled. Save the BIOS settings and exit F2 settings To change the BIOS setting using the iDRAC GUI: Log into the iDRAC GUIUnder Configuration -> BIOS Settings, expand the Memory Settings sectionChange the "Correctable Error Logging" setting to disabledClick on Apply button to save the Memory SettingsDon't forget to select either the Apply and Reboot button (to reboot immediately) or At Next Reboot button to apply the BIOS changes. Existing memory related KB articles and whitepapers will be updated to reflect this recommended change.NOTE: The approved customer facing messaging is attached as a file to this article - "Managing Correctable Error Notices Dec 2021 v1.pdf".This article will be updated as new information becomes available.