...
BugZero found this defect 91 days ago.
An anomaly in the firmware for certain SSDs will cause the drive to reboot. The rebooting frequency is periodic and will occur every half hour to an hour. This issue will only occur after the drive has been in use more than 56,000 hours (i.e., 6.5 years) and this information is shown in the Power On Hours (POH) field. When the drive is used for OS storage, this issue may cause the kernel to crash and a BSOD may occur. In very rare cases, some of the key files for the Operating System (OS) may be corrupted when the drive stalls and a reboot process that makes the OS image will be needed in order to recover. When this drive is used for DATA storage, a disk drive drop may be detected and reported. The RAID volume will attempt to perform a rebuild which can cause the performance to drop. In addition, if the drive's reboot starts before the rebuild is completed, it will cause the volume rebuild to fail and trigger the rebuild cycle again and again, never recovering the RAID volume. If multiple disks in the RAID fail at the same time, data may not be recoverable except through restore from backup media. The HPE Smart Storage Administrator (SSA) utility will display the POH details in years. For information, refer to HPE Smart Storage Administrator (SSA) – Quick Guide to Determine SSD Power on Hours. Note : This drive information can also be found from HPE Smart Storage Administrator (SSA) CLI, HPE MegaRAID Storage Administrator (MRSA) User Guide or from the Integrated Lights-Out (iLO) GUI. If necessary, access HPE Support Center to find the appropriate information on how to check the POH data. Each administrator for the devices may show the POH information differently. For example, to discover if the drive has been in use for more than 6.5 years or 3,418,669.8 minutes use the following commands and output for SSA CLI: alf-rmc-sdf-p0:~ # smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device alf-rmc-sdf-p0:~ # smartctl -d megaraid,6 -a /dev/sda smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.111-default] (SUSE RPM) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HP Product: EG001200JWJNQ Revision: HPD0 Compliance: SPC-4 User Capacity: 1,200,243,695,616 bytes [1.20 TB] Logical block size: 512 bytes Rotation Rate: 10500 rpm Form Factor: 2.5 inches Logical Unit id: 0x5000c500a046dae3 Serial number: WFK019WN Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Aug 14 06:35:01 2024 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 60 C Accumulated power on time, hours:minutes 31326:50 Manufactured in week 34 of year 2017 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 506 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 1830 Elements in grown defect list: 0 SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 077 064 044 Pre-fail Always - 50079291 3 Spin_Up_Time 0x0003 095 095 070 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 091 064 030 Pre-fail Always - 1358777742 9 Power_On_Hours 0x0032 050 050 000 Old_age Always - 43804 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 180 Unknown_HDD_Attribute 0x003b 100 100 030 Pre-fail Always - 48312991 194 Temperature_Celsius 0x0022 039 047 000 Old_age Always - 39 (0 18 0 0 0) 196 Reallocated_Event_Count 0x0033 100 100 010 Pre-fail Always - 0
Any HPE server that supports the following SSDs, when the Power On Hours (POH) field entry exceeds 56,000: Drive Model Description P/N EO000400JWDKP HPE 400GB SAS 12G Write Intensive 3yr Wty EO000400JWDKP Solid State Drive (SSD) 873351-B21 EO000800JWDKQ HPE 800GB SAS 12G Write Intensive 3yr Wty EO000800JWDKQ SSD 873355-B21 EO001600JWDKR HPE 1.6TB SAS 12G Write Intensive 3yr Wty EO001600JWDKR SSD 873357-B21 MO000400JWDKU HPE 400GB SAS 12G Mixed Use MO000400JWDKU SSD 873359-B21 MO000800JWDKV HPE 400GB SAS 12G Mixed Use MO000400JWDKU SSD 873363-B21 MO001600JWDLA HPE 1.6TB SAS 12G Mixed Use MO001600JWDLA SSD 873365-B21 MO003200JWDLB HPE 3.2TB SAS 12G Mixed Use MO003200JWDLB SSD 873367-B21 Note : HPE Gen10 systems do not support the MegaRaid (MR) Controllers or VROC, and instead support the HPE MCHP series of Smart Array Controllers. (VROC support starts with HPE Gen10 Plus systems.)
To correct and prevent this issue, update the SSD firmware components to version HPD3 for each of the drive models: Online HDD/SSD Flash Component for Windows (x64) - EO000400JWDKP, EO000800JWDKQ, EO001600JWDKR, MO000400JWDKU, MO000800JWDKV, MO001600JWDLA and MO003200JWDLB Drives Online HDD/SSD Flash Component for Linux (x64) - EO000400JWDKP, EO000800JWDKQ, EO001600JWDKR, MO000400JWDKU, MO000800JWDKV, MO001600JWDLA and MO003200JWDLB Drives Online HDD/SSD Flash Component for VMware ESXi - EO000400JWDKP, EO000800JWDKQ, EO001600JWDKR, MO000400JWDKU, MO000800JWDKV, MO001600JWDLA and MO003200JWDLB Drives