...
After upgrading a cluster to OneFS 9.1.0.20, 'isi status' shows all the PowerScale nodes in Read-Only (RO) mode: Node Pool Name: f600_60tb-ssd_384gb Protection: +2d:1n Pool Storage: HDD SSD Storage Size: 0 (0 Raw) 0 (0 Raw) VHS Size: 0.0 Used: 0 (n/a) 0 (n/a) Avail: 0 (n/a) 0 (n/a) Throughput (bps) HDD Storage SSD Storage Name Health| In Out Total| Used / Size |Used / Size -------------------+-----+-----+-----+-----+-----------------+----------------- 123|n/a |-A-R |938.7| 9.9M| 9.9M|(No Storage HDDs)|(No Storage SSDs) 124|n/a |-A-R | 0| 9.9M| 9.9M|(No Storage HDDs)|(No Storage SSDs) 125|n/a |-A-R | 0|10.8M|10.8M|(No Storage HDDs)|(No Storage SSDs) 126|n/a |-A-R | 0| 9.9M| 9.9M|(No Storage HDDs)|(No Storage SSDs) 127|n/a |-A-R | 1.4k| 9.9M| 9.9M|(No Storage HDDs)|(No Storage SSDs) 128|n/a |-A-R | 0| 7.9M| 7.9M|(No Storage HDDs)|(No Storage SSDs) 129|n/a |-A-R | 0| 7.9M| 7.9M|(No Storage HDDs)|(No Storage SSDs) 130|n/a |-A-R | 0| 7.3M| 7.3M|(No Storage HDDs)|(No Storage SSDs) -------------------+-----+-----+-----+-----+-----------------+----------------- f600_60tb-ssd_384gb| OK |293.3| 9.2M| 9.2M|(No Storage HDDs)|(No Storage SSDs) You will see entries similar to the following in the /var/log/messages file for the affected nodes: 2022-07-26T01:40:46+02:00 (id92) isi_testjournal: NVDIMM is persistent 2022-07-26T01:40:46+02:00 (id92) isi_testjournal: NVDIMM armed for persistent writes 2022-07-26T01:40:47+02:00 (id92) ifconfig: Configure: /sbin/ifconfig ue0 netmask 255.255.255.0 169.254.0.40 2022-07-26T01:40:47+02:00 (id92) dsm_ism_srvmgrd[2056]: ISM0000 [iSM@674.10892.2 EventID="8716" EventCategory="Audit" EventSeverity="info" IsPastEvent="false" language="en-US"] The iDRAC Service Module is started on the operating system (OS) of server. 2022-07-26T01:40:47+02:00 (id92) dsm_ism_srvmgrd[2056]: ISM0003 [iSM@674.10892.2 EventID="8196" EventCategory="Audit" EventSeverity="error" IsPastEvent="false" language="en-US"] The iDRAC Service Module is unable to discover iDRAC from the operating system of the server. 2022-07-26T01:44:15+02:00 (id92) isi_testjournal: PowerTools Agent Query Exception: Timeout (20 sec) exceeded for request http://127.0.0.1:8086/api/PT/v1/host/sensordata?sensorSelector=iDRAC.Embedded.1%23SystemBoardNVDIMMBattery&sensorType=DellSensor data: HTTPConnectionPool(host='127.0.0.1', port=8086): Read timed out. (read timeout=20) 2022-07-26T01:44:20+02:00 (id92) isi_testjournal: Query to PowerTools Agent for NVDIMM Battery failed
The issue appears to be related to changes to the NVDIMM status monitoring code made in OneFS version 9.1.0.19 which can cause a timeout to occur during the initial NVDIMM status query at startup, putting the node into read-only mode. Even though subsequent status queries succeed, the node does not automatically return to Read-Write mode. OneFS 9.2.x and newer are not affected by this issue.
To verify that the NVDIMM is healthy and that you are running into the issue described in this KB, run the following four commands: # isi_hwmon -b NVDIMMHealthMonitoring # isi_hwmon -b NVDIMMPersistence # /opt/dell/DellPTAgent/tools/pta_call get agent/info # /opt/dell/DellPTAgent/tools/pta_call post "host/sensordata?sensorSelector=iDRAC.Embedded.1%23SystemBoardNVDIMMBattery&sensorType=DellSensor" These commands are query only commands and should be considered nondisruptive. Output for the commands for a node in this state should be similar to: # isi_hwmon -b NVDIMMHealthMonitoring DIMM SLOT A7: OK # isi_hwmon -b NVDIMMPersistence NVDIMM Index 0 State: PERSISTENT Vendor Serial ID: xxxxxxxxx Correctable ECC Count: 0 Uncorrectable ECC Count: 0 Current Temp: 255 Health: 0 NVM Lifetime: 90 Warning Threshold Status: 0 Error Threshold Status: 0 Health Info Status: 0 Critical Health Info: 0 Critical Info Status: 0 Last Save Status: 0 Last Restore Status: 0 Last Flush Status: 0 Armed: 1 SMART/Health Events Observed: 0 FW Health Monitoring: 1 NVDIMM Mapped: 1 # /opt/dell/DellPTAgent/tools/pta_call post "host/sensordata?sensorSelector=iDRAC.Embedded.1%23SystemBoardNVDIMMBattery&sensorType=DellSensor" Request sent to DellPTAgent @ http://127.0.0.1:8086 [127.0.0.1] { "HealthState": "OK", "EnabledState": "Enabled", "ElementName": "System Board NVDIMM Battery", "SensorType": "Other", "Id": "iDRAC.Embedded.1_0x23_SystemBoardNVDIMMBattery", "CurrentState": "Good" } Response: status: 200 [OK], size: 223 bytes, latency: 0.034 seconds. # /opt/dell/DellPTAgent/tools/pta_call get agent/info Request sent to DellPTAgent @ http://127.0.0.1:8086 [127.0.0.1] { "idrac_ethernet_ip": "0.0.0.0", "servicetag": "xxxxx", "uptime": "2511 seconds ( 41 minutes 51 seconds )", "status": { "agent": "OK", "idracConnection": "OK", "idraccache": "OK", "iSM": "N/A" }, "name": "ClusterName-123", "MarvellLibraryVersion": "Not loaded", "system_uuid": "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "default_server_cert": "true", "rest_endpoints": "http://127.0.0.1:8086"" [127.0.0.1], "ptagentversion": "2.5.6-4", "domain": "", "host_epoch_time": "xxxxxxxxxx.354221 (secs.usecs)", "os_version": "9.1.0.0", "mfr": "Dell Inc.", "process_id": "2071", "api_blocking_enabled": "false", "host_pass_thru_ip": "xxx.xxx.xxx.xxx", "model": "PowerScale F600", "idrac_pass_thru_ip": "xxx.xxx.xxx.xxx", "os": "Isilon OneFS", "ism_version": "dell-dcism-3.4.6.13_7" } Response: status: 200 [OK], size: 871 bytes, latency: 0.009 seconds. The output given here is an example, and there may be variances in the output you get. The important part is that the output looks similar, and that you do not get an error message instead of the output.- If there are any indications of communications issues/errors, you must continue troubleshooting the issue, engaging an HW L2/SME and/or the PowerEdge support team as needed. - If the output indicates that the NVDIMM is in a good state and there are no issues, you can manually clear the RO state using the following command: # /usr/bin/isi_hwtools/isi_read_only --unset=system-nvdimm-failed Once you have applied the corrective step, monitor the node for ~10 minutes to make sure it does not go back into RO mode. If the node is power cycled or rebooted, this issue may occur again, and this workaround may need to be reapplied. PowerScale Engineering is aware of this issue and is investigating mitigation steps to be implemented in an upcoming OneFS 9.1 release. In the meantime, to permanently resolve this issue, you can upgrade the cluster to OneFS 9.2 or newer.