...
Performance issue reported by client up to and including data unavailable. Some troubleshooting methods impede quick time to resolution by focusing on data collection of symptoms. Rather than expose during live engagement a "common cause" characterized by all or most of the following patterns:Slow response and timeouts to admin WebUI and CLI commands, especially commands requiring node statistics with many errors per minute in messages, job, and celog (OneFS version 7x) or tardis (OneFS version 8x) found by running: isi_for_array -s "tail /var/log/messages" Displays messages such as: "Failed to send worker count stats" "No response from isi_stats_d after 5 secs" "Error while getting response from isi_celog_coalescer" (OneFS version 7x), or "Unable to get response from…"Hang dumps, especially if multiple hours, are identified in live messages on any node displayed with: grep -A1 "LOCK TIMEOUT AT" /var/log/messages isi_hangdump: Initiating hangdump…. Intermittent high CPU load is displayed as 1, 5, and 10 minute CPU load averages with: isi_for_array -s uptime High memory utilization on one or more services and on one or more nodes displayed by isi_for_array -s "ps -auwx" Service timeouts and Stack traces indicating memory exhaustion and service or "Swatchdog" timeouts displayed by isi_for_array -s "grep /var/log/messages | grep -A3 -B1 Stack"
The above complex symptoms indicate node resource exhaustion. This is possibly caused by long wait times, locking, and unbalanced work flow exceeding one or more node's capabilities, including one of several known causes: Uptime issuesIB hardware or switch failuresBMC or CMC gen5 controller unresponsiveSyncIQ policy when-source-modified configured on an active pathSyncIQ job degrades performance after upgrade to 8.x with default SyncIQ Performance Rules and limited replication IP pool.isi commands locking drive_purposing.lock This article recommends quickly identifying or eliminating these known performance disruptors before proceeding with more detailed symptom troubleshooting.
Depending on the workflow the timeouts, memory exhaustion and Stack traces may occur on one service more than another, such as lwio for SMB. Before implementing Troubleshooting guides (collect lwio cores, so on) on a particular service, when a pattern similar to the above is present. Run the following commands and record the outcome in case comment to indicate or eliminate a "common cause." uname -a Interpret: Susceptible to 248 or 497-day uptime issues: OneFS v7.1.0.0-7.1.0.6, 7.1.1.0-7.1.1.5, 7.2.0.0-7.2.0.3, and 7.2.1.0 Susceptible to drive_purposing.lock condition OneFS v7.1.1.0-7.1.1.9, 7.2.1.0-7.2.1.2 or below, and 8.0.0.0 isi_for_array -s uptime Interpret: Uptime at or about 248 or 497 days and uname -a indicated a susceptible version? Indicates Uptime issue isi status Interpret: If running slowly, statistics time out or display N/A N/A N/A, isi_stats_d is not communicating on one or more nodes. If uname indicates susceptible to drive_purposing.lock, close multiple WebUI instances and run isi_for_array "killall isi_stats_d" isi_for_array -s "tail /var/log/ethmixer.log" Interpret: Many state changes and ports register as down (a change not reporting "is alive") indicates IB hardware cable, card, or switch. Lack of IB errors reported in ethmixer log while intermittent stat and service failures suggest refocusing on Sync or job engine. /usr/bin/isi_hwtools/isi_ipmicmc -d -V -a bmc | grep firmware Interpret: If nodes are gen5 (x or n 210, 410m HD400) and OneFS below v8.0.0.4 cluster is susceptible. No version output means the controller is unresponsive. A responding controller does not fully eliminate this as a cause since stats_d and dependent services can fail while controller is unresponsive. Then a controller restarts without those services recovering. isi sync policies list -v | grep Schedule Interpret: If any "when-source-modified" then note the policy and disable: isi sync policies modify --enabled false isi sync rule list (for uname indicating 8.x only) Interpret: If the display is blank, then no Performance Rules exist, SyncIQ runs with 8.x defaults. isi sync policies list -v | grep -A1 "Source Subnet" Interpret: If blank there are no IP pool restrictions. If subnet and pool restrictions are listed, and default Sync Performance Rules, nodes participating in the limited replication pool may be overtaxed during SyncIQ jobs.