...
Client computers perform slowly. Specific jobs, particularly those running on the cluster, either fail or take longer than expected.
Performance issues are typically due to network traffic, network configuration issues, client or cluster processing load, or a combination thereof. This article describes several effective ways to troubleshoot performance issues.
Troubleshooting with InsightIQTable of Contents: Using Isilon InsightIQTroubleshooting without InsightIQNetwork throughputDistribution of client connectionsSmartConnectCluster throughputCluster processingQueued operationsCPU Using Isilon InsightIQ Using Isilon InsightIQ is the best way to monitor performance and to troubleshoot performance issues.The Isilon InsightIQ virtual appliance enables you to monitor and analyze Isilon cluster activity through flexible, customizable chart views in the InsightIQ web-based application. These charts provide detailed information about cluster hardware, software, and file system and protocol operations. InsightIQ transforms data into visual information that emphasizes any performance outliers, enabling quick diagnosis of bottlenecks or optimize workflows.For details on using InsightIQ, see the InsightIQ User Guide. Troubleshooting without InsightIQ If you are not using InsightIQ, you can run various commands to investigate performance issues. Troubleshoot performance issues first by examining network and cluster throughput, then by examining cluster processing, and finally by examining individual node CPU rates. Network throughput Use a network testing tool such as Iperf to determine the throughput capabilities of the cluster and client computers on your network.Using Iperf, run the following commands on the cluster and client. These commands define a window size that is large enough to reveal if the network link is a potential cause of latency issues. Cluster: iperf -s -w 262144 Client: iperf -c -w 262144 Distribution of client connections Check how many NFS and SMB clients are connected to the cluster to ensure they are not favoring one node. Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command to check NFS clients: isi statistics query - nodes=all --stats=node.clientstats.connected.nfs,node.clientstats.active.nfs The output displays the number of clients connected per node and how many of those clients are active on each node.Run the following command to check SMB clients: isi statistics query - nodes=all --stats=node.clientstats.connected.smb, node.clientstats.active.smb1,node.clientstats.active.smb2 The output displays the number of clients connected per node and how many of those clients are active on each node. SmartConnect Check to ensure that the node that SmartConnect is running on is not burdened with network traffic. Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command: isi_for_array -sq 'ifconfig|grep em -A3' The output displays a list of all the IP addresses that are bound to the external interface.Check for any nodes that have one additional IP address than the rest.Check the status of the nodes that you noticed in step 3 by running the following command: isi status Check the throughput column of the output to determine the load of the nodes noticed in step 3. Cluster throughput Assess cluster throughput by conducting write and read tests that measure the amount of time it takes to read from and write to a file. Conduct at least one write test and one read test, as follows.Write test Open an SSH connection on any node in the cluster and log in using the "root" account.Change to the /ifs directory: cd /ifs From the command-line interface (CLI) on the cluster or from a UNIX or Linux client computer, use the dd command to write a new file to the cluster. Run the following command: dd if=/dev/zero of=1GBfile bs=1024k count=1024 This command creates a sample 1GB file and reports the amount of time it took to write it to disk.From the output of this command, extrapolate how many MB per second can be written to disk in single-stream workflows.If you have a MAC client and want to conduct further analysis, Start Activity Monitor.Run the following command, where pathToFile is the file path of the targeted file: cat /dev/zero > /pathToFile This command helps measure the throughput of write operations on the Isilon cluster. (Although it is possible to run the dd command from a MAC client, results can be inconsistent.)Monitor the results of the command in the Activity Monitor's Network tab. Read testWhen measuring the throughput of read operations, be sure not to conduct read tests on the file that you created during the write test. Because that file has been cached, the results of your read tests would be inaccurate. Instead, test a read operation of a file that has not been cached. Find a file on the cluster that is larger than 1GB, and reference that file in the read test. Open an SSH connection on any node in the cluster and log in using the "root" account.From the CLI on the cluster or from a UNIX or Linux client computer, use the dd command to read a file on the cluster. Run the following command where pathToFile is the file path of the targeted file: dd if=/pathToLargeFile of=/dev/null bs=1024k This command reads the targeted file and reports the amount of time it took to read it.If you have a MAC client and want to conduct further analysis, Start Activity Monitor.Run the following command where pathToFile is the file path of the targeted file: time cp /pathToLargeFile > /dev/null This command helps measure the throughput of read operations on the Isilon cluster. (Although it is possible to run the dd command from a MAC client, results can be inconsistent.)Monitor the results of the command in the Activity Monitor's Network tab. Cluster processing Restripe jobsBefore examining input/output (I/O) operations (IOPS) of the cluster: Determine which jobs are running on the cluster. If restripe jobs such as Auto-Balance, Collect, or Multi-Scan are running, consider why those jobs are running and if they should continue to run.Consider the type of data being consumed. If client computers are working with large video files or virtual machines (VMs), the restriped job requires a higher amount of disk IOPS than normal.Consider temporarily pausing a restripe job. Doing so can significantly improve performance and might be a viable short-term solution to a performance issue. Disk I/OExamining disk I/O can help determine if certain disks are being overused.By cluster Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command to ascertain disk I/O: isi statistics pstat From the output of this command, divide the disk IOPS by the total number of disks in the cluster. For example, for an 8-node cluster using Isilon IQ 12000x nodes, which hosts 12 drives per node, you divide the disk IOPS by 96. For X-Series nodes and NL-Series nodes, you should expect to see disk IOPS of 70 or less for 100% random workflows, or disk IOPS of 140 or less for 100% sequential workflows. Because NL-Series nodes have less RAM and lower CPU speeds than X-Series nodes, X-Series nodes can handle higher disk IOPS. By node and by disk Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command to ascertain disk IOPS by node, which can help discover disks that are overused: isi statistics query --nodes=all --stats=node.disk.xfers.rate.sum --top Run the following command to determine how to query for statistics on a per disk basis: isi statistics describe --stats=all | grep disk Queued operations Another way to determine if disks are being overused is to determine how many operations are queued for each disk in the cluster. For a single stream SMB-based workflow, a queue of 4 can indicate an issue, while for high concurrency NFS namespace operations, the queue is greater. Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command to determine how many operations are queued for each disk in the cluster: isi_for_array -s sysctl hw.iosched | grep total_inqueue Determine the latency caused by the queue operations: sysctl -aN hw.iosched|grep bios_inqueue|xargs sysctl -D CPU CPU issues are frequently traced to the operations clients perform on the cluster. Using the isi statistics command, you can determine the operations performed on the cluster, cataloged by either network protocol or client computer. Open an SSH connection on any node in the cluster and log in using the "root" account.Run the following command to determine which operations are being performed across the network and assess which of those operations are taking the most time: isi statistics protocol --orderby=TimeAvg --top This command output gives detailed statistics for all network protocols, organized by how long the cluster takes to respond to clients. Although the results of this command may not identify which operation is the slowest, it can point you in the right direction.Run the following command to obtain more information about CPU processing, such as which nodes' CPUs are the most heavily used: isi statistics system --top Run the following command to obtain the four processes on each node that are consuming the most CPU resources: isi_for_array -sq 'top -d1|grep PID -A4'