...
BugZero found this defect 1635 days ago.
How to handle or troubleshoot OS Capacity issues on Avamar.This Resolution Path article is designed to address or troubleshoot OS Capacity issues on Avamar.For initial concepts and understanding of OS Capacity, see training article Avamar: Capacity Management Concepts and TrainingAs summarized from the training article, a reasonable understanding of the following topics should be required to proceed with the rest of this article: A basic understanding of checkpoints (cp), checkpoint validation (hfscheck), and Garbage Collection (GC), and the importance of eachThe difference between GSAN (aka "User Capacity" and OS Capacity)Checkpoint overhead dataIf any of the data partitions are more than 89% of the total physical OS Capacity space, garbage collection is unable to run.The closer an Avamar grid is to 100% User capacity, the less OS capacity there is available for checkpoint overhead.Factors which contribute to checkpoint overhead including Asynchronous crunching, number of checkpoints stored, HFSCheck and checkpoint validation importance, and so on.How to find the OS Capacity levelsBasic actions to alleviate OS Capacity It is often easiest to consider OS Capacity as the size of the GSAN data (more specifically the space allocated for this data) and the overhead generated by Avamar checkpoints. The greater the number of checkpoints and the higher the change rate, the higher the checkpoint overhead. Impacts of high OS Capacity can include: Garbage Collection Failure: GC fails with MSG_ERR_DISKFULL if OS capacity rises above 89%Backup or Replication failure: Backups or incoming replication may fail with MSG_ERR_STRIPECREATE if the OS Capacity rises above 90%. (This is only if a new data stripe must be created. If a new stripe is not needed, backups and replication may still run successfully.)Checkpoint Failure: A checkpoint fails with MSG_ERR_DISKFULL if the OS capacity rises above 96% As the above may indicate, OS Capacity is often the first type of Avamar Capacity to address when other Avamar Capacities are high too. At the least, Garbage Collection cannot be run when the OS Capacity reaches certain levels even when the GSAN or user capacity is high as well.Generally, the OS Capacity is considered high when GC fails with MSG_ERR_DISKFULL if OS capacity rises above 89%. If the OS Capacity is at all less than 89%, no maintenance jobs are impacted. Note: It is expected that OS Capacity fluctuates throughout the day. Verifying that daily maintenance jobs run smoothly is important and generally the best solution to avoid OS Capacity issues when possible. Note: While the above is seen as Avamar OS Capacity, it is possible there can be OS Capacity issues not directly related to backup partitions or checkpoints. These are the disks and partitions where the Linux OS is installed. While these issues are less common, they can have other impacts that are discussed below.
Avamar OS Capacity can increase due to any combination of the following reasons: High Change Rate of backup data, adding "too much too fast"High GSAN or "User Capacity" which leaves less room for OS Capacity and can sometimes even result in higher change ratesCheckpoint failing to complete successfully, which results in the status of "MSG_ERR_DISKFULL" as seen in the output.A Checkpoint validation (hfscheck) has failed or not run recently, so that oldest checkpoints cannot roll off or be removedCheckpoints not rolling off for other reasons, including checkpoint retention settings too high High OS capacity on other disk partitions can arise from various causes including incorrect data placement, log files becoming too large, and so on. A quick explanation of the phrase "too much too fast" as a reason for high OS Capacity can be explained as follows: For a quick background, Avamar checkpoints are a read-only snapshot and link to the live data. Since this is created with links, a checkpoint will use zero extra disk space immediately after it is created. If there are no changes to the live data, the checkpoint does not use additional space.This changes as the live data is modified while the checkpoint remains the same. At that point, there is an original copy of the data in the checkpoint and the updated live copy of the modified data.This is completely by design and intentional. This is why there is reserved OS Capacity space.However, if the amount or rate of change data increases drastically and suddenly, this can cause an uncommon spike in the OS Capacity size and be considered "too much too fast"The capacity.sh tool would show this as the cause when comparing the output across several days.
If maintenance jobs, including Garbage Collection, are failing from high Avamar OS Capacity, follow these steps: 1. Collect all Avamar Capacity information to paint a picture of the situation: Avamar: How to gather the information required to troubleshoot capacity issues. 2. Next, Review how high the OS Capacity is and what actions may be required. From the data collection article, this can be found using the following command: avmaint nodelist | egrep 'nodetag|fs-percent-full' How Avamar works is that the HIGHEST value for fs-percent-full shown is the limiting factor for the current OS Capacity. Depending on the node type generation and size, the number of data partitions storing backup and checkpoint data can vary. As seen from the Linux Operating System, these may be disks or partitions like "/data0*", where "*" can be a single digit. The number of data partitions depends on the node type, hardware generation, and size (and cannot be changed). 3. Review the number of checkpoints found and how recently they have been validated from the command: cplist cp.20290310080041 Mon Mar 10 08:00:41 2025 valid rol --- nodes 4/4 stripes 5980 cp.20290310080649 Mon Mar 10 08:06:49 2025 valid --- --- nodes 4/4 stripes 5980 Note: Some checkpoints must ALWAY be retained. 4. Verify if checkpoint operations are failing from "MSG_ERR_DISKFULL" by running the following command: dumpmaintlogs --types=cp --days=4 | grep "\ If Checkpoints have completed successfully, output similar to the following is seen: 2020/03/07-08:00:39.51323 {0.0} starting scheduled checkpoint maintenance 2020/03/07-08:01:31.49490 {0.0} completed checkpoint maintenance 2020/03/07-08:07:47.36128 {0.0} starting scheduled checkpoint maintenance 2020/03/07-08:08:29.40139 {0.0} completed checkpoint maintenance 2020/03/08-08:00:39.93332 {0.0} starting scheduled checkpoint maintenance 2020/03/08-08:01:29.50546 {0.0} completed checkpoint maintenance 2020/03/08-08:06:45.37918 {0.0} starting scheduled checkpoint maintenance 2020/03/08-08:07:27.36749 {0.0} completed checkpoint maintenance 2020/03/09-08:00:36.57433 {0.0} starting scheduled checkpoint maintenance 2020/03/09-08:01:24.22214 {0.0} completed checkpoint maintenance 2020/03/09-08:06:40.52884 {0.0} starting scheduled checkpoint maintenance 2020/03/09-08:07:22.18463 {0.0} completed checkpoint maintenance 2020/03/10-08:00:39.83562 {0.0} starting scheduled checkpoint maintenance 2020/03/10-08:01:31.87814 {0.0} completed checkpoint maintenance 2020/03/10-08:06:48.27867 {0.0} starting scheduled checkpoint maintenance 2020/03/10-08:07:29.95640 {0.0} completed checkpoint maintenance If failed due to MSG_ERR_DISKFULL, this output is seen: 2020/03/07-08:00:39.51323 {0.0} starting scheduled checkpoint maintenance 2020/03/07-08:01:31.49490 {0.0} failed checkpoint maintenance with error MSG_ERR_DISKFULL 2020/03/07-08:07:47.36128 {0.0} starting scheduled checkpoint maintenance 2020/03/07-08:08:29.40139 {0.0} completed checkpoint maintenance 2020/03/08-08:00:39.93332 {0.0} failed checkpoint maintenance with error MSG_ERR_DISKFULL 2020/03/08-08:01:29.50546 {0.0} completed checkpoint maintenance 2020/03/08-08:06:45.37918 {0.0} starting scheduled checkpoint maintenance 2020/03/08-08:07:27.36749 {0.0} completed checkpoint maintenance 2020/03/09-08:00:36.57433 {0.0} starting scheduled checkpoint maintenance 2020/03/09-08:01:24.22214 {0.0} completed checkpoint maintenance 2020/03/09-08:06:40.52884 {0.0} starting scheduled checkpoint maintenance 2020/03/09-08:07:22.18463 {0.0} completed checkpoint maintenance 2020/03/10-08:00:39.83562 {0.0} starting scheduled checkpoint maintenance 2020/03/10-08:01:31.87814 {0.0} completed checkpoint maintenance 2020/03/10-08:06:48.27867 {0.0} starting scheduled checkpoint maintenance 2020/03/10-08:07:29.95640 {0.0} completed checkpoint maintenance If checkpoint operations are failing with MSG_ERR_DISKFULL errors, Open a Service Request with the Dell Technologies Avamar Support team, otherwise continue from step 5. 5. Check if there are other checkpoint issues: The cplist command shows how many checkpoints are found and how recently a checkpoint was validated. As also shown in the data collection article, use Avamar - How to understand the output generated by the cplist command to understand the cplist output.There should be two or three checkpoints, and at least one of the checkpoints from the last 24 hours shows as validated with hfscheck. This would be normal behavior and output from all jobs running successfully and normal checkpoint retention settings.If there are more than three checkpoints, or no validated checkpoints within the last 24 hours, this must be addressed first as it may be the only way to reduce the OS Capacity. If this scenario is encountered, open a Service Request with the Dell Technologies, otherwise continue from step 6. 6. Determine the change rate: capacity.sh Example output: DATE AVAMAR NEW #BU SCANNED REMOVED MINS PASS AVAMAR NET CHG RATE ========== ============= ==== ============= ============= ==== ==== ============= ========== 2020-02-25 1066 mb 8 302746 mb -641 mb 0 23 425 mb 0.35% 2020-02-26 1708 mb 8 303063 mb -518 mb 0 23 1189 mb 0.56% 2020-02-27 3592 mb 8 304360 mb -413 mb 0 23 3178 mb 1.18% 2020-02-28 1086 mb 8 304892 mb -372 mb 0 23 713 mb 0.36% 2020-03-01 1002 mb 8 305007 mb -7469 mb 0 25 -6467 mb 0.33% 2020-03-02 585 mb 7 197874 mb 0 mb 0 9 585 mb 0.30% 2020-03-03 348 mb 7 199305 mb 0 mb 0 10 348 mb 0.17% 2020-03-04 775 mb 7 198834 mb -2 mb 0 10 773 mb 0.39% 2020-03-05 380 mb 4 196394 mb -5 mb 0 10 375 mb 0.19% 2020-03-06 1068 mb 4 159960 mb 0 mb 0 9 1067 mb 0.67% 2020-03-07 443 mb 4 197132 mb -18 mb 0 17 424 mb 0.23% 2020-03-08 348 mb 4 197231 mb -48 mb 0 20 300 mb 0.18% 2020-03-09 370 mb 4 196506 mb 0 mb 0 9 370 mb 0.19% 2020-03-10 349 mb 4 197292 mb -17 mb 0 20 332 mb 0.18% 2020-03-11 974 mb 2 77159 mb 0 mb 0 0 974 mb 1.26% ============================================================================================= 14 DAY AVG 940 mb 5 222517 mb -634 mb 0 15 306 mb 0.42% 30 DAY AVG 1121 mb 5 195658 mb -771 mb 0 14 349 mb 0.59% 60 DAY AVG 994 mb 4 128657 mb -1165 mb 0 17 -170 mb 0.98% Top Change Rate Clients. Total Data Added 14103mb NEW DATA % OF TOTAL CHGRATE TYPE CLIENT ============= ========== ======= ==== 6803 mb 48.24 0.91% AVA /Windows/testing/Hyper-V/hyperv1 3218 mb 22.82 0.61% AVA /clients/exchange1 2932 mb 20.80 0.44% AVA /BMR/server1 983 mb 6.97 0.10% AVA /Windows/testing/SQL/sql1 97 mb 0.69 1.13% AVA /REPLICATE/grid2.company.com/MC_BACKUPS Sometimes if the high change-rate or "too much too fast" situation recurs, this can be alleviated by lowering the overall GSAN or user capacity. With a lower GSAN capacity, there is a little more room for OS Capacity overhead and also results in fewer data storage container changes as well. For assistance with this scenario, Open a Service Request with the Dell Technologies Avamar Support team, otherwise continue from step 7. 7. Issues with high OS capacities on other disk partitions have various causes, but the solutions require technical support. Open a Service Request with the Dell Technologies Avamar Support team, otherwise continue from step 7. Once OS Capacity is addressed, GSAN capacity or other Avamar capacities can be reviewed. See Avamar Capacity Troubleshooting, Issues, and Questions - All Capacity (Resolution Path)