The main trouble signs of a Raid

25.06.2019 angrypixel Blog

Considering arrays, we cannot fail to mention RAID controllers, of which there are a great many today, with prices ranging from twenty to several thousand dollars. Comparison of reliability is a difficult question, but we guess that the difference in price has some specific reasons. Budget-level controllers have a simplified algorithm for operation and recovery after failures, which results in a greater likelihood of information loss. Expensive models are much more reliable, error handling algorithms are better, but they are not perfect.

RAID is not a panacea for data loss. Practice shows that there are both failures in the operation of the controller, and failures in the operation of hard drives, or one thing that results from the other. In any case, completely relying on the reliability of arrays and not taking care of the timely creation of backups, you run the risk of being left without “securely stored” information. The probability of data loss can be significantly reduced by regularly monitoring the state of the array and performing maintenance work, but it cannot be completely reduced to zero.

The reasons for the failure of RAID arrays

The most common reason for the failure of disk arrays is the negligence of system administrators who expect that “the bomb does not fall into one funnel twice”. During operation, such as RAID 5, one of the disks fails. The array continues to function, but with a noticeable decrease in speed. The system administrator, noticing the failure of the drive, is not in a hurry to take active steps, because expects that the array in this form can still work for some time. This sometimes turns out to be a delusion.

If you have one of the disks failing, it is best to immediately back up critical data and then, replacing one of the drives, perform a rebuild of the array.

Why did we have to mention the fact that you need to make a backup first? Because when trying to rebuild an array, sometimes it happens that the process “hangs”. As a rule, this happens if in the process of reading / writing a bad block is detected on one of the disks, and the controller cannot subtract information from the sector. As a result, after a long and useless wait, the server is overloaded. After that, it turns out that the array has completely “collapsed”. Hanging in such cases is most likely due to incorrect handling of the exceptional situation. As a rule, the described phenomenon is more typical for cheap models of controllers, but it also occurs when using expensive hardware.

Another common reason for the failure of arrays is the simultaneous transfer of several disks to offline mode. As practice has shown, most often this is due to problems with SMART, or the accumulation of bad blocks. As long as their number does not exceed a certain value, the disk works correctly, but one fine moment the array stops running. And everything seems to be fine, and the disk, judging by the sound, starts normally, and the controller correctly determines, but that’s just not clear why the disk status is offline, the array does not start and the data does not return. This is due to the fact that the controller cannot read the necessary data from the disk, or, by diagnosing SMART, it defines the disk as “dead”.

There are many more examples of failures in the work of arrays, but what to do if it did happen? Information lost, it must be restored. So you can check some raid recovery services.

Signs that may indicate a breakdown:

Hard drive clicks
Blue screen windows
The disk is not formatted
The computer is constantly rebooting
Permanent system hang
Drive or device not found
Operating system not found.