A while ago we noticed log messages on one of the servers that a hard drive was going bad. The logs showed errors reading certain sectors of one of the drives in a software RAID-1 volume. Here’s how we went about diagnosing the faulty drive and replaced it.
Background
The server has several Linux software RAID (mdraid) volumes and the drives are connected to the server’s onboard Intel AHCI SATA controller. The Intel chipset and the Linux AHCI driver provides hot-swap capability and are two required components of hot-swappability. Furthermore, the drives themselves are mounted in hot-swap carriers that connect to a backplane in the server that completes the setup, allowing us to to remove and add drives when needed. Naturally, one has to be careful not to remove drives that are in use. With the chipset, driver and physical parts in place, you also need software to manage the hard drives – taking them offline before removing them and functionality to set up new drives to replace the old ones in the RAID array(s). In Linux, this is handled by the mdraid management tool ”mdadm”.
Diagnosis
Judging from the logged errors, the faulty drive was /dev/sde and with that I examined the drive using the smarctl utility from the smartmontools package and initiated an offline test of the drive:
smartctl --test=offline /dev/sde
When you launch that test, an estimate of when the test is done will be printed out by smartctl. The offline test will run a shorter test of the drive and update the SMART values, which can then be read out using
smartctl -A /dev/sde
Here’s what was printed out:
# smartctl -A /dev/sde smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1050 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 27 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6688 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 26 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 7 194 Temperature_Celsius 0x0022 115 103 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 192 192 000 Old_age Always - 1458 198 Offline_Uncorrectable 0x0030 200 192 000 Old_age Offline - 5 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 182 000 Old_age Offline - 0
Looking at this, attributes 197 and 198 caught my eyes. There’s a lot of current pending sectors (1458) and a non-zero number (5) for attribute 198, Offline Uncorrectable. That attribute denotes the number of defective sectors that were found during the offline scan. The pending sectors count doesn’t mean that there really are 1458 broken sectors, but it indicates ”suspected” sectors that for some reason have not been readable. However, the pending sectors count along with the five uncorrectable sectors and the kernel errors about read errors suggested that this drive needed to be replaced as soon as possible.
The drive was part of a two-disk RAID-1 volume, so the next step was to remove this drive from the RAID-volume and the server itself, get it replaced and RMA:ed and a new disk being put to use in its place. Here is how we did it. Some details first:
- The RAID-1 volume was /dev/md4
- It’s comprised of two drives, dev/sde and /dev/sdf
- Each drive has one large partition on them, marked as type fd aka ”Linux RAID autodetect”
How we did it
- Make sure you identify exactly which drive it is that’s broken and what RAID-volume it belongs to. Our server has six drives in it and the broken one was the fifth drive, hence /dev/sde. To identify the correct drive check the kernel logs, examine the mdraid status using ”cat /proc/mdstat” and make good use of the mdadm RAID management utility to make sure you identify the correct drive. In our case, it was the md4 RAID volume and the /dev/sde drive.
- Buy a replacement drive of at least the same size and arrange with your vendor to get the faulty drive returned and exchanged.
- Save the partition table from the faulty drive, assuming it’s still readable. This can be done with the sfdisk partition tool:
sfdisk -d >sde-partitions.dat
- Use mdadm to mark the faulty drive (actually partition) in the affected RAID volume:
mdadm /dev/md4 --fail /dev/sde1
- Next, remove the faulty partition from the RAID volume:
mdadm /dev/md4 --remove /dev/sde1
- Bring the new disk to the server and swap out the faulty drive. Make sure you remove the correct drive. In our case, it was the fifth drive, so we removed drive carrier #5. When you do this, have a monitor connected to the server and you should notice a kernel message popping up on the console showing that the OS has detected a drive being removed. When you plug in the new drive, another message is shown indicating that a new drive has been added to the machine. On our server, the new drive was detected as ”sde”, just like the old drive.
- Log on to the server again and partition the new drive. Make use of the old partition table dump created above:
sfdisk /dev/sde < sde-partitions.dat
- Add the new disk to the RAID volume, in our case md4:
mdadm /dev/md4 --add /dev/sde1
- Assuming everything went well, you should be able to see the RAID volume being rebuilt:
# cat /proc/mdstat Personalities : [raid1] md4 : active raid1 sde1[2] sdf1[1] 976759936 blocks [2/1] [_U] [==>..................] recovery = 12.7% (124386752/976759936) finish=133 3min speed=106564K/sec
- Done!