One of the two Western Digital Raptor drives in colobus failed tonight. I was sitting at my desk when my phone indicated an incoming SMS. It turned out to be from the mdadm monitoring utility telling me that one of the drives had failed and the RAID was running in degraded mode. The main filesystem is on a RAID 1 array, which means two identical drives in a mirrored configuration. Since one has now failed, it is running in a non-redundant configuration until the drive can be replaced.
Fortunately I was able to find a replacement drive at the local Best Buy for an entirely reasonable price and I’ll be heading down to Boston to replace the drive tomorrow. Amazingly, the failed drive was less than a year old. The drive is under warranty (5 year warranty terms) so it’ll be replaced free, but in order to rush in an immediate replacement I need to get a new drive. The failed drive will be RMA’d later and then I’ll have a spare on hand.
We’ll need to reboot to perform the drive replacement. The latest version of the Linux kernel fully supports hotplugging of SATA drives, but unfortunately that was not out 160 days ago when colobus was last rebooted. It’ll received a shiny new kernel tomorrow as well.
UPDATE: The drive replacement was completed in about 20 minutes with the array reconstruction taking an additional hour (while the system was up). I am very pleased with how things worked and how easily the my Tyan Rackmount’s hotswap bays worked. I wasn’t able to actually hotswap the drives this time because the old Linux kernel version didn’t support it, but when the machine was rebooted the kernel was upgraded and any future drive replacements can be done with the machine online.