2013/01/29

Booting from a ghost partition

The most insane things can happen. Booting a new kernel, you see it in /boot/grub/grub.conf, but you don't see it in the GRUB menu at boot time. This has happened to me twice recently. First time, I lost 2 hours and thought I was going mad before I figured it out.

The cause is that I have /boot as a software RAID (/dev/md0, raid1 of sda1, sdb1, sdc1). While GRUB supports Linux software RAID1, it only does so partially; it assumes that the first drive of the array is the same as the others and uses it as an ext3 (or other) partition. What had happened a few weeks back was that sda1 had become the spare drive. This was caused by a mislabeled ICY DOCK; what I thought was sda became sdc, and sdc, sda. This first time I gave up before figuring it out, copied /boot into a temp directory, reformatted sda1 as a straight ext3 and copied /boot back into it.

This second time, I had already seen it and new where to look. That is /proc/mdstat. And to make sure that sda1 was an active part of the array. Sure enough, sda1 wasn't even in the array! So

mdadm /dev/md0 --add /dev/sda1
Then wait a minute for the RAID to rebuild, reboot and everything was fine.

This particular array had 3 active drives, no spares. I had replaced sda at some point and while I had added it to the main array (md1, raid5) I hadn't done so for md0. Why does md0 have no spares? It seemed like a good idea at the time. Repairing a computer that won't boot is annoying. Having all active, no spares means you never (in theory) end up with a disk that won't, should the first one give up the disk. And given that /boot is only written to once every blue moon, so it's not like I'm stressing the spare at all.

In the second case all 3 partitions of md0 are active; there are no spare drives. Had it been a case like first one, I would have forced sda1 to be an active partition in the array. Something like

mdadm /dev/md0 --fail /dev/sdc1 # cause the array to rebuild to sda1+sdb1
# now get sdc1 as the spare
mdadm /dev/md0 --remove /dev/sdc1 
mdadm /dev/md0 --add /dev/sdc1

No comments: