2010/10/16

How not to spend a friday night.

I cascade of stupidity caused me to drive to Montreal and back on a Friday night. In the dark. In heavy rain. If you know me, this isn't my idea of fun. While I am slightly to blame, most of the blame is elsewhere.

First, power outage at a client's in St-Constant at 13h00 (roughly). DAMN YOU HYDRO. Though really, black-outs are par for the course. Client has a UPS though. BUT the USB cable went missing 6 years ago, so no way for the computer to turn itself off cleanly. DAMN YOU APC. Why not just put a USB-B port on the back of your damn UPSes instead of having using a secret-sacred RJ45 with 10 pins that costs way to much. Oh, yeah, that would be why. And DAMN YOU JEAN-PHILLIPP, a UPS without apcupsd (or equiv) is less then useful.

So anyway, battery eventually drains, BAM! Hard shutdown. Power comes back at some point, but I suspect not very cleanly. The on-off cycling causes the BIOS to loose its settings BLACK EYES TO YOU ASUS. SERIOUSLY WHAT THE EF?! Also: DAMN YOU APC AGAIN! The UPS should be smart enough to wait for a few seconds of clean power before turning passing power though.

But, now that the BIOS has been reset to showing the SATA as IDE drives, GRUB can no longer load stage 2. Which might be damn stupid, or unavoidable. The bug report to me is "GRUB " on the screen, nothing further.

And this is where my blame comes in; I know that ASUS motherboards can reset the BIOS. But I'd just had a problem with /boot on a RAID1, so I'm thinking that was the problem. I eat supper, drive 2 hours, boot the computer with Knoppix, reinstall grub so it can find stage 2. Reboot, yay grub! But then initrd can't find md0, which has VolGroup00 on it! WHA! It's there! KNOPPIX can find it, why can't you?

Messing around for a while until the light goes on: BIOS RESET BECAUSE ASUS HATES LIFE! OK, set the SATA back to AHCI. Boot again. Still no go. FAH!

And then the other shoe drops. One of those things that if you've never pulled an initrd apart and poked at the init script inside one, you wouldn't notice: init was looking for md0, but KNOPPIX was calling it md127. And it turns out that md devices have a prefered minor device number. So when KNOPPIX was calling it md127, it was writing that to the array. Which means when init was trying to activate md0, it goes BUH CAN'T FIND IT.

BLACK EYES TO YOU, KNOPPIX FOR CHANGING THAT! Seriously, changing the preferred name of an array is really bad form.

So how to change it back. First you deactivate the arrays :
mdadm --stop /dev/md126

mdadm --stop /dev/md127

BUH! That last doesn't work; LVM is still holding a lock on the array. I strongly suspect that vgremove would be enough to drop the lock, but there's no way I'm going to test that on live data.

So reboot, don't activate mdadm-raid. Do the following
mdadm --assemble --update=name --name=0 /dev/md0 /dev/sda2 /dev/sdb2 /dev/sdc2

mdadm --detail /dev/md0
mdadm --assemble --update=name --name=1 /dev/md1 /dev/sda1 /dev/sdb1 /dev/sdc1
mdadm --detail /dev/md1
The --detail lines should show Preferred Minor : 0 or 1 depending.

Note: Those are hairy commands. They could potentially kill your arrays if you get the partitioning wrong. DO NOT JUST CUT AND PASTE THEM IF YOU RUN INTO THE SAME PROBLEM AS ME! Read the docs, understand them, then adapt the commands to your setup.

PS : you pull initrd apart with
cd /boot

mkdir t ; cd t
gzip -dc ../initrd-$(uname -r).img | cpio -i
ls
Now go poke at init

No comments: