Quantcast
Viewing latest article 7
Browse Latest Browse All 40

Recovering a RAID Array after Lightning

Image may be NSFW.
Clik here to view.
RAID array
The EVMS RAID 5 array in my linux fileserver crashed recently due to a lightning storm, and I thought I’d lost everything.  But with some luck and intuition I was able to recover all my files.  I’ll tell you how I did it, so hopefully others who run into similar problems can recover their data too.  But first, a little background.

Last week Seattle had some crazy electrical storms.  In recent years’ storms, my block has done better than most with respect to power failures making me think we’re either lucky or in a particularly robust section of the grid.  So I was a little surprised to find my whole house offline on Wednesday morning.  After a bit of debugging I figured out that the small UPS that runs all my networking gear got toasted, and for some reason the file server was down.

I left it alone for several days, and when I got around to turning it back on, I was happy that the whole stack through the samba server came up by itself.  (It doesn’t always!)  But when I started looking around I quickly realized things were amiss.  The media/video directory normally has 4 subdirectories: movies, episodic TV, imake and other.  But today it listed:

leo@elephant:/raid/shares/media/video$ ls
dpisndic TV  hmakd  movies  nther

WTF!?  A few bits had been scrambled in the directory names.  This sounds really bad.  Moreover, even though the first couple levels of the directory hierarchy were there, but no files were to be found.  Definitely a problem.

Step 1: As soon as you suspect your RAID array has a problem, stop writing to it until you know what’s going on.  Writing changes can make things worse.  Stop the bleeding.   

I unmounted the drive from my mac, not trusting Finder or Spotlight to sprinkle damaging meta-files over the array.  Once I remembered how to ssh into the box, I stopped the samba daemon,

leo@elephant:/$ sudo /etc/init.d/samba stop

unmounted the filesystem

leo@elephant:/$ sudo umount /raid

and changed fstab so it would be read-only when it comes back, and that it wouldn’t come back without me asking.

leo@elephant:/$ sudo vi /etc/fstab

changing

/dev/evms/teraraid500 /raid ext3 defaults  0 0

to

/dev/evms/teraraid500 /raid ext3 ro,noauto  0 0

I tried poking around in EVMS by running

leo@elephant:/$ evmsn

But it hung during initialization with blue dialog saying "Discovering segments…"  I’m thinking EVMS can’t help me.  After a bit of googling I thought I should try e2fsck or some such.  First, I tried to mount it again read-only and see what’s there.

mount: wrong fs type, bad option, bad superblock on /dev/evms/teraraid500,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Bad superblock.  Uh oh.  Well this guy managed to recover a drive with a bad superblock.  Lots of things were pushing me in this direction — fix the filesystem.  But I realized that was a mistake.

Step 2: Do not make changes at the filesystem level until you’re confident that the RAID array is working properly.  You set up RAID for a reason.  You’ve still got a chance to recover everything, but if you start
making changes to it in a broken state, you’re almost certainly going
to make things worse.

Me to self: Think about it.  EVMS is confused.  Linux is confused.  Ext2 and ext3 are messed up complaining about bad superblocks.  The problem was caused by lightning.  When the drive was mounted there were wierd bit-level corruptions in the data that were still there.  Maybe one of the drives in the array got data scrambled, but didn’t get totally fragged so it went offline.  RAID 5 is designed to survive total loss of a single drive.  But if a drive gets corrupted, who knows what will happen.  So I came up with this plan:

Step 3: Try physically disconnecting the drives in your array, one at a time.  If only one of them is scrambled, disconnecting it should restore all the data in the array.

Having followed my own advice, it’s easy for me to tell the drives in my array apart since each drive in the RAID array is from a different manufacturer (which makes array failure due to manufacturing defects far less likely). 

This plan actually worked perfectly!  Removing a drive caused a bit of a hassle in getting the machine back up, because when I booted it couldn’t find the /boot partition complaining

 * Starting Enterprise Volume Management System...
[42949392.340000] raid5: raid level 5 set md1 active with 2 out of 3 devices, algorithm 0

* Checking all filesystems...
fsck.ext3: No such file or directory while trying to open /dev/sdd5
/dev/sdd5:
The superblock could not be read or does not describe a correct ext2 filesystem. 
If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else),
then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Notice the complaint about the superblock again — don’t trust it, and don’t do what it says!  What really happened was that the boot drive letter had been changed from /dev/sdd to /dev/sdc, so I had to change /etc/fstab to mount /boot from  /dev/sdc5 instead of /dev/sdd5.  In my system, I boot off a non-RAID disk attached to the mobo, which for some annoying reason gets the last drive letter after all the drives no the SATA card.

But once I got past this, it quickly turned out that the Samsung drive was the culprit.  With it removed, the software RAID kicked in and plugged all the whole.  Everything the array looked
completely normal again.  All the directories.  All the files.  Hooray!


Viewing latest article 7
Browse Latest Browse All 40

Trending Articles