A while back, lightning scrambled one of the disks in my home RAID 5 array. I figured out how to recover it. And I got the critical data off. Here I describe the steps I took to add a new drive and get it working with the old RAID array. I share this with the net in hopes it will make it easier for somebody else who has to go through this process themselves, and selfishly as notes for me to refer to. It’s a testament to the power of EVMS and a warning to anybody who thinks it might be fun to run their own open-source software RAID server at home.
My advice for people seeking reliable storage: go with a hosted solution. Understanding the arcane nuances of these software systems is an extremely specific skill that doesn’t translate well to many real-life necessities. If you’re smart, you can figure it out, but it doesn’t teach you much of anything except how to do exactly that. Each person who understands this stuff should be keeping petabytes of data happy, rather than one couple’s pictures and music collections. I hear Microsoft’s "home server" actually makes this pretty easy, but I can’t recommend anybody willingly lock themselves into Microsoft’s business model.
Background
So I bought a new drive, following my own advice about picking drives from different manufacturers when building a raid array, and plugged it in to the mobo and booted the machine. After futzing with /etc/fstab to get it to find the boot disk and load up, I logged into evms and got these messages:
MDRaid5RegMgr: RAID5 array md/md1 is miissing the member with RAID index 0. The array is running in degrade mode.
and
MDRaid5RegMgr: Region md/md1 is currently in degraded mode. To bring it back to normal state, add 1 new spare device to replace the faulty or missing device.
Conceptually easy. I’ve got a new 500 GB drive in the system. Linux sees it. It didn’t take me too long to figure out it’s called /dev/sda, while the previous 2 disks in the array are sdb and sdc, with a small boot drive at sdd. Now the fun part is figuring out EVMS terminology enough to tell it to use the new disk.
The hierarchy of the array in EVMS land seems to be as follows:
- Logical Volume teraraid (contains)
- Region md/md1 (which contains)
- Segments sdb1 and sdc1 (which are built on)
- Logical disks sdb, sdc.
What I tried, and what seems to have worked
I see that logical disk sda has no segments. So I try Action -> Create -> Segment. It only gives me one choice for "Segment Manager" which is "GPT Segment Manager." But when I choose it, it doesn’t let me make a segment on sda. Only the tiny free space on sdb and sdc. So sda needs something else done to it before we can use it. What?
sda also shows up in the list of Logical Volumes, next to Teraraid and the formatted boot partition. Hmmm.
Well I tried converting it to an EVMS Volume. It complained that sda does not have a File System Interface Module (FSIM) associated with it, but it made the new logical volume anyway. This wasn’t getting me anywhere. So I erased it.
Next I tried "Add" -> "Segment Manager to Storage Object". I noticed that all of the Disk Segments associated with the array were listed as using "Plug-in" "GptSegM" and this gave me the choice of adding Gpt Segment Manager to sda. W00t. I said "No" to make this a system disk. This seems to be working. Now I see a bunch of Disk Segments starting with sda, including a big one (465 GB) labelled sda_freespace1.
Now when I tried to Create -> Segment, it let me use GPT Segment Manager on sda_freespace1 and allocate a 450 GB disk segment to match the others. (I left 15 GB off each disk with the idea I could put a boot segment in that space, but I’ve never gotten around to it.)
Now in "Available Objects" there is sda1 with 450.0 ready for me. Alrighty we’re getting there.
Now I look at "Storage Regions" and in the context menu for md/md1 I see an option that says "Add spare to fix degraded array…" I didn’t see it there before — it might have not shown up when there weren’t any spares, or maybe I was just being thick. In any case, selecting it now gives me a menu with one choice — sda1.
Now in details of md/md1 it shows:
Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐
──│ │──
lv│ Name Value │
lv│ ────────────────────────────────────────────────────────────────────── │
lv│ Major Number 9 │
md│ Minor Number 1 │
│ Name md/md1 │
│ State Discovered, Degraded, Active │
│ Personality RAID5 │
│ + Working SuperBlock │
│ Number of disks 3 │
│ + Disk 1 sdb1 │
│ + Disk 2 sdc1 │
│ Number of stale disks 1 │
│ + Stale disk 0 sda1 │
│ │
│ │
│ │
│ │
│ │
│ Use spacebar on fields marked with "+" to view more information │
│ │
│ [Help] [OK] │
│ │
└────────────────────────────────────────────────────────────────────────┘
That last line about the Stale disk is new.
Actions -> Save commits these changes to disk. Now looking at Detailed information for md/md1 shows
Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐
──│ │──
lv│ Name Value │
lv│ ────────────────────────────────────────────────────────────────────── │
lv│ Major Number 9 │
md│ Minor Number 1 │
│ Name md/md1 │
│ State Discovered, Degraded, Active, Syncing = 0 │
│ Personality RAID5 │
│ + Working SuperBlock │
│ Number of disks 3 │
│ + Disk 1 sdb1 │
│ + Disk 2 sdc1 │
│ Number of stale disks 1 │
│ + Stale disk 0 sda1 │
│ │
│ │
│ │
│ │
│ │
│ Use spacebar on fields marked with "+" to view more information │
│ │
│ [Help] [OK] │
│ │
└────────────────────────────────────────────────────────────────────────┘
Emotionally I feel like I should be done now. But I don’t hear the thrashing noise of a half-terabyte of of checksums being unwound and copied onto a fresh disk. And it says "Syncing = 0". Hmmm.
I quit evmsn and reload it to see two new messages. One familiar:
MDRaid5RegMgr: Region md/md1 is currently in degraded mode. To bring it
back to normal state, add 1 new spare device to replace the faulty or missing device.
And one novel:
MDRaid5RegMgr: RAID5 array md/md1 is missing the member with RAID index 0. The array is running in degrade mode. The MD recovery process is running, please wait…
But this novel message saying it’s recovering is "Number 0" implying that it came before the other message (Number 1) which tells me I need to take action for it to fix itself. And the drives are not thrashing. Again I look at the details for md/md1 and now I see:
Na┌──────────────────── Detailed Information - md/md1 ─────────────────────┐
──│ │──
lv│ Name Value │
lv│ ────────────────────────────────────────────────────────────────────── │
lv│ Major Number 9 │
md│ Minor Number 1 │
│ Name md/md1 │
│ State Discovered, Degraded, Active, Syncing = 0.3% │
│ Personality RAID5 │
│ + Working SuperBlock │
│ Number of disks 3 │
│ + Disk 1 sdb1 │
│ + Disk 2 sdc1 │
│ + Disk 3 sda1 │
│ │
│ │
│ │
│ │
│ │
│ │
│ Use spacebar on fields marked with "+" to view more information │
│ │
│ [Help] [OK] │
│ │
└────────────────────────────────────────────────────────────────────────┘
Which really seems to say its doing its thing. Maybe I don’t hear the disks because it’s formating the disk first, which is a linear process. Or maybe the whole copy process is very linear and I won’t hear it thrashing. Its progress implies it’s going to take a couple/few days to finish, which is what I’d expect. So maybe it’s working. I’ll let it run for a while and see what happens to the array if I try to unplug one of the previously working drives.
Pretty cool that I didn’t even need to unmount the array to do this.
Now if I could just figure out why my laser printer periodically decides it needs to print it internal test page, I’d be even happier.