Replacing a Failed drive in MD RAID 10

DustinB3403

So tomorrow's project (as I'm building backups and heading home for the night) will be how to determine which drive is failed with MDADM as well as physically tell and then how to eject the disk from the Software array to be replaced.

DustinB3403

So to start let's check the array.

Obviously sdc is in a Failed state.

So let's see what smartclt has to say...

  smartctl -i /dev/sdc

Hrm... something is off....

So It would appear I have to update the smartctl database...

DustinB3403

Now with leaving SmartCTL as is(I'll have to come back to it); I don't have hot-swap capabilities on this server. An updated version of SmartCTL would be nice to provide additional information about my disks, and is something that I want to update. But the critical point is to get this drive swapped out as quickly as possible so that I can get this server back to good running condition.

Since I don't have hot-swap capabilities, I'm going to have to shut down the server in order to actually perform the disk exchange. Not overly complex, but adds to the risk of having to restore from backup should something go horribly wrong.

DustinB3403

Now there are a few guides that keep popping up in Google Search that give instructions on how to do this for RAID 1 MDADM Arrays.

And even @scottalanmiller has recommended the same above guide for RAID10 and this one on SW. But again RAID1.

So we'll have to work through it and ensure that they are still accurate.

travisdh1

@DustinB3403 Should be, mdadm still works the same way.

DustinB3403

@travisdh1 said:

@DustinB3403 Should be, mdadm still works the same way.

Thanks, just being extra cautious to ensure this works smoothly.

To remove the disk from the array I should have to simply type

mdadm --manage /dev/md0 --fail /dev/sdc

and then

mdadm --manage /dev/md0 --remove /dev/sdc

At this point I should be able to shutdown the server, remove the disk and add it's replacement with

 shutdown -h now

DustinB3403

Obviously at this point there is some manual labor involved since I have no hot-swap capabilities. If your server has hot-swap you can just pull the drive at this point and add the replacement disk.

DustinB3403

I'm at a stand-still as I wait for my replacement disk to arrive, so this project will have to get picked up in a day or so.

travisdh1

@DustinB3403 said:

@travisdh1 said:

@DustinB3403 Should be, mdadm still works the same way.

Thanks, just being extra cautious to ensure this works smoothly.

To remove the disk from the array I should have to simply type
mdadm --manage /dev/md0 --fail /dev/sdc
and then
mdadm --manage /dev/md0 --remove /dev/sdc
At this point I should be able to shutdown the server, remove the disk and add it's replacement with
 shutdown -h now

Yep. After putting a replacement drive in, just add it back.

mdadm --manage /dev/md0 --add /dev/sd?

I like to keep an eye on the rebuild process with:

watch /cat/proc/mdstat

The array should be back to normal.

coliver

How did you figure out what drive it was in the array? Or did you pull them until you saw the one with that serial number?

DustinB3403

@coliver said:

How did you figure out what drive it was in the array? Or did you pull them until you saw the one with that serial number?

How do I know which disk it is?

Well the other day I noticed that the array had a failed disk. Since I was rebuilding the system anyways I pulled each disk and performed a check disk from windows while checking for bad sectors.

Only 1 disk was found with bad sectors.

Knowing which disk this was, and windows saying it fixed the problem, I re-added the disk and simply "remember" which disk had the bad sectors.

So this disk is the disk that has to be removed.

coliver

@DustinB3403 said:

@coliver said:

How did you figure out what drive it was in the array? Or did you pull them until you saw the one with that serial number?

How do I know which disk it is?

Well the other day I noticed that the array had a failed disk. Since I was rebuilding the system anyways I pulled each disk and performed a check disk from windows while checking for bad sectors.

Only 1 disk was found with bad sectors.

Knowing which disk this was, and windows saying it fixed the problem, I re-added the disk and simply "remember" which disk had the bad sectors.

So this disk is the disk that has to be removed.

Ok, so you wouldn't be able to figure this out from the Linux CLI you would have to have a record of all the serial numbers that are in each bay.

DustinB3403

@coliver Pretty much.

Since there is no hot-swap function on my server (no indicator lights either) it's simply a matter of my knowing which disk is connected to which SATA port.

DustinB3403

So at this point I have the disk marked as failed, and removed from the array as shown below.

As you can see sdc is not a part of the array at the moment, which means nothing will be written to the disk. Obviously I'm in a dangerous point in time.

If I can't get my replacement disk soon, I risk losing the entire array.

Now, because I've ready had issues with this array (specifically the disk) I have nothing running on this system that I don't have several backups of. So the drive has been ordered and will be here in a day or so.

At which point I'll shutdown the server, remove the bad disk, and put the new one in.

DustinB3403

While I wait for that drive to arrive, I'm going to figure out how to configure email alerts for the mdadm array. Seeing as this would be incredibly useful to have.

Since I can't sit here watching the cat /proc/mdstat....

travisdh1

@DustinB3403 said:

While I wait for that drive to arrive, I'm going to figure out how to configure email alerts for the mdadm array. Seeing as this would be incredibly useful to have.

Since I can't sit here watching the cat /proc/mdstat....

No remote ssh access?

DustinB3403

@travisdh1 I do have access, but I'm still not going to sit here and watch it.

DustinB3403

So now that I have the email alerts configured for my Xen Servers, I really want to work on updating SmartCTL so it supports the drives that I have in this server.

Which are pretty common drives.

Western Digital Red 1TB.

I'm really surprised how old of a database is built into XenServer 6.5.

So time to figure this part out.

JaredBusch

@DustinB3403 said:

So now that I have the email alerts configured for my Xen Servers, I really want to work on updating SmartCTL so it supports the drives that I have in this server.

Which are pretty common drives.

Western Digital Red 1TD.

I'm really surprised how old of a database is built into XenServer 6.5.

So time to figure this part out.

WTF is a TD?

DustinB3403

@JaredBusch said:

@DustinB3403 said:

So now that I have the email alerts configured for my Xen Servers, I really want to work on updating SmartCTL so it supports the drives that I have in this server.

Which are pretty common drives.

Western Digital Red 1TD.

I'm really surprised how old of a database is built into XenServer 6.5.

So time to figure this part out.

WTF is a TD?

That would be a typo' whoops.

1TB.