Invalid Drive Movement from HP SmartArray P411 RAID Controller with StorageWorks MSA60

Shuey

Due to hurricane Matthew, our company shutdown all servers for two days. One of the servers was an ESXi host with an attached HP StorageWorks MSA60.

When we logged into the vSphere client, we noticed that none of our guest VMs are available (they're all listed as "inaccessible"). And when I look at the hardware status in vSphere, the array controller and all attached drives appear as "Normal", but the drives all show up as "unconfigured disk".

We rebooted the server and tried going into the RAID config utility to see what things look like from there, but we received the following message:

"An invalid drive movement was reported during POST. Modifications to the array configuration following an invalid drive movement will result in loss of old configuration information and contents of the original logical drives".

Needless to say, we're very confused by this because nothing was "moved"; nothing changed. We simply powered up the MSA and the server, and have been having this issue ever since.

I have two main questions/concerns:

Since we did nothing more than power the devices off and back on, what could've caused this to happen? I of course have the option to rebuild the array and start over, but I'm leery about the possibility of this happening again (especially since I have no idea what caused it).
Is there a snowball's chance in hell that I can recover our array and guest VMs, instead of having to rebuild everything and restore our VM backups?

travisdh1

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I have two main questions/concerns:

Since we did nothing more than power the devices off and back on, what could've caused this to happen? I of course have the option to rebuild the array and start over, but I'm leery about the possibility of this happening again (especially since I have no idea what caused it).

Any number of things. Do you schedule reboots on all your equipment? If not you really should for just this reason. The one server we have, XS decided the array wasn't ready in time and didn't mount the main storage volume on boot. Always nice to know these things ahead of time, right?

Is there a snowball's chance in hell that I can recover our array and guest VMs, instead of having to rebuild everything and restore our VM backups?

Possibly, but I've never seen that particular error. We're talking very limited experience here. Depending on which RAID controller it is connected to the MSA, you might be able to read the array information from the drive on Linux using the md utilities, but at that point it's quicker just to restore from backups.

Shuey

@travisdh1 said in "Invalid Drive Movement" (HP Smart Array P411):

Any number of things. Do you schedule reboots on all your equipment? If not you really should for just this reason. The one server we have, XS decided the array wasn't ready in time and didn't mount the main storage volume on boot. Always nice to know these things ahead of time, right?

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@travisdh1 said in "Invalid Drive Movement" (HP Smart Array P411):

Any number of things. Do you schedule reboots on all your equipment? If not you really should for just this reason. The one server we have, XS decided the array wasn't ready in time and didn't mount the main storage volume on boot. Always nice to know these things ahead of time, right?

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

Shuey

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

We unfortunately don't have a "normal reboot schedule" for ANY of our servers :-/...

I'm not even sure what the correct order is :-S... I would assume that the MSA would get powered on first, then the ESXi host. If this is correct, we have already tried doing that since we first discovered this issue today, and the issue remains :(.

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

We unfortunately don't have a "normal reboot schedule" of ANY of our servers :-/...

I should not have said schedule. I should have said your "Normal reboot process." Regardless of the regularity of the reboots, is the process a standard one?

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I'm not even sure what the correct order is :-S... I would assume that the MSA would get powered on first, then the ESXi host. If this is correct, we have already tried doing that since we first discovered this issue today, and the issue remains :(.

You are correct, the MSA needs to be up first.

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

A bit. Why is there an MSA out of contract? The only benefit to an MSA is the support contract. Not that that makes it worth it, but proprietary storage requires a warranty contract to be viable. The rule is that any storage of that nature needs to be decommissioned the day before the support contract runs out because there isn't necessary any path to recovery in the event of an "incident" without one. It's not a standard server that you can just fix yourself with third party parts. Sometimes you can, but as it is a closed, proprietary system, you are generally totally dependent on your support contract from the vendor to keep it working.

There is a good chance that this is a "replace the MSA and restore from backup" situation in that case.

scottalanmiller

Because the scenario that you are in is not one that should arise, I am going to guess that tracking down info on this will be difficult. But here is something that I found. Worth trying while we look for something more to help.

0_1475969903389_Screenshot from 2016-10-08 19-37-28.png

scottalanmiller

I see that you have this posted here as well: http://serverfault.com/questions/807892/how-to-recover-from-invalid-drive-movement-hp-smartarray-p411

scottalanmiller

Hopefully the controller offers an option to continue even with the invalid drive movement, but it might not. Updating the firmware might enable that, or might not.

Reid Cooper

If you have no support options and get desperate, you could try some really desperate things like wiping the array controller and forcing it to pick up the array as if it had never had drives before. That might work, but it is risky and I would not do it unless you have exhausted other options.

Shuey

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

We unfortunately don't have a "normal reboot schedule" of ANY for our servers :-/...

I should not have said schedule. I should have said your "Normal reboot process." Regardless of the regularity of the reboots, is the process a standard one?

I'm not sure we have a "standard"... we only reboot this particular ESXi host when absolutely necessary, and this weekend is possibly the first time we've rebooted the MSA in a year or more :-S...

Shuey

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

A bit. Why is there an MSA out of contract? The only benefit to an MSA is the support contract. Not that that makes it worth it, but proprietary storage requires a warranty contract to be viable. The rule is that any storage of that nature needs to be decommissioned the day before the support contract runs out because there isn't necessary any path to recovery in the event of an "incident" without one. It's not a standard server that you can just fix yourself with third party parts. Sometimes you can, but as it is a closed, proprietary system, you are generally totally dependent on your support contract from the vendor to keep it working.

There is a good chance that this is a "replace the MSA and restore from backup" situation in that case.

Unfortunately, my company's philosophy on "investing in IT infrastructure" goes like this: "We'll spend hundreds to thousands of dollars every time our PACS vendor tells us they need it. Then, when they say that they need to upgrade their equipment, we'll re-purpose their old stuff for the rest of our production environment (because we don't understand the importance of spending money on the rest of our infrastructure, and we don't trust the knowledgeable people we hired in our IT department)"

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

We unfortunately don't have a "normal reboot schedule" of ANY for our servers :-/...

I should not have said schedule. I should have said your "Normal reboot process." Regardless of the regularity of the reboots, is the process a standard one?

I'm not sure we have a "standard"... we only reboot this particular ESXi host when absolutely necessary, and this weekend is possibly the first time we've rebooted the MSA in a year or more :-S...

For the future, sadly it is too late now, but consider these things...

A monthly reboot at the least of everything, not just some components, let's you test that things are really working and at a time when you can best fix them.
Avoid devices like the MSA in general, they add a lot of risk fundamentally.
Avoid any proprietary "black box" system that is out of support. While these systems can be good when under support, the moment that they are out of support their value hits a literal zero. They are effectively bricks. Would you consider running the business on a junk consumer QNAP device? This device when out of support is far worse.

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

A bit. Why is there an MSA out of contract? The only benefit to an MSA is the support contract. Not that that makes it worth it, but proprietary storage requires a warranty contract to be viable. The rule is that any storage of that nature needs to be decommissioned the day before the support contract runs out because there isn't necessary any path to recovery in the event of an "incident" without one. It's not a standard server that you can just fix yourself with third party parts. Sometimes you can, but as it is a closed, proprietary system, you are generally totally dependent on your support contract from the vendor to keep it working.

There is a good chance that this is a "replace the MSA and restore from backup" situation in that case.

Unfortunately, my company's philosophy on "investing in IT infrastructure" goes like this: "We'll spend hundreds to thousands of dollars every time our PACS vendor tells us they need it. Then, when they say that they need to upgrade their equipment, we'll re-purpose their old stuff for the rest of our production environment (because we don't understand the importance of spending money on the rest of our infrastructure, and we don't trust the knowledgeable people we hired in our IT department)"

Simply explain that an unsupported MSA is a dead device, totally useless. When asked to use it, explain that it's not even something you'd play around with at home.

Even a brand new, supported MSA falls below my home line. But once out of support, it's below any home line.

http://www.smbitjournal.com/2014/11/the-home-line/

scottalanmiller

What have you been trying thus far? What's your current triage strategy assuming that we can't fix this?

scottalanmiller

Edited to add tags and upgrade the title for SEO and rapid visual determination.

Shuey

You guys are not going to believe this...

First I attempted a fresh cold boot of the existing MSA, waited a couple minutes, then powered up the ESXi host, but the issue remained. I then shutdown the host and MSA, moved the drives into our spare MSA, powered it up, waited a couple minutes, then powered up the ESXi host; the issue still remained.

At that point, I figured I was pretty much screwed, and there was nothing during the initialization of the RAID controller where I had an option to re-enable a failed logical drive. So I booted into the RAID config, verified again that there were no logical drives present, and I created a new logical drive (RAID 1+0 with two spare drives; same as we did about 2 years ago when we first setup this host and storage).

Then I let the server boot back into vSphere and I accessed it via vCenter. The first thing I did was removed the host from inventory, then re-added it (I was hoping to clear all the inaccessible guest VMs this way, but it didn't clear them from the inventory). Once the host was back in my inventory, I removed each of the guest VMs one at a time. Once the inventory was cleared, I verified that no datastore existed and that the disks were basically ready and waiting as "data disks". So I went ahead and created a new datastore (again, same as we did a couple years ago, using VMFS). I was eventually prompted to specify a mount option and I had the option of "keep the existing signature". At this point, I figured it'd be worth a shot to keep the signature - if things didn't work out, I could always blow it away and re-create the datastore again. After I finished the process of building the datastore with the keep signature option, I tried navigating to the datastore to see if anything was in it - it appeared empty. Just out of curiosity, I SSH'd to the host and checked from there, and to my surprise, I could see all my old data and all my old guest VMs! I went back into vCenter and re-scanned storage and refreshed the console, and all of our old guest VMs were there! I re-registered each VM and was able to recover everything! All of our guest VMs are back up and successfully communicating on the network.

I think most people in the IT community would agree that the chances of having something like this happen are extremely low to impossible.

As far as I'm concerned, this was a miracle of God...

scottalanmiller

That is seriously amazing!