Replacing the Dead IPOD, SAN Bit the Dust

JaredBusch

Not knowing they workload of the two hosts, it is really a no brainer to assume that neither system was pegged.

I would bulk up storage for the better host and get things there until it cries.

Then see how much more I got and add things to the second host if required.

DustinB3403

@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:

Why not RLS a la StarWind ?

Because the hosts aren't uniform.

DustinB3403

@JaredBusch said in Replacing the Dead IPOD, SAN Bit the Dust:

Not knowing they workload of the two hosts, it is really a no brainer to assume that neither system was pegged.

I would bulk up storage for the better host and get things there until it cries.

Then see how much more I got and add things to the second host if required.

Pretty much what I was going to say, add storage to the newer more powerful unit (assuming the hosts weren't pegged) and import the data to local storage.

Get a new backup device for new backups and go from there.

StrongBad

@DustinB3403 said in Replacing the Dead IPOD, SAN Bit the Dust:

@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:

Why not RLS a la StarWind ?

Because the hosts aren't uniform.

Could make them uniform, of course.

JaredBusch

The better choice for this would be to dump the existing infrastructure and just migrate it all to a @scale solution.

Yeah it is more expensive than building up the existing hosts.

But it is obvious the company has no idea what it is doing with this gear. So take that out of the equation by getting on a managed solution.

DustinB3403

@StrongBad said in Replacing the Dead IPOD, SAN Bit the Dust:

@DustinB3403 said in Replacing the Dead IPOD, SAN Bit the Dust:

@dafyre said in Replacing the Dead IPOD, SAN Bit the Dust:

Why not RLS a la StarWind ?

Because the hosts aren't uniform.

Could make them uniform, of course.

You could, but will the existing system withstand the time needed to get the hardware platform uniform and functional.

Will both hosts support the same amount of storage, the same RAM, CPU etc.

Is it cost effective to go down that approach. Versus just getting a single stable server.

scottalanmiller

First question is: Is failover needed? Doing the process of "reading back" what was there in the past, there was an EQL SAN single point of failure without a failover device (dual controllers is not failover in any sense.) So historically they've been running without high availability. So the big question is... do they need it now? If high availability is needed now, why wasn't it needed in the past?

Two videos worth watching on this:

https://mangolassi.it/topic/11324/scott-alan-miller-smb-system-architectural-patterns

Youtube Video

https://mangolassi.it/topic/25/mainframe-architectural-pattern-for-smb-it-scott-alan-miller-speaking-at-spicecorps-dfw-2012

Youtube Video

scottalanmiller

@JaredBusch said in Replacing the Dead IPOD, SAN Bit the Dust:

The better choice for this would be to dump the existing infrastructure and just migrate it all to a @scale solution.

Yeah it is more expensive than building up the existing hosts.

But possibly NOT as expensive as replacing the SAN itself. So while it's not cheap compared to what they could do, it might be cheap compared to what they expected to do.

JaredBusch

@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:

@JaredBusch said in Replacing the Dead IPOD, SAN Bit the Dust:

The better choice for this would be to dump the existing infrastructure and just migrate it all to a @scale solution.

Yeah it is more expensive than building up the existing hosts.

But possibly NOT as expensive as replacing the SAN itself. So while it's not cheap compared to what they could do, it might be cheap compared to what they expected to do.

Correct, but I am assuming that someone here is telling them to shit on the SAN anyway...

scottalanmiller

@JaredBusch said in Replacing the Dead IPOD, SAN Bit the Dust:

@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:

@JaredBusch said in Replacing the Dead IPOD, SAN Bit the Dust:

The better choice for this would be to dump the existing infrastructure and just migrate it all to a @scale solution.

Yeah it is more expensive than building up the existing hosts.

But possibly NOT as expensive as replacing the SAN itself. So while it's not cheap compared to what they could do, it might be cheap compared to what they expected to do.

Correct, but I am assuming that someone here is telling them to shit on the SAN anyway...

Probably. But maybe they don't realize how expensive and bad of an idea that that is. The cost analysis should be crazy.

dafyre

Are they back in an operation state at the moment, or still waiting on used parts delivery?

scottalanmiller

Waiting on parts, but they will have those very soon.

Aconboy

@JaredBusch Not that much more expensive and far more reliable for the job at hand

wrx7m

@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:

Waiting on parts, but they will have those very soon.

The parts mentioned are the SAN controllers?

scottalanmiller

@wrx7m said in Replacing the Dead IPOD, SAN Bit the Dust:

@scottalanmiller said in Replacing the Dead IPOD, SAN Bit the Dust:

Waiting on parts, but they will have those very soon.

The parts mentioned are the SAN controllers?

Yes, they need two new SAN controllers and one new backplane.

NerdyDad

Thanks @scottalanmiller for helping me out with this predicament.

Current status of SAN. Firmware is as updated as it can go right now. I have 2 drives that are rebuilding from a RAID6 array. I have one more drive that is warning me about potential failure but not going to replace it until the other 2 are done rebuilding. The SAN is a Dell EqualLogics PS5000X. Firmware of the controllers are second to the latest firmware.

Host is a Dell PowerEdge R610 with the 86 GB of RAM and 16 vCPUs with VMware ESXi 6.0. This host currently supports 3 VM's, totaling at about 350 GB of production data. 2 of these VM's is on the local datastore of the host, but 1 VM is actually on that SAN that we need. It totals at 220 GB of data. There are no backups (my mistake).

We've tried flipflop failovers with the controllers and it only lasts us so long. Long enough to boot the VM backup but not enough time to actually backup the data. The backplane has been replaced. We've tried replacing controllers and all of the disks turned orange instead of green. We went back with the original controller and array began to operate normally again.

Dell support has advised us to allow for the array to continue rebuilding which was at 17%. Once done, I'm going to attempt to connect to it again and try to pull off the data. Support guy thought that we were overtaxing the SAN and basically freezing it up.

Besides retiring the thing, are there any pointers that I should consider in order to ensure that the backup or migration is a success?

scottalanmiller

I'd say that there are probably three key options for this as broad stroke approaches, each is valuable for its own reasons:

Mainframe: Just put disks in the local machines and do away with the clustering. The clustering added cost and risk without any actual benefits in the past. So why carry any of that forward. Just put disks into the local machines for the lowest cost, simplest solution. Points of failure are reduced, overall risk is reduced, bottlenecks are removed, flexibility is increased all for the lowest cost of investment. Costs nearly nothing, very effective, no downsides compared to the old solution. All positive movement.
Self Made Cluster: Replicated Local Disks and a hypervisor with high availability like is in place today. This is more costly and likely means some hardware upgrades to get the two hosts closer together, but at two hosts is very low cost and will provide dramatically more protection than the old approach.
Hyperconvergence: Do a full update moving to a totally hyperconverged product that provides complete support top to bottom. This is the most costly but replaces all hardware, gets inclusive support and requires the least internal IT effort.

scottalanmiller

@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:

I have 2 drives that are rebuilding from a RAID6 array. I have one more drive that is warning me about potential failure but not going to replace it until the other 2 are done rebuilding.

Oh no, that isn't good. Two lost controllers and two lost drives on RAID 6? What's the projected drive replacement time, a week at least, I would guess. It's almost better to not bother replacing the drives and just take a backup.

scottalanmiller

@NerdyDad said in Replacing the Dead IPOD, SAN Bit the Dust:

Support guy thought that we were overtaxing the SAN and basically freezing it up.

He is likely correct. That is generally expected with a RAID 6 rebuild, especially with two drives rebuilding at once.

NerdyDad

@scottalanmiller I likely won't put in my last spare drive unless I absolutely have to. My main end goal is to somehow migrate the data and retire the SAN. It went from 0-17% in about 3 hours. I'm going to let it continue and hopefully it will be done in the morning. I will check on it once I get back to the office.