Burned by Eschewing Best Practices

Carnival Boy

Yeah, but cost isn't an issue as money is no object.

Dashrender

Yeah, but cost isn't an issue as money is no object.

Any business that says that is just trying to fail! The whole point of a for profit company is to make money, and they should be doing so with smart spending.

Carnival Boy

I'm not disagreeing, but if the OP says money is no object then you should treat that as fact. Maybe he has a magic money tree. Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point. The point I'm trying to understand is why dual SANs and dual switches equals this pyramid of doom thing.

DustinB3403

It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.

It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.

But it's just wasteful in most cases.

coliver

@DustinB3403 said:

It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.

It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.

But it's just wasteful in most cases.

Basically this. You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.

There is a point where this makes sense... but not at 6 servers and two physical hosts. I'm not sure where the tipping point is but probably at the hundreds of virtual servers mark.

Dashrender

@coliver said:

You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.

You are actually mathematically substantially less reliable with that setup, at least from a hardware failure perspective.

Carnival Boy

You got any facts to back that up? I find it extremely difficult to evaluate reliability. Anyway, you can't just judge it from a hardware failure perspective, since we're comparing hardware redundancy versus software redundancy (eg DAGs, file syncing). Both are complicated. Both require expertise to administer and both are risky.

scottalanmiller

@Carnival-Boy said:

I'm not disagreeing, but if the OP says money is no object then you should treat that as fact.

I don't agree. Knowing someone is wrong, confused or doesn't understand something is exactly when they need help most, not the least. Tons and tons of what we do in IT is recognizing when people don't know what they need to know and helping them. In a case like this where we know they have to be wrong and don't understand what they are doing, should we really help them hurt themselves?

I totally get that this goes against my "always give people the benefit of the doubt" theory about never hurt the innocent to protect the guilty, but this is a case where money is never no object, it's simply not true, and it means someone desperately needs help and don't understand that they don't know.

scottalanmiller

@Carnival-Boy said:

Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point.

That would actually make money the ONLY object. Budget would be the whole concern, not just part of it.

Carnival Boy

Well, ok, but the OP isn't' actually on ML so it's a moot point. What I'm really interested in is what the problem is with his solution (ignoring the financial cost) and why it is one of your inverted pyramid thingies. I'm not arguing, I just don't understand and want to learn.

DustinB3403

An IVPD looks stable and reliable when looking at it from the top, you have a bunch of equipment that supposedly will fail over between the devices.

But what isn't obvious from it, is if you look at it from the side you have individual layers of equipment, which is dependent on everything above or below its self.

So in the most simple example 3-2-1 you have 1 NAS(or SAN) 2 Switches and 3 Servers.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

What this means is that there are so many potential points for failure, and that in the most basic approach of the 3-2-1 the "reliability" isn't at all reliable, or is only as reliable as your weakest link, which is often the NAS (or SAN).

Because if any part of that chain breaks the whole system can and likely will come crashing down. Here's a really good explanation from the one and only SAM

DustinB3403

In addition, and outside of what was brought up in the SW topic, is that there were likely many other Best Practices that were not followed by the OP on SW which lead to him getting burned, regardless of what Hypervisor his employer uses.

And the reason I say this is because the SWOP has stated he was already burned by Citrix Support, which seems very odd, as support hasn't designed the system to fail, but are trying to recover a failed system.

In summation the SWOP has a system that was improperly setup (likely by Eschewing Best Practices) for the benefit of Quick deployment, while not understanding how and why he got burned.

It has nothing to do with Citrix, unless Citrix saw the state of their system and how things were configured and said "Nope Nope Nope, we can't help you as everything you've setup is completely ignoring Best Practice Recommendations in its configuration, it has to be rebuilt." And "We won't support the system in this configuration."

Which is probably how the conversation went.

Carnival Boy

But isn't this 2-2-2 and not 3-2-1? I'm still not getting it.....

"The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device"

There is no single storage device here. Isn't it a "Tower of Redundancy" rather than a "Pyramid of Doom"? An expensive tower, but a tower. Or maybe a folly.

Carnival Boy

And what's the difference between "Inverted Pyramid of Doom" and the traditional term "Single Point of Failure (SPOF)", as in "a single SAN is a SPOF and therefore a bad solution. You need at least two for redundancy"?

DustinB3403

This is still a IVPD, because the servers are dependent on the NAS(s), its an improved IPVD (if such a thing could exist) but there are many points that can fail.

Making it an overly complicated solution, and by design reduces the reliability of the system as a whole. Which includes recoverablity, stability and reliability.

DustinB3403

A Single Point of Failure by its self won't bring the entire organization down.

Only that Point, and what it hosts is unavailable until it's fixed.

DustinB3403

The best way to think of a SPOF is to take any single server, and unplug it. Without any other backup servers for these functions to migrate to.

That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.

Carnival Boy

@DustinB3403 said:

That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.

That's not my understanding of SPOF. In the context of the OP, the "system" contains various pieces of hardware (hosts, switches & SANs). If he lacks redundancy in one area of this system (for example, by only having one switch), then that piece of non-redundant hardware is a SPOF. In the pyramid analogy, it is the '1' in 3-2-1 that represents a non-redundant component and the '1' is the SPOF.

DustinB3403

It still represents the same single point of failure. Any device (including a network switch, NAS, server, or network cable) that doesn't have a redundant "fail-safe" is a SPOF.

dafyre

The trick when building a "system" of anything... is to always be searching for things that have become an SPOF. So let's start with 3 servers and 2 x SANs (Network RAID-1, redundant, automatic failover, etc, etc), and 1 x Switch all in the same building connected to the same power grid and circuits...

The first SPOF is the Network switch. How do we fix it? Add another Network switch (this is assuming that every part of this system is in the same data center / rack).

The next is the fact that they are all on the same circuit. Have the elecrician separate them out.

What happens if the power blips? Need UPSes fo each circuit.
What happens if there's an extended power outage? Need a good generator capable of running for hours or days as neccesary.

What about cooling? That goes on its own circuit and hopefully is also connected to the generator...

The list could go on and on forever. The reason so many folks warn about the complexity is that once you've built this giant system... it is extremely complex... and the more reduntant you try to make it, the more complicated (and costly) it gets... The more moving parts you have, the more risk you run of missing something that is obviously another SPOF.

The idea is to find the balance of increased redundancy / automatic failover / reduced down time,cost, and complexity for your organization. It might not get you to the 5 nines. But it might get you say... 3 nines (99.9, right?) of uptime....

This will also involve playing nicely with the bean counters. They will suffer from sticker shock when you show them the price tag for what you want (regardless of if your organization can afford it or not). Work with them and explain how you came up with the system design and how it can save money in the long run. It would also be worth bringing them in at the start to find out exactly what the cost of down time is. So you're not spending half a million dollars to prevent $20 worth of down time.