Storagebod Rotating Header Image

April 12th, 2012:

Designed to Fail

Randy Bias has written an interesting piece here on the impact of complexity on reliability and availability; as you build more complex systems, it becomes harder and harder to engineer in multiple 9’s availability. I read the piece with a smile on my face and especially the references to storage; sitting with an array flat on it’s arse and already thinking about the DAS vs SAN argument for availability.

How many people design highly-available systems with no single points of failure until it hits the storage array? Multiple servers with fail-over capability, multiple network paths and multiple SAN connections; that’s pretty much standard but multiple arrays to support availability? It rarely happens. And to be honest, arrays don’t fall over that often, so people don’t tend to even consider it until it happens to them.

An array outage is a massive headache though; when an array goes bad, it is normally something fairly catastrophic and you are looking at a prolonged outage but often not so prolonged that anyone invokes DR. There are reasons for not invoking DR, most of them around the fact that few people have true confidence in their ability to run in DR and even fewer have confidence that they can get back out of DR, but that’s a subject for another blog.

I have sat in a number of discussions over the years where the concept of building a redundant array of storage arrays has been discussed i.e stripe at the array level as opposed to the disk level. Of course, rebuild times become interesting but it does remove the array as a single point of failure.

But then there are the XIVs, Isilons and other clustered storage products which are arguably extremely similar to this concept; data is striped across multiple nodes. I won’t get into the argument about implementations but it does feel to me that this is really the way that storage arrays need to go. Scale-out ticks many boxes but does bring challenges with regards to metadata and the like.

Of course, you could just go down the route of running a clustered file-system on the servers and DAS but this does mean that they are going to have to cope with striping, parity and the likes. Still, with what I have seen in various roadmaps, I’m not betting against this as an approach either.

The monolithic storage array will continue for some time but ultimately, a more loosely coupled and more failure tolerant storage infrastructure will probably be in all our futures.

And I suppose I better find out if that engineer has resuscitated our array yet.