Storagebod Rotating Header Image

gestaltit

Death of Backup?

Can Snaps and Replication ever replace traditional back-up applications? It's an interesting thought and certainly one that we've considered in the past. We often find that the answer that you get very much varies from what the favourite technology is with generally the NetApp fans saying yes and the EMC fans saying no.

Now as a storage agnostic, my answer is maybe but it depends on what you use your back-ups for and your internal processes. I can certainly think of a use case where the answer is NO.

One thing which we do and I suspect many other people do is to use our back-ups as source-data for development copies. So we have to get that data into another environment which may sit on different disk technologies and it's certainly a different environment. Using traditional tape or even VTL based environments, this is relatively easy to do but with a snap-based environment; this will be a lot harder, not impossible but it actually adds complexity.  

And if you go down the snaps/replication route; you've made your migration path away from your disk supplier infinitely harder. Because not only are you going to be migrating your primary disk environment but you have tightly coupled your back-up environment to your primary disk environment. 

So I would still plump for keeping my back-up environment fairly loosely coupled as opposed to tightly integrated. 

Bug or Incompatible?

Just scratching an itch because of something which has happened this week. Now one of those truisms trotted out by NAS and IP Storage fanatics in general is that FC is really hard to manage because of all the various hoops that you have to leap through to make various bits of kit to talk to each and the endless certification matrices. Just look at EMC's Support Matrix; a document so long and so footnote heavy, that it is made for an e-reader. 

Where NAS just works; I mean CIFS is CIFS and NFS is NFS; it just works…doesn't it? 

Well no, not always; we've just come across a bizarre bug in OnTap 7.3.2 with CIFS running a Mac. If you rename a file and whilst doing so, change case; the file disappears. 

Well at least it's flagged as a bug and yes there is a fix. But is it just that there's an incompatibility and Apple's interpretation of the CIFS 'standard' is slightly different from Microsoft's? It happily works under Windows and it happily works under Linux. 

You see I wonder if we have all got a little complacent with regards to NAS? We all just assume that it'll work and mostly it does….but then again, this pretty much the case in the FC world too. And as we make even more use of NAS in the Enterprise world, perhaps we should be paying more attention to NAS certification/compatibility matrices? Because I for one haven't worried about this in the past but I will give it more consideration in the future.

Controlling Behaviour

Two very different press conferences/product launches happened today; you can't have missed them.

i) the iPad launch by Apple

ii) the completion of the Sun takeover by Oracle

But actually they had a very common theme; control.

Let's take Apple and the iPad and indeed all their products; Apple exert complete control of the hardware that their product runs on; indeed on their mobile devices, they even control the applications that run on their hardware. Some people hate this, they really do not like this controlling element; they go out of their way to do things to break-free of this controlling element. 

But for some reason, we stick with Apple's products; we may hate the company but we love the product; we accept their control grudgingly. We like the fact that we don't have to waste our precious time making things work together. And at the end of the day, we can get out of the relationship with Apple pretty easily if we really decide we don't like them.

Now let's take Oracle and Sun; Larry has looked back at history to the IBM of the 60s and I suspect at his friend Steve and decided I want some of that control.  In fact, Oracle found people who say that are looking forward to Oracle controlling the whole stack? The one throat to choke but I'm willing to be that in big Enterprise computing, no-one really wants this; they don't want to be locked in to a single vendor. We've been there and done that; we have choice, we have competition. 

Yes, at one level, life would be a lot easier with a single throat to choke but we know where that leads and we know if we get too much into bed with Oracle, it's going to be major struggle to get out of the relationship. There's too much at stake to allow Oracle the same level of control we grudgingly accept from Apple.  

Disastrous Thinking

As follow-up to my blog here; I'd like to share yet more thoughts on availability and the potential negative impacts on some of the new technologies out there.

How many of you run clusters of servers? HA/CMP? Veritas Cluster? Microsoft Cluster? VMWare Clustering? I suspect lots of you do? How many of you cluster NAS heads? Yet again, I suspect lots of you do? How many of you cluster arrays? Not so many I guess? Certainly in my experience, it is uncommon to cluster an array. And when I talk about clustering an array, I don't mean the implementation of replication.

So, if you don't cluster your arrays; how do you protect against the failure of a RAID rank? Statistically unlikely but it is it more or less unlikely than a loss of data-centre? I'm not sure and the failure of a RAID rank for many people could well mean the invocation of the disaster recovery plan. Why? 

The loss of a RAID rank might well lead to the loss of an application/service and if it is an absolutely business critical service, can you bring it up at the remote replication site in isolation? As a discrete component? If you can, can you cope with increased transaction times due to latency? Many applications now have complex interactions with partner applications; these might not be well understood. So the failure of RAID rank could lead to the invocation of the Disaster Recovery Plan. Actually in my experience, this is very nearly always the case unless the service has been designed with recovery in mind; this requires infrastructure and application teams to work together, something which we are not exactly good at.

But you now take the challenge and make sure that every application can be failed over as a discrete component. Excellent, a winner is you! You know the impact of loosing a RAID rank, you know what applications it impacts, you've done your service mappings etc, etc. And you have been very careful to lay things out to minimise a single RAID failure's impact.

And then you implement automated storage tiering. Firstly, you now have no idea in advance what impact a RAID rank failure may have; you have no idea what applications may be impacted. And actually, the failure of a single RAID rank may well have huge impact. We could be looking at restoring many terabytes of data to cope with the failure of a couple of terabytes and many applications failing.

It will depend on the implementation of the automated storage tiering and I am concerned that at present we do not know enough about the various implementations which will hitting our arrays over the next eighteen months. So despite automation making things day-to-day a lot easier, we cannot treat it as Automagic Storage Tiering; we need to know how this works and how we plan to manage this. 

And perhaps for key applications, we will need to cluster storage arrays locally; that in itself will bring challenges.

I'm still a big fan of automated storage tiering but over the next few months, I would like to see the various vendors start talking about how they mitigate some of this risk. Barry Burke has made a big thing about the impact of a double disk failure on an XIV array in the past; in a FAST v2 environment, I would like to see how EMC mitigate against very similar problems.

I would also like to know what impact of a PAM card failure from NetApp is; does the array degrade to the extent where it is not useable? What kind of tools can NetApp give me to assess potential impact. As Preston points out here; failure of individual components within an array could have significant impacts.

We are heading to a situation where technology gets every more complex and arguably ever more reliable. But we rely on it to ever more greater extents; so we must understand risks and mitigations to a much greater amount than we have in the past. 

  

How do you measure availability?

Recently on Twitter, there was a conversation about vendors and certifying the availability of their arrays; which vendors certify their arrays as five nines etc. I am going to argue that these figures lull people into a false sense of security as actually no-one knows what they mean!

If a vendors says that that their array is 99.999% available, what does that really mean to you? Probably not a lot in practical terms. Does it mean that individual components are 99.999% available? Or does it mean that the array itself in some shape or form is available?

If the array is still powered on and not in flames, is that available?

If 75% of disks are working, is the array still available?

If the array can service any I/O is that available?

What do vendor figures actually mean and do they matter to you? More importantly, do they matter to your customers? Your customer doesn't care whether the array is still working, all they care about is whether they have access to their data and their service is available. So ultimately, vendor availability figures are pretty much meaningless in the larger scheme of things. 

So those vendors who read my blog, what do your availability figures actually mean?

A Bold Prediction

Did VMWare save EMC? Looking back at the last decade, the acquisition that stands out is that of VMWare by EMC. An acquisition so important that I suggest that it saved EMC. Did EMC need saving? At the point it bought VMWare probably not but if EMC had not bought VMWare, I suspect it would have been in dire straights or at least not the company it is today. I'd go as far as to propose that it would have been bought itself by now.

Acquiring VMWare changed EMC from a storage company into a company which wants to be a lot more; it brought ambition and ideas into EMC. VMWare opened EMC eyes' to a new way of thinking for it; it has driven more innovation into EMC than any other acquisition. Without the experience that VMWare brought into it, I doubt that EMC would be trying as hard to innovate in Cloud Computing; they'd just be another 'me too'. 

So what will the defining acquisition of the next decade be? And will it be made by a storage company? Or will be by someone acquiring a storage company? Actually I'm going to make a bold prediction, it'll involve Cisco but I'm not sure whether Cisco will be the purchaser or the purchased. There's something to think about and no-one will remember if I was wrong anyway.

Google for the Infrastructure

I've been thinking about FAST and especially FAST v2 but not entirely from a storage point of view. FAST v2 and indeed any automated storage tiering product has some interesting uses beyond storage and could be a basis for a whole new way of managing IT as a service. In fact, it finally enables storage and beyond to managed as a service. BTW I'm going to use FAST as shorthand for any automated storage product; so please don't take this as only being about EMC.

In order for FAST to work, it needs to gather and react to a lot of information from the array itself. In fact for FAST to be truly useful, it needs to gather, react and store alot of information about what is going on the array.

Take a typical corporate accounting application; most of the time it can be pretty quiet and non-performance intensive but at certain times of the year, it will be a very intensive workload. During these times, you might want it all to be on the fastest, most performant tier; now FAST will react to a sudden increase in workload and move the application when it sees the demand increase but will FAST be able to move this quickly enough? So perhaps, we need to give the array some hints as to when to prime the load?

These sort of peaks are very predictable and we know when they will happen but not all peaks are quite as predictable; or at least we don't think they are. FAST will be gathering stats all the time and by analysing this data; it might be able to do the predictive analysis a lot quicker and spot things that we can't or at least don't have the time for. It may pick up on relationships between applications, application X runs hot at a certain time which causes application Y to become busy at some period later; for example, certain types of activity may cause a reporting job to be run at a later date.

You see from our storage infrastructure, we can start to gather a lot of information about our whole estate. But EMC could go further, they have things like nLayers and Smarts to leverage; they could start to pull information from VMware and do a whole lot of analysis on this. NetApp have SanScreen; HP have a zillion tools as do IBM.

Once you've got that information, you need to start turning that into something the business understands so that you can sit with the business and do what-if modelling, show conflicts and clashes where multiple services are demanding the same high-performance infrastructure at the same time. Perhaps the business owner needs to prioritise or purchase more infrastructure. Perhaps they need less, perhaps they can shift some stuff into the Public Cloud and just pull it back when they need too.

So FAST could be rather more than just a way optimising your storage infrastructure; if you data-mine this in the same way Google data-mine statistics, you can find out a lot of stuff which you didn't realise and probably completely change the way you look at your infrastructure.

So when EMC talk about FAST being a foundational technology, they aren't wrong…actually, like Virtual Provisioning, it is so important….it should be Free! Actually they could fund this by getting rid of half their account managers; FAST could literally sell itself.

Too many or too few?

Chad Sakac recently tweeted about some issues EMC were having with VMware and there was a predictable and rather pathetic dig from NetApp about perhaps this being a result of EMC having too many product lines and having too much to QC. Leaving alone the issue that all software has bugs, even NetApp; this led me to look back at recent RFxs that I've been involved in and who had been invited to respond.

EMC have been invited to respond to all storage RFPs, whereas NetApp have only been invited to respond to about half of them.  Why? Well, EMC have a much better coverage of the whole storage domain with their many products, whereas NetApp have but a single answer to every question that I ask.

Who wins the greater number of RFPs? Well, honours are pretty much equal but I would argue that in being invited to respond to all my RFPs, EMC are developing a much greater understanding of my business and my challenges, this long term has to have value to both myself and EMC.

Now, do EMC have too many products and NetApp too few? Actually, I reckon honours are even and the answer to both questions is yes! But arguably it is easier to consolidate product lines than it is to develop new ones.

Of course IBM and HP have a even greater understanding of my business as they can cover pretty much the whole stack; if EMC are sensible, they will use Acadia as a vehicle to develop a deeper understanding of the businesses that they deal with.

And yes, I could of used several other vendors as examples of companies with but a single answer.

Terms of Service

One of the things you get used to as a consumer of services is that at times changes to the terms and conditions of that service irritate you and you consider moving your custom. Most of the time you don't and eventually you learn to live with the changes. Actually, most of the times, the changes don't make a jot of difference and you are just irritated for the sake of being irritated.

As consumers of 'cloud applications'; this happens to us a lot; Twitter changes something, we all howl, it generally stays changed and we learn to live with it. User-interfaces change underneath us all the time, we have no choice and we learn the new interface. We cannot opt-out.

Now, look around your data centre; how many applications have you got running on legacy hardware/operating systems which are long out of support and in some case, the company which built them no longer exists? As you own the infrastructure, you can simply take the decision to opt-out and continue to run the application. A Business Unit might have very good reasons for continuing to run the application but it could simply be the case of 'It Aint Broke, So Don't Fix It'.

If Cloud Infrastructures become the norm, this no longer becomes quite so tenable. If your Cloud Provider upgrades it's underlying infrastructure and you find your instance no longer works; your only opt-out might well to be find a way of moving that instance into a infrastructure which will support it. However, if the application is core and lots of applications partner with it; this might not be easy.

For support teams, this might finally give them the stick they need to encourage maintenance of applications enabling them to upgrade but if you are running a private cloud infrastructure, you could find yourself in the position where you have legacy clouds…and that will just make things worse.

Not a Cloud Storage Problem

Before we all get carried away and pick on Cloud Storage as a specific target; perhaps we should sit back and think. It is not Cloud Storage; it is the Public Cloud which is the problem; the most visible failures have been storage related, but let's be honest; without storage, you don't have a Cloud Environment.

Cloud providers of Storage, Compute etc need to be held up to the highest standards of availability. You would not outsource your computing environment to Accenture, Cap Gemini, IBM etc without doing your due diligence, or perhaps you would?

Actually, I can think of many cases where people have outsourced various key parts of their business without due diligence; web-hosting for example, lots of SMBs have hosted their websites on random web-hosting companies with very little in the way of investigation. We have simply got into the habit of trusting people and we have accepted the enthusiastic amateur who starts a business. 

But this business has got too big and important; but it aint a Cloud Storage problem! Stop throwing bricks at Cloud Storage; start holding the whole hosted computing business to account. Demand SLAs, verify SLAs, check insurances, ask for references, ask for evidence of best practise operating procedures. Be an informed consumer!

However, also accept that if you pay peanuts; you'll get monkeys. So don't just look at the cost, consider the value!