True Names

Names have great power and by naming something, you define it and often bound it in ways that you never mean to do. This is never truer than in certain IT functions and teams by focussing on one part of the function, the primary purpose of that team is lost.

I was talking to one of our Infrastructure Designers and he wanted to know what could be done stop a new 'Backup and Recovery' infrastructure that we are going to implement, falling into the pitfalls that so many previous infrastructures had done. And asking the question in such a way, he at least managed to ask the right question, as he asked about 'Backup and Recovery' as opposed to short-hand which is so commonly used that is 'Backup' with no mention of the primary function of the infrastructure.

By loosing the Recovery bit of the phrase, we unconsciously start to focus on the wrong thing; the purpose of the infrastructure is lost and it simply becomes about storing some files without really thinking about why we are storing them.

So often I hear that 99% success rate is an acceptable metric for back-ups; sometimes it's higher, may be three nines, may be four nines and often it can be lower. However, if you were ask the average IT manager if it was okay for you to wander around a data centre and turn off 1 in 100 machines? What do you think the reaction would be? Probably not positive? And with a highly virtualised data-centre with 1000s of virtual machines? We could be talking significant numbers. And that assumes that the same back-up is consistently failing; what if it is a random distribution? The impact of 1% of backups failing could be impacting easily 20-30% of your server estate at a rough guess.

Perhaps it's about time, we stopped talking about Backup and we started talking about Recovery. If you called the 'Backup Team' the 'Recovery Team' and talked about the Recoverability of your estate as opposed to simply the amount of data and the Backup success rate; people might take it more seriously. And with modern systems, it should take minutes to recover a system in general; so what's the excuse for not testing on a regular basis? Even if you were not bringing the recovered system up and into production, a simple smoke-test would probably be an enhancement.

It's time for the Backup teams to step out of the shadows and into the light; the first thing is to rename that team to reflect their true purpose.

6 Comments

Roland Bavington says:

July 22, 2010 at 10:08 pm

Hear hear Martin, I spent years working for a tape library vendor, trying to get this message into the organisations I was selling to.
On the same lines lets stop calling storage storage because having the capacity to store 1s and 0s is only part of the story!
Roland

Owen says:

July 22, 2010 at 11:29 pm

Who cares if backups fail?
A provocative first sentence to get your attention.
The backup success or failure isn’t the critical metric, what is important is the customer exposure. The terms RPO (recovery point objective) and RTO (recovery time objective) are usually used when talking about DR and replication. They should also apply to backup and recovery.
People thing of RPO in terms of a singe future event. (e.g. if a disaster was to occur sometime in the future, what is the most amount of data (measured in time) that could be lost).
An organisation needs to trade of cost vs. recoverability. The outcome of this will be a statement, or set of statements, along the lines of “for the past x weeks we must be able to recover to within y hours of any point in time”. Unlike a DR situation, if the critical event is not detected for up to x weeks, the RPO y hours.
The failure of a backup is not a problem, so long as a backup _completes_ within y hours of the _start_ of the previous backup. If y=24 hours, you have to backup more then once a day!
I’ve skipped over the definition of a backup, as this varies between organisations. In some cases, a snapshot is considered a backup (although I would argue it isn’t), other organisations won’t consider a backup complete until it is off-sited.

Martin G says:

July 23, 2010 at 10:03 am

Owen, I completely agree and if people focused on the requirement which should be how much data can I afford to lose and for how long can that data be unavailable and took those requirements seriously, we would have better recovery focused systems.
But whilst people continue to treat backup as the primary objective…

Adriaan says:

August 5, 2010 at 12:26 pm

Hi Martin,
The two key long term value activities are Recovery and Archive – and both get badly affected when backup takes central mindshare in the design.
So how about pushing those two centre stage – a good acronym would help but my brain is not offering one
thanks for raising this key issue

Owen says:

August 5, 2010 at 11:32 pm

The term “BURA” is thrown around a lot. BackUp comes first, but it is an industry standard term that at least references Restore and Archive.

Adriaan says:

August 6, 2010 at 11:40 am

So thinking about it a bit more – current data protection practices are heading more towards incremental backups, ideally also via a form of offsite replication.
how about Replicate and Archive to Recover or RA2R or even
Snap, Replicate and Archive to Recover – SRA2R?
The approach also has the benefit of moving mindset away from the current problem of using copy out based backups which are hugely resource and time consuming – ie all the bits that make Backup a dominant operational issue.

M	T	W	T	F	S	S
« May
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

True Names

6 Comments

Leave a Reply Cancel reply

Categories

Blogroll

Google Ads