NetApp Reveal Cisco’s Storage Incompetence

Peter Perrault, a NetApp marketing tweeted this link to some NetApp marketing fluff about about Cisco's use of NetApp storage; there is a very scarey statement in the PDF which if true makes me wonder about Cisco as a credible partner in the world of storage.

From the article;

"A few years ago, faulty disk drives on the development team’s underlying enterprise storage systems were a nuisance for Cisco IT, as Storage Domain Design Architect Rich Harper recalled. But, it wasn’t insurmountable until double-disk drive failures in the same parity stripe occurred twice in six months, resulting in the loss of hundreds of days of development work. While Harper readily acknowledged the drive issues were an industry problem that all vendors were experiencing, he admitted, “The experience still left a bad taste in our mouths.” "

Can anyone else see how scarey this is? A double-disk failure caused the loss of hundeds of days of development work? Have these people not heard of back-ups? Are Cisco still relying on parity to protect against data loss? Or is this just pure marketing hyperbole?

Look guys, RAID is not a substitute for back-ups! Snaps, clones and whatever else are not substitutes for back-ups. You need to get the data off the primary storage to another physical location/device. BackUp is not exciting but you need to do it.

11 Comments

Sjon says:

September 16, 2009 at 10:22 am

Isn’t this reaction a bit out of context?
Who says they didn’t make backups? Who says they couldn’t recover? However, that probably has created sufficient delay and pain in development that they decided that double disk failures should not be the reason to go back to restoring from backup any longer.
Apparently double parity RAID-DP is a very cost-effective and painless (and cheap) way to protect against DD faulure, without the need for lengthy restore procedures.

Martin G says:

September 16, 2009 at 10:43 am

“resulting in the loss of hundreds of days of development work”
Sorry, that’s a fairly unequivocable statement IMO. If they said that it potentially resulted in the loss of hundreds of days of development work it would have been a fair statement. Or perhaps it was the loss of time where developers sat there and twiddled their thumbs!
But if the loss of a single RAID rank caused that sort of impact; I would query the storage design and certainly the continuity planning.
Do I think Cisco are incompetent; of course not. I do think the marketing fluff that we see at times is madness tho’.

Storagezilla says:

September 16, 2009 at 3:14 pm

Correct. Array based Snaps can be considered backups, but if the data they contain isn’t moved off the array they were taken on you’re not “fine” and it won’t be “okay”.
Take your snaps, but roll the last snapshot of the day over to different media located off the device.

Rob says:

September 16, 2009 at 4:01 pm

Martin,
You’ve picked up on the obvious. Backups are necessary (even today) to:
– guard against silent corruption.
– offsite for many
The “kid” in charge of backups was probably let go.
Second point… I’ve personally experienced twice (so far) RAID5 blow-outs. As mentioned before,UBE is the issue – not wall clock time!!!
Finally, to experience it TWICE is an opening for
a mythical exchange:
Ralph: “What a pain”
Joe: “Hey, what if this happens again?”
Ralph: “I doubt it…”
Do you think after the second occurence, the exchange
was the same? Doubt it.
I know someone who all their RAID5s became RAID6s
after loss of a RAID5.

Val Bercovici says:

September 16, 2009 at 6:50 pm

Hi Martin,
If I give you driving directions to my new favorite pub, and those directions only include left turns – am I implying your car doesn’t need to be capable of turning right? Probably not, just that left turns are the most efficient way to get to this particular pub from your location.
Likewise, I wouldn’t read quotes like the one you highlight above assuming everything not stated wasn’t actually done. Cisco are not in the habit of publicizing incompetence, but in the interests of brevity and style, the editing process for customer success stories sometimes leaves out details techies crave.
In this case I have no doubt there was an elaborate backup & recovery system in place. What this scenario is highlighting is that sometimes the best restore is one you never have to make. Certain batch jobs can run into days, and if the previous primary storage system supported hundreds of those jobs, then the unfortunate (but all too common) RAID5 failure resulted in having to resubmit those jobs. As a result, hundreds of cumulative days (or “man-days”) of work would have been lost. Simple as that.
In these budget-constrained times, we see more and more storage admins forced to cut corners by configuring RAID5 when their alternate storage vendors recommend RAID10 for availability and performance. No such compromises are necessary when using NetApp RAID6(DP) as well as our other recommended storage efficiency best-practices for availability and performance. We guarantee it! 🙂
http://www.netapp.com/in/company/leadership/storage-efficiency/
-Val.

Martin G says:

September 16, 2009 at 7:00 pm

Val, I have never had a vendor recommend RAID10 in the last four years! Never! Nada! I keep hearing this from NetApp but in all my discussions with vendors (and I’ve spoken to quite a few), never have they recommended RAID-10. I’m beginning to feel left out!
And as for batch jobs taking days; I cut my teeth in configuring boxes for seismic processing.

Steve O'Donnell says:

September 16, 2009 at 7:07 pm

Double drive failure in a single RAID group? I sense that this probably means even worse operational incompetence than not doing backups. What this almost certainly means is that they didn’t notice the first disk had failed and get it replaced allowing recovery to happen.
This is not unusual in a development environment. See the same thing happen with dual power supplies and network links. If it’s not being actively maintained it WILL fail.
That is the single biggest weakness in RAID, the storagebod (nothing personal), if we rely on humans they sometimes screw up.
Steve

Martin G says:

September 16, 2009 at 9:16 pm

The only time I have seen a double drive failure; the drives went within 30 minutes of each other. The engineer was on the way and actually, all that needed doing was reseating one of the failed drives and the rank came back.
Obviously we still had both the ‘failed’ drives swapped out but in a controlled manner.

Val Bercovici says:

September 17, 2009 at 1:47 am

Martin – you are most certainly being left out on the RAID10 recommendations 🙂
Most customers use vendor best-practice papers and publicly available independently audited benchmarks to maximize performance and availability for their arrays.
For obvious reasons, the *vast* majority (> 95%) of those benchmarks & reports (SPC, SpecSFS, MS ESRP) show systems configured with RAID10. This is an overwhelming and undeniable pattern – regardless of one’s biases against any individual benchmark or recommendation.
Because we solved the latency problem, only NetApp submits configurations using RAID6(DP) for primary storage performance and availability. It’s a true testament to our technology that we attain performance which remains very competitive with peer results, despite using a fraction of the disks, while providing 100% double-disk failure protection, which RAID10 does not!
-Val.

Jon Harris says:

September 18, 2009 at 1:14 pm

I think you yourself have hit the nail on the head when you refer to the piece as ‘marketing fluff’.
Look at it from the marketeer’s pov – they’re trying to promote their own solution to the issue of double disk failures in a single parity group, whilst surreptitously kicking the opposition. Accurate historical reporting, it is not.
I’m sure the piece wouldn’t have had quite the same impact if it said “…until double-disk drive failures in the same parity stripe occurred twice in six months, resulting in the loss of hundreds of days of development work. However, all the data was recovered from backup, resulting in minor inconvenience for developers as they waited a few hours to get their data restored.”
You’re basing your argument on pure speculation, which leads you to question whether ‘Cisco are a credible partner in the world of storage’! That’s a big, big accusation to make, especially when it’s based on nothing more than your own conjecture.
Oh, and we’ve been offered plenty of RAID10 stuff in the past few months, let alone the last few years (but maybe that’s based on our application requirements).

Martin G says:

September 18, 2009 at 1:44 pm

Well Cisco certainly would have approved the marketing fluff and if they really intend to move into the world of data centre services, they need to be careful as to the message that they are allowing to be sent out. I wouldn’t buy services from a company whose idea of a sensible back-up policy is on-array snaps/clones alone.

M	T	W	T	F	S	S
« May
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

NetApp Reveal Cisco’s Storage Incompetence

11 Comments

Leave a Reply Cancel reply

Categories

Blogroll

Google Ads