Creative Communities of the World Forums

The peer to peer support community for media production professionals.

Activity Forums Storage & Archiving RAID level reliability

  • RAID level reliability

    Posted by Alex Gerulaitis on May 21, 2013 at 1:49 am

    What’s the most reliable RAID level, and why? (Let’s forget performance considerations for the purpose of this thread.)

    The factors affecting reliability are drive FR (failure rate), and UREs. Reliability is minimizing chances of data loss, which seems to be not a single factor either: (1) minimized chances of catastrophic data loss, (2) minimized chances of smaller data loss incidents (corrupted files). Then there’s also maximized uptime – not sure where that fits in.

    I’ve seen papers on RAID5 vs. RAID6, triple-mirroring RAID1, but not RAID10 vs. RAID6. The latter two seem to be discussed the most in terms of reliability.

    (As there isn’t a “storage” forum on Cow, I figured my best chance of getting a good discussion started is in this forum.)

    Thanks in advance for any ideas.

    Alex Gerulaitis
    Systems Engineer
    DV411 – Los Angeles, CA

    Alex Gerulaitis replied 12 years, 10 months ago 7 Members · 25 Replies
  • 25 Replies
  • Alex Gardiner

    May 21, 2013 at 8:55 am

    +1 This a good discussion Alex and I’m pleased you raised it, as it is of interest to me too.

    While you say forget performance, (as you know) it is always a requirement to consider the application alongside utilization and reliability? After all, of the three, its usually only possible to pick 2 😉

    Turning it on its head, we may avoid failure, but end up with a system that is unreliable from a usability perspective.

    In turn, if we are forgetting performance all together, would we consider technologies such as RAIDZ? Removing the controller itself might not be the worst idea?

    NB: Since the cow is a rather flame prone, I’ll temper this statement by saying that IMO Z is not particularly practical for post storage in 2013.

    Like you I’ve read the papers (and then some) and my view has become one of avoiding raid5 for larger units (shoot me – I’m not selling anything).

    Ahem.. so, without really contributing anything at all, I’ll simply say that I am also looking forward to some real input and figures from the big players. Will they share… I hope so 🙂

    Alex


    +447961751453
    indiestor.com
    github.com/indiestor/

  • David Gagne

    May 21, 2013 at 2:51 pm

    A multi mirror raidz would be most resiliant to drive failure. Add in HA, UPS, and it’s never going down…

  • Alex Gerulaitis

    May 21, 2013 at 6:37 pm

    [alex gardiner] “While you say forget performance, (as you know) it is always a requirement to consider the application alongside utilization and reliability?”

    Thanks Alex. You’re right, performance is always a factor, as is efficiency (ratio of performance and usable capacity to TCO).

    My personal quest in this is to figure out which one is more reliable – 6 or 10, and why, using good math models supported by stats. RAIDZ – for me personally – falls outside of it as it’s not supported by popular RAID controllers I use. That said, I am not opposed to discussing it.

    [alex gardiner] “In turn, if we are forgetting performance all together, would we consider technologies such as RAIDZ? Removing the controller itself might not be the worst idea?”

    Just for the purpose of this thread, I’d like to center on a perhaps quite theoretical question of what industry standard RAID level is the most reliable – along with why and by how much.

  • Alex Gerulaitis

    May 21, 2013 at 6:50 pm

    [David Gagne] “A multi mirror raidz would be most resiliant to drive failure. Add in HA, UPS, and it’s never going down…”

    Thanks David. Not familiar with Z although I heard good things about it. Are there any models that quantify that resiliency, and measure it against, say, 10, 6, and a 3-way RAID1, in a similar way Adam Leventhal does it with 5 and 6?

    (I realize copying the same data in at least three places is the most resilient method – and RAID6 might be the most efficient implementation of it in terms of space utilization – even if not the most reliable.)

    Wikipedia has an AFR formula supposedly measuring resiliency but I am not too happy with it: doesn’t account for risks of UREs during rebuilds; factors annual FR rather than the risk of simultaneous failures within drive replacement and rebuild time frames.

  • David Gagne

    May 21, 2013 at 7:51 pm

    In terms of resiliancy, it’s not a hard math problem.

    Raid6, you have two drives that can fail without loss. Raid10 is a RAID0 (stripe) across RAID1 (mirror) sets. In this you can have 1 drive failure from any of the sets without loss, but if both drives fail in a set simultaneously, you have loss. But to quote someone I know, “I’ve never had drives fail faster than I can replace them one at a time.” If you have RAID5, and a drive fails, and you replace it same day, you will not lose data.

    But lets say you are the most paranoid person on earth. You could take 16 disks and do 3 sets of RAIDZ3 and mirror them. This would give you the ability to lose 3 disks per set, or up to 2 whole sets and 2 disks on your existing set, and still rebuild. Of course you’d only get 2 disks worth of storage with this scheme, but YOU ARE LIVING SCARED. Basically as long as you had 2 good disks in the same set, your data would be secure.

    Oh, but of course, you have a backup on tape, right?

  • Alex Gerulaitis

    May 21, 2013 at 8:22 pm

    [David Gagne] “If you have RAID5, and a drive fails, and you replace it same day, you will not lose data. “

    You might if you get a drive failure or an URE during rebuild – which is technically after replacement. Chances of an URE (non-catastrophic data loss) during rebuild are higher than 50% on a 10TB array with desktop drives – if we believe published URE numbers – which some people think are too optimistic.

    Single bit URE on video files – may not be a big deal. In Pentagon – it probably is – they wouldn’t want that drone send a Hellfire to the wrong address…

    So RAID5 is fine in many circumstances but it can no longer be considered truly resilient with today’s high capacity drives. On top of it, we don’t really know what happens on that near-certain URE-on-rebuild: will the RAID brain offline the drive and fail the whole array? That would mean a near-certain catastrophic data loss on a whole array with a single drive failure. Or will it just leave that URE in place meaning a single bit error in your file or file system? Either way – not pretty.

    RAID6 is quite a bit more resilient, both with UREs and drive failures as chances of two or three happening within the same replacement / rebuild cycles are much, much smaller. RAID10 is also resilient to multiple drive failures and to a lesser degree – with UREs.

    I just wanted to see if anyone had a math model or stats pitching 6 against 10 but it may indeed be an exercise in vain.

  • Bob Zelin

    May 21, 2013 at 10:16 pm

    no matter what is determined from this thread, everything fails, there will always be data corruption at the worst possible time. If you do not have a secondary method of backup (with drives) or archive (with LTO) you are playing Russian roulette with your media. I try to use RAID 6 as often as possible these days, but data corruption is a fact of life, and programs like Disk Warrior do not always solve these issues. You MUST have a backup plan, no matter what the most reliable RAID level is.

    just my 2 cents.

    Bob Zelin

  • Vadim Carter

    May 23, 2013 at 4:08 pm

    It is true that RAID 6 can sustain a loss of any two drives in a set. RAID 10, however, can lose more than two drives and still remain operational. This is because RAID 10 stripes on top of multiple mirrored pairs of disks and for as long as there is at least one good disk in each mirrored pair, RAID 10 will stay operational.

    I do have all the nitty-gritty details and formulas laying around somewhere which explain mathematically which RAID level is more reliable. If my memory is correct, RAID 10 is deemed slightly more reliable than RAID 6.

    Earlier threads had a statement about minimizing chances of data corruption. To anyone reading this, make no mistake about it, RAID does not protect against data corruption. Furthermore, as disk drives get larger in capacity, the probability of silent data corruption increases no matter what RAID Level is used. This is why RAIDZ is an excellent choice where data integrity is paramount, each block of data is checksummed and the checksum is then written to a separate area of the disk with a pointer to the original data block. The pointer itself is also checksummed. When a data block is accessed, its checksum is checked and verified thus guaranteeing immediate corruption detection.

    One last thing in regard to performance, all other things being equal, RAID 10 will always outperform RAID 6. The reason is simple – parity calculation is an “expensive” operation. Calculating it twice is even more “expensive”.

    Lucid Technology, Inc. / 801 West Bay Dr. Suite 465 / Largo, FL 33770
    “Enterprise Data Storage for Everyone!”
    Ph.: 727-487-2430
    https://www.lucidti.com

  • Alex Gerulaitis

    May 23, 2013 at 9:12 pm

    Thanks for the input Vadim.

    [Vadim Carter] “I do have all the nitty-gritty details and formulas laying around somewhere which explain mathematically which RAID level is more reliable. If my memory is correct, RAID 10 is deemed slightly more reliable than RAID 6.”

    Would be awesome if you could find those formulas and see if they account for all the reliability variables and factors:
    – chances of enough simultaneous drives failures within a rebuild cycle to bring down the whole array
    – UREs and how they affect reliability, especially during rebuilds
    – protection against other causes of data loss

    From my research so far, reliability advantages of either RAID level depend on a balance of drives’ FR and URE rates. E.g. if there is a guaranteed URE during a mirror rebuild, then there is a guaranteed data loss incident in RAID10. 6 OTOH will be resistant to it, as it still has another place where to look for healthy data during rebuilds.

    Based on that alone, resistance of RAID6 to data loss due to UREs is much, much higher vs. RAID10 on a single drive failure – thousands to millions times so. Does it make RAID6 much more reliable overall? I don’t know, and would love help with that.

    At the same time, resistance to whole drive failure is really down to FR within drive replacement / rebuild window, not necessarily drive’s lifetime – which may make Wikipedia and many other formulas not too trustworthy.

    If we assume that there’s an increased chance of a drive failure during rebuild (in the same mirror pair, or in RAID6 array), then RAID10 may have less of an advantage, depending on those chances.

    This is why I think formulas will help play with failure and URE rates, and perhaps also match them to available stats.

    [Vadim Carter] “To anyone reading this, make no mistake about it, RAID does not protect against data corruption.”

    Perhaps you’re talking about “silent rot”, not all incidents of data corruption? A URE is data corruption, and redundant RAIDs do protect against them, to a degree?

    [Vadim Carter] “This is why RAIDZ is an excellent choice where data integrity is paramount, each block of data is checksummed and the checksum is then written to a separate area of the disk with a pointer to the original data block.”

    Are checksums a function of RAIDZ or ZFS?

    [Vadim Carter] “One last thing in regard to performance, all other things being equal, RAID 10 will always outperform RAID 6. The reason is simple – parity calculation is an “expensive” operation. Calculating it twice is even more “expensive”.”

    Agreed, parity calculations are expensive – yet computing power grows much, much faster than disk speeds. Perhaps at some point parity calculations on beefy systems with software RAID will get cheap enough to have a negligible effect on performance if they haven’t already? I’d be more concerned with I/O than parity overhead.

  • Alex Gerulaitis

    May 24, 2013 at 2:27 am

    [Alex Gerulaitis] “This is why I think formulas will help play with failure and URE rates, and perhaps also match them to available stats.”

    Found a spreadsheet that seems to be doing exactly that, on zetta.net/_wp/?m=200906.

    It puts RAID10 reliability at about 25% that of RAID6, with certain assumptions about drive counts, URE and failure rates. Not sure how trustworthy it is until someone more knowledgeable than me analyzes it.

    (The page I linked to above has errors and may not be easily navigable depending on your browser. I put that spreadsheet and up on my GDrive if you’d like to take a look at it.)

Page 1 of 3

We use anonymous cookies to give you the best experience we can.
Our Privacy policy | GDPR Policy