RAID 6, With or Without Hot Spare

Storage & Archiving

RAID 6, With or Without Hot Spare

Chris Murphy replied 12 years, 4 months ago 4 Members · 20 Replies

Ericbowen
January 8, 2014 at 3:54 pm

Many of those factors described why Raid 5 fails are not controllable by the client even with the Raid Controller management. Raid 6 is able to repair the corrupt parity data which I have seen done where as raid 5 has not been able to for medium to severe occurrences. The over all results have been near full proof reliability for years on every raid 6 I have dealt with. I cant say how many Raid 5’s I have seen unravel on both OSX or Windows since I lost count years ago. The results speak for themselves. A Intel/LSI Raid Controller takes 5 to 6 hours to rebuild a 16TB 8 drive array. So even 32TB wont take more than 10 to 12 hours on 1 of those arrays. That is perfectly acceptable to rely on 1 level of parity until completed.

Eric-ADK
Tech Manager
support@adkvideoediting.com
Chris Murphy
January 8, 2014 at 10:43 pm

Haha, you know, I find the proper math is often surprisingly helpful!

4000GB @ 130MB/s does indeed translate into ~8.5 hours to fully write as a single block device if its sequential write performance is maximized. And raid1/10 (and 0+1) can do that. Parity raid will be slower, but how much slower greatly depends on the controller, and some even have settings that affect the trade off between array responsiveness while degraded vs rebuild performance. It’s probably within +20% for a 1x raid6 failure. Uncertain how much slower 2x failures will be.

I think Kevin is fine choosing raid6, but still needs to look at the 1x fail raid6 performance, and if he can tolerate that performance level for ~10-12 hours. It’s also worth looking at the 2x fail performance as well just to be aware of how long it’s going to take. I still think he can skip the hot spare.

EricBowen:Many of those factors described why Raid 5 fails are not controllable by the client even with the Raid Controller management.

Sure, there are products that neither set things up correctly, nor expose the settings so that the user might have a chance of doing so themselves. But papering over these mistakes with raid6 isn’t correct either. While it’s reasonable to say, only the bottom line matters, it doesn’t change the fact that wrong raid5 and raid6 configurations can fail for identical reasons – reasons that neither raid5 nor raid6 were designed to mitigate.

EricBowen:Raid 6 is able to repair the corrupt parity data which I have seen done where as raid 5 has not been able to for medium to severe occurrences.

How is the parity corrupted? What enables raid6 to either detect or correct it? And is this raid detectable corruption also detected by the drive ECC?

The raid6 I’m referring to is the commonly available, non-checksumming, P+Q parity based on Galois field algebra. There are no checksums for data or parity chunks. There’s no way for this implementation to directly detect or correct corruption in data or parity chunks, nor is it designed to. It defers to drive and controller ECC, which do use checksumming, and it’s the drive ECC that detects and corrects corruption. This kind of raid6 cannot detect or correct for silent data corruption. During normal operation, only data chunks are read, and so long as the drive doesn’t report a read error, the raid doesn’t question the veracity of the data. If the drive reports a read error due to ECC detection of error but inability to correct it, then the affected data chunk is reconstructed from parity, and the data propagates up to the application layer and is also written back to the device that previously reported the read error. The overwrite of the affected sector(s) fixes the problem, either by successful overwrite of the physical sector or the drive firmware remaps to a reserve sector if there’s a persistent write failure for that sector.

There are obviously more data chunks than there are parity chunks. If either Q or P parity chunks are being corrupted somehow, then absolutely data chunks are being corrupted. And again in normal operation, parity chunks aren’t consulted. If the drive doesn’t report a read error, corrupted data chunks propagate to the application layer undetected or corrected.

So why is it that raid5 instances are having so many unravelings? Because they’re configured wrong. If they’re configured correctly, the incidence of bogus ejection of drives as faulty for being unresponsive goes to essentially zero. Bad sectors are corrected on the fly, and also during normally scheduled scrubs. Yes, of course, there still could be two legitimate drive failures at the same time, and mitigating that possibility is why we have raid6. But dual drive failures are still rare in the raid sizes discussed, compared to single drive failure with a subsequent bad sector causing a 2nd disk ejection and hence array collapse.
Ericbowen
January 9, 2014 at 12:15 am

Likely the Parity is corrupted from bad blocks in the drive in those areas would be the most likely probability to me. As to how the parity information is corrected you would have to ask Intel or LSI. I just watch the logs that report from the web management consoles or error logs. If it says fixed then I assume it means it was fixed. Either way the raid integrity was maintained with the raid 6 in those cases where as the raid 5’s would fall apart in some cases.

I am not miss configuring raid 5’s when I create them nor am I failing to schedule parity checks. There is no magic formula or settings when creating a raid 5 volume. There is just the Console settings and initialization. None of the console settings other than Full initialization are going to effect the Reliability of the raid 5 volume. I am talking raid 5’s with enterprise drives only which include the timeout recovery option in the firmware. They are still unraveling and until I see greater reliability to rebuild with corruption or handle corruption than raid 6 I wont suggest it over 6.

Eric-ADK
Tech Manager
support@adkvideoediting.com
Kevin Patrick
January 9, 2014 at 4:40 pm

[Chris Murphy] “So why is it that raid5 instances are having so many unravelings? Because they’re configured wrong. “

The Pegasus unit does have controller settings, but they don’t offer much explanation about the settings, nor any recommendations. The options are:

SMART Log: On or off
SMART Polling Interval: 1 to 1440 minutes
Enable Coercion: On or off (I believe this is for mixed drive capacities)
Coercion Method: GBTruncate, 10GBTruncate, GrpRounding, TableRounding
Write Back Cache Flush Interval: 1 to 12 seconds
Enclosure Polling: 15 to 255 seconds.

Are these the kind of settings you’re referring to? I believe these are the only controller settings.

I contacted Promise, but only talked briefly with sales. I will be trying to contact their tech support tomorrow to get some more information.

I guess I should be glad my Mac Pro won’t get here until February, so I can spend the time fully understanding the Pegasus array and it’s options. As opposed to plugging it in and happily running away with a complete lack of RAID knowledge.

In case I forget, thanks Alex, Chris and Eric for all your comments and advice.
Kevin Patrick
January 9, 2014 at 4:51 pm

[Chris Murphy] “scrubs weren’t scheduled”

I’m not sure what a scrub is. Both the definition and how that translates to the features of a Pegasus array. The background activities that are available for the array are:

Media Patrol: Default on. This appears to check media, not data. If a media error is encountered, it triggers PDM.
Redundancy Check: Ensures all data matches exactly. Settings for Low, Medium and High. Balancing system resources vs. r/w operations. Options to Auto Fix or Pause on Error.
Synchronization: Recalculates redundancy data on physical drives to ensure sync. Settings for Low, Media and High.

[Chris Murphy] “drive and controller timeouts weren’t set correctly”

I’m not sure this is even a setting for the Pegasus unit. (unless I missed something)
Alex Gerulaitis
January 9, 2014 at 11:22 pm

[Chris Murphy] “If either Q or P parity chunks are being corrupted somehow, then absolutely data chunks are being corrupted. “

Not sure I get it. If we’re not looking at a possibility of simultaneous and independent corruption of two data blocks within one stripe… If one parity chunk is corrupted, you could always re-create it from the other parity chunk, and the data?

Isn’t 6 equivalent to a 3-way RAID1 for the purpose of data recovery? One copy is corrupted, you’re not sure which one is good out of three – just check which ones match, assume those ones are good, discard the mismatching one?

Applying that to 6: calculate P and Q again from the data chunks, compare them to existing P and Q; whichever one mismatches – re-write it, and Bob’s your uncle?

If the data chunks were corrupted, then P and Q would still be healthy and you could re-create data from them, supposedly?
Chris Murphy
January 11, 2014 at 6:38 pm

EricBowen: Likely the Parity is corrupted from bad blocks in the drive in those areas would be the most likely probability to me.

Why only parity? Drives know nothing about RAID, they won’t discriminate between data and parity chunks. Bad blocks have a much greater chance of corrupting data chunks, simply because there are more of them. If you’re really seeing parity chunks corrupted more often than the ratio of parity to data disks, that sounds like raid firmware bugs to me.

Bad blocks that are not detected or corrected by drive ECC is quite rare. When it happens, t’s not something parity raid knows about about in normal operation. The usual case is the drive’s ECC detects error and corrects it without informing the controller; another possibility is detection without correction while informing the controller with a read error. That read error includes the LBA of the bad sector so the controller knows what data needs to be reconstructed from parity, and then it sends a copy up to the application layer as if nothing has happened, and causes a copy to be written to that same (bad) LBA. Then it’s up to the drive firmware to determine if merely overwriting the sector fixes the problem, or if it’s a persistent write failure it will remove that physical sector from use by dereferencing, the LBA and data get assigned to a reserve sector. Once that happens, the old sector isn’t accessible with general purpose commands (it has no LBA).

Anyway, nothing else in the storage stack discriminates between data and parity. So if you mean to indicate a high instance of parity corruption (either single or dual parities) compared to data corruption, that sounds like firmware problems. And to the contrary, it’s not unheard of.

An Analysis of Data Corruption in the Storage Stack

EricBowen: As to how the parity information is correct you would have to ask Intel or LSI. I just watch the logs that report from the web management consoles or error logs. If it says fixed then I assume it means it was fixed.

It may very well be there are proprietary implementations that the manufacturer’s won’t talk about.

EricBowen: I am not miss configuring raid 5’s when I create them nor am I failing to schedule parity checks.

I take your work for it. But then, well before raid5 unravels, scrubbing would reveal mismatches. Mismatches aren’t normal or OK. In small amounts, it can represent silent data corruption, which again raid6 doesn’t mitigate. Anything more than this indicates a problem.

EricBowen: I am talking raid 5’s with enterprise drives only which include the timeout recovery option in the firmware. They are still unraveling and until I see greater reliability to rebuild with corruption or handle corruption than raid 6 I wont suggest it over 6.

OK but again raid6 isn’t meant to handle corruption, it’s meant to handle two drive failures. Corruption≠failure. There is an expected redundancy improvement in raid6 over raid5 in the use case it’s designed for, which is protection from an additional drive failure while still partially degraded from a one disk failure. Here’s a presentation from NetApp.

Parity Lost and Parity Regained

Corruption occurs outside of drives a significant percent of the time, before it even gets to the raid controller, which will promptly write corrupt data and correct parities for that corrupt data to disk (short of additional corruptions).

Are Disks the Dominant Contributor for Storage Failures?

You’ve got enterprise drives: the mid-range enterprise SATA drives spec an order of magnitude fewer unrecoverable errors than consumer drives, and enterprise SAS yet another order of magnitude fewer. And on top of that regular scrubs, which should spot mismatches before they become problems. Yet you’re reporting significantly higher raid5 implosions compared to raid6? This is unexpected but without logs, maybe even debug logs, it may remain obscure what the cause is.
Chris Murphy
January 11, 2014 at 7:04 pm

One drawback of sane UI is that by not showing esoteric settings, users have no way of knowing if they’re being handled on their behalf or not. I don’t see anything in here that could be set flat out wrong enough that it would explain corruption. The write back cache flush of up to 12 seconds could mean a rather spectacular amount of corruption if there were also a power loss during heavy write. However the user manual for the R6 says the write back cache is battery backed, so that ought to mean once power is reapplied, the contents of the cache are written to the drives on power-up. For there to be no data loss or corruption requires drive write caches are disabled. That way anything sent to the drives is committed (in theory) and anything in the controller write cache is preserved until power is restored. This doesn’t mean there will be zero corruption but it’s significantly reduced.
Chris Murphy
January 11, 2014 at 9:13 pm

Alex Gerulaitis: If we’re not looking at a possibility of simultaneous and independent corruption of two data blocks within one stripe…

Right, this would be a problem because two corruptions in the same stripe can’t be treated the same as two missing (read error, or failed drives). For failures, the exact affected chunks are known. For corruptions it has to be deduced and with two corruptions that’s difficult at best, and in the category of specialized data recovery.

Alex Gerulaitis: If one parity chunk is corrupted, you could always re-create it from the other parity chunk, and the data?

It could just recompute P+Q and overwrite – no need for a reconstruct. But before that, how was the corruption determined and isolated? Standard in parity raid implementations is a parity verification (scrub check or read-only scrub) but all this does is report mismatches. That is, the new parities don’t match existing. So that just says there’s a problem.

There might be proprietary implementations that go to the effort of deducing whether it’s D or P or Q that are corrupt, but for a one off corruption rather than whole disk? Seems doubtful by virtue there are still raid6 corruptions in sufficient quantity that the industry has developed alternative solutions to avoid or mitigate them: T10 DIF/DIX (now called PI), ZFS, Btrfs, and ReFS.

Alex Gerulaitis: Isn’t 6 equivalent to a 3-way RAID1 for the purpose of data recovery? One copy is corrupted, you’re not sure which one is good out of three – just check which ones match, assume those ones are good, discard the mismatching one?

It’s maybe worse with raid1 because the vote is an arbitrary decision. It seems like it makes sense to go with the majority, 2 vs 1. But in reality you’ll get sufficiently arbitrary results that it doesn’t really fix the problem.

Alex Gerulaitis: Applying that to 6: calculate P and Q again from the data chunks, compare them to existing P and Q; whichever one mismatches – re-write it, and Bob’s your uncle? If the data chunks were corrupted, then P and Q would still be healthy and you could re-create data from them, supposedly?

Sure it’s possible. I don’t know any implementations that do this, but that proves nothing. It seems like a lot of additional code, testing and maintenance of that code, ensuing greater risk for bugs and additional corruptions, for a small use case that should only rarely be a problem with the kind of hardware we’re talking about. Again, raid6 is about enabling recovery from specific known missing chunks, not deducing what’s still present but maybe wrong. The use cases where some corruption is a big problem, there are better solutions for this than raid6.
Chris Murphy
January 11, 2014 at 10:16 pm

Media Patrol requires translation, it seems like a marketing term to me. I’ll guess it’s either a selective or extended self-test via SMART. Default On means it’s happening, but how often and what time of day? If it’s a SMART test, it slightly reduces drive performance so it might be something you’d rather want done on demand if the schedule can’t be specified.

Redundancy check sounds like a scrub/verify. I’m not sure what the options are, I’d rather just have a mismatch count, rather than it either stopping or fixing. I also don’t know what autofix means exactly – if it ignores disk read errors or fixes them; or if it overwrites parities in the event of a mismatch, or what. There are multiple potential problems, each with different fixes possible.

Synchronization sounds like it’ll read data chunks, recompute parities, and overwrite existing parities. You wouldn’t normally do this without a reason. For example if a redundancy check shows mismatches have spiked since the last time you ran it, you’d want to power down, check all connections, maybe even reseat the drives if they’re in a backplane. Power back up. And do a file system check/repair – of the full variety[1]. Now you can rebuild parities, which will also reset any mismatch counts. If you were to immediately do another redundancy check there should be no errors/mismatches at all. If there are, then there’s almost certainly a hardware problem and it needs to be found.

[1]
Disk Warrior if you want GUI only, or check the btree rebuild options under the -R option in “man fsck_hfs”. Disk Utility does not rebuild any of the btrees. I’d use the 2nd form, -fy -f first, then separately rebuild each btree with -Rc, -Re, -Ra. This could take a while. And it only fixes file system metadata. Actual data files are untouched, and the file system itself knows nothing of the underlying raid so that’s not fixed either. However, if the raid is in bad shape, file system checks will be in bad shape and likely not repairable. If it doesn’t need repairing or is readily fixable, then at least the data chunks for the file system are likely consistent.

Page 2 of 2

← 1 2

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums