External hard drive(s) causing Kernel Panic since RAID setup.

Storage & Archiving

External hard drive(s) causing Kernel Panic since RAID setup.

Luke Ogden replied 12 years, 7 months ago 7 Members · 26 Replies

Chris Murphy
August 20, 2013 at 9:52 pm

Re: PDF “Using the HFS+ journal for deleted ﬁle recovery” describes a way to use stale journal buffer entries to locate and recover successfully written files that were subsequently deleted, before the relevant journal data itself is overwritten.

That is totally different from the notion that the journal can be “used to restore data that is incomplete or corrupted.” Section 1.3 of the cited document describes rather clearly how journaling of file system metadata only is done by JHFS+. Step 1. only occurs once the file data is successfully committed to disk. The file is neither an incomplete or corrupted file, it’s a successful write. If the write isn’t completely successful, an entry is not made in the metadata journal, there is no point in doing so.

A metadata only journal does not enable restoration of incomplete or corrupted writes to disk. That’s what data journaling can do, and what COW file systems can do (and there are limitations there too depending on when the crash occurs – none of these things enable magical appearance of unwritten data.)

You probably won’t find a document that describes how a controller does not behave but rather how it does. There are a lot of things a controller does, but file system management and mounting isn’t one of them. The controller manages the physical, link and transport layers to the drive which includes 8b/10b encoding, NCQ/TCQ, FIS, things that the OS does not need to deal with. The payload in FIS packets is the data stored on the disk sectors, and that raw data can be bootloader code, partition maps, file system structures, file metadata or file data itself. Neither the drive nor controller care what the payload is.

For hardware RAID cards, there’s an additional layer because they’re necessarily a bit more intelligent. Depending on the implementation, they might understand what a partition map is, and they’ll certainly understand their own raid metadata found on each drive which enables them to know what chunks are on what physical drives, and what the LBA mappings are. Many hardware controllers work directly on unpartitioned disks, writing only its particular flavor of raid metadata to manufacturer defined locations on each disk. Assembling the raid array (a logical block device) is the job of the controller, and presents that logical block device to the kernel via a kernel driver; the kernel sees this block device as it does any physical block device (like a bare drive). Obviously there is no such thing as a 12TB physical block device (yet), but the kernel doesn’t know that and doesn’t need to. Like with any drive, it asks the controller to read at least LBAs 0,1,2 which are the standard locations for the pMBR, primary GPT header and GPT table. The raid controller maps those requests to their actual physical devices and LBAs – meaning that this partition data, contained within the logical block device, is subject to being stored in both a data chunk and parity chunk just like any other kind of data.

The kernel then parses the GPT, finds a partition type GUID it recognizes, finds the start LBA for that partition, asks the controller to read e.g. LBAstart+2 which is the location for the JHFS+ volume header. Naturally this is different for each file system. The controller fullfills that request by sending a read LBA request to the proper physical drive. The kernel, from the volume header data, can then find the various other JHFS+ structures, request those LBAs from the controller, which in turn maps those requests to actual physical drives, and once the kernel has sufficient information about the volume the volume is mounted and presented in a sequence of events that also involves diskarbitrationd and the Finder.

So there are lots of layers, and you’re not going to find things parsing data in layers it doesn’t need to. Neither SAS, SATA, or RAID controllers need to understand file system structures at all and have nothing to do with mounting a file system. SATA/SAS controllers present a physical block device to the kernel, and raid controllers assemble physical member disks into a logical block device and present that to the kernel. RAID array assembly, and volume (file system) mounting are different things.
Chris Murphy
August 20, 2013 at 10:53 pm

Eric says journaling adds overhead and can be used to restore (fix) corrupted files (i.e. fix file system inconsistencies). Isn’t he right about that?

Yes, it doesn’t actually work that way. If there were an incomplete file (data) write or it was corrupted when written, the journal wouldn’t have any information to fix either of those problems. Again the purpose of a journal is to (hopefully) obviate the need for a lengthy fsck on larger file systems. Also, since there’s no longer a GUI means of creating a (non-journaled) HFS+ volume for some time, that leaves doing this via CLI which is quite a bit more obscure as a work around for the claim that the existence of a journal causes certain (undefined) issues with applications and controllers; especially when neither apps or controllers are journal aware.

Chris says SAS/SATA controllers aren’t file system aware, they deal with block data. Eric says RAID controllers deal with RAID sets and are responsible for data integrity on them, which could be related to OP’s problems.

The OP’s problem occurs only once two disks in a USB docking station are Apple software RAID 1’d together. This mostly implicates the kernel and by extension the softRAID.kext and one or more of the IOxxx.kext files. So the kernel panic report is needed. Maybe there’s a 3rd party USB driver being used (or not being used) that’s implicated in the kp. But I think it’s very unlikely the kp is journal related at all, and somewhat unlikely it’s drive related. I think it’s more likely related to some hardware incompatibility between XNU and this docking station, for some reason only when software RAID is used.

I did previously say that a kernel panic is always a bug, that’s not entirely true, there are worse things than kp’s which is massive data corruption on disk. So there are some reasons why unhandled exceptions are better ending with the kernel taking an abrupt nose dive rather than staying alive to do possibly some serious damage.
Luke Ogden
August 30, 2013 at 12:02 pm

[Chris Murphy] “So the kernel panic only happened after creating a software raid1 array? And the drive erase operation that took three days was successful for both drives?”

Correct.

[Chris Murphy] “Can you post the kernel panic to pastebin and post the URL in the forum?”

https://pastebin.com/zJph9DqA

[Chris Murphy] “Do you get a kp when only one of the drives is inserted into the docking station?”

Yes I tried mounting them one at a time and they cause a kp individually as well

UPDATE
I’m been away for a few weeks but this morning I successfully mounted each drive long enough to format to Mac OS Extended (journaled) in disk utility.

They are currently both mounted in the dock and I’ve been performing a few basic read/write functions to each.

I’m about to try setting them up as a RAID 1 pair again using the ‘best practice’ ie. don’t zero out the data and format each as Mac OS Extended (Journaled), correct ?

If I configure them as RAID 1 again and get the kp we know its the dock don’t we?
Chris Murphy
August 30, 2013 at 6:13 pm

don’t zero out the data and format each as Mac OS Extended (Journaled), correct ?

Yes. Zeroing isn’t needed, diskutil should wipe the metadata areas first, and then create new ones, then format the resulting logical block device as HFSJ. It shouldn’t matter what the file system is, the kp appears to be occurring before the file system is mounted, at the time the raid is being assembled.

If I configure them as RAID 1 again and get the kp we know its the dock don’t we?

Your kp report points to two things that were in use immediately prior to the kernel panic: apple.iokit.IOStorageFamily and apple.driver.AppleRAID. And in the list of loaded kexts, I see only Apple kexts. I think it’s a kernel bug. It may also be that the dock isn’t conforming to some aspect of the USB spec, and the kernel isn’t prepared to handle it. Or maybe something is being corrupted (again could be dock firmware related) and the kp is occurring as the result of reading some corrupt metdata.

But in any case, the protocol is to treat it as a product defect with the manufacturer. They need to confirm/deny whether they support Apple software RAID 1 being used with their product (I haven’t looked at whether they claim this in marketing material or spec sheets). And if they want to support it, they need to take it up with Apple to work out the details of who fixes what.

I suspect you will get another kp, because a non-deterministic bug where you don’t is actually worse. Without a lot of usage you won’t have confidence if at any future time you will get another kp. And that could cause raid metadata corruption, or worse, file system corruption. So I think that’s not really worth messing around with, I’d go straight to the manufacturer, see what their expectations are for its use. If it’s supposed to work with software RAID, have them swap out the hardware.

Another thing you could try is a much newer kernel. Current is XNU 12.4.0 which is the OS X 10.8.4 kernel. It’s possible the manufacturer will recommend this too. I haven’t check the specs to see what the minimum supported OS is.
Chris Murphy
August 30, 2013 at 6:26 pm

Attempt 2 at being more concise and coherent on next steps. What *I* would do is not even mess with the dock anymore. I would call the manufacturer support, tell them you used Apple software raid1 with Disk Utility, and now you’re getting kernel panics on both a rMBP and iMac. And you’re happy to give them the kp report if they want. Questions for them: Is the use of Apple software raid1 supported with this dock, i.e. is it expected to work? Is it expected to work with OS X 10.7.5? Do I have the current dock firmware applied (if applicable)?

If the answers are yes, yes and yes. Then set up an RMA and send it back. I wouldn’t try to troubleshoot this specific unit anymore. All of the conditions are the same as before, so just trying again shouldn’t matter and if it does matter, it’s still unreliable. So I’d swap out the hardware.
Luke Ogden
September 12, 2013 at 12:35 pm

So I spoke with Startech, the manufacturer of the dock today and there technical advisor informed me that this particular dock doesn’t support Apple RAID 1 software configuration. So as you rightly suggested Chris I’m going to swap out the hardware and start over. Thanks for the help in troubleshooting this problem.

With regards the new hardware has anyone any recommendations? The technical advisor pointed me in the direction of this enclosure

https://uk.startech.com/HDD/Enclosures/Dual-Bay-SATA-External-Hard-Drive-Enclosure~SAT3520U3SR

But as its for weekly backups I kind of liked the tool-less design of the previous dock. Not a deal breaker though, As this will be the third solution I will have purchased It’s important now to just get something that works.

Thanks

Page 3 of 3

← 1 2 3

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums