External hard drive(s) causing Kernel Panic since RAID setup.

Storage & Archiving

External hard drive(s) causing Kernel Panic since RAID setup.

Luke Ogden replied 12 years, 7 months ago 7 Members · 26 Replies

Ericbowen
August 14, 2013 at 10:01 pm

Don’t format the drives on the Windows system. Just get into the Diskpart utility and run the clean command. That will wipe the disk table info. Then you should be able to initialize and format them again on the Mac. Just don’t use Journaling. Journaling add’s allot of meta data that has and can cause problems with applications and I would not be surprised causing some issues now with the controller. Journaling is used for HFS+ partition data recovery. Other than that it has no other real function. I am proponent of writing zeros to drives to clean them out since it seems the drive manufacturers are not marking the bad blocks out before they ship them anymore. The link above from Apple’s forums has a post from the drive manufacturers as to why they suggest it also. I don’t believe you were wrong in that nor do I think that caused this issue because it shouldn’t have.

BTW you will probably have to install Mac Drive trial long enough to recognize the disks in Windows to clean them. An other option is to download Acronis or Paragon Software’s boot disk and boot to the utility. Then Delete the current partition. Then boot to Windows and clean the drives. I figure one of the 2 boot discs should boot on an IMac. If not Boot on the Windows system with the boot discs

Eric-ADK
Tech Manager
Eric Hansen
August 14, 2013 at 10:05 pm

once i have properly set up a drive, if it gives me issues even once and it’s still under warranty, i send it off for replacement. i don’t have time to mess around and my data’s too important. i use Hitachi because they have a no questions asked replacement policy, but i bet other companies do too.

e

Eric Hansen
Production Workflow Designer / Consultant / Colorist / DIT
https://www.erichansen.tv
Chris Murphy
August 18, 2013 at 7:25 pm

So the kernel panic only happened after creating a software raid1 array? And the drive erase operation that took three days was successful for both drives?

Certainly any kp is a kernel bug, but it can be triggered by wrong hardware behaviors. Yet ~750 million write operations were successful during the erase. Can you post the kernel panic to pastebin and post the URL in the forum?

It’s a good idea to allow the system to report the kernel panic to Apple, and in the comment section including the make/model and firmware revision of the docking station, and the condition that the kp’s started to happen only after two drives were software raid1’d together. That could be coincidence, the only way to test this would be to work with drives that don’t have Apple raid metadata on them, and see if they can be used independently for a while. Then recreate the software raid1 and see if you get kp’s again.

Do you get a kp when only one of the drives is inserted into the docking station? If not, the Apple raid metadata can be remove from OS X using dd. If it does still kp, then either you’ll need to put the drive into a alternate enclosure which hopefully doesn’t trigger the kp. Or boot the iMac from a Linux live cd/dvd and use dd there to wipe the raid metadata off the drives. So decide which is easier: downloading and booting from a linux live cd/dvd or swapping enclosures. And then once you confirm you’re not getting kp’s when connecting a single drive post the result from this command:

diskutil list

And I’ll get you a suitable dd command to blow away the Apple software raid metadata.
Chris Murphy
August 18, 2013 at 7:32 pm

Zeroing the drives can’t possibly be related to the problem. It may be pointless, but it’s also benign. He had over ~750 million successful write operations doing that zeroing, and only once the raid1 was created did he start getting kp’s. So I think the problem is elsewhere.

If he can get one drive into either the existing docking station, or an alternate enclosure, without getting a kp; or booting a totally different kernel that doesn’t panic with this hardware, it’s possible to remove the Apple raid1 metadata from both drives, and then see if the kp is consistently reproducible only when the drives are software raid1’d together. That’s much faster than returning the drives. And to save time he needs to know if this is induced by software raid1. I think it’s some obscure incompatibility between software raid1 and this docking station, but there isn’t enough information to know that yet.
Chris Murphy
August 18, 2013 at 8:12 pm

Even if true, the read-write head is designed such that writes are immediately followed by reads, and the firmware confirms/denies the write. If it persistently fails for a sector, the LBAs are reassigned to a reserve sector(s) and the bad sector is removed from use. It can’t even be accessed by software anymore as it no longer has an LBA. Writing zero’s will also do this. But it’s not really necessary in advance.

The bigger issue for large raid arrays is the lack of scheduled scrubbing, to make sure rarely read/written sectors aren’t going bad. The typical case of raid5 array failure is lack of scrubbing, a handful of bad sectors are developing on one or more drives, and a single drive failure occurs. During rebuild, one of those bad sectors is encountered as a persistent read failure and the rebuild fails and thus the array fails. It’s also exacerbated by using consumer drives with very long error correction timeouts.

Before disposing of a drive, it’s probably a good idea to “clear” or “purge” any non-raid, or raid1 drives; just like it’s a good idea for photographers to tear their reject prints before disposing of them. “Clear” is the minimum kind of erase, which is what Apple Disk Utility uses. To constitute “purge” one must use the firmware’s ATA Secure Erase, Sanitize or Crypto Erase commands. Even though ATA Secure Erase has been in ATA drives since 2000, Apple sadly doesn’t support it in Disk Utility, and is the only way to sanitize data that exists in bad blocks removed from use that still contain data. The added thoroughness is perhaps a small benefit, but using ATA commands to have the firmware perform the operation is much faster than writing zeros sector by sector by software. And it also entirely frees the bus from write operations. Further annoying is that most USB bridge chipsets don’t support the passthrough needed for these commands to work so typically it’s only available with a SATA or eSATA connection.
Chris Murphy
August 19, 2013 at 3:35 am

The journal makes it possible to mount an unclean file system faster than doing a full fsck, not data recovery. On large volumes, or with many files, an fsck on a non-journaled file system can take a long time, even hours, and gobs of memory. The journal has no possible relationship with application problems, they aren’t even aware of the journal being enabled or not. There is perhaps some advantage to disabling journaling on SSDs, which can repair much faster in the event of an abrupt unclean unmount, and also the journal writing a lot of metadata will cause additional SSD wear. But if this is for video, there isn’t much metadata anyway. The file sizes are huge, and quantity few, therefore not much metadata.
Ericbowen
August 20, 2013 at 3:19 pm

https://support.apple.com/kb/HT2355

“When you enable journaling on a disk, a continuous record of changes to files on the disk is maintained in the journal. If your computer stops because of a power failure or some other issue, the journal is used to restore the disk to a known-good state when the server restarts.”

Notice the Restore in this part? File system integrity data is not just used for verification. It’s also used to restore data that is incomplete or corrupted on the drive. Hence what I stated. And yes there have been meta data issues with the Journaling especially on the Windows platform with Mac Drive because the application is not communicating with the drive directly. It is through the hardware controller. That translation of Volume to block data has to go through that process and that is where the problems happen. Verified here with turning Jounaling off and has often been recommended for Mac based formatted drives used for processing work and not archive.

BTW if the controller cannot recognize the volume table or data it wont initialize and mount the disks which is what this issue is pointing to.

Eric-ADK
Tech Manager
Chris Murphy
August 20, 2013 at 5:36 pm

“journal is used to restore”

JHFS+ uses metadata only journalling, data journaling does not happen. The kb statement above means the journal is used to update the state of the file system, rather than having to run fsck. If at mount time the journal is determined to be inconsistent, then fsck is needed.

“It’s also used to restore data that is incomplete or corrupted on the drive.”

This is incorrect.

As for the “meta data issues” with journaling, what issues? You’ve established only correlation to journaling, not causation. NTFS is also a journaled file system, do these same problems happen with Windows applications on NTFS? Or only with 3rd party software on JHFS+? And since at least 10.8, possibly 10.7, Disk Utility doesn’t have an option for HFS+ anymore. It only creates JHFS+ and HFSX, with or without encryption. There also isn’t an option to disable the journal on an existing JHFS+ volume. This can still be done with CLI using diskutil disableJournal. It’s much more difficult to disable it on NTFS. Journaling has been around for a really long time, for apps to have a problem with it sounds like an application bug.

Also, SATA/SAS controllers don’t understand file systems, journals, or even partition schemes. They don’t understand any of the data on a hard drive. Their job strictly to communicate with the drive and manage the stream between kernel and drive. Controllers don’t mount file systems. The job of mounting file systems and understanding their structures is with the kernel, and only the kernel and certain file system utilities should care about the existence (or not) of a file system journal.
Ericbowen
August 20, 2013 at 7:39 pm

https://www.dfrws.org/2008/proceedings/p76-burghardt.pdf

I have yet to find any whitepaper or information that does not state the controller is not translating instructions from the processor or OS/application. If you have references otherwise please link them. BTW a Raid Controller wont Mount the Raid volume without a OS kernel?

Eric-ADK
Tech Manager
Alex Gerulaitis
August 20, 2013 at 8:09 pm

Eric says journaling adds overhead and can be used to restore (fix) corrupted files (i.e. fix file system inconsistencies). Isn’t he right about that?

Chris says journaling in itself is not a backup/restore mechanism as it doesn’t save actual file contents, only metadata. Does this conflict anything Eric said?

Chris says SAS/SATA controllers aren’t file system aware, they deal with block data. Eric says RAID controllers deal with RAID sets and are responsible for data integrity on them, which could be related to OP’s problems.

Are there really any conflicts between what you two are saying?

Page 2 of 3

← 1 2 3 →

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums