Periodic network drop and recovery

Storage & Archiving

Periodic network drop and recovery

Posted by John Heagy on October 9, 2013 at 5:46 pm

We are testing a NAS offering and are seeing periodic network drops and recovery. The storage is supporting 16 ProRes 140 ingest pretty well except for periodic drops to zero for 1/2 sec than a 40Mb spike recovery. This of course causes dropped frames. It happens somewhat randomly but is spread across all 16 ingests. The NAS server is an Oracle box and the ingest clients are all 1Gig connected Xserves running 10.8.3.

The Oracle box and storage system report no issues. We suspect a UNIX-OS X network tuning mismatch. The NAS vender has limited OS X experience.

Has anybody seen anything like this periodic drop and recovery?

To preempt the inevitable Zelin response: This is not a video targeted product and their lack of OS X knowledge is a knock against them, but we’d like to understand the issue regardless.

John

John Heagy replied 12 years, 5 months ago 6 Members · 8 Replies
8 Replies

Eric Newbauer
October 9, 2013 at 11:41 pm

[John Heagy] “Has anybody seen anything like this periodic drop and recovery?
“

Probably everyone who’s ever tried doing what you’ve explained. 😉

There can be many reasons why you’re seeing such drops, but you can narrow the problem set considerably by focusing on (or temporarily eliminating, if possible) any switches or 10GbE.

Eric Newbauer
Studio Network Solutions (SNS)
https://www.studionetworksolutions.com
Bob Zelin
October 11, 2013 at 10:21 pm

have them run a disk array test with a graph (like AJA System Test) on the Oracle server – I have no idea of how to do this in Unix, but I am sure that there is some utility. See if the drive array is working properly. With a flakey drive, I have seen these kinds of spikes. It happens even with an all MAC system. You need to know the total bandwidth of the drive array, to make sure that it can handle all the streams that you expect to playback, and you need to make sure that you have a pretty flat response, with minimal spiking on the drive array. You need a utility to see this.

Bob Zelin

Bob Zelin
Rescue 1, Inc.
maxavid@cfl.rr.com
Vadim Carter
October 17, 2013 at 6:44 pm

I am going to put my 2c in.

This issue could be related to caching. When the cache on your storage device gets full, the OS will flush its contents to disks. This can cause your storage to momentarily stop responding to network i/o while it is performing heavy disk write operations. Try disabling caching on your NAS and see if it helps at all.

Another thing to take a look at is the MTU (Maximum Transmission Unit) settings. It must be the same for all devices on the network. Typical MTU value is 1500. Jumbo frames use MTU of 9000. Any mismatch will spell trouble due to packet fragmentation. You can check the MTU settings by going to System Preferences > Network (I think you have to click on the “Advanced…” button). You’ll probably have to run the ‘ifconfig’ command on your Oracle box to see the MTU setting. Lastly, do not forget to check the Ethernet switch for the same.

Eric has made a good suggestion of eliminating the switch but it does not look like it is going to be possible in your case given the number of clients.

Lucid Technology, Inc. / 801 West Bay Dr. Suite 465 / Largo, FL 33770
“Enterprise Data Storage for Everyone!”
Ph.: 727-487-2430
https://www.lucidti.com
John Heagy
October 17, 2013 at 8:30 pm

Thanks for everybody’s input.

Regarding jumbo frames… We initially had the drop with standard 1500 MTU and then switched everything to Jumbo including server, client, and switch and the drop went away. So ingest was now solid but playback was still hit or miss. They made some changes to the server and solidified playback but now the ingest drops are back.

John
Chris Murphy
October 22, 2013 at 7:24 am

This is what network protocol over what speed physical link? It seems really straight forward to narrow down whether it’s the sending computers that are hanging up for some reason, or if it’s the NAS that’s getting busy. In particular, it’s expected any spare RAM it has is used for caching and once that gets full it’s going to behave this way in which case the array write performance is inadequate for the sending streams you’ve got. So anyway, more information is needed to have any idea what’s going on.

bonnie++ and iozone are useful utilities for narrowing down such things but you’ve got to have an idea what what your workload needs to be in order to setup any benchmark tool to simulate it correctly. There’s no point getting data indicating problems (or success) for a non-used workload.
John Heagy
November 11, 2013 at 4:59 pm

This issue persisted in 10.8.5 but was resolved in 10.9.
Bruno Forcier
November 12, 2013 at 4:38 pm

Another question you should ask yourself is the following:

Is this a IOPS centric NAS or sequential centric?

Ideally, in a post environment, your NAS should be sequential centric…
John Heagy
November 20, 2013 at 10:23 pm

With our B4M Fork recordings ZFS reports 50-50 sequential/random. This is due to the elemental nature of the recordings.

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums