Creative Communities of the World Forums

The peer to peer support community for media production professionals.

Activity Forums Storage & Archiving Very strange problem with an Infortrend A16E iSCSI storage array

  • Very strange problem with an Infortrend A16E iSCSI storage array

    Posted by Edwards Willeam on December 4, 2009 at 1:23 am

    Hi,

    > We have a very strange problem with an Infortrend A16E iSCSI storage
    > array [1]. I think it’s not a Open-iSCSI related problem, but someone
    > here may shed some light 🙂
    > This array has 4 iSCSI interfaces to distribute/balance ethernet
    > traffic. There are 16 hosts connected to this array via iSCSI, with 4
    > hosts per channel/interface.
    > *Randomly*, one of these channels resets, making the 4 servers connected
    > to the channel timeout. The other 3 channels are not affected at all.
    > Open-iSCSI logs this:
    > ping timeout of 5 secs expired, last rx 502453156, last ping 502446907,
    > now 502463156

    The initiatior sends a iscsi ping every X seconds. If we do not get a
    response in Y seconds we drop the session (drop connection and
    relogin).

    There was a bug in the initiator where we would spit out this timeout
    error by accident. What kernel are you using? Are you using the iscsi
    modules in the kernel or modules from a open-iscsi.org release and
    what
    release of open-iscsi.org?

    > connection4:0: iscsi: detected conn error (1011)
    > session4: iscsi: session recovery timed out after 120 secs

    I do not think it is the bug, because you would normally log right
    back in.

    The recovery timed out error means that the initiator tried to log
    back
    in for 120 seconds and during that time we could not reconnect/
    relogin.

    I think this makes sense when looking at the switch messages below.
    If
    something causes the link to go down, the iscsi ping would fail/
    timeout.

    I am not sure if the iscsi layer dropping the session would cause the
    link to go down/up.

    – Hide quoted text –
    – Show quoted text –
    > The switch port where it is connected shows:
    > %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5,
    > changed state to down
    > %LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to down
    > %LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to up
    > %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5,
    > changed state to up
    > It appears like iSCSI channel *resets* and starts a down+up port
    > process.. we have changed the wire, the switch.. and still get the same
    > error.
    > The Infortrend array is logging nothing and the official support people
    > have no idea about this issue :-/
    > We believe that the source of the problem is a single server. When we
    > move this server to a different iSCSI channel we get the same error
    > there, and the channel where it previously was starts working as
    > expected, with no interface resets.
    > Anyone could say that something in that faulty server is making the
    > interface reset; but we’ve checked it several times and we really
    > believe that the server is configured as the other 16 we have attached
    > to the array.
    > The switch connecting the servers and the array is a Cisco Catalyst 2960G.
    > Anyone ever experienced anything similar?
    > Regards,

    IrrNI

    Steve Modica replied 16 years, 5 months ago 3 Members · 2 Replies
  • 2 Replies
  • Matt Geier

    December 5, 2009 at 4:15 am

    Whos iSCSI Initiator are you using? I do not see that called out anywhere?

    Did the support people you talked with simply tell you they would do nothing? (I would think if anything they would want to understand THEY do not have a problem……) or did they confirm all the hardware/networking/drivers/etc were all working correctly?

  • Steve Modica

    December 5, 2009 at 4:27 am

    Since you’ve isolated this to a single server, have you considered running that server on a port of its own for a short while? You should also get the logs out of this server and the TCP stats. When the port goes away, if there’s enough time, you should try and get a tcpdump output. (In fact, it would be good to do that even while it’s working. It might shed some light on what’s on the wire.)

    Older linux kernel TCP stacks had some serious TCP ordering issues. I think those were corrected around 2.6.20.

    Another test, when this thing goes away, can you still ping?

    Steve

We use anonymous cookies to give you the best experience we can.
Our Privacy policy | GDPR Policy