FW 1.3.2 static LAG weird behavior

Tue Sep 15, 2015 7:47 am

Today, we changed a few device cables on two 24port Netonix switches coupled with a static LAG (port 21+22, key 1 on both sides). During these changes access to switch 2 and beyond (via switch 1 and the LAG) was suddenly lost. Trying to fix that, we found that the LAG is working as expected sometimes, but sometimes not. I guess you like bug reports like that. So I tried to find out more as long as my guys were at that location:

1) When it is not working, the LAG seems to behave as if it was too separate links and, because STP is disabled in our setup, we got a nice loop/storm with about 8 Mbps of multicast traffic (probably injected by OSPF multicasts). In that state, management access to switch 2 and all devices connected to it is lost if traffic has to pass the LAG. Unplugging one of the links breaks the loop/storm, and re-plugging it started the loop again - at first, but read on ...

2) Then I disabled one of the ports of the LAG on switch 2 and (with both cables plugged in) the LAG started to behave as it should - i.e. no loop/storm. However, on switch 2, I lost access to all devices with an even last digit in their IP address, including switch 2, which happens to have an even IP address too. So it seems switch 1 did not notice it had lost the link on one of the LAG ports and went on balancing traffic.

3) We then tried to unplug one and both of the LAG's cables, disabling and re-enabling LAG ports and finally got it to work with the LAG cables both connected, 1 LAG port on switch 1 and 2 disabled and the other leg passing all traffic. While trying, we weren't able to reproduce the loop/storm we had seen initially, but I kept one of the LAG's ports disabled in case it would come up again when something else is changed.

All I can say is that the resulting (static) LAG behavior seems to depend on the sequence of changes in configuration and cabling. I don't think a screenshot would be helpful but you can get the configuration files of course.

Tue Sep 15, 2015 9:31 am

"Static" LAGs are just that Static, there is no fault protection with "Static" LAGs like there is with LACP LAGs which is why the LACP protocol was developed to deal with a port or link fault within the LAG group. All LACP does is sit on top of "Static" LAGs and changes their grouping when a port/link within the group drops out. LACP requires a program/daemon running to constantly monitor the LAG group and reconfigures it when a port/link drops out.

LACP LAGs can deal with cables being unplugged and plugged in whereas a "Static" LAG can not as there is no protocol monitoring and changing it's grouping as needed so when a port/link drops out all the traffic on that LAG port/link stops and weird stuff starts to happen as that traffic is "lost".

All of my "Static" LAGs at my towers require all ports be stable or they mess up, our switches act the same way with "Static" LAGs as other equipment I tested such as Cisco.

I run "Static" LAGs at all my towers because I have Cisco 2951 routers which will not do LACP but I do run LACP LAGs between (2) WS-24-400A in my office and they deal with what your describing.

The reason you lose communications with some devices is the switch and what ever it is connected to divide the streams up between the ports.

Remember a LAG does not create a larger single pipe but rather divides the traffic up between the ports that are in the LAG group based on the criteria defined on the 4 check box options at the top of the LAG Tab.

Aggregation Mode
[ ] Destination MAC
[ ] Source MAC
[ ] Source/Destination IP
[ ] Source/Destination Port

Can you switch to LACP because a "Static" LAG assumes there will be no change to the LAG group and does not know how to deal with a port drop from a LAG group.

To learn more about how Static and LACP LAGs act you can read the Wikipedia pages which will describe exactly what I am laying out here.

Tue Sep 15, 2015 10:43 am

Okay, that explains the state we have seen in (2) when one port was disabled and only half of the traffic was passing.

It does not explain the loop/storm in state (1) when both cables were connected and all ports were up, and it does not explain state (3) either, the one we are in right now, unless we assume that state (3) is really state (1) i.e. the LAG has somehow forgotten (again) that it is a LAG and either starts sending packets in a loop when both legs are active or passes all traffic on if one of the legs is unplugged or disabled.

I should add that we use LAGs very often. Actually on most routed stations we got two "core" switches and two "PoE" switches, connected in a loop topology (with RSTP of course), and the connections between the switches are LAGs from 2 to 6 cables. Now in this case, we got two Netonixes only and RSTP is turned off. Turning RSTP on would likely cause the storm to stop if the LAG again behaves as it it was not a LAG, but that is just hiding the problem.

I would really like not to switch to LACP because (a) I know I'm not doing anything wrong, (b) because a dynamic factor would add complexity to what is already badly reproducible, and (c) because we mostly don't use LACP for one simple reason: Until all switches learn how to send emails in case of failures, LACP would hide a problem, whereas static LAG failures tend to be more visible.

I will have to do a test setup for our standard 4 switch ring topology with LAGs, though, and do much more intensive testing before we can deploy more Netonix switches (except in simple single-switch scenarios). So if you want, wait until I provide more evidence.

Tue Sep 15, 2015 11:16 am

To avoid a loop storm you must also have RSTP enabled on the switch and on those ports and even then it "can" act weird as it was simply not meant to deal with ports being changed thus the name "Static".

But I want to stress that you should not be unplugging and plugging in ports that are in a Static LAG, it is not meant to deal with that behavior.

tma wrote:I should add that we use LAGs very often. Actually on most routed stations we got two "core" switches and two "PoE" switches, connected in a loop topology (with RSTP of course), and the connections between the switches are LAGs from 2 to 6 cables. Now in this case, we got two Netonixes only and RSTP is turned off. Turning RSTP on would likely cause the storm to stop if the LAG again behaves as it it was not a LAG, but that is just hiding the problem.

You NEED to have RSTP enabled on the switch and on those ports, that is NOT hiding the problem that is what RSTP is designed to do.

DO NOT COUNT ON LOOP PROTECTION FOR THIS WHICH IS NOT WHAT LOOP PROTECTION WAS DESIGNED FOR AND I HAVE TO CHECK BUT I THINK WE AUTOMATICALLY DISABLE LOOP PROTECTION ON PORTS THAT ARE IN A LAG?

Make sure you use RSTP and NOT STP as STP takes too long to go to Forwarding state.

I was one of the guys LABing and testing LAGs. In fact I was impressed that LACP LAGs between our switches responded much quicker than any other switches I tested which also included EdgeMAX, HP, and Cisco.

Try a LAB with 2 of our switches with a Static LAG and it does not do that if you have RSTP enabled on those ports. And LAB LACP between 2 of our switches and between 2 other manufactures!!!

I STRONGLY disagree with your reason for not using LACP, LACP is more resilient and provides fault tolerance where as a Static LAG will just fail if a port drops out but LACP can recover.

But even using LACP you NEED to have RSTP enabled on those ports to deal with the individual ports before the switch establishes the LAG and prevents loops. Up until the LAG is established they are still ports and loops can occur.

I promise you we are handling Static and LACP correctly.

However I will check with Eric about adding an SMTP alert on LACP port failures which is a GREAT idea!!! :thumbsup:

Tue Sep 15, 2015 1:23 pm

sirhc wrote:You NEED to have RSTP enabled on the switch and on those ports, that is NOT hiding the problem that is what RSTP is designed to do. [...]

But even using LACP you NEED to have RSTP enabled on those ports to deal with the individual ports before the switch establishes the LAG and prevents loops. Up until the LAG is established they are still ports and loops can occur.

This is the first time I hear someone saying that (R)STP is needed to prevent a loop in a LAG. Also, with a static LAG, there should be no "startup time" needed to detect two ports being in a LAG. But even if it's true what you say, the storm should have stopped once the LAG had been established - if that was required for a static LAG. But not so.

Please read my original posting again. I said the guys were disconnecting/re-connecting OTHER ports when suddenly the LAG (that was okay before) started looping multicast packets at 8 Mbps. There was no configuration change causing that because I was alarmed when it happened and we lost management to the second switch. I then told them to open one of the LAG legs to stop the storm. Reconnecting it brought the storm back. But when I disabled one of the LAG ports (keeping the LAG configured as it was), the storm vanished and did not come back even with all legs of the LAG enabled again. Let's focus on that.

Maybe the key is that is has been multicasts. I saw that on the port details pop-up. Unicast and broadcast packet counters seemed to increase at a slow/normal rate only, but multicasts were counting up rapidly. (Btw, flow-control Rx and Tx counters were also counting up rapidly.) So I don't think it was one of those real bad storms that just tears down everything, or I wouldn't have been able to work with switch 1. But it was bad enough to make me loose access to switch 2 - or loop protection prevented that somehow.

Unfortunately, VLANs may also be part of the picture. We got 3 tagged VLANs on that LAG, no untagged traffic (and no "native" VLAN). Are you sure your LAG testing included multicast packets and VLANs, so you can promise there is no bug? 1.3.2 had a problem with tagged management traffic that was not in earlier versions ...

Anyway, I will do a lab setup with two Netonix switches, a static LAG with tagged VLANs between them, no (R)STP, and make sure to inject multicast packets, because that causes no problems on all other switches I've been working with so far. I will then report back. From that, I can add RSTP to see if the problem goes away.

Tue Sep 15, 2015 3:32 pm

Look forward to hear back from you but yes please LAB with RSTP on as well which is how I always configure my switches with LAGs.

Wed Sep 30, 2015 9:57 am

I just started my lab setup to make this reproducible ... like this:

Unpack two WS-12-250A switches (so both were at their factory defaults).
Connect myself to port 5 of switch 1.
Switch 1 - upgrade to firmware 1.3.2
Switch 1 - set new static IP

connect myself to port 5 of switch 2 now (disconnecting from switch 1)
Switch 2 - set new static IP
connect port 9 from switch 1 to switch 9 (no LAG, just one cable!)
connect myself to port 5 of switch 1 now (disconnecting from switch 2)
(I reach switch 2 via switch 1 now)
Switch 2 - upgrade to firmware 1.3.2

The reason to set the IP first and then do the upgrade on switch 2 was simply because I wanted to see how that would work out in the field.

Result: The firmware upgrade was okay on switch 2 too. However, when switch 2 came up, it started ping-ponging packets on the connection between port 9 of switch 1 and 2 and I was not able to ping switch 2 anymore. My management connection to switch 1 (port 5) was still okay though and I could see pseudo-traffic on port 9 of switch 1 and 2 with 8 Mbps, 15 Kpps, also spilling out onto my management connection on port 5 of switch 1. This is exactly what I had seen on the LAG between two switches, but this time without any LAG configured.

To be sure that this pseudo-traffic wasn't somehow introduced via my management connection on port 5, I disconnected port 5. The ping-ponging on the interconnection continued (as seen by rapid flickerling of the LED).

Next, I wanted to see the effect via management on switch 2, and so I connected myself to port 5 on switch 2. But when I did that, the ping-ponging stopped immediately and I'm now able to ping both switches again.

Obviously, I was wrong when I suspected that this effect was caused by a (static) LAG. OTOH, it doesn't take a firmware upgrade process to start that behavior, because the switches where the problem was discovered in the first place were already upgraded to 1.3.2 before roll-out. I will now try to find out what is needed. Stay tuned.

Wed Sep 30, 2015 10:25 am

Hey Thomas,

Go watch this video but the part related to LAGs starts at 1 hour and 4 minutes into the video
https://www.youtube.com/watch?v=8JvBEAD4MFM
And 1 hour 12.5 minutes into the video - JUST FOR YOU THOMAS!

With our switches if you are using LAGs either Static or LACP you need to enable RSTP on those ports. If you do not want to use RSTP on your other ports then un-check all ports except the ports that have LAGs on them.

If you do not enable RSTP on the LAG ports odd things will happen such as loops.

Also the current firmware we are on is v1.3.3, please use that version when testing/LABing. Even if there were no reported fixes pertaining to what you are doing Eric will insist you are the current version before he investigates anything.

When writing firmware you often learn that if you changed something over here in an obscure area that should not affect things over there you often find out it does so Eric will always ask you to be up to date on your firmware if you are asking him to look into something.

Another Video I did is here
https://www.youtube.com/watch?v=cMv7JfG9cjI

Wed Sep 30, 2015 11:34 am

Dear Chris, I really appreciate you do parts of your videos just for me

but I would appreciate even more if you read what I'm writing. I was saying that this time I was able to reproduce the problem in the lab WITHOUT ANY LAG involved. That, connecting two switches with one cable, should definitely not require STP.

I'm trying to make this reproducible on 1.3.2 because that is the version that is on the switches that show the problem. Once I've got a recipe to trigger the ping-ponging at will, I'll upgrade to 1.3.3 to see whether it has been fixed in the meantime (w/o the release notes mentioning anything that sounds similar).

Currently though, I'm nowhere near that recipe. I appears as if this is somehow depending on the sequence of events when one of the switches comes up from a reboot. Maybe it also takes an ARP broadcast or OSPF multicast to appear in the right moment to get it going. If that's the case, the recipe will have to include something like "you need to try often enough". But I saw this happening 3 times now.

Wed Sep 30, 2015 11:52 am

tma wrote:Dear Chris, I really appreciate you do parts of your videos just for me but I would appreciate even more if you read what I'm writing. I was saying that this time I was able to reproduce the problem in the lab WITHOUT ANY LAG involved. That, connecting two switches with one cable, should definitely not require STP.

I'm trying to make this reproducible on 1.3.2 because that is the version that is on the switches that show the problem. Once I've got a recipe to trigger the ping-ponging at will, I'll upgrade to 1.3.3 to see whether it has been fixed in the meantime (w/o the release notes mentioning anything that sounds similar).

Currently though, I'm nowhere near that recipe. I appears as if this is somehow depending on the sequence of events when one of the switches comes up from a reboot. Maybe it also takes an ARP broadcast or OSPF multicast to appear in the right moment to get it going. If that's the case, the recipe will have to include something like "you need to try often enough". But I saw this happening 3 times now.

Sorry, I have degraded to skimming things, so many emails and post, my apologies

Let me know if you need anything Thomas, you know my cell!

FW 1.3.2 static LAG weird behavior

FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Re: FW 1.3.2 static LAG weird behavior

Who is online