LACP goes down briefly when radio loses power?

DOWNLOAD THE LATEST FIRMWARE HERE
User avatar
sirhc
Employee
Employee
 
Posts: 7347
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1597 times
Been thanked: 1318 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 18, 2017 12:09 am

Seth.png
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 18, 2017 5:34 am

After further troubleshooting I believe the issue is caused by a Pause Frame storm that is killing traffic on the switch. Here is why.

I upgraded my Netonix switch to a FW that has their Pause Frame Storm protection mechanism and also upgraded all the Airfibers on this switch to 4.0b2. I then caused one of the wireless Airfiber slaves to break the link by changing the DNS server entry on the Network Page. It was either after the link went down or right after it came back up (sorry did not catch the exact timing), I lost momentary connection to all the other Airfibers on the switch and the data flow from the tower router stalled. It also caused the LACP LAG to reset.

Shortly after I got the following message from the Netonix Switch. Port 4 was the link that I caused to break but it recorded excessive pause in port 3? Switch stats in Port 3 did not show an abnormal number of pause frames.

Code: Select all
 Jan 18 04:02:02 switch[1345]: got excessive pause frames on port 3 (18055), count = 1
Jan 18 04:02:03 switch[1345]: got excessive pause frames on port 3 (62674), count = 2
Jan 18 04:02:04 switch[1345]: got excessive pause frames on port 3 (62655), count = 3
Jan 18 04:02:05 switch[1345]: got excessive pause frames on port 3 (62564), count = 4
Jan 18 04:02:06 switch[1345]: got excessive pause frames on port 3 (62693), count = 5
Jan 18 04:02:07 switch[1345]: got excessive pause frames on port 3 (63205), count = 6
Jan 18 04:02:08 switch[1345]: got excessive pause frames on port 3 (63116), count = 7
Jan 18 04:02:09 switch[1345]: got excessive pause frames on port 3 (62703), count = 8
Jan 18 04:02:10 switch[1345]: got excessive pause frames on port 3 (62633), count = 9
Jan 18 04:02:11 switch[1345]: got excessive pause frames on port 3 (62645), count = 10
Jan 18 04:02:12 switch[1345]: Excessive flow control pause frames received on port 4 (Slipher Tower) (Slipher Tower), disabling flow control
Jan 18 04:02:13 Port: link state changed to 'down' on port 4
Jan 18 04:02:13 LACP: starting negotiation with partner 4C-5E-0C-6B-BA-14
Jan 18 04:02:13 LACP: LACP changed state to Active on port 7 (key 1)
Jan 18 04:02:13 LACP: LACP changed state to Active on port 8 (key 1)
Jan 18 04:02:13 switch[824]: LACP changed state to Active on port 7 (Uplink) (key 1)
Jan 18 04:02:13 switch[825]: LACP changed state to Active on port 8 (Uplink) (key 1)
Jan 18 04:02:15 sSMTP[831]: Sent mail for


Flow control was then disabled on Port 4 by the switches protection mechanism. I then reverted the DNS change on the Airfiber slave and the slave disconnected and the reconnected without any data interuptions to the LAG or the other AF on the switch.

I then renabled Flow control on port 4 and reset the port counters on the switch. It has ZERO pause frames. I then changed the DNS entry on the AF slave of the Master in Port 4 and either after the Slave disconnected or right after it reconnected the Pause frames shot up to 472000 in like 2 seconds and it also recorded an immediate 129 TX drops.
Pause Storm.PNG




This was accompanied by these entries in the Netonix switch. Odd that it was ports 2 and 3 and not 4. Even odder that I have nothing even connected to port 2 and port 2 has all Zeros for its Port Details.
Code: Select all
 Jan 18 04:07:13 switch[1345]: got excessive pause frames on port 2 (749331), count = 1
Jan 18 04:07:13 switch[1345]: got excessive pause frames on port 3 (109600), count = 1
Jan 18 04:08:40 switch[1345]: got excessive pause frames on port 3 (61840), count = 1
Jan 18 04:08:41 switch[1345]: got excessive pause frames on port 3 (62277), count = 2
Jan 18 04:08:42 switch[1345]: got excessive pause frames on port 3 (62311), count = 3
Jan 18 04:08:43 switch[1345]: got excessive pause frames on port 3 (62271), count = 4
Jan 18 04:08:44 switch[1345]: got excessive pause frames on port 3 (62245), count = 5
Jan 18 04:08:45 switch[1345]: got excessive pause frames on port 3 (62256), count = 6
Jan 18 04:08:46 switch[1345]: got excessive pause frames on port 3 (62257), count = 7
Jan 18 04:08:47 LACP: LACP changed state to Active on port 7 (key 1)
Jan 18 04:08:47 LACP: LACP changed state to Active on port 8 (key 1)
Jan 18 04:08:47 switch[1826]: LACP changed state to Active on port 7 (Uplink) (key 1)
Jan 18 04:08:47 switch[1345]: got excessive pause frames on port 3 (36884), count = 8
Jan 18 04:08:47 switch[1827]: LACP changed state to Active on port 8 (Uplink) (key 1)
Jan 18 04:08:48 sSMTP[1833]: Sent mail for


Was this issue not fixed by UBNT or am I misinterpreting and this is something else?

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 18, 2017 6:07 am

I did another test again on the link in port 4 and this time I watched the throuhput and PPS graph for that port on the Netonix. Again after breaking the wireless link by changing the DNS entry in the slave I got a data interuption for all the devices on the switch and saw a lower than before, but still much too high number of Pause frames sent by the AF5X in port 4 in the span of a few seconds.
Pause Storm test2.PNG




Port 4 also recorded (while the AF wireless link to the slave was down) over 30Mbps of TX traffic and over 60Kpps all of which I expect were pause frames.
Big blue hump is while Slave was disconnected from Master.
Pause frames 2.PNG




So while I do get overall less pause frames on FW 4.x under normal operations, I now get excessive Pause Frames if the Slave goes offline by reboot, power loss, or setting change that breaks the wireless connection.

User avatar
sirhc
Employee
Employee
 
Posts: 7347
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1597 times
Been thanked: 1318 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 18, 2017 11:02 am

Yea this was my original guess of the cause of your pain.

viewtopic.php?f=17&t=2375&start=10#p17016
sirhc wrote:Seth, please upgrade switches to v1.4.7rc4, your firmware is REALLY OLD.

I think you are possibly seeing Pause Frame storms from your AFX radios which is a known UBNT bug.

v1.4.7rc4 has protection against this.


As far as the port number being off that may be a port # reporting Error on our side and will ask Eric (our programmer) but I remember him saying something about being off on the port # being reported for some actions in the log.

I have resorted to either turning Flow Control OFF on ports with AF radios or making sure Pause Frame Storm Protection is ON but I then have to turn back on FC after events that cause the Storm Protection to kick it OFF.

For me I use a lot of AF24 links as my primary links so heavy rains often cause the AF24 links to either modulate below the needed capacity or in case of sever rains drop but either way the Storm Protection often turns FC OFF during one of those events.

What happens is the flood of Pause Frames are sent from AF to say Port #3 but there are packets coming into the switch on ports 7 and 8 and a lot of them are destined to port 3 but port 3 is in a lockdown state from the Pause Frames so the switch starts buffering the packets and soon the switch buffers are full so the switch starts sending Tx Pause Frames out Ports 7 and 8 telling your router to stop sending more packets because the buffers are full and Port 3 is still being told Pause but Ports 7 and 8 also feed ports 4 and 5 and this is how you access the switch from your office or house so it appears as the switch is locking up but it is not.

You can either turn FC OFF on ports 3,4, and 5 or on ports 7 and 8 or OFF on all or just make sure Storm Protection is ON but you will constantly have to turn FC back ON after events like this.


UBNT needs to fix AF firmware that when link drops or modulates down lowering capacity below current load to simply drop the packets and does not send unsolicited Pause frames storms as it causes havoc elsewhere.

Flow control is great to handle SMALL momentary bursts that exceed the wireless link capacity to prevent packet drops that can affect traffic in an adverse way but they need to know when sending too many Pause Frames for too long of period they are causing the switch that gets them to fill up buffers and send Tx Pause Frames themselves towards the source of the packets and in your case your router. And since ports 7 and 8 belonged to an LACP lag pausing one port too long would cause the LACP to break as well as the control BPDU packets are also prevented from getting back and forth between the switch and the router.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 18, 2017 11:19 am

- X radios - Flow control pause frame flood issue fixed

Well supposedly UBNT fixed the Pause Frame issue for AFX radios in 4.x and I think they might have in regards to modulation drops as I did see a decrease in pause frames when using the 4.x FW. However, I wonder if UBNT did not account for a sudden complete link loss in whatever they did to "fix" the pause frame issue. Reason is before on FW 3.4 and 3.2 I never got pause frame issues as my mod rates were always stable. In the case of a reboot or loss of the slave link I did not get one either. I wonder if their fix did fix the mod rate pause frame issue, but introduced a new one for when the link is suddenly broken?

I have pinged Chuck at UBNT in a forum post that contains most of the data I have posted here and hope they have something to say on this as it is a very big problem.

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Sat Jan 21, 2017 7:38 pm

Here so far is what UBNT is saying about this issue. See posts by UBNT-Chuck below.

"Hi,

Let me start by saying that the use of Ethernet pause frames is problematic, period. I say this regardless of whose equipment you use. It is much better to allow overflow packets to be dropped and allow the layer 3 protocols to do their job of regulating capacity. The best advice I can offer is never to use it (whether you are using airFiber or not).

Having said that...

The current implementation of flow control in airFiber causes a full airFiber buffer to send Pause frames until the buffer condition clears. This design choice allows you to design a network where data overflows do not occur at the airFIber end of the link.

Ultimately, exceeding the link capacity must eventually lead to packet drops somewhere. You need to decide where this is. If you enable flow control across your entire layer 2 network, you are counting on this happening at the router. IF you choose this path (and I do not recommend this), you need to know that Ethernet flow control can and will disrupt your entire Layer 2 network. This is a fact of Ethernet pause frame life.

Having an option to disable pause frames after some period of time (modulating whether the feature sends pause frames or not based on how long we have been in an overflow situation) has merit and I will discuss this internally. Changes in this area, however, have a long lead time.

Chuck"


And

Hi,
As soon as one end of the link realizes the other side is gone, we do stop pause frames. Right now, that is determined by the absence of keep alive messages that come every 30 seconds or so (this is basically true, but, is an simplification of what is really going on behind the scenes)...I will look into utilizing the capacity numbers for a quicker shutdown.

Because the pause frame logic depends upon a very low latency response, it is primarily handled in hardware. The hardware's behaviour can be modified via software loads, but, there is quite a bit of coodination required to do this and the implementation is not as easy as you might think. We are looking at freeing resources to address this.

In the short term, I would recommend turning flow control off if our current behaviour is causing issues.

Chuck


I and others still think this is a major issue and I don't know why UBNT is having such issues figuring it out. For example my SAF links (I know they are much more expensive than AF) do not have this issue at all.

I have a SAF link and a backup AF5X link to the same tower. SAF link is running at a Capacity of 366Mbps and AF5X link at 212Mbps TX Capacity. Actual traffic over the SAF link is around 100-150Mbps.

I reset the router interface counters and here is how many Pause Frames the SAF generated in 2 minutes.

RX Pause Frames: 380

I then forced the router to use the AF5X link instead and I reset the router interface counters and here is how many Pause Frames the Airfiber generated in 2 minutes with the same amount of average traffic.

RX Pause Frames: 21893


So even under normal operations where the mod rate is stable and the link is not Maxed the AF sends much more pause frames than the SAF. Also, maybe the comparison between my SAF link and the AF is not exactly a good one as I have no idea how big the port buffers are on each device, but I do know that the SAF does not flood the Ethernet with pause frames if the other side of the link goes offline.

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 25, 2017 7:48 pm

As I am left with the option of either no flow control or suffer with Pause frame storms that knock out communications to the switch/router, I need some advice.

What would give the most benefit if any to my setup?

1. Just leave flow control off on the Airfiber facing interfaces?
2. #1 plus turning on Strict QOS on the switch configured for the speed of the Airfiber Capacity on each Airfiber port?
3. #1 plus setup a rate limit queue on the tower router for each Airfiber vlan interface set to Capacity of each airfiber link?
4. Just leave flow control on every Airfiber interface?

User avatar
sirhc
Employee
Employee
 
Posts: 7347
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1597 times
Been thanked: 1318 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 25, 2017 8:02 pm

Let me think on this Seth, but in reality UBNT needs to adapt/fix Flow Control on their radios to deal with the changing environments/capacity of wireless not just send as many unsolicited Pause Frames as they can any time their port buffer is full. This is simply not acceptable.

I have played with Flow Control in my topology and found when and when not to use it.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
sbyrd
Experienced Member
 
Posts: 236
Joined: Fri Apr 10, 2015 6:16 pm
Has thanked: 16 times
Been thanked: 26 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 25, 2017 8:05 pm

sirhc wrote:Let me think on this Seth, but in reality UBNT needs to adapt/fix Flow Control on their radios to deal with the changing environments/capacity of wireless not just send as many unsolicited Pause Frames as they can any time their port buffer is full. This is simply not acceptable.

I have played with Flow Control in my topology and found when and when not to use it.


Agreed. Unfortunately, as I posted Chuck at UBNT seems to advise against Flow Control in all instances, so I do not know how much of a priority they will put on fixing it. In the mean-time I need to know if there is anything I can do besides leave it on and suffer some consequences or leave it off and possibly suffer different ones.

Which is the lesser of the two evils? Maybe there is no clear answer and it is not that simple?

User avatar
sirhc
Employee
Employee
 
Posts: 7347
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1597 times
Been thanked: 1318 times

Re: LACP goes down briefly when radio loses power?

Wed Jan 25, 2017 8:10 pm

If you leave it on and have Pause Frame Storm Protection Enabled it will protect you for the most part.

Funny thing is UBNT AF team used to tell people to use Flow Control?????

https://community.ubnt.com/t5/airFiber/ ... d-p/559729

Re: Ethernet Flow Control - Recommended Setting
‎09-20-2013 - 03:07 PM

Hi,

For most configurations, enabling flow control is the right thing to do. It avoids packet overload, minimizes dropped packets, and airFiber's implementation shouldn't really interfere with TCP as our buffers are not very deep (flow control allows us to avoid buffer bloat). Flow control really helps to even out traffic - avoiding overload conditions that happen merely as a result of very short high intensity bursts with little interpacket spacing.

Where flow control can hurt is where you are using <typically many> QoS levels and intend for lower priority traffic to be dumped. In this case, lower priority traffic can pause a link thus holding up higher priority traffic. In these specific scenarios, you are typically carrying more traffic than your end points can handle (i.e. this is not a short burst, but, a more sustained load). For those scenarios, you are better off without flow control so that the lower priority traffic can be dumped.

This is a really high level overview of pause frame issues in Ethernet. I suspect that most airFiber users will be served well by the use of Ethernet flow control.

Chuck

CLICK IMAGE BELOW TO VIEW FULL SIZE
chuck.png
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

PreviousNext
Return to Hardware and software issues

Who is online

Users browsing this forum: No registered users and 14 guests