v1.5.22 Bug Reports and Comments

Wed Dec 11, 2024 6:11 pm

@RTGLWHow often are you pulling data from switch via SNMP?

1 minute is the shortest time interval we recommend, a little longer time span would be better.

That little CPU that handles these things like SNMP can get over loaded or slow down more important services it does for the switch core like LACP, RSTP, and all other services.

Wed Dec 11, 2024 7:39 pm

We have updated several of our WS units to 1.5.22 and no issues to report with how they are operating.

One issue we have seen with the Netonix Manager - once updated to 1.5.22 we are unable to access the web UI of a switch from Netonix Manager. We select the switch and click the globe icon and we get a "403 Forbidden" message instead of getting the UI screen.
We were at version 1.5.16 prior to the update to 22 and the switches we still have at that version can still be accessed from Netonix Manager.
The status in Netonix Manager of switches at .22 still update - it is only accessing the UI that appears broken.
This issue is minor as we can go directly to the ip address for the switch and login - but if it is an easy fix it would be nice to have the convenient button again.

Wed Dec 11, 2024 9:48 pm

sirhc wrote:@RTGLWHow often are you pulling data from switch via SNMP?

1 minute is the shortest time interval we recommend, a little longer time span would be better.

That little CPU that handles these things like SNMP can get over loaded or slow down more important services it does for the switch core like LACP, RSTP, and all other services.

Similar to Dawizman, we poll switches twice every 60s on average as we poll from primary and redundant prometheus nodes every 60s individually. Granted this is twice as frequent as you've recommended, I can't say we've run into any issues with this setup previously. In saying that though, we run very light in terms of services enabled on our switches. E.g: No LACP/LAG, QoS, or discovery tab and otherwise only enable HTTPS, SSH, Syslog, and NTP. So other configuration's mileage may vary...

Additionally; When I performed a WALK on the switch from my MIB browser, it allowed our monitoring node to collect SNMP data from the point in time I ran the WALK. Like I could manually prompt SNMP collection. Example using board temps, hopefully visualizes the above better. This is a lab switch, so no traffic going across it but it's configured the same as our production switches.

As mentioned, disabling and re-enabling the SNMP server fixed our issue, as did a reboot. So far I've been unable to replicate the issue once resolved by either of those methods. Can PM syslogs from the above example switch upgrade if needed.

Wed Dec 11, 2024 10:58 pm

RTGLW wrote:
sirhc wrote:@RTGLWHow often are you pulling data from switch via SNMP?

1 minute is the shortest time interval we recommend, a little longer time span would be better.

That little CPU that handles these things like SNMP can get over loaded or slow down more important services it does for the switch core like LACP, RSTP, and all other services.

Similar to Dawizman, we poll switches twice every 60s on average as we poll from primary and redundant prometheus nodes every 60s individually. Granted this is twice as frequent as you've recommended, I can't say we've run into any issues with this setup previously. In saying that though, we run very light in terms of services enabled on our switches. E.g: No LACP/LAG, QoS, or discovery tab and otherwise only enable HTTPS, SSH, Syslog, and NTP. So other configuration's mileage may vary...

Additionally; When I performed a WALK on the switch from my MIB browser, it allowed our monitoring node to collect SNMP data from the point in time I ran the WALK. Like I could manually prompt SNMP collection. Example using board temps, hopefully visualizes the above better. This is a lab switch, so no traffic going across it but it's configured the same as our production switches.

As mentioned, disabling and re-enabling the SNMP server fixed our issue, as did a reboot. So far I've been unable to replicate the issue once resolved by either of those methods. Can PM syslogs from the above example switch upgrade if needed.

You know we never tested or even thought about the switches being polled by two snmp like servers. Wondering what happens if they both hit at same time?

And yes this is hitting it pretty often, more than we recommend obviously, worried about the small CPU in there which could be fine and something else.

As a test could you only query it from one and start at 2 minutes, then decrease to 1 min just for shits and giggles.

Not saying we won't investigate and or come up with a solution, but this would help.

Wed Dec 11, 2024 11:35 pm

sakita wrote:I modified the time on my lab NTP server and then tried it...

/etc/init.d/ntp restart
stopped process in pidfile '/var/run/ntp' (pid 826)

...time was immediately set to match the NTP server. Cool.

After that I changed the NTP server time and tried the disable NTP/ save / enable NTP save and it did not change the Netonix time.

Again, it does set the time correctly on startup or reboot as long as the NTP server is up and accessible over the network (and it seems to try it enough times to work fine in general). However, when it needs to be done manually, resyncing is currently a command prompt activity (or wait 24 hours).

I checked 5 field switches that have been running 1.5.21 since last week and they all had accurate time and their logs showed "admin: sync time via ntp" log events once each day so that is working as intended as well.

This is what I've confirmed on my testing as well.

Working on a solution so that disable/enable cycle immediately triggers ntp without needing to invoke the script.

Thu Dec 12, 2024 12:22 am

We just found our first issue with traffic being eaten...

Model: WS-12-250-AC
Port: Port13
Port speed: 1G
SFP: FIBERSTORE SFP-10G-DAC (don't ask)

Normal IP traffic working fine, PPPoE not. Learning MACs in the affected VLAN from PoE/radio ports, not from the SFP port. Switch on the other side of the SFP not learning MACs from the affected VLAN on that port (other VLANs OK).

Tried the following on the Netonix without success:
-Remove and add ports from the VLAN
-Remove and add the entire VLAN
-Change the position of the VLAN in the list

Then I configured an IP address on the VLAN to communicate with my test IP upstream and this suddenly started the MAC learning, and the PPPoE sessions came up.

Thu Dec 12, 2024 12:37 am

oeyre wrote:We just found our first issue with traffic being eaten...

Model: WS-12-250-AC
Port: Port13
Port speed: 1G
SFP: FIBERSTORE SFP-10G-DAC (don't ask)

Normal IP traffic working fine, PPPoE not. Learning MACs in the affected VLAN from PoE/radio ports, not from the SFP port. Switch on the other side of the SFP not learning MACs from the affected VLAN on that port (other VLANs OK).

Tried the following on the Netonix without success:
-Remove and add ports from the VLAN
-Remove and add the entire VLAN
-Change the position of the VLAN in the list

Then I configured an IP address on the VLAN to communicate with my test IP upstream and this suddenly started the MAC learning, and the PPPoE sessions came up.

We've been trying to hunt this one down for awhile, you may have just figured out the differentiating factor here.

Can anyone else who is either experiencing this issue or not chime in and let us know if you have an IP address assigned to the affected VLAN that carries the PPPoE traffic?

Also, did any of your other VLANs that weren't exhibiting the problem have IP's assigned too them?

One more question to assist with one of my own working theories, is DHCP Snooping enabled on any of your ports?

Thu Dec 12, 2024 12:48 am

Yes - we have the same problem -- for us it kills OSPF multicasts (probably all multicasts - but that's what easy to notice for us).
Only on SFP ports with tagged VLANs - trunks (but not in all cases, or not always, can't figure out when).
(I suspect it might also have something to do with LAG - in most cases these are LAG members).
We do have IPs for VLANs (AKA "watchdog IPs").

Also - seen the NTP issue.
/etc/init.d/ntp restart solves the problem (but not via the GUI).

Thu Dec 12, 2024 1:02 am

Hi yahel,

By any chance, are the trunked VLANs that are on the SFP ports using an assigned IP?

If it is, if you remove the IP from that VLAN and add it back does that make a difference?

Thu Dec 12, 2024 1:22 am

Yes - the trunked VLANs are using an assigned IP (watchdog IP).

We currently have the Interface (P14) that is giving us hard time disabled (it's a LAG member with two other members - one is P13-SFP and the other P12-RJ45).
With it disabled, everything works fine -- when we enable it, the OSPF dies.

I'll ask Vivek from our team in India to temporarily disable the IPs on the VLANs tonight after 2am, and he'll see if that helps (I'll be asleep).
If it does make things work with P14 enabled, he'll try to re-enable the IPs, one by one, to see if that can teach us anything.

Thanks!

v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Re: v1.5.22 Bug Reports and Comments

Who is online