Fri Aug 29, 2025 12:45 pm
We again were hit with another mass cold reboot thought to be associated with RSTP. Five WS switches comprising a common RSTP network at different tower sites underwent a spontaneous simultaneous cold reboot, taking down all the POE radios in the process. An interesting twist this time though is a sixth switch co-located at one of the sites also participated in the cold reboot, though it's not part of the actual RSTP chain (RSTP disabled) and none are on an external watchdog. It does have loop protection enabled, as do all the others. Remaining switches not co-located or part of RSTP did not reboot and there are no other co-located switches. Non-POE equipment, e.g., routers, UPS, environmental monitors, at the sites were unaffected.
Following this event, once the network came back up the switches all failed to get NTP. For the next 24 hours there were multiple instances of
admin: stopped ntp daemon
admin: started ntp daemon
admin: sync time via ntp
Exactly 24 hours plus two minutes later the sync time via ntp succeeds on all the switches that rebooted.
All switches are running v1.5.25, which has not fixed the problems with random mass reboots or entirely fixed the NTP issue from previous versions going back to at least v1.5.8 when we experienced our first occurrence. I've reported previous events. Fortunately the events are infrequent but it's highly disruptive (and nerve wracking) when it occurs.
In addition, following this event, the Netonix Manager v1.0.24 shows incorrect run times for the affected switches, displaying run times as though the event never occurred though the run time is correct on each switch GUI. A workaround for this is to wait the 24 hours, then remove the switch from monitoring and add it back in. Then the correct run time is displayed. It appears the manager does not get it's runtime from the switches except at initialization of monitoring, but then relies on it's own clock from the device it's running on.