[go: up one dir, main page]

Page MenuHomePhabricator

asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping
Closed, ResolvedPublic

Description

2024-09-06 14:32:09	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 14:32:09	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 13:52:25	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-09-06 13:52:25	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: up -> down
2024-09-06 13:42:25	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 13:42:25	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 13:37:16	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-09-06 13:37:16	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: up -> down
2024-09-06 12:37:12	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 12:37:12	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: down -> up
2024-09-06 10:22:26	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-09-06 10:22:26	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: up -> down
2024-08-09 12:52:39	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-08-09 12:52:39	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: down -> up
2024-08-09 12:26:58	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-08-09 12:17:04	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-08-09 12:12:24	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-08-09 12:12:24	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: up -> down
2024-08-09 11:47:19	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: down -> up
2024-08-09 11:47:18	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: down -> up
2024-08-09 11:42:20	fpc4.vcp-255/0/52	asw2-d-eqiad	ifOperStatus: up -> down
2024-08-09 11:42:20	fpc2.vcp-255/0/51	asw2-d-eqiad	ifOperStatus: up -> down

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis triaged this task as High priority.Sep 6 2024, 6:39 PM

As noted on IRC partially as well: the flapping has been going on for a while, there didn't seem to be any critical hosts in D4 (assuming the line card numbering matches the physical racks properly, in all VCs) and hence it was not Klaxon-worthy to me. Nevertheless, they're still production hosts running on a switch, with interface issues for sometimes for up to two hours. And unless the eqiad VC cabling is different from a perfect spine-leaf topology, this means the D4 asw only had one remaining uplink, which is an issue.

Over to netops for troubleshooting and coordinating further steps :)

cmooney renamed this task from asw2-d-eqiad vcp links flapping to asw2-d4-eqiad vcp links flapping.Sep 7 2024, 10:31 AM
cmooney subscribed.
cmooney renamed this task from asw2-d4-eqiad vcp links flapping to asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping.Sep 7 2024, 10:35 AM
cmooney added subscribers: VRiley-WMF, Jclark-ctr.

Thanks @CDanis and @Southparkfan for the task!

Logs relate to this link, which is a 40G DAC connection between the two switches. I'm somewhat confused about the timestamps in the task description however. They seem to show two fairly lengthy outages, from 10:22 to 12:37 and again from 13:52 to 14:32, however the link graphs (here and here) don't reflect this, instead having normal usage levels throughout. The throughput from row D interfaces on the CRs was also at normal levels, and the on-device logs also don't exactly tally which is confusing.

image.png (706×875 px, 260 KB)

image.png (706×875 px, 249 KB)

In terms of the Juniper logs there is nothing in them about this link until 14:22, when we start seeing brief flaps like this reported:

Sep  6 14:22:06  asw2-d-eqiad fpc4 Local fault detected on port 65 (vcp-255/0/52)
Sep  6 14:22:07  asw2-d-eqiad fpc4 Local fault cleared on port 65 (vcp-255/0/52)

These continue regularly throughout the day, and then stop at 22:39. There have been none for the past ~12 hours.

Sep  6 22:39:23  asw2-d-eqiad fpc4 Local fault cleared on port 65 (vcp-255/0/52)

There are also frequent messages like this in the logs, about this link from asw2-d7-eqiad to asw2-d1-eqiad, so we may need to also replace this cable:

Sep  6 10:22:02  asw2-d-eqiad fpc7 qsfp-7/0/51 Chan# 0: Rx power high alarm set
Sep  6 10:22:11  asw2-d-eqiad fpc7 qsfp-7/0/51 Chan# 0: Rx power high alarm cleared

Right now things appear stable enough, and indeed on the face of it the impact of this appears to be minimal based on throughput graphs (as illogical as that seems based on the logs), so I think it can wait until Monday, but I will keep an eye on it over the weekend.

@Jclark-ctr @VRiley-WMF can one of you attend site on Monday as soon as you can and we can have a look at replacing the cables on these links? Hopefully we have some of the DACs but I believe we can use a regular fiber QSFP+ connection also.

Thanks @CDanis and @Southparkfan for the task!

Logs relate to this link, which is a 40G DAC connection between the two switches. I'm somewhat confused about the timestamps in the task description however. They seem to show two fairly lengthy outages, from 10:22 to 12:37 and again from 13:52 to 14:32, however the link graphs (here and here) don't reflect this, instead having normal usage levels throughout. The throughput from row D interfaces on the CRs was also at normal levels, and the on-device logs also don't exactly tally which is confusing.

The timestamps in the description come from LibreNMS's logs viewer for asw2-d-eqiad: https://librenms.wikimedia.org/device/149/logs

LibreNMS might have converted to my local time?

The timestamps in the description come from LibreNMS's logs viewer for asw2-d-eqiad: https://librenms.wikimedia.org/device/149/logs

Ah ok gotcha. I think these are therefore based on SNMP traps, which probably explains the discrepancy. We get up/down flaps within the same second, likely for some of those the "recovery" wasn't properly logged - coming as it did right on top of the "outage".

LibreNMS might have converted to my local time?

Yeah seems to be doing that alright. The good thing is despite the system logging these messages or sending traps the actual links appears to continue to operate. Working now to try and get it replaced.

Icinga downtime and Alertmanager silence (ID=81e99a80-f593-4494-a565-ea730a19fbc7) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: repalce vcp link from d2 port 51 to d4 port 52

asw2-d-eqiad

Ok link was replaced:

Sep  9 15:36:56  asw2-d-eqiad vccpd[2257]: VCCPD_PROTOCOL_INTF_STATE_CHANGED: Member 4, interface vcp-255/0/52.32768 came up
Sep  9 15:36:56  asw2-d-eqiad vccpd[2257]: VCCPD_PROTOCOL_INTF_STATE_CHANGED: Member 4, interface vcp-255/0/52 came up
Sep  9 15:36:56  asw2-d-eqiad vccpd[2257]: VCCPD_PROTOCOL_INTF_STATE_CHANGED: Member 2, interface vcp-255/0/51 came up

Will keep an eye on how things progress, thanks @VRiley-WMF for the quick action!

Thank you! I appreciate it. Will be relabeling the new cable as 0325. Feel free to reach out if anything else happens.

So far things seem stable with this. I will leave task open to review as the week goes on, also considering if we need to do anything with the link reporting high power alert periodically.

cmooney claimed this task.

Still all looking good, there have been no logs or cases the interface reported down since we changed it.

We do, however, still get warnings for this link from asw2-d1-eqiad to asw2-d7-eqiad

Sep 10 07:55:31  asw2-d-eqiad fpc7 qsfp-7/0/51 Chan# 0: Rx power high alarm set
Sep 10 07:55:35  asw2-d-eqiad fpc7 qsfp-7/0/51 Chan# 0: Rx power high alarm cleared

Unlike the one we did the other day this is a link on multicore multi-mode (pink) fiber from rack 1 to 7. Either side has QSFP-SR4-40G optics. Tbh I think for now we can leave this one, but I am recording it here. Unfortuantely the system gives us no diagnostics for that port or the RX level, but usually if the receive power is too high it ultimately damages the optic on the receiving side, over time. Given things are otherwise healthy I think we can weather this and see what happens, if we need to we can replace the optics or potentially insert attenuators (if such a thing exists for MPO).