Hey @andrea.denisse We can continue to try to troubleshoot this error. Currently, it isn't showing any hardware fault through the iDRAC. However, I know we spoke about possibility powering the unit down as well. Is there any preferrance on how we should proceed?

Tue, Oct 8, 7:31 PM · SRE Observability (FY2024/2025-Q1), SRE, DC-Ops, ops-eqiad

VRiley-WMF closed T372607: Decommission the alert1001 and alert2001 hosts, a subtask of T372418: Put the alert1002 and alert2002 hosts in production, as Resolved.

Tue, Oct 8, 7:26 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Alerting

VRiley-WMF closed T372607: Decommission the alert1001 and alert2001 hosts as Resolved.

Tue, Oct 8, 7:26 PM · SRE, ops-eqiad, DC-Ops, ops-codfw, decommission-hardware, SRE Observability (FY2024/2025-Q1), Observability-Alerting

VRiley-WMF updated the task description for T372607: Decommission the alert1001 and alert2001 hosts.

Tue, Oct 8, 7:25 PM · SRE, ops-eqiad, DC-Ops, ops-codfw, decommission-hardware, SRE Observability (FY2024/2025-Q1), Observability-Alerting

VRiley-WMF merged T376537: ManagementSSHDown into T376094: ManagementSSHDown - ms-be1077 / logging-hd1005.

Tue, Oct 8, 7:18 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF merged task T376537: ManagementSSHDown into T376094: ManagementSSHDown - ms-be1077 / logging-hd1005.

Tue, Oct 8, 7:17 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF claimed T376094: ManagementSSHDown - ms-be1077 / logging-hd1005.

Tue, Oct 8, 5:49 PM · SRE, DC-Ops, ops-eqiad

Thu, Oct 3

VRiley-WMF added a comment to T372514: Q1:rack/setup/install aqs1022.eqiad.wmnet.

@Jclark-ctr You're right. I had a typo. It is in Rack D6 as per the request

Thu, Oct 3, 3:41 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

Wed, Oct 2

VRiley-WMF added a comment to T372514: Q1:rack/setup/install aqs1022.eqiad.wmnet.

Location:
D5
U31
CableID 2576
Port 30

Wed, Oct 2, 10:53 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

VRiley-WMF updated the task description for T372514: Q1:rack/setup/install aqs1022.eqiad.wmnet.

Wed, Oct 2, 10:52 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

Tue, Oct 1

VRiley-WMF added a comment to T376058: puppetserver ram upgrades - decom memory option.

@MoritzMuehlenhoff Once the server can be powered off, we will insert the RAM and when it powers back up, it should instantly recogize it. I'm ready anytime today to install the memory. Just let us know when puppetserver1001 can be powered down. Thanks!

Tue, Oct 1, 6:07 PM · ops-eqiad, SRE, DC-Ops

VRiley-WMF updated the task description for T376058: puppetserver ram upgrades - decom memory option.

Tue, Oct 1, 6:05 PM · ops-eqiad, SRE, DC-Ops

Mon, Sep 30

VRiley-WMF added a comment to T376058: puppetserver ram upgrades - decom memory option.

Surprisingly, I have been able to locate six (6) 32 gig sticks of RAM 3200 MHz. Please let us know when we can initiate this process.

Mon, Sep 30, 7:10 PM · ops-eqiad, SRE, DC-Ops

VRiley-WMF added a comment to T375000: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5.

Hi @MoritzMuehlenhoff It looks like we could use snapshot1008 and snapshot1009 as stand ins for the servers. Let us know if there is any prefernce on cage or location.

Mon, Sep 30, 4:17 PM · SRE, ops-eqiad, DC-Ops

Thu, Sep 26

VRiley-WMF closed T374897: ManagementSSHDown - elastic1089 as Resolved.

after troubleshooting this, we had to reboot E1 managment switch. This issue should be cleared up.

Thu, Sep 26, 6:15 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T374897: ManagementSSHDown - elastic1089.

Thu, Sep 26, 6:14 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF closed T375758: ManagementSSHDown as Resolved.

Thu, Sep 26, 6:13 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF added a comment to T375758: ManagementSSHDown.

After troubleshooting the cables and seeing multiple issues with other servers. It was recommended to reboot the switch. Logged it and then proceeded to reboot. It looks like this has cleard up the issue. Closing this now.

Thu, Sep 26, 6:13 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T375758: ManagementSSHDown.

Thu, Sep 26, 6:12 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF added a comment to T375257: Degraded RAID on es1022.

This drive has been replaced. Please let us know if there are any further issues.

Thu, Sep 26, 3:43 PM · DBA, SRE, ops-eqiad, DC-Ops

VRiley-WMF closed T375459: ManagementSSHDown as Resolved.

Reseated cable and it seems to be communicating now. Will close this and monitor.

Thu, Sep 26, 1:58 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T375459: ManagementSSHDown.

Thu, Sep 26, 1:56 PM · SRE, DC-Ops, ops-eqiad

Mon, Sep 23

VRiley-WMF added a comment to T375382: Post pc1013 crash.

Hi! We do have a spare DIMM that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks!

Mon, Sep 23, 6:12 PM · Wikimedia-production-error, Sustainability (Incident Followup), SRE, DBA

VRiley-WMF closed T375314: ManagementSSHDown as Resolved.

Swapped out cable. Closing for now.

Mon, Sep 23, 6:05 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF added a comment to T375257: Degraded RAID on es1022.

Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with swapping it out with a 4TB drive?

Mon, Sep 23, 5:14 PM · DBA, SRE, ops-eqiad, DC-Ops

VRiley-WMF added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

After working with Dell and explaining the issue, they can confirm that there is no hardware issues in the TSR report. I did provide them the image that @Jclark-ctr provided as well. Case#: 198075128 They are continuing to believe there is something with the OS.

Mon, Sep 23, 12:50 PM · Data-Persistence-SRE, DBA

VRiley-WMF added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

With this information, I'm going to reach back out to Dell.

Mon, Sep 23, 11:47 AM · Data-Persistence-SRE, DBA

Fri, Sep 20

VRiley-WMF merged task T375130: ManagementSSHDown into T374897: ManagementSSHDown - elastic1089.

Fri, Sep 20, 4:08 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF merged T375130: ManagementSSHDown into T374897: ManagementSSHDown - elastic1089.

Fri, Sep 20, 4:08 PM · SRE, DC-Ops, ops-eqiad

Thu, Sep 19

VRiley-WMF closed T373740: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error as Resolved.

This DIMM (B2) has been swapped out. Please let us know if any other issue crops up.

Thu, Sep 19, 5:41 PM · SRE, DC-Ops, ops-eqiad, cloud-services-team

VRiley-WMF added a comment to T373740: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error.

Is there an acceptable time to swap out the DIMM? We can proceed at any time.

Thu, Sep 19, 12:53 PM · SRE, DC-Ops, ops-eqiad, cloud-services-team

VRiley-WMF closed T375037: PDU sensor over limit as Resolved.

Atempted to rebalance power.

Thu, Sep 19, 12:53 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T375037: PDU sensor over limit.

Thu, Sep 19, 12:31 PM · SRE, DC-Ops, ops-eqiad

Wed, Sep 18

VRiley-WMF claimed T373740: 2024-08-31 cloudvirt1048 NodeDown because memory hardware error.

Wed, Sep 18, 3:07 PM · SRE, DC-Ops, ops-eqiad, cloud-services-team

Sep 12 2024

VRiley-WMF added a comment to T362841: Degraded RAID on aqs1014.

@Eevans the drives that were not listed in the group have been replaced. Please let us know if anything else is needed.

Sep 12 2024, 5:42 PM · DC-Ops, Cassandra, SRE, ops-eqiad

andrea.denisse awarded T374540: Degraded RAID on prometheus1008 a 100 token.

Sep 12 2024, 5:42 PM · SRE Observability (FY2024/2025-Q1), SRE, DC-Ops, ops-eqiad

VRiley-WMF closed T374540: Degraded RAID on prometheus1008 as Resolved.

@andrea.denisse This drive has been replaced Please let us know if there are any other issues with this unit.

Sep 12 2024, 4:24 PM · SRE Observability (FY2024/2025-Q1), SRE, DC-Ops, ops-eqiad

Sep 11 2024

VRiley-WMF added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

After working with Dell on this issue for a while and they reviewed the logs, they don't see any issues with the Hardware. Would it be possible to reinstall the OS and we'll monitor this issue to see if anything else comes up? Also, logging in through iDrac, it doesn't show any errors at the moment.

Sep 11 2024, 5:50 PM · Data-Persistence-SRE, DBA

VRiley-WMF moved T374540: Degraded RAID on prometheus1008 from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Sep 11 2024, 4:36 PM · SRE Observability (FY2024/2025-Q1), SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T374540: Degraded RAID on prometheus1008.

Sep 11 2024, 4:36 PM · SRE Observability (FY2024/2025-Q1), SRE, DC-Ops, ops-eqiad

VRiley-WMF moved T374215: db1246 crashed, doesn't reboot cleanly from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Sep 11 2024, 4:26 PM · Data-Persistence-SRE, DBA

Sep 10 2024

VRiley-WMF added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

I have attempted a few troubleshooting steps. I have uploaded logs to Dell under SR 197398410. Awaiting results.

Sep 10 2024, 5:49 PM · Data-Persistence-SRE, DBA

VRiley-WMF claimed T374215: db1246 crashed, doesn't reboot cleanly.

Sep 10 2024, 3:54 PM · Data-Persistence-SRE, DBA

VRiley-WMF added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

@ABran-WMF I'm taking a look at this. I will update with results.

Sep 10 2024, 3:49 PM · Data-Persistence-SRE, DBA

VRiley-WMF updated the task description for T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

Sep 10 2024, 1:27 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

VRiley-WMF added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

ganeti1039
B2
U4
CableID 4893
Port 3

Sep 10 2024, 1:27 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

Sep 9 2024

VRiley-WMF added a comment to T374272: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping.

Thank you! I appreciate it. Will be relabeling the new cable as 0325. Feel free to reach out if anything else happens.

Sep 9 2024, 3:41 PM · ops-eqiad, SRE, DC-Ops, Infrastructure-Foundations, netops

Sep 6 2024

VRiley-WMF closed T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet, a subtask of T371077: SmartNotHealthy on an-worker1085, as Resolved.

Sep 6 2024, 7:11 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27), sre-alert-triage

VRiley-WMF closed T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet as Resolved.

Sep 6 2024, 7:11 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF added a comment to T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet.

Ah, the blinking light did activate. I have swapped the HDD, and it should be good to go. Let us know if there is anything else we can help with. Thank you!

Sep 6 2024, 7:10 PM · SRE, ops-eqiad, DC-Ops

Sep 5 2024

VRiley-WMF added a comment to T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet.

Hi @BTullis we can replace this drive at any time. Although the LED on the drive isn't on, as long as we know the slot, that works for us.

Sep 5 2024, 2:06 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF moved T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Sep 5 2024, 2:06 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF claimed T373800: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet.

Sep 5 2024, 2:05 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF reassigned T373888: puppetmaster1003: broken disk from VRiley-WMF to MoritzMuehlenhoff.

This drive has been replaced! Thanks!

Sep 5 2024, 11:03 AM · SRE, DC-Ops, ops-eqiad

Sep 4 2024

VRiley-WMF updated the task description for T367801: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet.

Sep 4 2024, 8:11 PM · SRE, fundraising-tech-ops, ops-eqiad, DC-Ops

Sep 3 2024

VRiley-WMF added a comment to T373888: puppetmaster1003: broken disk.

Sure, that will work for us. We will plan for it then. Thank you!

Sep 3 2024, 5:33 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF closed T373696: Relabel eqiad kubernetes nodes as Resolved.

This is completed. Thank you!

Sep 3 2024, 5:32 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

VRiley-WMF updated the task description for T373696: Relabel eqiad kubernetes nodes.

Sep 3 2024, 5:31 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

VRiley-WMF closed T373696: Relabel eqiad kubernetes nodes, a subtask of T351074: Move servers from the appserver/api cluster to kubernetes, as Resolved.

Sep 3 2024, 5:30 PM · serviceops, MW-on-K8s

VRiley-WMF moved T373888: puppetmaster1003: broken disk from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Sep 3 2024, 3:50 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF claimed T373888: puppetmaster1003: broken disk.

Sep 3 2024, 3:49 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF added a comment to T373888: puppetmaster1003: broken disk.

Hey @MoritzMuehlenhoff , thanks for reaching out on this ticket. Thankfully, I have been able to locate a replacement disk for this unit. We can swap this disk at anytime.

Sep 3 2024, 3:49 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF merged T373235: Degraded RAID on puppetmaster1003 into T373888: puppetmaster1003: broken disk.

Sep 3 2024, 3:49 PM · SRE, DC-Ops, ops-eqiad

VRiley-WMF merged task T373235: Degraded RAID on puppetmaster1003 into T373888: puppetmaster1003: broken disk.

Sep 3 2024, 3:47 PM · DC-Ops, SRE, ops-eqiad

VRiley-WMF closed T373755: PDU sensor over limit as Resolved.

Rebalanced power

Sep 3 2024, 1:51 PM · SRE, ops-eqiad, DC-Ops

VRiley-WMF claimed T373755: PDU sensor over limit.

Sep 3 2024, 1:51 PM · SRE, ops-eqiad, DC-Ops

Aug 29 2024

VRiley-WMF assigned T373376: (2) new singlemode fiber patches from dmarc to routers for IX ports to cmooney.

Aug 29 2024, 11:04 PM · procurement, Infrastructure-Foundations, DC-Ops, ops-eqiad, netops, SRE

VRiley-WMF updated the task description for T373376: (2) new singlemode fiber patches from dmarc to routers for IX ports.

Aug 29 2024, 11:04 PM · procurement, Infrastructure-Foundations, DC-Ops, ops-eqiad, netops, SRE

VRiley-WMF added a comment to T373376: (2) new singlemode fiber patches from dmarc to routers for IX ports.

Ports: 27/28 patch to cr2-eqiad:xe-3/0/3 - Cable ID 1-8292024

Aug 29 2024, 11:04 PM · procurement, Infrastructure-Foundations, DC-Ops, ops-eqiad, netops, SRE

VRiley-WMF claimed T373235: Degraded RAID on puppetmaster1003.

Aug 29 2024, 5:01 PM · DC-Ops, SRE, ops-eqiad

VRiley-WMF closed T372939: PDU sensor over limit as Resolved.

Rebalanced power

Aug 29 2024, 4:50 PM · DC-Ops, ops-eqiad

Aug 27 2024

VRiley-WMF updated the task description for T370546: Q1:rack/setup/install logging-sd100[1-4].

Aug 27 2024, 5:10 PM · SRE, observability, ops-eqiad, DC-Ops

VRiley-WMF added a comment to T370546: Q1:rack/setup/install logging-sd100[1-4].

logging-sd1001
Rack E 5
U 32
CableID 20220092
Port 18

Aug 27 2024, 5:10 PM · SRE, observability, ops-eqiad, DC-Ops

VRiley-WMF closed T372560: Disk failed on ms-be1079 as Resolved.

Drive has been replaced. Please let us know if there are any other issues with this drive. Thanks!

Aug 27 2024, 1:54 PM · ops-eqiad, DC-Ops, SRE-swift-storage, SRE

Aug 26 2024

VRiley-WMF updated the task description for T372432: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x).

Aug 26 2024, 5:07 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

VRiley-WMF added a comment to T372432: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x).

ml-serve1009
Rack A2
U19
CableID 4897
Port 7

Aug 26 2024, 4:46 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

Aug 24 2024

VRiley-WMF closed T373121: decommission an-coord1001.eqiad.wmnet and an-coord1002.eqiad.wmnet as Resolved.

Aug 24 2024, 7:54 PM · DC-Ops, ops-eqiad, decommission-hardware

VRiley-WMF closed T373133: decommission kafka-jumbo100[1-6] as Resolved.

Aug 24 2024, 7:50 PM · ops-eqiad, DC-Ops, decommission-hardware

VRiley-WMF closed T373177: decommission an-tool1010.eqiad.wmnet as Resolved.

Aug 24 2024, 7:32 PM · DC-Ops, ops-eqiad, decommission-hardware

VRiley-WMF closed T373178: decommission dbproxy101[8-9].eqiad.wmnet as Resolved.

Aug 24 2024, 7:28 PM · ops-eqiad, DC-Ops, decommission-hardware

VRiley-WMF closed T373179: decommission an-coord100[1-2] as Resolved.

Aug 24 2024, 7:20 PM · ops-eqiad, decommission-hardware, DC-Ops

Aug 23 2024

VRiley-WMF added a comment to T372560: Disk failed on ms-be1079.

Was able to find the drive. We can replace at anytime @MatthewVernon

Aug 23 2024, 4:40 PM · ops-eqiad, DC-Ops, SRE-swift-storage, SRE

Aug 21 2024

VRiley-WMF added a comment to T372560: Disk failed on ms-be1079.

Calling back into dell for this ticket. It was supposed to have 1 day shipping, however has not yet arrived.

Aug 21 2024, 11:54 PM · ops-eqiad, DC-Ops, SRE-swift-storage, SRE

Aug 20 2024

VRiley-WMF added a comment to T372781: cr1-eqiad: disk failure.

Sounds like a plan. Thank you! I will be at the ready.

Aug 20 2024, 2:44 PM · SRE, ops-eqiad, Infrastructure-Foundations, netops, DC-Ops

VRiley-WMF added a comment to T372781: cr1-eqiad: disk failure.

@ayounsi I've checked the device and there doesn't seem to be any failure notifications (Physically anyway). Would it be possible to open up a RMA or Support ticket with Juniper?

Aug 20 2024, 2:31 PM · SRE, ops-eqiad, Infrastructure-Foundations, netops, DC-Ops

VRiley-WMF closed T372207: Disk (sdc) failed on ms-be1058 as Resolved.

As of right now, since there are no replacements. I will be closing this ticket. If a replacement is needed, feel free to open this back up or make a new ticket and we can look into what options we may have.

Aug 20 2024, 2:17 PM · SRE-swift-storage, SRE, DC-Ops, ops-eqiad