[go: up one dir, main page]

Page MenuHomePhabricator

Lots of hosts with hyperthreading disabled
Open, LowPublic

Description

We do have a lot of machines with HT disabled. My (possibly suboptimal) way of detecting HT disabled on non-VMs is:

grep hypervisor /proc/cpuinfo > /dev/null || lscpu | grep 'Thread(s) per core:    1'

eqiad

analytics1001.eqiad.wmnet
analytics1002.eqiad.wmnet
analytics1026.eqiad.wmnet
analytics1027.eqiad.wmnet
bast1001.wikimedia.org
carbon.wikimedia.org
conf1001.eqiad.wmnet
conf1002.eqiad.wmnet
conf1003.eqiad.wmnet
db1021.eqiad.wmnet
db1049.eqiad.wmnet
db1050.eqiad.wmnet
db1051.eqiad.wmnet
db1052.eqiad.wmnet
db1053.eqiad.wmnet
db1054.eqiad.wmnet
db1055.eqiad.wmnet
db1056.eqiad.wmnet
db1057.eqiad.wmnet
db1059.eqiad.wmnet
db1060.eqiad.wmnet
db1061.eqiad.wmnet
db1062.eqiad.wmnet
db1063.eqiad.wmnet
db1064.eqiad.wmnet
db1065.eqiad.wmnet
db1066.eqiad.wmnet
db1067.eqiad.wmnet
db1068.eqiad.wmnet
db1069.eqiad.wmnet
db1070.eqiad.wmnet
db1071.eqiad.wmnet
db1072.eqiad.wmnet
db1073.eqiad.wmnet
dbproxy1006.eqiad.wmnet
dbproxy1008.eqiad.wmnet
dbstore1001.eqiad.wmnet
dbstore1002.eqiad.wmnet
es1013.eqiad.wmnet
ganeti1001.eqiad.wmnet
ganeti1002.eqiad.wmnet
ganeti1003.eqiad.wmnet
ganeti1004.eqiad.wmnet
iridium.eqiad.wmnet
labcontrol1001.wikimedia.org
labnet1001.eqiad.wmnet
labsdb1004.eqiad.wmnet
labsdb1005.eqiad.wmnet
labsdb1006.eqiad.wmnet
labsdb1007.eqiad.wmnet
labservices1001.wikimedia.org
labstore1001.eqiad.wmnet
labstore1002.eqiad.wmnet
lithium.eqiad.wmnet
logstash1001.eqiad.wmnet
logstash1002.eqiad.wmnet
logstash1003.eqiad.wmnet
lvs1001.wikimedia.org
lvs1002.wikimedia.org
lvs1003.wikimedia.org
lvs1004.wikimedia.org
lvs1005.wikimedia.org
lvs1006.wikimedia.org
mc1009.eqiad.wmnet
mc1010.eqiad.wmnet
mc1011.eqiad.wmnet
mc1012.eqiad.wmnet
mc1013.eqiad.wmnet
mc1014.eqiad.wmnet
mc1015.eqiad.wmnet
mc1016.eqiad.wmnet
ms-be1001.eqiad.wmnet
ms-be1003.eqiad.wmnet
ms-be1005.eqiad.wmnet
ms-be1008.eqiad.wmnet
ms-be1013.eqiad.wmnet
ms-be1014.eqiad.wmnet
ms-be1015.eqiad.wmnet
ms-be1021.eqiad.wmnet
ms-fe1001.eqiad.wmnet
ms-fe1002.eqiad.wmnet
ms-fe1003.eqiad.wmnet
ms-fe1004.eqiad.wmnet
netmon1001.wikimedia.org
notebook1002.eqiad.wmnet
oresrdb1002.eqiad.wmnet
osmium.eqiad.wmnet
radon.wikimedia.org
rdb1001.eqiad.wmnet
rdb1002.eqiad.wmnet
rdb1003.eqiad.wmnet
rdb1004.eqiad.wmnet
rdb1008.eqiad.wmnet
restbase1015.eqiad.wmnet
scandium.eqiad.wmnet
snapshot1001.eqiad.wmnet
stat1003.eqiad.wmnet
terbium.eqiad.wmnet
tin.eqiad.wmnet
wtp1001.eqiad.wmnet
wtp1002.eqiad.wmnet
wtp1003.eqiad.wmnet
wtp1004.eqiad.wmnet
wtp1005.eqiad.wmnet
wtp1006.eqiad.wmnet
wtp1007.eqiad.wmnet
wtp1008.eqiad.wmnet
wtp1009.eqiad.wmnet
wtp1010.eqiad.wmnet
wtp1011.eqiad.wmnet
wtp1012.eqiad.wmnet
wtp1013.eqiad.wmnet
wtp1014.eqiad.wmnet
wtp1015.eqiad.wmnet
wtp1016.eqiad.wmnet
wtp1017.eqiad.wmnet
wtp1018.eqiad.wmnet
wtp1019.eqiad.wmnet
wtp1020.eqiad.wmnet
wtp1021.eqiad.wmnet
wtp1022.eqiad.wmnet
wtp1023.eqiad.wmnet
wtp1024.eqiad.wmnet

codfw

acamar.wikimedia.org
achernar.wikimedia.org
auth2001.codfw.wmnet
baham.wikimedia.org
db2010.codfw.wmnet
db2011.codfw.wmnet
db2012.codfw.wmnet
db2016.codfw.wmnet
db2017.codfw.wmnet
db2018.codfw.wmnet
db2019.codfw.wmnet
db2023.codfw.wmnet
db2028.codfw.wmnet
db2029.codfw.wmnet
db2030.codfw.wmnet
heze.codfw.wmnet
labstore2001.codfw.wmnet
labtestservices2001.wikimedia.org
lvs2001.codfw.wmnet
lvs2002.codfw.wmnet
lvs2003.codfw.wmnet
lvs2004.codfw.wmnet
lvs2005.codfw.wmnet
lvs2006.codfw.wmnet
maps-test2001.codfw.wmnet
maps-test2002.codfw.wmnet
maps-test2003.codfw.wmnet
maps-test2004.codfw.wmnet
ms-be2011.codfw.wmnet
ms-be2013.codfw.wmnet
ms-be2014.codfw.wmnet
ms-be2015.codfw.wmnet
mw2126.codfw.wmnet
restbase2002.codfw.wmnet

esams

lvs3001.esams.wmnet
lvs3002.esams.wmnet
lvs3003.esams.wmnet
lvs3004.esams.wmnet
ms-be3001.esams.wmnet

ulsfo

bast4001.wikimedia.org
lvs4001.ulsfo.wmnet
lvs4002.ulsfo.wmnet
lvs4003.ulsfo.wmnet
lvs4004.ulsfo.wmnet

Event Timeline

Unfortunately, I don't know of a way to enable hyperthreading without rebooting the server into the bios. This means downtime for each one of these hosts.

I found this somewhere that they say is for disabling HT "racadm set BIOS.ProcSettings.LogicalProc Disabled" but have not tried it. [http://www.gooksu.com/2015/04/27/racadm-quick-dirty-cheatsheet/]

Tested on spare server "gadolinium".

status before changes, disabled:

/admin1-> racadm get BIOS.ProcSettings.LogicalProc
[Key=BIOS.Setup.1-1#ProcSettings]
LogicalProc=Disabled

enable it:

/admin1-> racadm set BIOS.ProcSettings.LogicalProc Enabled
[Key=BIOS.Setup.1-1#ProcSettings]
RAC1017: Successfully modified the object value and the change is in 
       pending state.
       To apply modified value, create a configuration job and reboot 
       the system. To create the commit and reboot jobs, use "jobqueue" 
       command. For more information about the "jobqueue" command, see RACADM 
       help.

currently powered down?

/admin1-> racadm serveraction powerstatus
Server power status: OFF

power it up

/admin1-> racadm serveraction powerup
Server power operation successful

settings now "pending value", but stays pending.

/admin1-> racadm get BIOS.ProcSettings.LogicalProc
[Key=BIOS.Setup.1-1#ProcSettings]
LogicalProc=Disabled (Pending Value=Enabled)

try scheduling a job like it talked about above after setting the value, first without a reboot

/admin1-> racadm jobqueue create BIOS.Setup.1-1
RAC1024: Successfully scheduled a job.
Verify the job status using "racadm jobqueue view -i JID_xxxxx" command.
Commit JID = JID_871255342135

shows up in jobqueue

/admin1-> racadm jobqueue view
-------------------------JOB QUEUE------------------------
[Job ID=JID_871255342135]
Job Name=Configure: BIOS.Setup.1-1
Status=Scheduled
Start Time=[Now]
Expiration Time=[Not Applicable]
Message=[JCP001: Task successfully scheduled.]
----------------------------------------------------------

but still "pending value"

/admin1-> racadm get BIOS.ProcSettings.LogicalProc
[Key=BIOS.Setup.1-1#ProcSettings]
LogicalProc=Disabled (Pending Value=Enabled)

Try scheduling a job _with_ a reboot, fails

/admin1-> racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
ERROR: SUP002: Job creation failure. Retry the action. If this fails, reboot the iDRAC.

"racadm powercycle it", and now see during boot:

  ->                      1     1    BIOS Configuration (JID_871255342135)    
������������������������������������������������������������������������������ͻ
� -> BIOS Config...� Task 1  of4  %- BIOS Configuration (JID_871255342135)     �
�                  �               00:26                                       �
�                  � Progress: 3  %   10:00                                    �
�                  � Elapsed Time:Executing                                    �
�                  � Worst Case Time: 10:0Task in Progress                     �
�                  � Task Status: Executing                                    �
�                  � Last Status Message: Task in Progress                     �
�                  ����������������������00:00:26�����������������������������͹
                                       0
�                  � Total Elapsed Time:000:00:22                              �
�                  � Failed Task Count:00                                      �
�                  � Warning Task Count:0                                      �
�                  � Success Task Count:0                                      �
������������������������������������������������������������������������������͹
�Legend:           � Console Log:                                              �
�                  � Collecting the list of tasks to be executed               �
�(+) : Success     � Task in Progress                                          �
�(!) : Warning     �                                                           �
�(X) : Failed      �                                                           �
�(.) : Pending     �                                                           �
�->  : In Progress �                                                           �
������������������������������������������������������������������������������ͼ

Now at "Task 1 of 44" and running over 4 minutes.

...

conclusion: needs reboot anyways, takes longer than just doing it the manual way in BIOS, not worth it ?

and then after 10 minutes or so ... "� Last Status Message: Task Failed .. Task Status: Failed " .. yea.. well.. good that we tried :p

It gets better... now we have these pending and failing job that pops up at reboot so i need to clean up again and delete them. Why multiple jobs? attempts from the past?

/admin1-> racadm jobqueue view
-------------------------JOB QUEUE------------------------
[Job ID=JID_871255342135]
Job Name=Configure: BIOS.Setup.1-1
Status=Running
Start Time=[Now]
Expiration Time=[Not Applicable]
Message=[PR20: Job in progress]
----------------------------------------------------------
[Job ID=JID_871257159703]
Job Name=Configure: BIOS.Setup.1-1
Status=New
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
----------------------------------------------------------
[Job ID=RID_871257159966]
Job Name=Reboot: Power cycle
Status=Pending Reboot
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
----------------------------------------------------------
[Job ID=JID_871257636874]
Job Name=Configure: BIOS.Setup.1-1
Status=New
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
----------------------------------------------------------
[Job ID=RID_871257637293]
Job Name=Reboot: Power cycle
Status=Pending Reboot
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
----------------------------------------------------------

trying to delete jobs from jobqueue

first attempt:

/admin1-> racadm jobqueue delete -i JID_871255342135
RAC1007: The job JID_871255342135 cannot be deleted.
       Wait until the specified job is completed, or reboot the system 
       to move the job to Completed state. To view the job status, use
       "racadm jobqueue view -i <Job ID>".

second job

/admin1-> racadm jobqueue delete -i JID_871257159703
RAC992: Invalid job: JID_871257159703.
       To monitor the current set of the jobs, use the "racadm jobqueue 
       view" command.

haha, what a mess

"racadm jobqueue delete -i JID_CLEARALL_FORCE" was supposedly for deleting all jobs, but also doesn't work in this version. RAC992: Invalid job: JID_CLEARALL_FORCE.

"racadm racreset" says it will reboot but doesn't disconnect me.

" racadm jobqueue delete -i JID_871257159703" simply hangs ... and does not return ...

So the test host may have to have an onsite manually reset all its bios/drac settings and then set back up bios/drac. I'd think that would clear all pending jobs on the ilom interface. (If so, please create a task in the proper ops-onsite project with link back to this task explaining what needs to happen!)

@RobH Ok, yep. Thanks. Created T158131 and linked that in the "spares" sheet.

Ottomata subscribed.

analytics102[67] will be decomed soon. Added T159742 for analytics100[12].