Ge GOC's TroubleShooting
Ge GOC's TroubleShooting
Ge GOC's TroubleShooting
1- Introduction.
2- How to check basic function of the IG’s ?
3- What are the basic ‘connections’ to an IG ?
4- Serial Over Lan [SOL] useful commands / resets
5- ipmitool : a ‘shell’ for the SOL connection
6- How to identify what DARC/IG/VIG you have ? (Westville or Jarrell)
7- How to reflash the VRAC of VIG Boards independently ?
8- How to check if the DIP/RAC VDIP/VRAC are ‘responding’ [lspci] ?
9- How to check if the DARC / IG’s hardware config ?
10 - How to test the VDIP Board (VCT’s) ?
11- How to test the VRAC Board (VCT’s) ?
12 - How to exercise in deep the VIG’s (VCT’s) ?
13 - How to test the DIP Board (non-VCT’s)
14 - How to test the RAC / IG Boards (non-VCT’s)
15 - Test / verify the RAC flash code (non VCT IG’s)
Appendix 1
Appendix 2
1- Introduction.
The purpose of this document is to regroup tools & diags that can be used on-site or remote to evaluate
possible issues with the DARC / VDARC / IG’s / VIG’s components of a GOC3-4-5 console. This
concerns LightSpeed, LS Pro, BrightSpeed, VCT all types. However, main focus is VCT therefore all
examples provided are from VCT configurations (except few specific to non-VCT consoles).
The host HP 8000/8200, SCSI tower, SDDA, Intercom & few other components are NOT covered here.
This covers the basic functions of the Motherboards, connections with the PCI devices on board, diags
available, Serial Over Lan [SOL] connections & utilities, and some ways to reflash / reconfigure some
components.
The Console Diagnostics of course can also be used to launch many of these diags, I do not describe
Console Diagnostics in details here (see Appendix 2). The purpose of the present doc is to list & describe
tools 1 by 1 and related to the troubleshooting needs.
Color codes :
Bold : highlights, keywords.
Blue : whatever has to be typed
Green : good responses, or items of interest.
Red : errors or important warning.
{TOC} : link brings you back to the Table Of Contents
One of the common problems seen is that one of the IG’s is ‘dropping’ the Recon process that it should
normally run. Several root causes can produce this : hardware Pbs, network Pbs, Motherboard Pbs or mis-
configuration of IG’s firmware, most being covered in further sections.
In normal conditions, all 3 IG’s should run a process image_generation. The check consists of listing
those 3 processes on the DARC :
[ctuser@darc log]$ ps -lef | grep image_g
4 S ctuser 18919 18913 0 75 0 - 446 - May14 ? 00:00:00 rsh ig1 image_generation -bp
0 -host darc -node 1 -th 2
4 S ctuser 18922 18913 0 75 0 - 446 - May14 ? 00:00:00 rsh ig2 image_generation -bp
0 -host darc -node 2 -th 2
4 S ctuser 18925 18913 0 75 0 - 446 - May14 ? 00:00:00 rsh ig3 image_generation -bp
0 -host darc -node 3 -th 2
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p1
GOC3-4-5 T/S guide.
0 S ctuser 24126 22312 0 77 0- 940 pipe_w 11:40 pts/0 00:00:00 grep image_g
[ctuser@darc log]$
Here, the 3 processes are running, which is good. However, this does not prove that hardware on each IG
is fully functional. See below for more hardware checks & diags.
If one of the 3 processes is not running, then you have a problem : either 1 IG did not start, or
encountered severe error leading to process crash.
The ONLY physical link between DARC & IG’s is the RJ45 cable which carries both :
• the Ethernet connection (Linux processes, scan data, images…) when IG has successfully booted
-AND-
• the Serial Over Lan connection, which should be active & alive at all times ! It is used to remotely
‘start’ the IG boot. Even if Linux is not running on the IG, the RJ45 cable goes to a control chip
called BMC (Board Monitor Ctrl) which essentially deals with the hardware (CPU, fans,
temperatures, power…)
We ‘talk’ to this chip when using the SOL {telnet localhost 623} and get the DPCCLI> prompt.
To check connections :
1. ping ig1, ig2, ig3 from the DARC. It is possible that a failing IG still replies to the ping, but does not
properly execute Recon. If all 3 replies, try reboot the failing IG using SOL (see below)
2. if one IG is not replying to ping, it is likely NOT running Linux. However, the SOL should still be
available.
3. Launch ‘telnet localhost 623’ on the failing IG. If it replies with DPCCLI>, at least the hardware is
still active. Try to run basic hardware checks on the IG then reboot the board (see below).
4. If the SOL connection looks ‘dead’ (not connected) then you should check if the software service
driving the SOL is active :
a. Try first to open an SOL connection on another IG. If it fails, go to [b] ; if it works, go to [5]
b. On the DARC, type [ctuser@darc ~]$ ps -elf|grep dpcproxy
1 S root 2735 1 0 75 0 - 916 - May14 ? 00:00:00 /usr/local/cli/dpcproxy
0 S ctuser 8757 8736 0 76 0 - 941 pipe_w 10:23 pts/0 00:00:00 grep dpcproxy
IMPORTANT NOTICE : all details given in this chapter apply the same way to the SOL connections
from the OC to the DARC
–AND-
from the DARC to the IG’s
The SOL prompt DPCCLI> offers a help ; most used commands are in bold.
dpccli> help
alarm -q { -g <id> [ -o <id> ] [-a <id> ] } | { -l <severity> } | -p
alarm -s -a <id> -l <severity>
alarm -c { -g <id> [ -o <id> ] [-a <id> ] } | { -l <severity> } | all
boot -s normal | service [-f] [-console]
clear
clearlog
commit
console
displaylog [-F csv|hex|dsv] [-O filename] [-n <num>]
exit
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p2
GOC3-4-5 T/S guide.
quit
get -T <target> <options>
help [-C <command>]
id
identify [-on [<seconds>]] [-off]
identify -s
interrupt -i nonmask [-console]
power { -on [-console] } | -off
power -state
reset [-console]
sensors [-v] [-F csv|dsv] [-f ok|nc|cr|nr|us] [-T volt|temp|power|fan]
service { console | exit | ftp { start | stop } }
set -T <target> <options>
shutdown [-f] [-r]
version
dpccli>
The 1st thing is to run the sensors –v command which will report all power supplies, fans, voltages, CPU…
on the motherboard. All MUST be OK as shown below :
dpccli> sensors -v
05/16/2007 | 04:04:42 | BB +1.2V Vtt | ok | 1.21 | Volts
05/16/2007 | 04:04:43 | +1.5V NIC Core | ok | 1.54 | Volts
05/16/2007 | 04:04:44 | BB +1.5v | ok | 1.51 | Volts
05/16/2007 | 04:04:45 | Memory Voltage | ok | 1.81 | Volts
05/16/2007 | 04:04:45 | BB +3.3V | ok | 3.37 | Volts
05/16/2007 | 04:04:46 | STBY +3.3V | ok | 3.32 | Volts
05/16/2007 | 04:04:47 | AUX +3.3V | ok | 3.25 | Volts
05/16/2007 | 04:04:48 | BB +5V | ok | 4.97 | Volts
05/16/2007 | 04:04:48 | STBY +5V | ok | 4.88 | Volts
05/16/2007 | 04:04:49 | BB +12V | ok | 11.97 | Volts
05/16/2007 | 04:04:50 | BB -12V | ok | -11.73 | Volts
05/16/2007 | 04:04:52 | Baseboard Temp | ok | 30.00 | Celsius
05/16/2007 | 04:04:52 | CPU 1 Blower | ok | 2485.00 | RPM
05/16/2007 | 04:04:53 | CPU 2 Blower | ok | 2414.00 | RPM
05/16/2007 | 04:04:54 | Add-in Blower | ok | 1917.00 | RPM
05/16/2007 | 04:04:55 | Exhst Fan | ok | 7171.00 | RPM
05/16/2007 | 04:05:00 | CPU 1 Temp | ok | 36.00 | Celsius
05/16/2007 | 04:05:01 | CPU 2 Temp | ok | 36.00 | Celsius
05/16/2007 | 04:05:02 | CPU 1 VRM | ok | 12.03 | Volts
05/16/2007 | 04:05:02 | CPU 2 VRM | ok | 12.03 | Volts
05/16/2007 | 04:05:10 | Proc 1 Vccp | ok | 1.25 | Volts
05/16/2007 | 04:05:11 | Proc 2 Vccp | ok | 1.25 | Volts
dpccli>
You may use the displaylog command to post on the screen the on-board log file of motherboard activity.
This will likely scroll slowly and for 3~5 minutes as this log is huge. It rarely contains relevant information,
few occurrences of IERR (interrupt error) or POST (PowerOn SelfTest error) can be of some use. Use e.g.
displaylog –n 50 to display the last 50 lines or the log.
The commands power –state, power –off, power –on allow to see if the board is ON, also to turn it
OFF or turn it ON from the DPPCLI> prompt. This is equivalent to push the On/Off switch on the front
panel, but NOT the back panel switch : this one also removes power to the BMC chip, which is not the
case for the front switch or the SOL command.
One of the common actions via SOL connection is to reboot the unit.
This is done by typing :
dppcli> reset [enter]
dpplci > console [enter]
The previous command reset –c does not work properly anymore due to DPCCLI versions change
coming with MW29.7 release. You need now to type 2 distinct commands as described above. Expect ~30
sec. Before you see the start of the reboot script. Characters can look scrambled on screen, depending on
the type of connection used for this reboot (Xterm, VOLC…)
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p3
GOC3-4-5 T/S guide.
WARNING : these reset actions can be done via VOLC or ISD remote connection, but our Telnet
application will not handle properly the characters flow during the boot process, and it WILL HANG.
Don’t be surprised, the boot will actually happen in the background. Disconnect/close the Telnet
application, and reconnect to the site.
A utility called ipmitool can be used as a ‘shell’ using the SOL connection to control the IG’s, and perform
all same operations than the DPCCLI> prompt (and even more), but this tool is far from user-friendly, so
we’ll limit here to the basic checks and readings that seem of interest, with reasonable typing…
You’ll see in the help that the range of possibility is large, including options that can put you in trouble.
Use only what is described here !
-h This help
-V Show version information
-v Verbose (can use multiple times)
-c Display output in comma separated format
-I intf Interface to use
-H hostname Remote host name for LAN interface
-p port Remote RMCP port [default=623]
-U username Remote session username
-f file Read remote session password from file
-S sdr Use local file for remote SDR cache
-a Prompt for remote password
-e char Set SOL escape character
-C ciphersuite Cipher suite to be used by lanplus interface
-k key Use Kg key for IPMIv2 authentication
-L level Remote session privilege level [default=ADMINISTRATOR]
-A authtype Force use of auth type NONE, PASSWORD, MD2, MD5 or OEM
-P password Remote session password
-E Read password from IPMI_PASSWORD environment variable
-m address Set local IPMB address
-b channel Set destination channel for bridged request
-l lun Set destination lun for raw commands
-t address Bridge request to remote target address
-o oemtype Setup for OEM (use 'list' to see available OEM types)
-O seloem Use file for OEM SEL event descriptions
Interfaces:
open Linux OpenIPMI Interface [default]
imb Intel IMB Interface
lan IPMI v1.5 LAN Interface
lanplus IPMI v2.0 RMCP+ LAN Interface
Commands:
raw Send a RAW IPMI request and print response
i2c Send an I2C Master Write-Read command and print response
lan Configure LAN Channels
chassis Get chassis status and set power state
power Shortcut to chassis power commands
event Send pre-defined events to MC
mc Management Controller status and global enables
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p4
GOC3-4-5 T/S guide.
ipmitool -I lan -H ig1 -P “” sel : basic comm. over SOL, and log status
ipmitool -H ig1 -P '' chassis power status : check chassis power behaviour at boot
ipmitool -H ig1 -P '' fru : lists board components
ipmitool -H ig1 -P '' sdr type temp : components temperature checks
ipmitool -H ig1 -P '' sel list last 15 : lists last 15 lines of the on-board log
ipmitool -H ig1 -P '' sensor : lists all board sensors... many details. Look for OK’s
ipmitool -H ig1 -P '' sdr : simplified board sensors. Again, look for OK’s
The command dmidecode (as root) reads Board type. It returns a Version string which contains WV for
Westville, JR for Jarrell. This works for all IG’s, VIG’s and DARC types we use nowadays.
Notice the same tool on HP host will also give you the BIOS version of the Motherboard.
The vrac_flash_update command can be run either from the OC or from the DARC. It needs of course to
have Applications down.
From the OC : it will reflash all IG’s without further confirmation.
From the DARC : you will need to specify which IG you whish to reflash, like example below.
Note : in this case, the reflash failed due to defective VRAC module. In such case, a swap of IG is
required.
This tool will check the FLASH versions of the VRAC on all
IGs in the system. If any version is not up to date the
VRAC will be updated at this time. APPLICATIONS MUST NOT BE
RUNNING WHEN THIS STARTS. YOU MUST NOT ABORT THIS PROCEDURE
ONCE STARTED. Are you sure you want to continue?
enter 'y' to continue
y
vrac_flash_update[5965]: started check of VRAC FLASH; argv=['ig1']
args= ['/usr/bin/vrac_popup', '-t', 'Updating VRAC FLASH
DO NOT INTERRUPT!']
Xlib: connection to "oc:0.0" refused by server
Xlib: No protocol specified
On DARC & IG's, lspci allows to verify that the PCI devices [DIP / VDIP / RAC / VRAC]
are responding and correctly detected from the Motherboard side.
Note : the 2 commands suggested here do the same, the 2nd one is easier to use.
This is run either from the DARC prompt to detect DIP/VDIP, or from each IG’s prompt to detect RAC /
VRAC. The GE components are those which appear ‘unknown’ !!
Notice you also see here all other components attached to the Motherboard (Eth. controllers, VGA cards…)
Good responses on these tests mean the PCI device is detected, not necessary they work properly. For
sure, if a device is not listed, system will not work. In such case, it can be productive to open the box
(DARC, IG) and reseat all PCI connections, cables, etc… ESD equipment needed !
The ‘hinv’ OC command is not implemented on DARC nor IG’s software, however a generic Linux
command lhinv does exactly the same job. See below examples on a VIG. Notice this is a convenient way
to check the amount of memory detected on the Motherboard, which is critical to proper operation.
A diag is available from the DARC prompt : vdip_menu. If launched without option it will post a menu,
where you can choose from a list of tests. Use the –h to see options. Most likely use the –A option which is
the most convenient, as it chains all tests w/o loopback optical fiber. It takes ~ 30 sec.
Execute the test as root, otherwise it will fail the bandwidth test.
where:
-h displays this text
-l lists the tests
-v verbose mode
-r sets the number of times to run each test
-a executes all of the menu selections (Require loopback cables)
-A executes all of the menu selections except serial loopback tests
-t runs the test or tests that follow
Loop 1:
PASSED - 1 - Read Configuration Space Header
PASSED - 2 - Read contents of Board ID and PCI Rev.
PASSED - 3 - Read contents of Registers
PASSED - 4 - Data Pattern Test of SRAM A
PASSED - 5 - Data Pattern Test of SRAM B
PASSED - 6 - Walking Bit Test of SRAM A
PASSED - 7 - Walking Bit Test of SRAM B
PASSED - 8 - Address Pattern Test of SRAM A
PASSED - 9 - Address Pattern Test of SRAM B
PASSED - 10 - Read Test of FLASH
PASSED - 11 - Relay Loopback Test
Loop 1:
PASSED - 14 - DMA Test
PASSED - 15 - DMA Band Width Test
PASSED - 16 - Memory Test
PASSED - 17 - Interrupt Test
***********************************************************
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p8
GOC3-4-5 T/S guide.
RESULTS
0 of 2 Loops Contain Failures
ALL TESTS PASSED
***********************************************************
[root@darc ~]#
A diag is available from the IG’s prompt : vrac_menu. If launched without option it will post a menu,
where you can choose from a list of tests. Use the –h to see options. Most likely use the –a option which is
the most convenient, as it chains all tests. It takes ~ 2 min.
This test will exercise the VRAC with various Recon modes. A successful test means the VRAC is
performing back-projection tasks properly, however, the IG itself can generate problems like ‘dropping’
the Recon process, which is described in another chapter.
We have a diagnostic on the DARC called test_igs which will go to each IG in sequence and run many
Recon tasks, in various Recon modes (thin slice, cone beam, cardiac..). This is a complete test for the
Recon hardware. It takes ~ 20 minutes for 3 IG’s.
Notice the tests also gives you details about the type of VRAC, firmware rev, etc.
[ctuser@darc~]$test_igs
###########################################
IG-1
rsh ig1 OK
ig1 has a VRAC
2395084 G [FPGA LOAD 1] PCI Rev 128 PBC Rev 452 BPP Rev 355
1. VRAC Board Level Diagnostics
..............
ALL TESTS PASSED
2. VRAC Backprojection Tests
....
ALL TESTS PASSED
3. Helical Backprojection Stess Test
Successfully sent LUTs to the VRAC
In cone beam mode
Aligned 1710 channels to 1712 channels
Image 0 : 8bcee099
Image 1 : 80514510
Image 2 : 743daa46
Image 3 : 67e7f940
Image 4 : 5b7ce5da
.........................
PASSED - Run Helical Simulation Cone Beam
4. Cardiac Backprojection Stess Test
Successfully sent LUTs to the VRAC
In cone beam mode
Aligned 1710 channels to 1712 channels
Image 0 : 8bcee099 88effbca 86214b64 83317960
GE Proprietary – May 2007 – Ph.Giudici - rev.0 p10
GOC3-4-5 T/S guide.
The DIP board on LS / LS Pro / BrightSpeed… systems is attached to the DARC as a PCI device. See 1st the
lspci command description. Then, to run diags on the DIP, use DipDiag utility :
[ctuser@darc ~]$ DipDiag -h
usage: DipDiag [options]
options:
-a, --all,
-x, --execute, 1,2,3,...
-n, --do-not-execute, 1,2,3,...
-i, --iterations, N
-q, --quiet,
-p, --stop-on-error,
-h, --help,
Test numbers are as follows
1 - X-ray Enable Relay
2 - Test FEC Bit
3 - Test EDDR Bit
4 - Test Magic Number Register
5 - Memory Test
6 - Interrupt Test
7 - External LoopBack Test
8 - DMA Test
Without options the test is run in interactive mode.
Use the DipDiag –a option which will chain all tests automatically.
WARNING : as you won’t probably have the loopback Optical Fiber installed, tests 7 & 8 WILL FAIL.
This is normal. These tests MUST PASS when the loopback fiber is installed.
The RAC is the PCI device on top of IG’s Motherboard in charge of Recon backprojection, etc. See 1st the
lspci command description. Previous software releases had a ‘racdiag’ utility on the IG’s, not user-
friendly. On current releases (06MW03.x, 06MW29.7 & beyond), a macro is now available on the OC host:
racdiags. This diag needs to shutdown Application. Notice the macro will try to run diags on all IG’s,
therefore will send failing messages on IG2 & IG3 if they are not here (as below). You can also run as
many loops of this diag with racdiags –i3 for e.g. 3 loops.
15 - Test / verify the RAC flash code (non VCT IG’s) {TOC}
The ‘flash_rac’ utility is used on the IG's to check the integrity of the RAC flash memory.
This is valid with releases 06MW03.5, 06MW29.7 & beyond.
Console Diags run many diags & tests simultaneously, on the HP, SCSI Tower, DARC, IG… and is
configurable as what tests to run, number of loops, etc…. It uses graphical Xwindows interface, and
therefore is not primarily oriented to remote operations, even if it is possible.
To run the diags : 2 ways :
• From the CSD page, with Proprietary key : click on the Console Diags link.
• From a Linux shell with Apps down, as root : type diagsSetup.sh
• With releases 07MW11.x, from From a Linux shell with Apps down, as root : type ConsoleDiags
The tool will open as many Xterm windows as tests made, and you will see the execution & results in each
window.
WARNING : if SCSI tower diags are selected, you need MOD & DVD media inserted, and they will be
overwritten !