[go: up one dir, main page]

0% found this document useful (0 votes)
66 views19 pages

Troubleshoot

The document provides a comprehensive guide for fault handling and troubleshooting of xFusion FusionServer, detailing steps for locating common server problems, collecting necessary information for diagnosis, and rectifying faults. It outlines high-risk operation scenarios and emphasizes the importance of proper procedures and technical support when issues arise. Additionally, it includes information on common fault locating methods and the collection of hardware and operating system logs for effective troubleshooting.

Uploaded by

EMC FE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views19 pages

Troubleshoot

The document provides a comprehensive guide for fault handling and troubleshooting of xFusion FusionServer, detailing steps for locating common server problems, collecting necessary information for diagnosis, and rectifying faults. It outlines high-risk operation scenarios and emphasizes the importance of proper procedures and technical support when issues arise. Additionally, it includes information on common fault locating methods and the collection of hardware and operating system logs for effective troubleshooting.

Uploaded by

EMC FE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

xFusion FusionServer

Typical Fault Handling and Troubleshooting


1 Locating common server problems

2 Troubleshooting
contents 2.1 Information Collection
2.2 Fault location

3 High-Risk Operation Scenarios


Locating common server problems

Locating common server problems step by step:

Procedure Note
✓ Manuals and user guide
Preparations
✓ Tools

Collecting Collect complete information that is helpful for


Information fault diagnosis and location

Use the fault locating method to find the root


Diagnosing and
cause of the fault, and formulate and implement
Rectifying Faults
troubleshooting plans

If the documents cannot help you to locate or


Obtain Technical
rectify the fault, contact technical support.
Support

3
1 Locating common server problems

2 Troubleshooting
contents 2.1 Information Collection
2.2 Fault location

3 High-Risk Operation Scenarios

4
Information Need to Collect for Troubleshooting

When a server is faulty, you need to collect the following information for fault diagnosis:

Basic Info Server hardware logs OS or Business Logs

 Server model • collect server iBMC logs • including operating system and
service software logs
 Server configuration • to identify server hardware or
firmware faults • to analyze driver, operating
 Fault occurrence time system, or software problems

 Fault symptom

and etc.

5
Basic Information: Fault Information Collection

Collect the following information for fault diagnosis:

Server Fault Records


Server Model 2288H V5 Server SN 2102310XXXXX

Server Hardware CPU, memory, RAID, OS and service e.g. SUSE 11 SP1 64-bit,
Configuration and network adapter software versions Oracle 10u2, etc.
models, etc.
Fault occurrence time xxxx Year xx Month xx Day xx Hours xx Minutes xx Seconds

Fault symptom e.g. the system automatically restarts during installation

Actions Before Fault e.g. modify the BIOS WatchDog


Occurance
Actions taken after e.g. remove the power cable and reinstalled, the fault persists
the fault and results replace the CD-ROM, the fault persists, etc.

6
Device Indicators and Error Codes
Indicator/Button Fault Description
---: The device is operating normally.
Fault diagnosis LED Error code: A component is faulty. For more information, please check FusionServer Rack Server iBMC Alarm Handling 04:Link:
https://support.xfusion.com/support/#/en/docOnline/EDOC1000054724?path=en-us_concept_0000001133310193
Power indicator:
Off: The device is not powered on.
Power button/indicator Steady green: The device is powered on.
Blinking yellow: The iBMC is starting. The power button is locked and cannot be pressed. The iBMC is started in about 1 minute, and
then the power indicator is steady yellow.
Steady yellow: The device is standby. Power button: When the device is powered on, you can press this button to gracefully shut down
the OS.
NOTE: For different OSs, you may need to shut down the OS as prompted. When the device is powered on, holding down this button for
6 seconds will forcibly power off the server. When the device is standby, you can press this button to power on the device.
The UID button/indicator helps identify and locate a device. UID indicator:
UID button/indicator Off: The device is not being located.
Blinking or steady blue: The device is being located.

UID button: You can control the UID indicator status by pressing the UID button or using the iBMC. You can press this button to turn on
or off the UID indicator. You can press and hold down this button for 4 to 6 seconds to reboot the iBMC.
Health status indicator Off: The device is powered off or is faulty.
Blinking red at 1 Hz: A major alarm has been generated on the system.
Blinking red at 5 Hz: A critical alarm has been generated on the system.
Steady green: The server is operating properly.
LOM port connection Each indicator shows the connection status of an Ethernet LOM port.
status indicator Off: The network port is not in use or has failed.
Steady green: The network port is properly connected.

NOTE • The indicators correspond to two 10GE and two GE network ports on the mainboard. • The LOM has a standby power supply
and will not be powered off even if the service system is powered off. As long as the LOM ports are properly connected to other working
network devices, the network ports will remain connected and the indicators are on.
7
iBMC: Collecting Alarm and Event Information

Path: Maintenance --- Alarm &SEL . Click System Event, you can obtain the work log information

8
iBMC: Collecting Server Hardware Logs

9
Common iBMC Log Directories
File Content Description Path
System current health event Current system health dump_info\AppDump\sensor_alarm\current_event.txt
dump_info\AppDump\sensor_alarm\sel\eo_sel.csv
SEL log System event record, has sensor events and alarms recorded by BMC

BMC version Information obtained by command “ipmcget –d version” dump_info\RTOSDump\versioninfo\app_revision.txt


Information obtained by command “ipmcget –d fruinfo”, including
FRU information dump_info\RTOSDump\versioninfo\fruinfo.txt
Product Serial Number
dump_info\RTOSDump\networkinfo\ifconfig_info
dump_info\RTOSDump\networkinfo\ipinfo_info
BMC network Information obtained by common command “ifconfig、
BMC network information dump_info\RTOSDump\networkinfo\netstat_info
netstat、route” etc.
dump_info\RTOSDump\networkinfo\resolv.conf
dump_info\RTOSDump\networkinfo\route_info
dump_info\LogDump\FDMDecoder\fdm_log
FDM log Troubleshooting function logs, which require parsing to view
dump_info\LogDump\FDMDecoder\fdm_output
Operate log Record about configurating operation, disk plug in/out, etc. dump_info\LogDump\operate_log
Mainboard voltage fault、system power drop, raid controller card
Maintenance log dump_info\LogDump\maintenance_log
outband communication, disk fault, etc.
PCIe card 、 riser card 、 raid card 、 disk backplane 、 mainboard
Card information dump_info\AppDump\card_manage\card_info
information
CPU information View CPU model dump_info\AppDump\CpuMem\cpu_info
Memory information View memory model dump_info\AppDump\CpuMem\mem_info
RAID information View raid card, Logic drive, disk information dump_info\AppDump\StorageMgnt\RAID_Controller_Info.txt
Sensor information View temperature, fan speed, disk location, etc dump_info\AppDump\sensor_alarm\sensor_info.txt
dump_info\AppDump\BIOS\currentvalue.json
BIOS configuration information View BIOS configuration parameter dump_info\AppDump\BIOS\registry.json
X86 OS related information Serial port log、last screen capture、etc. dump_info\OSDump\ (directory)
10
Collecting Operating System Logs
OS Collection Method
Windows Use FusionServer Tools to collect Windows and Linux (RHEL, SLES, CentOS, and Ubuntu) logs.
Linux For details, see FusionServer Tools 2.3.3 User Guide 01
• If the purple screen of death (PSOD) does not occur, perform the following steps:
1. Log in to the ESX server console as the root user.
2. Run the vm-support command to collect all VMware logs.
3. After logs are collected, check that a log file in the esxsupport-YYYY-MM-DD@HH-MM-
SS.tgz format is generated in the /var/tmp directory.
• If the PSOD occurs and the customer retains the site environment, perform the following steps:
1. Capture a screenshot of the PSOD or take a photo to save the displayed information.
2. Press Alt+F12 to switch to forcible memory information output mode,
VMware
press Alt+PageUp/Alt+PageDown to move to the last several log screens after the PSOD
occurs, and take screenshots or photos of these screens.
3. Hot-restart the system, and run the vm-support command to collect all VMware logs.
4. After logs are collected, check that a log file in the esxsupport-YYYY-MM-DD@HH-MM-
SS.tgz format is generated in the /var/tmp directory.
• If the PSOD occurs and the customer hot-restarts the system, run vm-support to collect all of
the VMware logs and check that a log file in the esxsupport-YYYY-MM-DD@HH-MM-
SS.tgz format is generated in the /var/tmp directory.

•Obtain the customer's written authorization before collecting information.


•Logs collected by FusionServer Tools may contain sensitive customer information. If sensitive customer information is involved, obtain the customer's written authorization before
11
performing any maintenance operation.
•Use Compatibility Checker to query the OSs supported by servers.
1 Locating common server problems

2 Troubleshooting
contents 2.1 Information Collection
2.2 Fault location

3 High-Risk Operation Scenarios

12
Common Fault Locating Methods

⚫ Using the Troubleshooting Tool


⚫ Rectify the fault based on the alarm
⚫ Locate the fault according to the fault code
⚫ Locate the fault based on the indicator
⚫ Rectify the fault based on the symptom
1. Minimum test
2. Cross test with same part
3. Add or reduce parts
⚫ Contact xFusion Technical Support for log analysis

13
1 Locating common server problems

2 Troubleshooting
contents 2.1 Information Collection
2.2 Fault location

3 High-Risk Operation Scenarios

14
Common Risky Operation Scenarios
产品组 4 2 1

High Risk Medium Risk Low Risk


✓ Persistent problems ✓ Performing operations Parts replacement: Basic configuration and maintenance operations:
unresolved after a part is related to customers' data ✓ Power Module ✓ Obtaining product documentation
replaced twice such as online RAID ✓ Power Backplane ✓ Obtaining a compatibility list and spare part BOM list
✓ Rectifying at least two capacity expansion ✓ Fan ✓ Configuring and upgrading a BIOS
faulty drives on the ✓ Repeated drive faults on ✓ Mainboard ✓ Operating and upgrading a BMC
same device and the the same device ✓ CPU ✓ Installing an OS and drivers and collecting system logs
RAID exception caused ✓ Repeated faults on the ✓ memory ✓ NIC, FC, and HBA configuration, driver installation, and
by the faulty drives same device, which cannot ✓ Hard Disk status query
✓ Restoring data as be resolved ✓ Hard Disk Backplane ✓ Configuring a RAID controller card and collecting RAID
required by customers in ✓ Batch problems on at least ✓ RAID Controller Card and BBU controller card logs
the case of RAID two devices, which affect ✓ PCIe Expansion Card and an HBA ✓ Querying BBU status and locating faults
configuration loss customers' services ✓ Heat Sink ✓ Locating drive faults
✓ Customer service ✓ CD-ROM Drive ✓ Restoring a single faulty drive and the RAID degradation
interruption ✓ IO Board and QPI Card caused by the faulty drive
Rack ✓ Power failure of multiple ✓ all cables ✓ Locating and rectifying CPU and DIMM faults
Server devices due to circuit ✓ all replaceable FRU units in the ✓ Locating and rectifying PSU faults
breaker tripping spare parts list ✓ Locating and rectifying fan module faults
✓ Other insecure factors, such ✓ Locating faults based on alarms and events in BMC logs
as burning and smoke ✓ Collect onsite fault information based on the Server
✓ Upgrading the firmware of Troubleshooting.
a RAID controller card or ✓ Burning a product SN, device SN, and customized
NIC information.
✓ Burning the MAC address ✓ Querying server OS/part compatibility
of a NIC ✓ DIMM installation rules
✓ Migrating an equipment ✓ Device routine maintenance
room

15
Common Risky Operations (1)
Power off Data back
Action Scenario Other Precautions
the device up
If the hard disk to be replaced
no no
does not contain data
If the hard disk to be replaced has 1. Before replacing a hard disk, ensure the slot of the hard
data, and the hard disk to be disk is correct. Otherwise, services may be interrupted
replaced is not in a RAID group or yes yes or data may be lost. You can use the RAID controller
Replacing
a non-redundant RAID group (for card tool to turn on the indicator of the hard disk on
SATA Hard
example, RAID 0) the OS to check the position.
Drive
If the hard disk to be replaced has 2. If the hard disk in the non-redundant RAID group is
data, the hard disk to be replaced faulty, data cannot be backed up. If the customer needs
is in a redundant RAID group, and no no to retrieve data, contact the data recovery company
replacing the hard disk does not
cause a RAID group fault
Replacing a Ensure that the remaining power modules are properly
When the server is fully
Power no no powered and the rated power is greater than or equal to the
configured with PSUs
Module rated power of the server
Replacing a
no no
Fan
1. If the component name of the RAID controller card to be
replaced is SRxxx or SPxxx and the component name of
the RAID controller card to be installed is XRxxx or XPxxx,
Replacing Before the replacement, upgrade the iBMC firmware and
yes no
RAID RAID controller card driver to the latest version.
2. If the customer's software depends on the RAID controller
card model and PCI ID, the customer's software needs to
16
be adapted again
Common Risky Operations (2)

Power off Data back


Action Scenario Other Precautions
the device up
1. Do not wear ESD gloves. Otherwise, the ESD gloves may
Replacing hang onto the pins of the processor base and damage the
yes no
CPU processor base.
2. The BOM numbers of CPUs on a server must be the same
Replacing The BOM numbers of the DIMMs on a server must be the
yes no
Memory same or must be replaceable
After the flexible I/O card is replaced, the MAC address of the
Replacing a
original flexible I/O card changes. If the flexible I/O card needs
Flexible I/O yes no
to be used on the service side, you need to reconfigure the
Card
flexible I/O card
Replacing a
Hard Disk yes no
Backplane
Replacing a If the model of the new PCIe card is different from that of the
yes no
PCIe Card original PCIe card, reinstall the corresponding PCIe card driver
If the current iBMC/BIOS parameters have been adjusted
based on service scenarios, replacing the main board will
cause the original iBMC/BIOS parameter configuration to be
Replacing
yes yes lost. In this case, you need to export the configuration backup
Mainboard
in advance. If the configuration cannot be exported, you need
to configure the iBMC/BIOS parameters again after the
replacement
17
xFusion Support Website Resources
No Material Description Link

1. Visit the Technical Support Website


https://support.xfusion.com/support/#/en/home
Server's appearance, physical structure, product 2. Go to the corresponding page, for example, Rack Servers >
1 Server Maintenance and Service Guide specifications, and methods for removing and 1288H V5.
installing components. 3. On the Documentation tab page, choose Operation &
Maintenance > Maintenance and Service Guide to download the
required user guide.

Server Hardware Installation and Parts


2 How to replace each part of a server in detail https://support.xfusion.com/support/#/en/multimedia
Replacement videos

How to collect logs, diagnose faults, upgrade


software, perform preventive maintenance and
https://support.xfusion.com/support/#/en/docOnline/EDOC100
3 Server Troubleshooting common operations, and collect the information
0087836?path=en-us_topic_0000001135985263
required to for troubleshoot high-density, rack,
E9000, E6000, heterogeneous servers.

This document describes iBMC alarms in terms of


FusionServer Rack Server iBMC Alarm https://support.xfusion.com/support/#/en/docOnline/EDOC100
4 the meaning, impact on the system, possible
Handling 0054724?path=en-us_concept_0000001133310193
causes, and handling suggestions.

This document describes the high-risk operations https://support.xfusion.com/support/#/en/docOnline/EDOC110


5 Server High-Risk Operations
and solutions for intelligent computing products. 0080946?path=en-us_topic_0000001140408219

18
Thank you. Fusion X, Digital Infinity

Copyright©2022 xFusion Digital Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. xFusion
may change the information at any time without notice.

You might also like