CN109117299B - Error detecting device and method for server - Google Patents
Error detecting device and method for server Download PDFInfo
- Publication number
- CN109117299B CN109117299B CN201710487094.3A CN201710487094A CN109117299B CN 109117299 B CN109117299 B CN 109117299B CN 201710487094 A CN201710487094 A CN 201710487094A CN 109117299 B CN109117299 B CN 109117299B
- Authority
- CN
- China
- Prior art keywords
- system management
- processing unit
- address space
- memory
- identification code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
- G06F11/1064—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in cache or content addressable memories
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a debugging device and a debugging method of a server, wherein the debugging method of the server comprises the steps that a processing unit operates in a system management mode according to an interrupt signal, the processing unit executes a basic input/output system code in the system management mode, executes a debugging program corresponding to an identification code in the basic input/output system code according to the identification code stored in a second address space of a memory unit of a memory module, generates debugging data, and the processing unit stores the debugging data in a third address space of the memory unit of the memory module in the system management mode. Wherein the first address space of the memory unit stores a plurality of sequences of presence detect data.
Description
Technical Field
The present invention relates to a fault detection device and a fault detection method for a server, and more particularly, to a fault detection device and a fault detection method for a server including a sequence presence detection memory.
Background
In a conventional server device, a designer may create an error removal code in a Basic Input/Output System (BIOS) code, and execute the error removal code to perform a debugging process when the server encounters a boot error. However, in practical situations, there are many possible causes of the boot error, and when the debugger needs to perform the debugging process with other debugging codes, the debugger needs to redesign the BIOS code and update the redesigned BIOS code to the server, which is time consuming.
In addition, another debugging method is to connect a pin of the processing unit to a display, and the processing unit can control the display to display a digital message in the designated stage of the boot process, so that the debugger can perform debugging according to the digital message. However, the amount of information that can be represented by the digital message is limited, so that a debugger cannot accurately infer the cause of the boot error according to the digital message, and thus the debugging is difficult to perform, and the digital message cannot be obtained again after being displayed on a display, so that the debugger is more difficult to perform the debugging, which is inconvenient.
Furthermore, another debugging method is to add an extended debug port (XDP) on the motherboard of the server, where the XDP can communicate with other units in the server, and the debugger can read the states of other units through the XDP to perform debugging. However, the addition of the debug extension port on the motherboard increases the production cost of the motherboard, and the pins for communication with the debug extension port must be added to other units in the server, thereby increasing the production cost of the server as a whole.
Disclosure of Invention
In view of the above, the present invention provides a fault detection apparatus for a server and a fault detection method thereof.
Therefore, the invention provides a fault detection device of a server, which comprises a memory module, a basic input and output system memory and a processing unit. The memory module includes a memory unit including a first address space for storing a plurality of sequences of presence detection data, a second address space for storing an identification code corresponding to an error detection procedure, and a third address space. The BIOS memory is used to store a BIOS code containing the debugging program. The processing unit is coupled with the basic input and output system memory and is used for operating in a system management mode according to an interrupt signal; in the system management mode, the processing unit executes the debugging procedure in the BIOS code according to the identification code to generate debugging data, and stores the debugging data in the third address space of the memory unit.
In one embodiment, the memory unit includes a system management bus interface or an integrated circuit bus interface, and the memory unit outputs the debug data through the system management bus interface or the integrated circuit bus interface.
In one embodiment, the error detection apparatus of the server further comprises a system management bus connected to the memory module and the system management bus interface; in the system management mode, the processing unit turns off a temperature data output function of the memory module, so that the memory module cannot transmit temperature data through the system management bus in the system management mode.
In one embodiment, the processing unit stores the identifier corresponding to the error type in the second address space according to an error type of an error parameter when executing an operating system.
In one embodiment, the processing unit updates the identification code after the error detection data is stored in the third address space.
In one embodiment, a method for debugging a server includes: a processing unit operating in a system management mode according to an interrupt signal, the processing unit executing a BIOS code in the system management mode to execute an error detection program corresponding to an identification code in the BIOS code according to the identification code stored in a second address space of a memory unit of a memory module and generate an error detection data; the processing unit stores the debug data in a third address space of a memory unit of a memory module in a system management mode. Wherein a first address space of the memory unit stores a plurality of sequences of presence detect data.
In an embodiment, the method for debugging a server further includes: the memory unit outputs the debug data through a system management bus interface or an integrated circuit bus interface.
In an embodiment, the method for debugging a server further includes: the processing unit turns off a temperature data output function of the memory module in the system management mode, so that the memory module cannot transmit temperature data through a system management bus connected to the memory module and the system management bus interface when the processing unit operates in the system management mode.
In one embodiment, in the step of the processing unit executing the error detection procedure according to the identification code, the processing unit stores the identification code corresponding to the error type in the second address space according to an error type of an error parameter when executing an operating system, so as to execute the error detection procedure according to the identification code.
In an embodiment, the method for debugging a server further includes: the processing unit updates the identification code after the error detection data is stored in the third address space.
Compared with the prior art, the error detecting device and the error detecting method of the server of the invention have the advantages that the memory unit storing the detection data in sequence can store the identification code and the error detecting data, a debugger can store different identification codes in the address space according to the actual error condition of the server and can obtain the error detecting data from the memory unit, thereby improving the convenience in error removal and the accuracy in error detection; moreover, the memory unit storing the serial detection data is used for storing the identification code and the debugging data, no additional hardware is needed for executing a debugging program, and the stored identification code and the debugging data are not lost due to shutdown or power removal, so that the overall production cost of the server is further reduced.
[ description of the drawings ]
FIG. 1 is a block diagram illustrating an embodiment of a server according to the present invention.
FIG. 2 is a schematic diagram of one embodiment of an address space arrangement of the memory cells of FIG. 1.
[ detailed description ] embodiments
FIG. 1 is a block diagram illustrating an embodiment of a server according to the present invention. Referring to fig. 1, the server at least includes a memory module 10, a processing unit 11 and a BIOS memory 12. The processing unit 11 is coupled to the memory module 10 and the BIOS memory 12. In one embodiment, the processing unit 11 may be a Central Processing Unit (CPU).
The memory module 10 includes a plurality of memory cells. Here, fig. 1 illustrates that the memory module 10 includes three memory units 101, 102, and 103, but the invention is not limited thereto, and the number of the memory units may be more than three or less than three. The memory cells 101 in the memory module 10 are used to store a plurality of Serial Presence Detect (SPD) data, such as the timing settings, various timing and voltage specification parameters of the memory module 10. The memory unit 101 further includes at least one identification code corresponding to a debug program, and the debug program can implement different functions, for example, the function implemented by the debug program can be to read a register value of a specific unit in the server, such as a chipset (chipset), or monitor the temperature of the specific unit in the server, or perform Serial ATA (SATA) test on the generated data, or record the value displayed by a debug light number in a Power On Self Test (POST) stage, or perform MCA (Machine Check Architecture) detection and error reporting, and the debugger of the server can define the program code by itself and store the program code in the BIOS memory 12 as a part of the debug program of the BIOS code. In one embodiment, the Identifier may be a Globally Unique Identifier (GUID).
FIG. 2 is a schematic diagram of one embodiment of an address space arrangement for memory cell 101 of FIG. 1. In configuration, referring to FIG. 2, the SPD data are stored in a first address space 101A of the memory cell 101, the identification code is stored in a second address space 101B different from the first address space 101A, and the second address space 101B may be continuous or discontinuous with the first address space 101A. Furthermore, the memory unit 101 further includes a third address space 101C for storing data, and the third address space 101C may be continuous or discontinuous with the first address space 101A and the second address space 101B. In one embodiment, the amount of data covered by the first address space 101A can be 384 bytes (i.e., between 0 th byte and 383 th byte), and the amount of data covered by the second address space 101B and the third address space 101C together can be 168 bytes (i.e., between 384 th byte and 551 th byte), but the invention is not limited thereto, and the amount of data covered by each address space can be configured according to actual requirements.
The processing unit 11 includes a System Management Interrupt (SMI) pin 111, the SMI pin 111 is used for receiving an interrupt signal, when the SMI pin 111 receives the interrupt signal, a logic level (logic level) of the SMI pin 111 is a high logic level, and the processing unit 11 enters a System Management Mode (SMM). In the system management mode, the processing unit 11 executes the BIOS code in the BIOS memory 12, and the processing unit 11 reads the memory unit 101 through the BIOS code to obtain an identification code in the second address space 101B, and executes a corresponding debugging procedure in the BIOS code through the identification code. In one embodiment, the identifier and the error detection procedure have a one-to-one correspondence relationship, for example, the identifier of "2" corresponds to the third error detection procedure, the identifier of "9" corresponds to the sixth error detection procedure, when the identifier stored in the second address space 101B is "2", the processing unit 11 executes the third error detection procedure according to the identifier of "2", and when the identifier stored in the second address space 101B is "9", the processing unit 11 executes the sixth error detection procedure according to the identifier of "9". Then, the processing unit 11 generates debug data during the debug procedure, and the processing unit 11 stores the debug data in the third address space 101C of the memory unit 101. In one embodiment, the memory unit 101 may be an Electrically Erasable Programmable Read Only Memory (EEPROM), when the processing unit 11 operates in the system management mode, the processing unit 11 turns on the write function of the memory unit 101, and after the error detection data is written into the third address space 101C, the processing unit 11 turns off the write function of the memory unit 101 and leaves the system management mode.
Further, the memory module 10 includes a System Management Bus (SMBus) interface, the processing unit 11 and other units in the server also have the SMBus interface, and the SMBus interfaces of the processing unit 11 and other units in the server are connected to the SMBus interface of the memory module 10 through a System Management Bus. After the debug data is stored in the third address space 101C of the memory unit 101, the processing unit 11 and other units can read the memory unit 101, so that the memory module 10 can output the debug data in the third address space 101C of the memory unit 101 and transmit the debug data through the system management bus.
For example, other units of the server may be a Baseboard Management Controller (BMC) and/or a chipset. As shown in fig. 1, the server further includes a chipset 13 and a bmc 14, the processing unit 11, the chipset 13 and the bmc 14 are respectively connected to the memory module 10 through system management buses 17, 16 and 15, and the processing unit 11, the chipset 13 and the bmc 14 can respectively obtain the debug data stored in the memory unit 101 from the memory module 10 through the system management buses 17, 16 and 15. Therefore, the debugger can go through the processing unit 11 and other units of the server to obtain the debug data and then proceed further debugging process.
In addition, the memory unit 101 may be externally connected to other debugging equipment, such as an analyzer (analyzer) or an oscilloscope, and a debugger may connect the debugging equipment to the system management bus interface of the memory unit 101, and after the processing unit 11 stores the debug data in the third address space 101C, receive the debug data output by the system management bus interface with the debugging equipment, and perform a subsequent debugging procedure according to the debug data.
In one embodiment, the memory module 10 may include an integrated circuit bus interface (I2C), and the memory unit 101 may output the debug data in the third address space 101C to other units or debugging devices in the server having the integrated circuit bus interface through the integrated circuit bus interface, which will not be described herein.
In an embodiment, the processing unit 11 may receive the aforementioned interrupt signal according to a BIOS code during a boot phase, and in detail, taking a boot phase of a Unified Extensible Firmware Interface (UEFI) as an example, during a Driver Execution Environment (DXE) boot phase, the SMI pin 111 of the processing unit 11 is initialized and the SMI pin 111 may be triggered by the BIOS code, so that the processing unit 11 operates in a system management mode to execute an error detection program and store error detection data in the third address space 101C of the memory unit 101. When the server encounters a boot-up abnormality and cannot enter the operating system, a debugger of the server can perform debugging on the display unit according to the debugging data, or use a testing instrument to directly connect to the output port of the memory unit 101 through an external circuit to read the debugging data in the third address space 101C to determine the possible cause of the boot-up abnormality of the server and perform a corresponding debugging procedure.
Furthermore, when the server enters the operating system without encountering a boot error, the processing unit 11 may periodically scan a plurality of system parameters in the operating system, such as various parameters of different hardware components in the server, and when the processing unit 11 scans the error parameters, the operating system triggers the processing unit 11 to enter the system management mode. In the system management mode, the processing unit 11 determines the error type of the error parameter and fills the second address space 101B with different identification codes according to the error type, and when the error parameter corresponds to a plurality of different error types, the processing unit 11 can fill the second address space 101B with a plurality of identification codes. In this regard, the processing unit 11 may perform a debug procedure associated with the type of error after scanning the error, and store the debug data in the third address space 101C. The debugger may obtain debug data generated after the processing unit 11 finds the error from the memory unit 101.
In one embodiment, after the processing unit 11 executes the debugging process, the processing unit 11 can clear the identification code corresponding to the executed debugging process in the second address space 101B for filling the identification code corresponding to the other error type in the second address space 101B, and execute the other non-executed debugging process by using the identification code previously stored in the second address space 101B when the next operation is in the system management mode. For example, taking the second address space 101B as an example where the first identification code corresponding to the first error category is stored first, after the processing unit 11 executes the corresponding first error detection program according to the first identification code, the processing unit 11 clears the first identification code in the second address space 101B, and fills the second identification code corresponding to the second error category, so that the processing unit 11 can execute the corresponding second error detection program according to the second identification code when operating in the system management mode next time.
In one embodiment, the memory module 10 can transmit the temperature data of any memory unit through its system management bus, that is, the memory unit 101 and the other memory units 102 and 103 in the memory module 10 are connected to the same system management bus and share the same system management bus, to avoid conflicts resulting from different memory units simultaneously transferring data via the system management bus, in the system management mode, the processing unit 11 first turns off the temperature data output function of the memory module 10, during the execution of the debug program by the processing unit 11 and the writing of the debug data into the memory unit 101, the memory module 10 cannot transmit the temperature data through the system management bus, after the processing unit 11 executes the debug procedure and stores the debug data in the third address space 101C, the processing unit 11 restarts the temperature data output function of the memory module 10 and leaves the system management mode.
In summary, according to an embodiment of the error detecting apparatus and the error detecting method of the server of the present invention, the memory unit storing the serial detection data can store the identification code and the error detecting data, and the debugger can store different identification codes in the address space according to the actual condition of the server error and can obtain the error detecting data from the memory unit, thereby improving the convenience of error removal and the accuracy of error detection; moreover, the memory unit storing the serial detection data is used for storing the identification code and the debugging data, no additional hardware is needed for executing a debugging program, and the stored identification code and the debugging data are not lost due to shutdown or power removal, so that the overall production cost of the server is further reduced.
The embodiments and examples of the present invention are described in detail with reference to the accompanying drawings, but the scope of the invention is not limited thereto, and all equivalent modifications and changes within the scope of the claims of the present invention should be considered as falling within the scope of the present invention.
Claims (6)
1. A fault detection device for a server, comprising:
a memory module comprising a memory cell, the memory cell comprising:
a first address space for storing a plurality of sequences of presence detect data;
a second address space for storing an identification code corresponding to an error detection procedure; and
a third address space;
a BIOS memory for storing a BIOS code, the BIOS code including the debug program; and
a processing unit coupled to the BIOS memory for operating in a system management mode according to an interrupt signal, wherein in the system management mode, the processing unit executes the debugging procedure in the BIOS code according to the identification code to generate debugging data and stores the debugging data in the third address space, the processing unit stores the identification code corresponding to an error type of an error parameter in the second address space according to an error type of the error parameter when executing an operating system, and the processing unit updates the identification code according to another identification code corresponding to another error type of the error parameter of the operating system after the debugging data is stored in the third address space.
2. The apparatus of claim 1, wherein the memory unit comprises a system management bus interface or an integrated circuit bus interface, and the memory unit outputs the debug data via the system management bus interface or the integrated circuit bus interface.
3. The apparatus of claim 2, further comprising a system management bus coupled to the memory module and the system management bus interface, wherein in the system management mode, the processing unit disables a temperature data output function of the memory module, such that the memory module cannot transmit a temperature data via the system management bus in the system management mode.
4. A method for debugging a server, comprising:
a processing unit operating in a system management mode according to an interrupt signal;
the processing unit executes a BIOS code in the system management mode, executes an error detection program corresponding to an identification code in the BIOS code according to the identification code stored in a second address space of a memory unit of a memory module, and generates an error detection data, wherein a first address space of the memory unit stores a plurality of sequences of presence detection data, and the processing unit stores the identification code corresponding to an error type in the second address space according to an error type of an error parameter when executing an operating system, so as to execute the error detection program according to the identification code; and
the processing unit stores the error detection data in a third address space of the memory unit in the system management mode, and updates the identification code according to another identification code corresponding to another error type of the error parameter of the operating system after the error detection data is stored in the third address space.
5. The method of claim 4, further comprising: the memory unit outputs the debug data through a system management bus interface or an integrated circuit bus interface.
6. The method of claim 5, further comprising: the processing unit turns off a temperature data output function of the memory module in the system management mode, so that the memory module cannot transmit temperature data through a system management bus connected to the memory module and the system management bus interface when the processing unit operates in the system management mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710487094.3A CN109117299B (en) | 2017-06-23 | 2017-06-23 | Error detecting device and method for server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710487094.3A CN109117299B (en) | 2017-06-23 | 2017-06-23 | Error detecting device and method for server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109117299A CN109117299A (en) | 2019-01-01 |
CN109117299B true CN109117299B (en) | 2022-04-05 |
Family
ID=64732310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710487094.3A Expired - Fee Related CN109117299B (en) | 2017-06-23 | 2017-06-23 | Error detecting device and method for server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109117299B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI714958B (en) * | 2019-01-30 | 2021-01-01 | 神雲科技股份有限公司 | A method of modifying setup of basic input/output system |
CN113687967B (en) * | 2020-05-18 | 2024-10-01 | 佛山市顺德区顺达电脑厂有限公司 | Method for recording startup error information |
CN113760612B (en) * | 2020-06-05 | 2024-07-26 | 佛山市顺德区顺达电脑厂有限公司 | Server debugging method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424084A (en) * | 2013-08-27 | 2015-03-18 | 鸿富锦精密电子(天津)有限公司 | System error information detection system and method for server |
CN106598790A (en) * | 2015-10-16 | 2017-04-26 | 中兴通讯股份有限公司 | Server hardware failure detection method, apparatus of server, and server |
CN106815088A (en) * | 2015-11-27 | 2017-06-09 | 佛山市顺德区顺达电脑厂有限公司 | server and its debugging method |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI220471B (en) * | 2003-02-20 | 2004-08-21 | Akom Technology Corp | Method, controller and apparatus for displaying BIOS debug message |
CN100524245C (en) * | 2006-12-21 | 2009-08-05 | 英业达股份有限公司 | Method for monitoring input/output port data |
US7613952B2 (en) * | 2006-12-29 | 2009-11-03 | Inventec Corporation | Method for facilitating BIOS testing |
CN101408860A (en) * | 2007-10-12 | 2009-04-15 | 华硕电脑股份有限公司 | Monitoring device and monitoring method thereof |
CN104035844A (en) * | 2013-03-04 | 2014-09-10 | 联想(北京)有限公司 | Fault testing method and electronic device |
CN106547653B (en) * | 2015-09-21 | 2020-03-13 | 龙芯中科技术有限公司 | Computer system fault state detection method, device and system |
CN106502846A (en) * | 2016-10-14 | 2017-03-15 | 合肥联宝信息技术有限公司 | A kind of computer glitch detection method and device |
-
2017
- 2017-06-23 CN CN201710487094.3A patent/CN109117299B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424084A (en) * | 2013-08-27 | 2015-03-18 | 鸿富锦精密电子(天津)有限公司 | System error information detection system and method for server |
CN106598790A (en) * | 2015-10-16 | 2017-04-26 | 中兴通讯股份有限公司 | Server hardware failure detection method, apparatus of server, and server |
CN106815088A (en) * | 2015-11-27 | 2017-06-09 | 佛山市顺德区顺达电脑厂有限公司 | server and its debugging method |
Also Published As
Publication number | Publication date |
---|---|
CN109117299A (en) | 2019-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI620061B (en) | Error detecting apparatus of server and error detecting method thereof | |
US20130268708A1 (en) | Motherboard test device and connection module thereof | |
US8707103B2 (en) | Debugging apparatus for computer system and method thereof | |
US20080046706A1 (en) | Remote Monitor Module for Computer Initialization | |
CN106547653B (en) | Computer system fault state detection method, device and system | |
US9542304B1 (en) | Automated operating system installation | |
US20210216388A1 (en) | Method and System to Detect Failure in PCIe Endpoint Devices | |
CN109117299B (en) | Error detecting device and method for server | |
CN102479148A (en) | Monitoring system and method for input and output port states of peripheral components | |
CN103257922B (en) | A kind of method of quick test BIOS and OS interface code reliability | |
CN113377586A (en) | Automatic server detection method and device and storage medium | |
CN104679626A (en) | System and method for debugging and detecting BIOS (Basic Input / Output System) | |
CN115878533A (en) | Adaptive configuration method, device, equipment and storage medium of AI server | |
CN106681877B (en) | Chip debugging system and method and system chip | |
US20060136794A1 (en) | Computer peripheral connecting interface system configuration debugging method and system | |
US11494289B2 (en) | Automatic framework to create QA test pass | |
CN110570897B (en) | Memory detection system, memory detection method and error mapping table establishment method | |
CN102053888A (en) | Self-testing method and system for computing device | |
CN104678292A (en) | Test method and device for CPLD (Complex Programmable Logic Device) | |
CN113791825A (en) | Component identification method, system, equipment and storage medium | |
CN101206606A (en) | Method of monitoring input/output port data | |
US20210173994A1 (en) | Method and system for viewing simulation signals of a digital product | |
US20210173989A1 (en) | Simulation signal viewing method and system for digital product | |
CN110321171B (en) | Startup detection device, system and method | |
CN112269705A (en) | Detection board for fault location of X86 architecture system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220405 |