[go: up one dir, main page]

CN115616377A - Fault chip detection method and device, computing equipment and storage medium - Google Patents

Fault chip detection method and device, computing equipment and storage medium Download PDF

Info

Publication number
CN115616377A
CN115616377A CN202211243265.5A CN202211243265A CN115616377A CN 115616377 A CN115616377 A CN 115616377A CN 202211243265 A CN202211243265 A CN 202211243265A CN 115616377 A CN115616377 A CN 115616377A
Authority
CN
China
Prior art keywords
node chip
preset
chip
node
communication link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211243265.5A
Other languages
Chinese (zh)
Inventor
马甲坤
张楠赓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canaan Creative Co Ltd
Original Assignee
Canaan Creative Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canaan Creative Co Ltd filed Critical Canaan Creative Co Ltd
Priority to CN202211243265.5A priority Critical patent/CN115616377A/en
Publication of CN115616377A publication Critical patent/CN115616377A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/28Testing of electronic circuits, e.g. by signal tracer
    • G01R31/2851Testing of integrated circuits [IC]
    • G01R31/2853Electrical testing of internal connections or -isolation, e.g. latch-up or chip-to-lead connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The embodiment of the application provides a fault chip detection method, a fault chip detection device, a computing device and a storage medium, wherein the fault chip detection method comprises the following steps: acquiring preset mark information and physical position information of a node chip in a communication link; and determining the fault node chip according to the preset mark information, the physical position information and the preset incidence relation. According to the technical scheme of the embodiment of the application, the node chip with the fault in the communication link is quickly detected and positioned.

Description

Fault chip detection method and device, computing equipment and storage medium
Technical Field
The present application relates to the field of chip detection technologies, and in particular, to a method and an apparatus for detecting a faulty chip, a computing device, and a storage medium.
Background
Currently, with the application and development of machine learning, especially deep learning technology, in various fields, higher requirements are put on the data processing capability of a computing device. In order to complete complex data operation tasks, a mode of parallel operation of a plurality of chips is adopted at present so as to improve data processing capacity and computing capacity. However, if any chip fails, it is difficult to identify and locate the failed chip.
Disclosure of Invention
Embodiments of the present application provide a method and an apparatus for detecting a faulty chip, a computing device, and a storage medium, so as to solve or alleviate one or more technical problems in the prior art.
In a first aspect, an embodiment of the present application provides a method for detecting a faulty chip, including:
acquiring preset mark information and physical position information of a node chip in a communication link;
and determining the fault node chip according to the preset mark information, the physical position information and the preset incidence relation.
In a second aspect, an embodiment of the present application provides a fault detection apparatus, including at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of the embodiments.
In a third aspect, an embodiment of the present application provides a computing device, including the fault detection apparatus in any implementation manner of this embodiment.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the embodiments.
By adopting the technical scheme, the embodiment of the application can detect and position the chip with the fault in the communication link.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 shows a schematic diagram of a communication link;
FIG. 2 shows a schematic structural diagram of a chip according to an embodiment of the present application;
FIG. 3 shows a flow diagram of a faulty chip detection method according to an embodiment of the present application;
FIGS. 4, 5, 6, and 7 show schematic diagrams of communication links according to embodiments of the present application;
fig. 8 shows a block diagram of a faulty chip detection apparatus according to an embodiment of the present application.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
In some high performance computing applications, multiple node chips may be used to compute together, with the node chips being connected by signal lines to form a communication link. In the starting process, all node chips are enumerated in a command broadcasting mode. In the enumeration process, after the node chip receives an enumeration instruction, the corresponding enumeration result data is added with 1, stored in the register and continuously transmitted backwards, and the subsequent node chip repeats the process and sequentially adds 1, so that the node chip on the whole communication link can determine the physical position information of the node chip, and further can respond to a calculation command corresponding to the node chip based on the position information.
However, in some cases, a node chip in the communication link fails and does not respond to the enumeration instruction, so that the node chip has no way to count and no location information. That is, the enumeration result of the previous node chip of the node chip is directly transmitted to the next node chip of the node chip without counting, and the final communication link is one less node chip according to the enumeration result. During maintenance, there is no way to locate the failed node chip.
For example,as shown in fig. 1, the communication link 10 may include seven node chips, M1, M2, M3 … … M7, whose physical positions are [1, 2, 3, 4, 5, 6, 7, respectively]. The enumeration process of the seven node chips comprises the following steps: after receiving a detection instruction (for example, an enumeration instruction), sequentially adding 1 to the values of the corresponding registers of the node chip to finally obtain seven pieces of physical position information, which are respectively A 1 、A 2 、A 3 、A 4 、A 5 、A 6 、A 7 I.e. physical position n = physical position information a of each chip n
When a certain node chip is damaged (taking M3 as an example), the node chip does not perform counting of enumeration results, that is, does not add 1, but continues to transmit a detection instruction backward, and at this time, the following node chips sequentially perform detection instructions and count the received enumeration results, and the number of the finally obtained physical location information is 6, that is, A1, A2, A3, A4, A5, and A6. Therefore, only 6 node chips are found out from the actual 7 node chips, and no way is provided for judging which node chip fails.
The embodiment of the application aims to provide a chip and a fault chip detection method, wherein the chip can be used as a node chip in a communication link. The fault chip detection method based on the chip can locate the node chip with the fault in the communication link.
Wherein, the communication link can comprise at least two node chips and realizes the communication between the at least two node chips. Illustratively, the communication link may be a one-way communication or a two-way communication.
In one example, the communication link connects only two node chips, and point-to-point connection communication may be implemented. In another example, a communication link may connect more than two node chips with one link, implementing a multi-drop communication. The number of node chips in a communication link is not limited in this embodiment.
In this embodiment, each node chip is connected through a circuit to form a communication link, and each node chip may be connected in series, may also be connected in parallel, and may also be combined in series and parallel.
Fig. 2 shows a schematic diagram of a node chip 100 according to an embodiment of the application. Node chip 100 may be a chip in a communication link. Illustratively, the node chip 100 of the present embodiment may be an Application Specific Integrated Circuit (ASIC) chip.
As shown in fig. 2, the node chip 100 includes at least one preset flag pin IO _ pos for receiving preset flag information V _ pos. For example, the existing input/output pins of the node chip 100 may be redefined as the preset flag pins, or one additional input/output pin may be added as the preset flag pin.
Each preset mark pin receives one piece of preset mark information, and the received preset mark information can be the same or different. The preset mark information of at least one preset mark pin is consistent with the physical position information of the node chip 100 in the communication link and the preset association relationship. For example: the relationship between the level value of the preset flag information V _ pos of at least one preset flag pin and the detected physical location information of the node chip 100 in the communication link is consistent with the preset association relationship.
The preset association relationship is a correspondence relationship between actual physical location information corresponding to the node chip and preset mark information.
In one embodiment, the presetting of the association relationship includes: when the actual physical position information of the node chip in the communication link is an odd number, the level value determined according to the preset marking information of the node chip is a first level value; and when the actual physical position information of the node chip in the communication link is an even number, the level value determined according to the preset mark information of the node chip is a second level value, wherein the preset mark pin of the node chip is used for receiving the preset mark information. Illustratively, the first level value is a low level and the second level value is a high level.
In this embodiment, the low level may represent "0" and the high level may represent "1". It should be noted that the definition of high level and low level is a relative value, and a voltage higher than a certain high threshold is defined as high level, and a voltage lower than a certain low threshold is defined as low level. Therefore, the definition of the low level and the high level may be different for different chips, and the voltage ranges corresponding to "0" and "1" may also be different, which is not limited in this embodiment.
Several examples of preset associations are given below.
Example one
In example one, the preset flag pin IO _ pos may be one. The preset association relationship comprises: when the actual physical location information of the node chip 100 in the communication link is odd-numbered, the preset flag information is a first level value, for example, "0", and when the actual physical location information of the node chip 100 in the communication link is even-numbered, the preset flag information is a second level value, for example, "1"; alternatively, when the actual physical location information of the node chip 100 in the communication link is odd-numbered, the preset flag information is a second level value, for example, "1", and when the actual physical location information of the node chip 100 in the communication link is even-numbered, the preset flag information is a first level value, for example, "0". The specific association relationship may be preset according to actual needs, and this embodiment does not limit this.
Example two
In example two, the number of the preset flag pins IO _ pos may be multiple, and the logical operation result of the multiple pieces of preset flag information V _ pos after the logical operation is consistent with the physical position information of the node chip 100 in the communication link and the preset association relationship. Logical operations include, but are not limited to, OR, AND, NOT, XOR, etc. The preset association relationship comprises: when the actual physical position information of the node chip 100 in the communication link is odd-numbered, the logical operation result after the logical operation of the plurality of pieces of preset flag information is a first level value, for example, "0", and when the actual physical position information of the node chip 100 in the communication link is even-numbered, the logical operation result after the logical operation of the plurality of pieces of preset flag information is a second level value, for example, "1"; alternatively, when the actual physical location information of the node chip 100 in the communication link is odd-numbered, the logical operation result after the logical operation of the plurality of pieces of preset flag information is a second level value, for example, "1", and when the actual physical location information of the node chip 100 in the communication link is even-numbered, the logical operation result after the logical operation of the plurality of pieces of preset flag information is a first level value, for example, "0". The specific association relationship may be preset according to actual needs, and this embodiment does not limit this.
Example three
The preset flag information V _ pos may be used to determine a configuration number for the chip, and the preset association relationship may include: and the corresponding relation between the actual physical position information of the chip and the configuration number.
For example, the number of the preset flag pins IO _ pos may be N, and the value of N is greater than or equal to 2. N preset flag information V _ pos corresponding to the N preset flag pins include a first level value or a second level value to be combined to generate 2 N A configuration result, 2 N The configuration numbers of the configuration results are 1 and 2 … … 2 respectively N -1, 0, each chip in the communication link being assigned in turn a configuration number 1, 2 … … 2 N 1, 0 if the number of chips in the communication link is greater than 2 N Then from 2 nd N Starting from +1 chips, the chips are repeatedly and sequentially assigned with the configuration numbers 1 and 2 … … 2 N 1, 0 (this embodiment can be implemented with more than 2 pairs of N preset flag pins N Fault detection of each chip, and fault detection of more chips is realized by using fewer preset marking pins, wherein N can be an integer greater than 1); the preset association relationship comprises: when the value corresponding to the actual physical location information of the node chip 100 in the communication link is divided by 2 N When the remainder of (1) is M, the configuration number corresponding to the node chip 100 is M, wherein M is less than or equal to 2 N
For example: when N is equal to 2, that is, the IO _ pos may be 2, the preset flag information V _ pos of two IO _ pos are combined to generate four configuration results, which are 00, 01, 10, and 11 respectively, and the configuration numbers are sequentially 1, 2, 3, and 0, and respectively correspond to the four node chips 100 1 、100 2 、100 3 、100 4 The actual physical positions are 1, 2, 3 and 4 in order. When node chip 100 3 In case of failure, the node chip 100 4 The remainder of the actual physical location information corresponding to a value of 3,3/4 is 3 and is not equal to that of node chip 100 4 The number "0" corresponding to the configuration result 11 of the preset flag information V _ pos of the two IO _ pos does not meet the preset association relationship, so that it can be determined that a faulty chip exists.
Example four
In example four, the IO _ pos may be one, the level value of the corresponding preset flag information is V _ pos, and the preset association relationship includes: the level value of the preset mark information of the chip and the level values of the preset mark information of two adjacent front and back chips in the communication link are in an arithmetic progression. For example, V _ posn = V _ pos1+ (n-1) × d, where V _ pos1 is a level value of the preset flag information of the first chip, V _ posn is a level value of the preset flag information of the nth chip, and d is a tolerance. When a difference between a level value of the preset flag information of the enumerated chip and a level value of the preset flag information enumerated before is not equal to d, it indicates that a faulty chip exists, and at this time, it may be determined that a faulty chip exists between the currently enumerated chip and the chip enumerated before. Taking the enumerated ith chip as an example, if a difference value between a level value of the preset marking information corresponding to the ith chip and a level value of the preset marking information corresponding to the previous enumerated ith-1 chip is not equal to d, it may be determined that the ith chip in the actual communication link has a fault, and the enumerated ith chip is the (i + 1) th chip in the actual communication link.
In one embodiment, as shown in FIG. 2, node chip 100 may also include a register 101. In response to a detection instruction in the communication link, the register 101 generates physical location information.
The detection instruction may be an enumeration instruction, and the physical location information may be an enumeration result. For example: the register 101 counts based on the received enumeration result of the previous chip to obtain the enumeration result of the current chip, and then generates physical location information.
For example: the communication link includes a plurality of node chips 100, 100 each 1 、100 2 ……100 N . Node chip 100 1 A first node chip configured as a communication link. Node chip 100 1 May be a node chip 100 that receives a detection command from an external control device, counts, e.g., adds 1, and generates and stores 1 Enumeration result of [1 ]]And to the next node chip 100 2 Transmission detection instruction and enumeration result [1 ]](ii) a Node chip 100 2 In response to a detection instruction, according to the enumeration result [1 ] it receives]Count, e.g., add 1, generate and store node chip 100 2 Enumeration result of [2 ]]And to the next node chip 100 3 Transmitting detection instruction and enumeration result [2 ]]And the rest is repeated to obtain a plurality of pieces of physical position information.
It should be noted that, the present embodiment is not particularly limited, and the count value accumulated each time and the binary form of the enumeration result are not particularly limited.
The present embodiment further provides a communication link, which includes a plurality of node chips, where a node chip is the node chip 100 in any implementation manner of the embodiments of the present application.
The embodiment also provides a fault chip detection method, and an execution main body of the fault chip detection method can be a fault chip detection device. Illustratively, the faulty chip detection device may be an external control device, such as a control board, outside the communication link.
As shown in fig. 3, the method for detecting a faulty chip includes:
step S301: acquiring preset mark information and physical position information of a node chip in a communication link;
step S302: and determining the fault node chip according to the preset mark information, the physical position information and the preset incidence relation.
In step S301, the obtaining of the preset mark information of the node chip in the communication link may include: a node chip in a communication link is configured with a preset mark pin; and acquiring preset marking information of the node chip through a preset marking pin.
Wherein, obtain the predetermined mark information of node chip through predetermineeing the mark pin, include: sending preset mark information to a preset mark pin of a node chip; and acquiring preset marking information received by a node chip in a communication link.
Further, in step S301, acquiring the physical location information of the node chip in the communication link may include: sending a detection instruction to a node chip in a communication link; and acquiring the physical position information of the node chip in the communication link.
The method for sending a detection instruction to a node chip in a communication link to acquire physical location information of the node chip in the communication link includes: sending a detection instruction to the communication link, so that each node chip in the communication link generates physical position information in response to the detection instruction; and acquiring each piece of physical position information.
That is, the physical location information in step S301 is information acquired based on the detection instruction, and may be different from the actual physical location information of the node chip.
Illustratively, as shown in FIG. 4, the communication link 20 includes a plurality of node chips 100 1 To 100 N The corresponding actual physical location information corresponds to values from 1 to N. Each node chip is provided with a preset mark pin IO _ pos. And configuring preset mark information input by the IO _ pos according to the actual physical position information of the node chip in the communication link and the preset association relation.
Sending a detection instruction to each node chip in the communication link 20; each node chip responds to the detection instruction and generates physical position information; and then a plurality of physical location information can be acquired. The input signal of the preset mark pin IO _ pos of each node chip is detected, and a plurality of pieces of preset mark information can be acquired. Whether a node chip fails or not and the position of the failed node chip can be quickly judged by presetting the association relation, the preset mark information and the physical position information.
It should be noted that the transmission direction of the detection instruction may be from the node chip 100 1 To node chip 100 N Or slave node chip100 N To node chip 100 1 . The number of IO _ pos may be plural.
In a specific example, as shown in fig. 5, the transmission direction of the detection instruction of the communication link 20 is the slave node chip 100 1 To node chip 100 N The number of IO _ pos is one, and the preset association is such that the preset flag information is at a high level ("1") when the physical position is odd-numbered, and at a low level ("0") when the actual physical position information is even-numbered. The node chip may be a node chip 100, the node chip 100 comprising a register 101. The register 101 generates physical location information in response to a detection instruction in the communication link. The detection instruction may be an enumeration instruction, and the physical location information may be an enumeration result. For example: after counting based on the received enumeration result of the previous node chip, the register 101 generates an enumeration result of the current node chip as the physical location information of the current node chip.
In one embodiment, the preset association relationship is a correspondence relationship between actual physical location information corresponding to the node chip and preset mark information. The setting of the preset association relationship may refer to the first example, the second example, the third example, and the fourth example, which are not described herein again.
In one embodiment, step S302 may include: sequentially judging whether each piece of physical position information and each piece of preset mark information are consistent with a preset association relation or not according to the sending sequence of the detection instruction in the communication link; and if not, determining a fault node chip according to the first node chip inconsistent with the preset incidence relation. Specifically, the corresponding node chip may be determined based on the first physical location information inconsistent with the preset association relationship; and determining the corresponding node chip as a fault node chip.
The following describes a specific implementation manner of the faulty chip detection method of the present embodiment in detail by taking an example one as an example.
In this example, the faulty chip detection method includes:
(1) Sending a detection instruction to a communication link, so that each node chip can respond to the detection instruction to generate physical position information;
(2) Acquiring physical position information;
(3) Sequentially judging whether each piece of physical position information and each piece of preset mark information are consistent with a preset association relation or not according to the sending sequence of the detection instruction in the communication link;
(4) If not, determining the corresponding node chip based on the first physical position information inconsistent with the preset association relation, and determining the node chip of the corresponding node chip as a fault node chip.
For ease of understanding, the following description will be made with the communication link 30 as an example. As shown in FIG. 6, communication link 30 includes seven node chips 100 1 、100 2 ……100 7 . Each node chip is provided with a preset mark pin IO _ pos, the preset correlation is that when the actual physical position information of the node chip in the communication link is odd-numbered, the preset mark information is high level ('1'), and when the actual physical position information of the node chip in the communication link is even-numbered, the preset mark information is low level ('0'), introduction of the fault chip detection method is carried out.
After the communication link 30 is initiated, a detection instruction is sent to the communication link 30. Illustratively, the detection instruction may be transmitted to the communication link 30 based on an external control device and an external interface of a circuit board on which the communication link 30 is located. Node chip 100 1 Can receive the detection command from the external control device, count the command, such as adding 1, generate and store the node chip 100 1 Enumeration result of [1 ]]And to the next node chip 100 2 Transmission detection instruction and enumeration result [1 ]](ii) a Node chip 100 2 In response to a detection instruction, according to the detection result [1 ] it receives]Count, e.g., add 1, generate and store node chip 100 2 Enumeration result of [2 ]]And to the next node chip 100 3 Transmitting detection instruction and enumeration result [2 ]]And so on.
In case one, it is assumed that none of the node chips failed. Based on the faulty chip detection method of this embodiment, the enumeration result obtained for the communication link 30 is [1, 2, 3, 4, 5 ]、6、7]Thereby obtaining a plurality of physical position information A 1 、A 2 、A 3 、A 4 、A 5 、A 6 、A 7 . Further, an input signal of a preset mark pin IO _ pos of each node chip is detected, so that a plurality of preset mark information of [1, 0, 1 ] are obtained]. It can be seen that the physical location information and the preset mark information of each node chip both conform to the preset association relationship. Therefore, each node chip is not failed.
In case two, assume a third node chip 100 3 Is a failed node chip. Based on the faulty chip detection method of the present embodiment, a detection signal is sent to each node chip in the communication link 30, since the faulty node chip 100 3 The detection instruction cannot be responded, and the physical position information cannot be generated, so that the plurality of pieces of physical position information obtained are A in sequence 1 、A 2 、A 3 、A 4 、A 5 、A 6 . Further, the input signal of the preset flag pin IO _ pos of each node chip is detected, since the failed node chip 100 is not detected 3 The input signal of the preset mark pin IO _ pos, and then the obtained multiple preset mark information are: [1, 0, 1]. It can be seen from A 3 Starting, the physical location information and the preset mark information start to be inconsistent with the preset association relation, that is, the node chip corresponding to the first physical location information inconsistent with the preset association relation is 100 3 Thus, a third node chip 100 may be determined 3 Is a failed node chip.
In one embodiment, step S302 may include: sequentially judging whether each piece of physical position information and each piece of preset mark information are consistent with a preset association relation or not according to the sending sequence of the detection instruction in the communication link; and if not, determining a fault node chip according to the first node chip inconsistent with the preset association relation. Specifically, the first node chip inconsistent with the preset association relationship and at least one node chip behind the first node chip may be determined as a faulty node chip.
Taking the third example as an example, a specific implementation manner of the faulty chip detection method of the present embodiment is described in detail below.
In this example, the method for detecting a faulty chip includes:
(1) Sending a detection instruction to the communication link, so that each node chip can respond to the detection instruction to generate physical position information;
(2) Acquiring physical position information;
(3) Sequentially judging whether each piece of physical position information and each piece of preset mark information are consistent with a preset association relation or not according to the sending sequence of the detection instruction in the communication link;
(4) If not, determining the first node chip inconsistent with the preset association relation and at least one node chip behind the first node chip as a fault node chip.
For ease of understanding, the following description will be given by way of example of the communication link 40. As shown in FIG. 7, communication link 40 includes seven node chips 200 1 、200 2 ……200 7 . The number of the IO _ pos is 2, the preset mark information V _ pos of two IO _ pos is combined to generate four configuration results, which are 00, 01, 10 and 11 respectively, and are numbered as 1, 2, 3 and 0 in sequence. The preset association relationship comprises: when the remainder of dividing the numerical value corresponding to the actual physical position information of the node chip in the communication link by 4 (i.e. the quadratic power of 2) is M, the number of the configuration result of each piece of preset mark information is M.
Suppose there are two consecutive node chips 200 3 And 200 4 A failure occurs.
Based on the fault detection method of this embodiment, after the communication link 30 is started, a detection instruction is sent to the communication link 30, and the obtained multiple pieces of physical location information sequentially are: a. The 1 、A 2 、A 3 、A 4 、A 5 (ii) a The configuration results of the obtained multiple pieces of preset mark information are as follows in sequence: 00. 01, 00, 01, 10, the number of the configuration result of the plurality of preset mark information is corresponding to: 1. 2, 1, 2 and 3. It can be seen that the physical location information a 3 The remainder of dividing the number by 4 is equal to "3" (3% = 4= 3), but the number of the configuration result of the corresponding preset mark information is "1", it can be determined that the first and preset association relations are presentInconsistent physical location information is A 3 The corresponding node chip is 200 3 . And because the total number of the node chips is 7, but the number of the received physical position information is only 5, two fault node chips are known in total, namely two fault node chips exist, namely 200 3 And 200 4
Therefore, based on the chip, the communication link and the fault chip detection method in the embodiment of the application, the node chip with the fault in the communication link can be determined quickly, accurately and conveniently.
In one embodiment, step S302 may include: sequentially judging whether each piece of physical position information and each piece of preset mark information are consistent with a preset association relation or not according to the sending sequence of the detection instruction in the communication link; if not, determining a fault node chip according to the first node chip inconsistent with the preset association relation; and repairing the fault node chip, and after repairing, continuing to perform fault detection according to the methods of the step S301 to the step S303.
Fig. 8 shows a block diagram of a faulty chip detection apparatus according to an embodiment of the present application. As shown in fig. 8, the defective chip detecting apparatus includes: a memory 801 and a processor 802, the memory 801 having stored therein instructions executable on the processor 802. The processor 802, when executing the instructions, implements the methods in the embodiments described above. The number of the memory 801 and the processor 802 may be one or more. The faulty chip detection device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The faulty chip detection device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
The device for detecting a faulty chip may further include a communication interface 803, which is used for communicating with an external device to perform data interactive transmission. The various devices are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor 802 may process instructions executed within the faulty chip detection apparatus, including instructions stored in or on a memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to an interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple failed chip detection devices may be connected, with each device providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
Optionally, in an implementation, if the memory 801, the processor 802, and the communication interface 803 are integrated on a chip, the memory 801, the processor 802, and the communication interface 803 may complete communication with each other through an internal interface.
It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting Advanced reduced instruction set machine (ARM) architecture.
Embodiments of the present application provide a computer-readable storage medium (such as the above-mentioned memory 801) storing computer instructions, which when executed by a processor, implement the method provided in embodiments of the present application.
Optionally, the memory 801 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the storage data area may store data created according to use of the faulty chip detection device, and the like. Further, the memory 801 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 801 optionally includes memory located remotely from processor 802, which may be connected to a faulty chip detection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Illustratively, the communication link is disposed on a circuit board, and the faulty chip detection device may be a control device of the circuit board.
The embodiment of the present application further provides a computing device, which may include the fault chip detection apparatus according to any embodiment of the present application. The computing device may also include a communication link of any embodiment of the present application.
According to the chip, the communication link and the fault chip detection method, device and computing equipment in the embodiment, one or more chips with faults in the communication link can be rapidly and accurately identified, so that the test efficiency is improved, and the stability of the equipment is improved.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more (two or more) executable instructions for implementing specific logical functions or steps in the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (21)

1. A method for detecting a faulty chip is characterized by comprising the following steps:
acquiring preset mark information and physical position information of a node chip in a communication link;
and determining a fault node chip according to the preset mark information, the physical position information and a preset incidence relation.
2. The method according to claim 1, wherein the predetermined association relationship is a correspondence relationship between actual physical location information corresponding to the node chip and predetermined flag information.
3. The method according to claim 2, wherein the node chip is configured with at least one preset mark pin for receiving the preset mark information.
4. The method according to claim 3, wherein there is one preset mark pin of the node chip, and the preset association relationship includes:
when the actual physical position information of the node chip in the communication link is an odd number, the preset mark information is a first level value, and when the actual physical position information of the node chip in the communication link is an even number, the preset mark information is a second level value.
5. The method according to claim 3, wherein the node chip has a plurality of preset mark pins, and the preset association relationship includes:
when the actual physical position information of the node chip in the communication link is an odd number, the logical operation result of each piece of preset mark information is a first level value, and when the actual physical position information of the node chip in the communication link is an even number, the logical operation result of each piece of preset mark information is a second level value.
6. The faulty chip detection method according to claim 4 or 5, wherein the first level value is a low level, and the second level value is a high level; or, the first level value is a high level, and the second level value is a low level.
7. The method according to any one of claims 2 to 5, wherein the communication link includes at least two node chips connected in sequence, and the determining a faulty node chip according to the preset mark information, the physical location information, and a preset association relationship includes:
sequentially judging whether the preset mark information and the physical position information of each node chip are consistent with a preset association relation or not according to the sending sequence of the detection instructions in the communication link;
and if not, determining a fault node chip according to the first node chip inconsistent with the preset incidence relation.
8. The method according to claim 7, wherein determining the failed node chip according to the first node chip inconsistent with the preset association relationship comprises:
determining a corresponding node chip based on the first physical position information inconsistent with the preset association relation;
and determining the corresponding node chip as a fault node chip.
9. The method according to claim 1, wherein the node chip is configured with at least two preset tag pins for receiving preset tag information, the preset tag information being used for determining a configuration number for the node chip;
the determining a fault node chip according to the preset mark information, the physical position information and a preset incidence relation includes:
determining a configuration number for the node chip according to the preset mark information;
and determining a fault node chip according to the physical position information, the configuration number and a preset incidence relation.
10. The method according to claim 9, wherein the predetermined association relationship is a correspondence relationship between actual physical location information corresponding to the node chip and the configuration number.
11. The method according to claim 10, wherein the node chip is configured with N preset flag pins, wherein a value of N is equal to or greater than 2, and N preset flag information corresponding to the N preset flag pins may be a first level value or a second level value, so as to generate 2 in combination N A configuration result, said 2 N The configuration numbers of the configuration results are 1 and 2 … … 2 respectively N -1、0。
12. The faulty chip detection method according to claim 11,
the preset association relationship comprises: when the value corresponding to the actual physical location information is divided by 2 N When the remainder of (1) is M, the configuration number corresponding to the node chip is M.
13. The method according to claim 12, wherein the communication link includes at least two node chips connected in sequence, and determining a faulty node chip according to the physical location information, the configuration number, and a preset association relationship includes:
sequentially judging whether the physical position information of each node chip and the configuration number of each node chip are consistent with a preset association relation or not according to the sending sequence of the detection instructions in the communication link;
and if not, determining a fault node chip according to the first node chip inconsistent with the preset incidence relation.
14. The method according to claim 13, wherein determining a failed node chip according to a first node chip inconsistent with the preset association relationship comprises:
determining a corresponding node chip based on the first physical position information inconsistent with the preset association relation;
and determining the corresponding node chip as a fault node chip.
15. The method according to claim 1, wherein the obtaining of the preset mark information of the node chip in the communication link includes:
a node chip in the communication link is configured with a preset mark pin;
and acquiring preset marking information of the node chip through the preset marking pin.
16. The method for detecting the faulty chip according to claim 15, wherein the obtaining the preset mark information of the node chip through the preset mark pin includes:
sending preset mark information to a preset mark pin of the node chip;
and acquiring preset marking information received by a node chip in a communication link.
17. The method of claim 15, wherein the obtaining the physical location information of the node chip in the communication link comprises:
sending a detection instruction to a node chip in a communication link;
and acquiring the physical position information of the node chip in the communication link.
18. The method of claim 17, wherein sending a detection instruction to a node chip in a communication link to obtain physical location information of the node chip in the communication link comprises:
sending the detection instruction to the communication link;
each node chip in the communication link generates physical location information in response to the detection instruction;
and acquiring the physical position information.
19. A fault detection device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-18.
20. A computing device comprising the fault detection apparatus of claim 19.
21. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions for causing the computer to perform the method of any one of claims 1-18.
CN202211243265.5A 2022-10-11 2022-10-11 Fault chip detection method and device, computing equipment and storage medium Pending CN115616377A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211243265.5A CN115616377A (en) 2022-10-11 2022-10-11 Fault chip detection method and device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211243265.5A CN115616377A (en) 2022-10-11 2022-10-11 Fault chip detection method and device, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115616377A true CN115616377A (en) 2023-01-17

Family

ID=84861728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211243265.5A Pending CN115616377A (en) 2022-10-11 2022-10-11 Fault chip detection method and device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115616377A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952026A (en) * 2023-03-15 2023-04-11 燧原智能科技(成都)有限公司 Method, device, equipment and storage medium for positioning abnormity of virtual chip
CN116340072A (en) * 2023-05-25 2023-06-27 中诚华隆计算机技术有限公司 Fault detection method and device for multi-core chip
CN118245311A (en) * 2023-11-24 2024-06-25 浙江正泰仪器仪表有限责任公司 Chip detection method and chip detection device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952026A (en) * 2023-03-15 2023-04-11 燧原智能科技(成都)有限公司 Method, device, equipment and storage medium for positioning abnormity of virtual chip
CN116340072A (en) * 2023-05-25 2023-06-27 中诚华隆计算机技术有限公司 Fault detection method and device for multi-core chip
CN118245311A (en) * 2023-11-24 2024-06-25 浙江正泰仪器仪表有限责任公司 Chip detection method and chip detection device

Similar Documents

Publication Publication Date Title
CN115616377A (en) Fault chip detection method and device, computing equipment and storage medium
US8990657B2 (en) Selective masking for error correction
CN111414268B (en) Troubleshooting method, device and server
US9436548B2 (en) ECC bypass using low latency CE correction with retry select signal
US8566672B2 (en) Selective checkbit modification for error correction
US9009548B2 (en) Memory testing of three dimensional (3D) stacked memory
US7073106B2 (en) Test method for guaranteeing full stuck-at-fault coverage of a memory array
CN111221675A (en) Method and apparatus for self-diagnosis of RAM error detection logic
US6789258B1 (en) System and method for performing a synchronization operation for multiple devices in a computer system
CN103902419B (en) A kind of cache testing method and device
CN116382958A (en) Memory error processing method and computing device
CN111221681A (en) Memory repairing method and device
CN101853198B (en) Detection method, equipment and system of address bus
EP2942714B1 (en) Monitoring method, monitoring apparatus, and electronic device
EP3564691B1 (en) Test device, test method, and test program
WO2018010084A1 (en) Esd testing device, integrated circuit, and method applicable in digital integrated circuit
CN101458624A (en) Loading method of programmable logic device, processor and apparatus
US9170869B2 (en) Switchable per-lane bit error count
CN104134464A (en) System and method for testing address line
CN115421948A (en) Method for detecting memory data fault and related equipment thereof
US20140372837A1 (en) Semiconductor integrated circuit and method of processing in semiconductor integrated circuit
CN111858135B (en) Data storage and verification method and device, terminal equipment and storage medium
CN111858196B (en) Computing unit detection method, parallel processor and electronic equipment
TWI789983B (en) Power management method and power management device
CN117096817B (en) Relay, relay repair method, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination