CN119473738A - Device exception processing system, method and fault recovery device based on PCIe - Google Patents
Device exception processing system, method and fault recovery device based on PCIe Download PDFInfo
- Publication number
- CN119473738A CN119473738A CN202411494843.1A CN202411494843A CN119473738A CN 119473738 A CN119473738 A CN 119473738A CN 202411494843 A CN202411494843 A CN 202411494843A CN 119473738 A CN119473738 A CN 119473738A
- Authority
- CN
- China
- Prior art keywords
- request
- circuit
- abnormality
- exception
- response
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000012545 processing Methods 0.000 title claims abstract description 8
- 238000011084 recovery Methods 0.000 title abstract description 17
- 230000004044 response Effects 0.000 claims abstract description 248
- 230000005856 abnormality Effects 0.000 claims abstract description 159
- 238000012544 monitoring process Methods 0.000 claims abstract description 77
- 230000002159 abnormal effect Effects 0.000 claims description 79
- 238000002955 isolation Methods 0.000 claims description 23
- 230000004083 survival effect Effects 0.000 claims description 8
- 239000000306 component Substances 0.000 description 51
- 230000006870 function Effects 0.000 description 22
- 238000010586 diagram Methods 0.000 description 16
- 230000007488 abnormal function Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000008358 core component Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/221—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test buses, lines or interfaces, e.g. stuck-at or open line faults
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The embodiment of the invention discloses a PCIe-based equipment exception handling system, a PCIe-based equipment exception handling method and a fault restorer. The system comprises a Host processor, a PCIe switch, a plurality of end point devices and a fault restorer, wherein the fault restorer is arranged in each end point device, each functional component in the end point device carries out abnormality monitoring and abnormality processing through the fault restorer, the fault restorer comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit, the abnormality monitoring circuit monitors read-write requests of the end point device, when abnormality exists, the abnormality monitoring circuit sends an abnormality identification to the abnormality reporting circuit and the automatic response circuit, the abnormality reporting circuit generates an abnormality reporting request according to the abnormality identification through the end point device and transmits the abnormality reporting request to the Host processor, and the automatic response circuit generates a response of the end point device where the automatic response information agent is located according to the abnormality identification, so that the abnormality recovery cost can be reduced, and the system robustness is improved.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a PCIe-based device exception handling system, method, and fault restorer.
Background
With the development of computer technology, system architecture (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) based on the high-speed serial computer expansion bus standard is widely used in personal computers, servers and embedded systems. In PCIe-based systems, endpoint devices may generate device exceptions due to hardware exceptions or software usage errors occurring in the functional components therein.
In the prior art, an abnormal unresponsiveness of an endpoint device may cause Host processors in the PCIe system to crash together (Hang death). In the conventional manner, the device completion timeout may be detected through a Root Complex (RC) of the PCIe architecture, so that the whole device or the functional component is reset for recovery, but the effect and the overhead on the Host processor are larger. Even further, host processors cannot complete a reset of a device or functional component due to some limiting reasons (e.g., reset procedure requires waiting for a response to be returned in its entirety, etc.). In addition, the occurrence of an exception in the access of other endpoint devices to the exception endpoint device may also cause the Host processor in the PCIe system to Hang.
Based on the above, the embodiment of the invention provides a PCIe-based equipment exception handling system, which enables a Host processor to accurately locate fault functional components in endpoint equipment, reduces exception recovery overhead and improves system robustness.
Disclosure of Invention
The invention provides a PCIe-based equipment exception handling system, a PCIe-based equipment exception handling method and a fault restorer, so as to reduce exception restoration overhead and improve system robustness.
According to an aspect of the present invention, there is provided a PCIe-based device exception handling system comprising a Host processor, a PCIe switch, a plurality of endpoint devices, and a failure restorer, wherein:
each endpoint device is connected with the Host processor through the PCIe switch;
each end point device is provided with a fault restorer, and each functional component in the end point device carries out abnormality monitoring and abnormality processing through the fault restorer;
the fault restorer comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit;
the abnormality monitoring circuit is used for monitoring the read-write request of the terminal equipment and sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists;
The exception reporting circuit is used for generating an exception reporting request according to the exception identifier through the located endpoint equipment and transmitting the exception reporting request to a Host processor;
and the automatic response circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier.
According to another aspect of the present invention, there is provided a PCIe-based device exception handling method, which is applied to a PCIe-based device exception handling system as provided in any one of the embodiments of the present invention, the method including:
Monitoring a read-write request of the endpoint equipment through an abnormality monitoring circuit, and sending an abnormality identification to an abnormality reporting circuit and an automatic response circuit when abnormality exists;
Generating an exception reporting request by the endpoint equipment of the exception reporting circuit according to the exception identifier, and transmitting the exception reporting request to a Host processor by the exception reporting circuit;
and generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier through an automatic response circuit.
According to another aspect of the present invention, there is provided a fault restorer applied to the PCIe-based device exception handling system according to any one of the embodiments of the present invention, the fault restorer including an exception monitoring circuit, an exception reporting circuit, and an automatic response circuit, wherein:
the abnormality monitoring circuit is used for monitoring the read-write request of the terminal equipment and sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists;
The exception reporting circuit is used for generating an exception reporting request according to the exception identifier through the located endpoint equipment and transmitting the exception reporting request to a Host processor;
and the automatic response circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier.
According to another aspect of the invention, there is provided a computer comprising a PCIe-based device exception handling system as provided by any one of the embodiments of the invention.
The technical scheme of the embodiment of the invention comprises a Host processor, a PCIe switch, a plurality of end point devices and a fault restorer, wherein each end point device is connected with the Host processor through the PCIe switch, the fault restorer is arranged in each end point device, each functional component in the end point device carries out abnormality monitoring and abnormality processing through the fault restorer, the fault restorer comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit, the abnormality monitoring circuit is used for monitoring read-write requests of the end point device, and sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists, the abnormality reporting circuit is used for generating an abnormality reporting request according to the abnormality identification through the end point device and transmitting the abnormality reporting request to the Host processor, and the automatic response circuit is used for generating a response of the end point device with automatic response information according to the abnormality identification, so that the problem that the end point device is easy to die of a PCIe system due to function abnormality is solved, the abnormality of the end point device can be accurately positioned, the fault components in the system can be reduced, and the fault recovery cost of the system can be improved, and the system cost is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a PCIe-based device exception handling system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a fault restorer provided according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an anomaly monitoring circuit according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an outstanding requests table provided in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of an exception reporting circuit according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an automatic response circuit according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an anomaly isolation circuit according to an embodiment of the present invention;
FIG. 8 is a flowchart of a method for PCIe-based device exception handling provided in accordance with an embodiment of the present invention;
Fig. 9 is a schematic structural diagram of a computer according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is a schematic structural diagram of a PCIe-based device exception handling system according to an embodiment of the present invention, where the embodiment may be applicable to a fault recovery situation when a device exception is performed in a PCIe system. As shown in FIG. 1, the system includes a Host processor, a PCIe switch, a plurality of endpoint devices, and a failback. Wherein:
Each EndPoint device (PCIe EndPoint, PCIe EP) is connected to the Host processor through a PCIe switch. The endpoint device may receive read and write requests from the Host processor or other endpoint device and transmit data accordingly. The Host processor may be a Central Processing Unit (CPU), a single processor chip, a processor group of multi-core processors, or the like. The Host processor is the core component responsible for performing computing tasks and controlling system operations.
In the embodiment of the invention, each end point device is provided with a fault restorer, and each functional component in the end point device carries out abnormality monitoring and abnormality processing through the fault restorer. As shown in fig. 1, the fault restorer may be connected between an endpoint device port (PCIe EP port) and a Network On Chip (NOC). NOCs may be implemented by a system on a chip (SOC). The NOC is an on-chip centralized routing mechanism, and can be provided with functions such as flow control management, request routing, on-chip management system, and the like. As shown in fig. 1, each functional component may be connected to the fault restorer through the NOC, so as to implement multi-functional class communication of the endpoint device. By the connection mode shown in fig. 1, the exception monitoring can be performed when each functional component of the endpoint device receives and responds to the read-write request, and the fault recovery can be performed when the exception occurs.
Fig. 2 is a schematic structural diagram of a fault restorer provided according to an embodiment of the present invention. As shown in FIG. 2, in order to implement anomaly monitoring and anomaly handling for each functional component in endpoint devices in a PCIe system, in an embodiment of the present invention, a fault restorer includes an anomaly monitoring circuit, an anomaly reporting circuit, and an automatic response circuit. The system comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit, wherein the abnormality monitoring circuit is used for monitoring read-write requests of the terminal equipment, sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists, the abnormality reporting circuit is used for generating an abnormality reporting request according to the abnormality identification through the terminal equipment and transmitting the abnormality reporting request to a Host processor, and the automatic response circuit is used for generating a response of the terminal equipment of the automatic response information agent according to the abnormality identification.
The abnormality monitoring circuit may monitor the read-write request of the endpoint device in various situations. For example, the anomaly monitoring circuit monitors one or more of a request handshake time, a response time, and a response data volume of a read-write request of the located endpoint device for anomalies. Specifically, the anomaly monitoring circuit can judge whether the read-write request handshake is overtime, whether the response is overtime, whether the number of responses is matched with the request, whether the response carries an anomaly flag, and the like.
Fig. 3 is a schematic diagram of an anomaly monitoring circuit according to an embodiment of the present invention. As shown in FIG. 3, the anomaly monitoring circuit optionally includes a request handshake timer, an outstanding request table, and a completion time timer.
The request handshake timer is used for counting handshake time of a read-write request of the terminal equipment, and determining that the request handshake timeout is abnormal when the handshake time exceeds a preset handshake threshold. At this time, the anomaly monitoring circuit may transmit an anomaly identification of the request handshake timeout anomaly to the anomaly reporting circuit. The exception reporting circuit may record exception information corresponding to the request handshake timeout exception, generate an exception reporting request, and transmit the exception reporting request to the Host processor. The exception information corresponding to the request handshake timeout exception includes, but is not limited to, a request function identifier, a request address, an accessed function component identifier, a request length and an exception type of a request corresponding to the exception information corresponding to the request handshake timeout exception. At this time, the automatic response circuit may generate a response of the endpoint device where the automatic response information agent corresponding to the read-write request data amount is located.
As shown in fig. 3, when the anomaly monitoring circuit determines that the read-write request of the located endpoint device normally completes handshake through the request handshake timer, the read-write request is recorded in the outstanding request table.
Fig. 4 is a schematic diagram of an outstanding request table according to an embodiment of the present invention. As shown in FIG. 4, the outstanding requests table may be a linked list. The read-write requests can be particularly serially connected into a linked list according to the ID sequence. Each row in the outstanding requests table may be a separate linked list item. The row of the incomplete request table can comprise link table management information such as read-write request ID, information effective identification, head identification of a link table, tail identification of the link table, next item identification of the link table and the like. The number of responses left by the read-write request, the remaining time length of the preset survival time of the read-write request, and other management information can be further included in one row of the completion request table. Other management information includes, but is not limited to, the ID and address of the read-write request, etc. When the incomplete request table records a read-write request, a free entry may be requested in a first-in first-Out memory (FIRST IN FIRST Out, FIFO) to obtain a free number of the FIFO. When the read-write request response is completed and the incomplete request list is cleared, returning is carried out in the FIFO, and the request free item of the FIFO is updated.
As shown in fig. 3, the completion time timer is configured to time the read/write request in the incomplete request table, and determine that there is a timeout exception in the response of the request when no response is received when the preset lifetime is reached. The completion time timer may count up or count down. For example, the completion time timer may count up from 0, and when a preset lifetime is reached, no response is received yet, and it is determined that there is a request response timeout exception. Or the completion time timer may count down from a preset lifetime, and when the count down is 0, no response is received yet, and it is determined that there is a request response timeout exception.
At this time, the abnormality monitoring circuit may transmit an abnormality identification of the request response timeout abnormality to the abnormality reporting circuit. The exception reporting circuit may record exception information corresponding to the request response timeout exception, generate an exception reporting request, and transmit the exception reporting request to the Host processor. The exception information corresponding to the request response timeout exception comprises, but is not limited to, a request function identifier, a request address, an accessed function component identifier, a request length and an exception type of a request corresponding to the exception information corresponding to the request response timeout exception. At this time, the automatic response circuit may generate a response of the endpoint device where the automatic response information agent corresponding to the read-write request data amount is located.
In the embodiment of the invention, when the abnormality monitoring circuit determines that the read-write request of the terminal equipment is responded before reaching the preset survival time through the completion time timer, the corresponding read-write request in the incomplete request table is cleared.
On the basis of the embodiment, the abnormality monitoring circuit is optional and is further used for inquiring the incomplete request table before clearing the corresponding read-write request in the incomplete request table and determining that the response data volume is abnormal when the response data volume is not matched with the corresponding read-write request data volume. The data amount at this time does not match as the completed response data amount is more or less than the read-write request data amount. The anomaly monitoring circuit may transmit an anomaly identification responsive to the anomaly in the data volume to the anomaly reporting circuit. The exception reporting circuit may record exception information corresponding to the response data volume exception, generate an exception reporting request, and transmit the exception reporting request to the Host processor. The exception information corresponding to the response data volume exception comprises, but is not limited to, a request function identifier, a request address, an accessed function component identifier, a request length and an exception type of a request corresponding to the exception information corresponding to the response data volume exception. At this time, the automatic response circuit may truncate and discard the response exceeding the data volume of the read-write request, or generate the response corresponding to the remaining volume of the read-write request, so as to obtain the response of the endpoint device where the automatic response information agent is located.
When the abnormality monitoring circuit determines that an abnormality exists, the abnormality reporting circuit can generate an abnormality reporting request and transmit the abnormality reporting request to the Host processor. In an optional implementation manner of the embodiment of the invention, the exception reporting circuit is specifically configured to obtain, by the located endpoint device, a request function identifier, a request address, an accessed function component identifier, a request length and an exception type corresponding to a read-write request according to the exception identifier, generate exception information, generate an exception queue according to the exception information, sequentially take out the exception information from the exception queue, generate an exception reporting request, and transmit the exception reporting request to the Host processor, where the exception reporting request includes at least one of an interrupt request, an advanced error reporting request, and a vendor specifying capability request.
Fig. 5 is a schematic structural diagram of an exception reporting circuit according to an embodiment of the present invention. As shown in fig. 5, the exception reporting circuit may push the exception information into an exception queue. And then pop out the abnormal information from the abnormal queue in turn to generate an abnormal report request. The push may be to add exception information at the tail of the exception queue. Pop may be to shift out exception information at the head of the exception queue. The exception reporting request may be an Interrupt request (INT), an advanced error reporting request (Advanced Error Reporting, AER), or a Vendor specific capability request (Vendor-Specific Capability, VSC).
When the abnormality monitoring circuit determines that an abnormality exists, the abnormality reporting circuit can generate an abnormality reporting request and transmit the abnormality reporting request to the Host processor. Meanwhile, the automatic response circuit can generate a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier.
In an optional implementation manner of the embodiment of the invention, the automatic response circuit comprises a response switching control sub-circuit and an automatic response generation sub-circuit, wherein the response switching control sub-circuit is used for switching an original response link into the automatic response generation sub-circuit according to the abnormal identifier, the automatic response generation sub-circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier, and the response switching control sub-circuit is also used for switching the automatic response generation sub-circuit into the original response link after the automatic response information is responded.
Fig. 6 is a schematic structural diagram of an automatic response circuit according to an embodiment of the present invention. As shown in fig. 6, the automatic response circuit includes a response switching control sub-circuit and an automatic response generation sub-circuit. And the response switching control sub-circuit switches the original response link into the automatic response generation sub-circuit according to the abnormal identifier. Specifically, when the read-write request handshake is overtime, the request response time-out returned by the functional component through the endpoint device, the request response data volume does not reach the read-write request data volume, or the request response data volume exceeds the read-write request data volume, the response switching control sub-circuit switches the original response link into the automatic response generation sub-circuit.
As shown in fig. 6, the automatic response generation sub-circuit is configured to generate a response of the endpoint device where the automatic response information agent is located according to the anomaly identification. The automatic response generation sub-circuit generates a response of the endpoint device where the automatic response information agent corresponding to the read-write request data volume is located when the abnormality identification is that the request handshake is overtime abnormality, generates a response of the endpoint device where the automatic response information agent corresponding to the read-write request data volume is located when the abnormality identification is that the request response is overtime abnormality, cuts off and discards redundant response corresponding to the read-write request or generates a residual quantity response corresponding to the read-write request when the abnormality identification is that the response data volume is abnormal, and obtains the response of the endpoint device where the automatic response information agent is located.
The automatic response generation sub-circuit is used for switching the automatic response generation sub-circuit into an original response link after the automatic response information is answered. By switching the automatic response generation subcircuit to the original response link, the response route transfer of the normal operation functional component can be performed after the functional component is abnormal and before the host processor has not completed the abnormal recovery.
In the embodiment of the invention, the exception before the host processor performs exception recovery can still be subjected to exception monitoring, exception reporting and automatic response. To reduce the repeated monitoring of anomalies, the response of an anomaly functional component that has been monitored before the host processor performs the anomaly recovery can be handled by an anomaly isolation circuit.
Through the fault restorer formed by the abnormality monitoring circuit, the abnormality reporting circuit and the automatic response circuit, when each functional component in the endpoint equipment is abnormal, the automatic response can be carried out, so that the normal use of the endpoint equipment is ensured, and the crash of a Host processor is avoided. The exception reporting request is transmitted to the Host processor, so that the Host processor can locate the exception and perform operations such as recovery processing on specific functional components of specific endpoint devices.
Specifically, the HOST processor may query for exception information upon receipt of an interrupt request, an advanced error report request, or a vendor specific capability request. The HOST processor locates the exception function components of the endpoint device based on the exception information, flushes, isolates them by configuring the NOC, etc. The HOST processor can perform abnormal recovery operations such as resetting and initializing the abnormal functional components, so that the Hang risk of the whole PCIe system is avoided, the robustness of the system is improved, resetting and initializing are performed within a limited range, and the abnormal recovery overhead is reduced.
On the basis of the embodiment, optionally, in order to further ensure the robustness of the system before and during the abnormal recovery, as shown in fig. 2, the fault restorer further comprises an abnormal isolation circuit. The abnormal isolation circuit is used for acquiring an abnormal report request, determining an abnormal functional component according to the abnormal report request, and discarding information when the read-write request is sent to the abnormal functional component or the read-write request is responded from the abnormal functional component. An exception isolation circuit may also be provided in the NOC to isolate requests and responses from functional components, as shown in fig. 1. The exception isolation circuit control information in the NOC may come from the processor configuration. The exception isolation circuitry in the NOC, once configured to enable isolation of a functional component, all requests directed to that functional component are automatically responded to by the exception isolation circuitry. Requests to other functional components are not affected.
Fig. 7 is a schematic structural diagram of an anomaly isolation circuit according to an embodiment of the present invention. As shown in fig. 7, the exception isolation circuit mainly determines whether a read-write request is directed to an exception function component or whether a response is from an exception function component. Specifically, the exception isolation circuit may determine the exception function component according to the identification information of the exception function component in the exception reporting request. The exception isolation circuit may determine whether the read-write request is directed to an exception function based on a request address of the read-write request. The exception isolation circuit may use the feature identification information carried in the response packet to determine whether the response is from an abnormal feature. The exception isolation circuitry may discard read and write requests directed to the exception function. The exception isolation circuitry may discard the response from the exception function. For normal communications such as requests or effects with non-abnormal functional components, the abnormal isolation circuit may forward normally. After the restoration of the abnormal functional components, the HOST processor may configure the NOC or the like to de-isolate the abnormal functional components.
According to the technical scheme, the equipment exception handling system based on PCIe is constructed, the system comprises a Host processor, a PCIe switch, a plurality of end point equipment and a fault restorer, wherein each end point equipment is connected with the Host processor through the PCIe switch, the fault restorer is arranged in each end point equipment, each functional component in the end point equipment carries out exception monitoring and exception handling through the fault restorer, the fault restorer comprises an exception monitoring circuit, an exception reporting circuit and an automatic response circuit, the exception monitoring circuit is used for monitoring read-write requests of the end point equipment, sending exception identification to the exception reporting circuit and the automatic response circuit when determining that an exception exists, the exception reporting circuit is used for generating an exception reporting request according to the exception identification through the end point equipment, transmitting the exception reporting request to the Host processor, and the automatic response circuit is used for generating a response of the end point equipment with automatic response information agent according to the exception identification, so that the problem that the end point equipment is easy to cause Hang of the PCIe system is solved, the fault functional components in the end point equipment can be accurately located, the robustness of the exception recovery is reduced, and the system cost is improved.
FIG. 8 is a flowchart of a PCIe-based device exception handling method according to an embodiment of the present invention, which may be applied to a PCIe-based device exception handling system according to any of the embodiments of the present invention. As shown in fig. 8, the method includes:
And 810, monitoring a read-write request of the endpoint equipment by an abnormality monitoring circuit, and sending an abnormality identification to an abnormality reporting circuit and an automatic response circuit when the abnormality exists.
The abnormality monitoring circuit monitors read-write requests of the end point equipment through the abnormality monitoring circuit, and when abnormality is determined to exist, the abnormality monitoring circuit sends an abnormality identification to the abnormality reporting circuit and the automatic response circuit, wherein the abnormality monitoring circuit counts the read-write requests of the end point equipment through the request handshake timer, when the handshake time exceeds a preset handshake threshold, the abnormality monitoring circuit determines that the request handshake timeout abnormality exists, when the abnormality monitoring circuit determines that the read-write requests of the end point equipment normally complete the handshake through the request handshake timer, the read-write requests are recorded in the incomplete request table, when the completion time timer counts the read-write requests in the incomplete request table, when the completion time timer reaches the preset survival time, the abnormality monitoring circuit determines that the read-write requests of the end point equipment are responded, and when the abnormality monitoring circuit determines that the read-write requests of the end point equipment are received and respond before the preset survival time is reached, the corresponding read-write requests in the incomplete request table are cleared.
Based on the above embodiment, optionally, the method further comprises the step of querying the incomplete request table by an anomaly monitoring circuit before clearing the corresponding read-write request in the incomplete request table, and determining that the response data volume is abnormal when the response data volume is not matched with the corresponding read-write request data volume.
And step 820, generating an exception reporting request by the endpoint equipment of the exception reporting circuit according to the exception identifier, and transmitting the exception reporting request to the Host processor by the exception reporting circuit.
The method comprises the steps of obtaining a request function identifier, a request address, an accessed function component identifier, a request length and an exception type corresponding to a read-write request according to the exception identifier through the end point equipment of the exception reporting circuit, generating exception information, generating an exception queue according to the exception information through the exception reporting circuit, sequentially taking out the exception information from the exception queue, generating an exception reporting request, and transmitting the exception reporting request to a Host processor, wherein the exception reporting request comprises at least one of an interrupt request, an advanced error reporting request and a provider specifying capability request.
And 830, generating a response of the endpoint device where the automatic response information agent is located according to the abnormal identifier through the automatic response circuit.
The automatic response circuit comprises a response switching control sub-circuit and an automatic response generation sub-circuit, wherein the response of the endpoint equipment where the automatic response information agent is located is generated through the automatic response circuit according to the abnormal identifier, the automatic response generation sub-circuit is used for switching an original response link according to the abnormal identifier, the automatic response generation sub-circuit is used for generating the response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier, and the automatic response generation sub-circuit is switched into the original response link after the automatic response information is responded through the response switching control sub-circuit.
The automatic response generation sub-circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identification, and comprises the steps of generating the response of the endpoint equipment where the automatic response information agent is located corresponding to the read-write request data volume when the abnormal identification is a request handshake timeout abnormality through the automatic response generation sub-circuit, generating the response of the endpoint equipment where the automatic response information agent is located corresponding to the read-write request data volume when the abnormal identification is a request response timeout abnormality through the automatic response generation sub-circuit, and intercepting and discarding redundant response corresponding to the read-write request or generating residual quantity response corresponding to the read-write request when the abnormal identification is a response data volume abnormality through the automatic response generation sub-circuit, so that the response of the endpoint equipment where the automatic response information agent is located is obtained.
On the basis of the implementation mode, the fault restorer comprises an abnormal isolation circuit, the method further comprises the steps of obtaining an abnormal report request through the abnormal isolation circuit, determining an abnormal functional component according to the abnormal report request, and discarding information when a read-write request is sent to the abnormal functional component or responds to the request from the abnormal functional component.
According to the technical scheme, the read-write request of the endpoint equipment is monitored through the abnormality monitoring circuit, the abnormality identification is sent to the abnormality reporting circuit and the automatic response circuit when the abnormality exists, the abnormality reporting request is generated through the endpoint equipment of the abnormality reporting circuit according to the abnormality identification and is transmitted to the Host processor through the abnormality reporting circuit, the response of the endpoint equipment of the automatic response information agent is generated through the automatic response circuit according to the abnormality identification, the problem that the PCIe system is easy to be dead due to functional abnormality of the endpoint equipment is solved, fault functional components in the endpoint equipment can be accurately located, the abnormality recovery cost is reduced, and the system robustness is improved.
The embodiment of the invention also provides a fault restorer. The fault restorer is applied to the PCIe-based device exception handling system provided by any embodiment of the invention. The fault restorer comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit, wherein the abnormality monitoring circuit is used for monitoring read-write requests of the endpoint equipment, sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when determining that abnormality exists, the abnormality reporting circuit is used for generating an abnormality reporting request according to the abnormality identification through the endpoint equipment and transmitting the abnormality reporting request to a Host processor, and the automatic response circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormality identification.
The abnormality monitoring circuit comprises a request handshake timer, an incomplete request table and a completion time timer, wherein the request handshake timer is used for conducting handshake time timing on read-write requests of the terminal equipment, when the handshake time exceeds a preset handshake threshold, the abnormal request handshake timeout abnormality is determined, when the abnormality monitoring circuit determines that the read-write requests of the terminal equipment normally complete handshake through the request handshake timer, the read-write requests are recorded in the incomplete request table, the completion time timer is used for conducting timing on the read-write requests in the incomplete request table, when the preset survival time is reached, no response is received, the abnormal request response timeout is determined, and when the abnormality monitoring circuit determines that the response is received before the read-write requests of the terminal equipment reach the preset survival time, the corresponding read-write requests in the incomplete request table are cleared.
Optionally, the abnormality monitoring circuit is further configured to query the incomplete request table before clearing the corresponding read-write request in the incomplete request table, and determine that there is an abnormality in the response data amount when the response data amount does not match the corresponding read-write request data amount.
The exception reporting circuit is specifically configured to obtain, by the located endpoint device, a request function identifier, a request address, an accessed function component identifier, a request length, and an exception type of a corresponding read-write request according to the exception identifier, generate exception information, generate an exception queue according to the exception information, sequentially take out the exception information from the exception queue, generate an exception reporting request, and transmit the exception reporting request to the Host processor, where the exception reporting request includes at least one of an interrupt request, an advanced error reporting request, and a vendor specification capability request.
The automatic response circuit comprises a response switching control sub-circuit and an automatic response generation sub-circuit, wherein the response switching control sub-circuit is used for switching an original response link into the automatic response generation sub-circuit according to an abnormal identifier, the automatic response generation sub-circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier, and the response switching control sub-circuit is also used for switching the automatic response generation sub-circuit into the original response link after the automatic response information is responded.
The automatic response generation sub-circuit is specifically used for generating a response of the endpoint device where the automatic response information agent corresponding to the read-write request data volume is located when the abnormality identification is a request handshake timeout abnormality, generating a response of the endpoint device where the automatic response information agent corresponding to the read-write request data volume is located when the abnormality identification is a request response timeout abnormality, and intercepting and discarding redundant response corresponding to the read-write request or generating a residual volume response corresponding to the read-write request when the abnormality identification is a response data volume abnormality to obtain the response of the endpoint device where the automatic response information agent is located.
Optionally, the fault restorer further comprises an abnormal isolation circuit, wherein the abnormal isolation circuit is used for acquiring an abnormal report request, determining an abnormal functional component according to the abnormal report request, and discarding information when the read-write request is sent to the abnormal functional component or responds to the request from the abnormal functional component.
According to the technical scheme provided by the embodiment of the invention, the problem that the PCIe system is easy to be hanged up due to abnormal functions of the endpoint equipment is solved by the fault restorer, fault functional components in the endpoint equipment can be accurately positioned, abnormal restoration expenditure is reduced, and system robustness is improved.
Fig. 9 is a schematic structural diagram of a computer according to an embodiment of the present invention. As shown in FIG. 9, a PCIe-based device exception handling system as provided by any of the embodiments of the invention is included in a computer. Therefore, the problem that the PCIe system is easy to be hanged up due to abnormal functions of the endpoint device is solved through the PCIe-based device exception handling system, fault functional components in the endpoint device can be accurately positioned, abnormal recovery overhead is reduced, and system robustness is improved.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (10)
1. A PCIe-based device exception handling system is characterized in that the system comprises a Host processor, a PCIe switch, a plurality of endpoint devices and a fault restorer, wherein:
each endpoint device is connected with the Host processor through the PCIe switch;
each end point device is provided with a fault restorer, and each functional component in the end point device carries out abnormality monitoring and abnormality processing through the fault restorer;
the fault restorer comprises an abnormality monitoring circuit, an abnormality reporting circuit and an automatic response circuit;
the abnormality monitoring circuit is used for monitoring the read-write request of the terminal equipment and sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists;
The exception reporting circuit is used for generating an exception reporting request according to the exception identifier through the located endpoint equipment and transmitting the exception reporting request to a Host processor;
and the automatic response circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier.
2. The system of claim 1, wherein the anomaly monitoring circuit comprises a request handshake timer, an outstanding request table, and a completion time timer;
the request handshake timer is used for counting handshake time of a read-write request of the terminal equipment, and determining that the request handshake timeout is abnormal when the handshake time exceeds a preset handshake threshold;
when the abnormality monitoring circuit determines that the read-write request of the endpoint equipment is normally completed by the request handshake timer, the read-write request is recorded in the incomplete request table;
the completion time timer is used for timing the read-write request in the incomplete request table, and when the completion time timer reaches the preset survival time, no response is received, and the condition that the response of the request is overtime is determined to exist;
And when the abnormality monitoring circuit determines that the read-write request of the terminal equipment is responded before reaching the preset survival time through the completion time timer, the corresponding read-write request in the incomplete request table is cleared.
3. The system of claim 2, wherein the anomaly monitoring circuit is further configured to:
Before the corresponding read-write request in the incomplete request table is cleared, the incomplete request table is queried, and when the response data volume is not matched with the corresponding read-write request data volume, the response data volume is determined to be abnormal.
4. The system of claim 1, wherein the exception reporting circuit is specifically configured to:
acquiring a request function identifier, a request address, an accessed function component identifier, a request length and an abnormality type corresponding to a read-write request by the endpoint equipment according to the abnormality identifier, and generating abnormality information;
Generating an exception queue according to the exception information, sequentially taking out the exception information from the exception queue, generating an exception reporting request, and transmitting the exception reporting request to a Host processor;
Wherein the exception reporting request includes at least one of an interrupt request, an advanced error reporting request, and a vendor specific capability request.
5. The system of claim 1, wherein the automatic response circuit comprises a response switching control sub-circuit and an automatic response generation sub-circuit;
The response switching control sub-circuit is used for switching the original response link into an automatic response generation sub-circuit according to the abnormal identifier;
An automatic response generation sub-circuit for generating a response of the endpoint device where the automatic response information agent is located according to the abnormal identifier;
the response switching control sub-circuit is also used for switching the automatic response generation sub-circuit into an original response link after the automatic response information is answered.
6. The system of claim 5, wherein the automatic response generation subcircuit is specifically configured to:
when the abnormal identifier is abnormal when the request handshake overtime, generating a response of the endpoint equipment where the automatic response information agent corresponding to the read-write request data volume is located;
when the abnormality mark is abnormal when the request response is overtime, generating a response of the endpoint equipment where the automatic response information agent corresponding to the read-write request data volume is located;
When the abnormality mark is abnormal in response data quantity, cutting off and discarding redundant response corresponding to the read-write request or generating residual quantity response corresponding to the read-write request to obtain the response of the endpoint equipment where the automatic response information agent is located.
7. The system of claim 1, wherein the fault restorer further comprises an anomaly isolation circuit;
The abnormal isolation circuit is used for acquiring the abnormal report request, determining an abnormal functional component according to the abnormal report request, and discarding information when the read-write request is sent to the abnormal functional component or the read-write request is responded to the abnormal functional component.
8. A PCIe-based device exception handling method, wherein the method is applied to a PCIe-based device exception handling system according to any one of claims 1-7, the method comprising:
Monitoring a read-write request of the endpoint equipment through an abnormality monitoring circuit, and sending an abnormality identification to an abnormality reporting circuit and an automatic response circuit when abnormality exists;
Generating an exception reporting request by the endpoint equipment of the exception reporting circuit according to the exception identifier, and transmitting the exception reporting request to a Host processor by the exception reporting circuit;
and generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier through an automatic response circuit.
9. The fault restorer is applied to the PCIe-based device exception handling system according to any one of claims 1-7, and comprises an exception monitoring circuit, an exception reporting circuit and an automatic response circuit, wherein:
the abnormality monitoring circuit is used for monitoring the read-write request of the terminal equipment and sending an abnormality identification to the abnormality reporting circuit and the automatic response circuit when the abnormality exists;
The exception reporting circuit is used for generating an exception reporting request according to the exception identifier through the located endpoint equipment and transmitting the exception reporting request to a Host processor;
and the automatic response circuit is used for generating a response of the endpoint equipment where the automatic response information agent is located according to the abnormal identifier.
10. A computer comprising the PCIe-based device exception handling system of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411494843.1A CN119473738A (en) | 2024-10-24 | 2024-10-24 | Device exception processing system, method and fault recovery device based on PCIe |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411494843.1A CN119473738A (en) | 2024-10-24 | 2024-10-24 | Device exception processing system, method and fault recovery device based on PCIe |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119473738A true CN119473738A (en) | 2025-02-18 |
Family
ID=94590691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411494843.1A Pending CN119473738A (en) | 2024-10-24 | 2024-10-24 | Device exception processing system, method and fault recovery device based on PCIe |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119473738A (en) |
-
2024
- 2024-10-24 CN CN202411494843.1A patent/CN119473738A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7240130B2 (en) | Method of transmitting data through an 12C router | |
US7010639B2 (en) | Inter integrated circuit bus router for preventing communication to an unauthorized port | |
US6944796B2 (en) | Method and system to implement a system event log for system manageability | |
US20040267999A1 (en) | System and method for presence detect and reset of a device coupled to an inter-integrated circuit router | |
US7630304B2 (en) | Method of overflow recovery of I2C packets on an I2C router | |
CN107678994B (en) | Method and device for hot-plugging a PCIe device | |
CN106502814B (en) | Method and device for recording error information of PCIE (peripheral component interface express) equipment | |
JP2015524122A (en) | Method, computer system and apparatus for accessing PCI Express endpoint device | |
US7398345B2 (en) | Inter-integrated circuit bus router for providing increased security | |
US20040255070A1 (en) | Inter-integrated circuit router for supporting independent transmission rates | |
CN110609762B (en) | Method and device for preventing advanced high performance bus (AHB) from deadlock | |
CN100370756C (en) | Reset processing method and device for system | |
CN114756489A (en) | Direct Memory Access (DMA) engine for diagnostic data | |
CN119473738A (en) | Device exception processing system, method and fault recovery device based on PCIe | |
US8880957B2 (en) | Facilitating processing in a communications environment using stop signaling | |
US20040255193A1 (en) | Inter integrated circuit router error management system and method | |
CN115766526B (en) | Method and device for testing physical layer chip of switch and electronic equipment | |
CN117271234A (en) | Fault diagnosis method and device, storage medium and electronic device | |
CN114461350A (en) | Container usability testing method and device | |
US20040255195A1 (en) | System and method for analysis of inter-integrated circuit router | |
CN117827973B (en) | Read request scheduling method and device of distributed database and electronic equipment | |
CN110300019B (en) | Event management subsystem and method for multi-protocol exchange system | |
US20140092756A1 (en) | Methods and apparatuses to provide time markers for a packet stream | |
CN114650322A (en) | Tracking hybrid compression method and device | |
CN115865624A (en) | Root cause positioning method of performance bottleneck in host, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |