CN120523692A

CN120523692A - Power supply status monitoring system and server

Info

Publication number: CN120523692A
Application number: CN202511025012.4A
Authority: CN
Inventors: 罗嗣恒; 孔财
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2025-07-24
Filing date: 2025-07-24
Publication date: 2025-08-22
Anticipated expiration: 2045-07-24
Also published as: CN120523692B

Abstract

The present application discloses a power supply status monitoring system and server, relating to the field of server technology. The system includes at least one power supply unit, a power distribution unit, a controller, and a first overcurrent protection unit. The power distribution unit is configured to output a first indication signal and a second indication signal, wherein the first indication signal is configured to indicate whether the input voltage of the power supply unit is within a preset input voltage range, and the second indication signal is configured to indicate whether the output voltage of the power supply unit is within a preset output voltage range. The controller is configured to update a control signal of the first overcurrent protection unit based on the first indication signal and the second indication signal, and determine whether to output a status abnormality signal based on the updated control signal, wherein the status abnormality signal is configured to indicate that the first overcurrent protection unit is in an abnormal state. The present application solves the technical problem in the related art that when a power supply unit fails, the status of the overcurrent protection unit is prone to false alarms.

Description

Power supply state monitoring system and server

Technical Field

The present application relates to the field of server technologies, and in particular, to a power supply state monitoring system and a server.

Background

With the development of cloud computing and artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) applications, users are increasingly demanding server computing performance, such as requiring a greater number of graphics processors (Graphics Processing Unit, GPUs) to be deployed in a server to enhance its computing performance.

In order to enhance the GPU power security, in some solutions, an overcurrent protection unit needs to be added to the GPU power input to achieve timely protection of GPU overcurrent or short circuit. However, the above solution is prone to false alarms of the state of the over-current protection unit when the power supply unit (power supply unit, PSU) fails, thereby affecting the operation and maintenance efficiency.

Disclosure of Invention

The application provides a power supply state monitoring system and a server, which at least solve the technical problem that the state of an overcurrent protection unit is easy to be misreported when a PSU breaks down in the related art.

The application provides a power supply state monitoring system which comprises at least one power supply unit, a power distribution unit, a controller and a first overcurrent protection unit, wherein the power distribution unit is connected in series between the power supply unit and the first overcurrent protection unit.

The power distribution unit is used for outputting a first indication signal and a second indication signal, wherein the first indication signal is used for indicating whether the input voltage of the power supply unit is in a preset input voltage range or not, and the second indication signal is used for indicating whether the output voltage of the power supply unit is in a preset output voltage range or not.

The controller is used for updating the control signal of the first overcurrent protection unit based on the first indication signal and the second indication signal, and determining whether to output a state abnormal signal according to the updated control signal, wherein the state abnormal signal is used for indicating that the first overcurrent protection unit is in an abnormal state.

The application also provides a server comprising the power supply state monitoring system.

According to the power supply state monitoring system and the server provided by the embodiment of the application, two paths of signals of the power supply unit are introduced as the basis for the controller to monitor and judge whether the first overcurrent protection unit is in an abnormal state, so that the condition that the state of the first overcurrent protection unit is misreported due to the fault of the PSU can be effectively prevented, and the operation and maintenance efficiency can be effectively improved.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic diagram of a GPU power state monitoring system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a power supply state monitoring system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another embodiment of a power status monitoring system according to the present application;

Fig. 4 is a schematic diagram of interconnection topology between a signal switching unit and a controller and between a baseboard management controller (baseboard management controller, BMC) according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another GPU power state monitoring system according to an embodiment of the present application;

fig. 6 is a schematic diagram of an internal structure of a signal processing unit provided in an embodiment of the present application;

FIG. 7 is a logic truth table corresponding to each input signal and each output signal in the signal processing unit;

FIG. 8 is a first power down timing diagram of an AC mains power down system according to an embodiment of the present application;

FIG. 9 is a second power down timing diagram of an AC mains power down system according to an embodiment of the present application;

Fig. 10 is a third power-down timing chart of an AC mains power down system according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

For purposes of clarity in describing the embodiments of the present application, "exemplary," "for example," etc. are intended to be exemplary, illustrative, or explanatory. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or" describes an association of associated objects, meaning that there may be three relationships, e.g., A and/or B, and that there may be A alone, while A and B are present, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Some abbreviations and key terms involved in the examples of application are explained below.

A power supply unit (power supply unit, PSU) for converting the alternating current into direct current required by the electronic device to provide stable power to the components of the device.

A power distribution board (power distribution board, PDB) for distributing the power provided by the PSU to the individual power units on the circuit board.

Overcurrent protection (over current protection, OCP) when excessive current is present in the circuit, the protection device acts to shut down the circuit or limit the current to prevent damage to the device from overcurrent.

An overcurrent protection element made of semiconductor material has the advantages of reusability and quick response compared with traditional fuse.

The voltage regulating module (voltage regulator module, VRM) is used for converting the input voltage into stable low voltage required by a processor, a display card and the like, and has the functions of dynamic response, efficient heat dissipation, intelligent protection and the like.

And the baseboard management controller (Baseboard Management Controller, BMC) is a core controller in the server and is used for remotely monitoring, managing and maintaining the server, and can realize functions such as starting and shutting down the server, checking hardware states and the like.

A complex programmable logic device (complex programmable logic device, CPLD) is a high density programmable logic device with an integrated density greater than 1000 gates, with more input/output signals, product terms, and macro-cells.

In the present digital age, the cloud computing technology is being iteratively updated at an unprecedented speed, the service mode of the cloud computing technology is continuously expanded, and the cloud computing technology extends from basic storage and computing services to more complex fields of analysis, decision support and the like. Meanwhile, the AI application also presents explosive growth situation, covers a plurality of front fields such as natural language processing, computer vision, reinforcement learning and the like, has increasingly wide application scenes, and deeply changes the life and working modes of people from intelligent home, intelligent traffic to medical diagnosis and the like.

In this context, the user's demand for computing performance of the server has evolved in a qualitative leap. The demand by users for server computing performance has presented an ever-increasing and increasingly stringent situation. In practice, this requirement is manifested as a number of specific requirements, with the emphasis being placed on deploying a greater number of graphics processors (graphics processing unit, GPUs) in a server system.

Specifically, in the deep learning training process, complex matrix operation and neural network calculation are required to be performed on massive data, so that the parallel computing capability of the GPU can obviously shorten the training time and improve the training efficiency. In a cloud computing environment, a plurality of users initiate computing requests at the same time, the server is required to have the capabilities of quick response and high-efficiency processing, the parallel processing capability of the server can be enhanced by increasing the number of the GPUs, and the requests of all the users can be processed in time.

Therefore, in order to meet the demands of cloud computing and AI application on continuous improvement of server computing performance, deployment of more GPUs in a server system becomes a necessary trend, and by fully exerting the parallel computing advantages of the GPUs, the overall computing performance of the server can be effectively improved, so that better and efficient services are provided for users.

When GPUs are introduced in a server system on a large scale, the accumulated power consumption of multiple GPUs can have a significant impact on the current server power architecture. Existing server power supply systems are typically designed based on traditional load requirements, and their power capacity and power stability may not meet the power consumption requirements of a large number of GPUs running at the same time. This may cause problems such as a drop in power supply voltage, an increase in current ripple, etc., thereby affecting the performance and stability of the GPU and possibly even causing a malfunction of the server system.

In order to effectively solve the GPU power supply security problem and improve the reliability and stability of the server system, in some advanced solutions, measures of adding an Electronic Fuse (EFUSE) protection circuit to the GPU power supply input terminal are proposed. EFUSE is an overcurrent protection element based on semiconductor technology, and has the advantages of high response speed, high precision, reusability and the like. EFUSE is able to detect an overcurrent or short circuit fault in a shorter time than conventional fuses and quickly cut the circuit, thereby effectively protecting the GPU and other server components from damage.

Specifically, when the GPU has an overcurrent or short-circuit fault, the EFUSE protection circuit immediately senses an abnormal change of current and triggers a protection action within nanosecond time to cut off the connection between the fault circuit and the power supply. The process can not only prevent the burn-out (burning) of the circuit board caused by overcurrent or short circuit, but also electrically isolate the fault part from other parts which normally operate, thereby avoiding the spread and spreading of the fault.

In addition, the EFUSE protection circuit also has a self-recovery function, and can automatically recover the normal working state after the fault is removed, and the fuse does not need to be manually replaced, so that the maintainability and the usability of the server system are improved.

Illustratively, in an AI server application, a single high performance GPU is capable of achieving a higher interval of 600W to 800W when running at full power. When a server needs to deploy 8 such high-performance GPUs to meet a large-scale AI computing task, the total power consumption will scale dramatically to 4800W to 6400W.

From the PSU point of view, if a 12V dc voltage is used to supply power to the GPU, the calculation is performed according to the basic formula p=ui of electric power (where P is power, U is voltage, and I is current), and the power supply current that needs to be provided by the PSU is in the range of 400A-533A. Such a large load current makes the selection of connectors and cables a challenge to be solved.

In terms of connector type selection, in order to ensure reliability and stability of the connector under high current conditions, the connector must have sufficient current carrying capacity and heat dissipation capacity. Parameters such as conductive cross-sectional area, contact resistance, etc. of the connector need to be carefully designed and optimized according to relevant standards for current carrying capacity. This results in an unavoidable increase in the size of the connector to increase the cross-sectional area of the conductive material, reduce resistance, and reduce heat generation. However, the introduction of large-sized connectors in the compact structural space of AI servers can present a series of space layout problems. The inside of the server is usually designed with high integration, the space between each hardware component is small, and the large-size connector may interfere with other components, so that the overall assembly and maintenance of the server are affected. For example, the heat dissipation device may collide with components such as a heat dissipation fan and a memory slot, which may not be properly installed or may affect the heat dissipation effect.

For cable selection, high currents require cables with thicker wire diameters. Increasing the wire diameter can reduce the resistance of the cable according to the law of resistance, thereby reducing the heat generated under high current conditions. However, the thickening of the cable has various effects. In the aspect of routing inside the chassis, the thick cable occupies more space, so that the limited routing channel becomes more crowded. This not only increases the difficulty of routing, but may also lead to crossover and entanglement between cables, affecting the stability and reliability of signal transmission. In the wire arranging process, the flexibility of the thick cable is relatively poor, and the thick cable is difficult to tidily comb and fix. This may cause the cable to vibrate and loosen during operation of the server, thereby causing a malfunction such as a poor contact. In addition, the thick cable may adversely affect heat dissipation inside the chassis. The cable can obstruct the circulation of air, forms local heat dissipation dead angle for the inside temperature distribution of server is inhomogeneous, influences hardware component's performance and life-span.

Meanwhile, the conduction loss of the system caused by large current is a problem which cannot be ignored. According to joule's law, under high current conditions, even if the cable and connector resistance is small, a large amount of heat is generated, resulting in increased conduction losses. The method not only can waste electric energy and reduce the energy utilization efficiency of the server, but also can increase the electricity cost of a user machine room. From the perspective of energy management, the increase of conduction loss means that the server needs to consume more electric energy to maintain the normal operation of the server in the running process, which is contrary to the current development trend of energy-saving, emission-reducing and green data centers. Therefore, how to reduce conduction loss under high current and improve energy utilization efficiency of the server is one of the important issues to be considered in AI server design.

In summary, in the related art, the PSU power supply bus voltage in the AI server system generally adopts the high-voltage direct current 54V, and then adopts the power brick scheme to convert the 54V power supply into the 12V power supply, where the 12V power supply can supply power to various components in the server system, such as a motherboard, a hard disk, a GPU, a fan, and the like.

Because a plurality of GPU components are deployed in the AI server, the reliability of GPU power supply is crucial for continuous and stable operation of the AI server service, and therefore, the system is required to monitor and manage the GPU power supply state in real time. In the related art, a CPLD or a BMC is generally used to obtain a control signal EN and a power supply establishment completion monitoring signal PG corresponding to EFUSE of each GPU, and then determine whether the EFUSE is in an abnormal state.

Illustratively, in some AI servers, n+n PSUs are typically used to provide a 54V voltage, which is converted to a 12V voltage by a power brick, and then a 12V standby voltage is converted by a 12V EFUSE, which can ultimately be converted to a p3v3_stby standby voltage to power a CPLD on a motherboard or GPU board.

Referring to fig. 1, fig. 1 is a schematic diagram of a GPU power state monitoring system according to an embodiment of the present application.

Illustratively, the following description will be made with 8 GPU configurations, 2+2 redundant PSU power supplies as an example.

The GPU power supply state monitoring system comprises an alternating current (ALTERNATING CURRENT, AC) mains supply access unit, 2+2 redundancy PSU, PDB, P V EFUSEs, GPU EFUSEs (comprising EFUSE：P12V_GPU0、EFUSE：P12V_GPU1、EFUSE：P12V_GPU2、EFUSE：P12V_GPU3、EFUSE：P12V_GPU4、EFUSE：P12V_GPU5、EFUSE：P12V_GPU6、EFUSE：P12V_GPU7）、GPU components (comprising GPU0, GPU1, GPU2, GPU3, GPU4, GPU5, GPU6 and GPU 7), P12V_STBY EFUSEs and VR P3V 3-STBY, CPLD, BMC.

The AC mains supply access unit is used as an initial energy source inlet of the whole system and is responsible for introducing external alternating current mains supply into the server. The power supply device needs to have interfaces and electrical characteristics which meet the standards, can adapt to the supply standards of the commercial power in different areas, and provides a stable foundation for the subsequent power supply conversion. At the same time, the unit is often also equipped with a certain protection mechanism, such as lightning protection, overvoltage protection, etc., to prevent the internal circuitry of the server from being damaged by the abnormal situation of the external power grid. Alternatively, the AC mains access unit described above may be used to provide a voltage of 220V.

The 2+2 redundant PSUs may be divided into two groups of 2 PSUs that operate in parallel. Under the normal working state, the two groups of PSUs jointly provide power for the system, load is shared, and the output capacity of the power supply is improved. When any one of the PSUs fails, the other PSU can immediately bear all loads, so that continuous operation of the server is ensured, and service interruption caused by power failure is avoided. The power supply conversion circuit can be integrated in each PSU, so that the input alternating current can be converted into direct current suitable for the internal use of the server, and the PSU has perfect protection functions such as overcurrent protection, overvoltage protection, short-circuit protection and the like.

The PDB can receive the output power from the 4 PSUs and distribute it reasonably to the various parts of the system. The PDB has high-precision power distribution capability and good electrical isolation performance, and can ensure that power supply among different components is not interfered with each other. Meanwhile, the PDB may integrate some power monitoring functions, such as real-time monitoring of voltage and current, so as to discover abnormal situations in the power supply process in time.

The P54V EFUSE can play an important role in overcurrent protection in the power supply distribution process. The P54V EFUSE is positioned between the PDB and a subsequent circuit, and when an overcurrent condition occurs in the circuit, the P54V EFUSE can be quickly fused to cut off the circuit, so that the subsequent electronic components are prevented from being damaged by excessive current. Compared with the traditional fuse, the P54V EFUSE has the advantages of high response speed, reusability and the like, and can better adapt to the requirements of modern electronic equipment on power protection.

Corresponding GPU EFUSEs are respectively set for the 8 GPUs, and comprise EFUSEs, namely P12V_GPU0 and P12V_GPU7. These EFUSEs are also used for over-current protection, providing an independent power protection mechanism for each GPU. When a fault such as short circuit or overcurrent occurs in a certain GPU, the corresponding GPU EFUSE is fused, the power supply of the GPU is cut off, the fault is prevented from being diffused to other parts, and the safe operation of the whole system is protected.

P12V_STBY EFUSE, which may be used to protect standby voltage paths. The standby voltage p12v_stby supplies power to some critical management components in the server, which require stable power supply even when the server is in a standby state. The P12V_STBY EFUSE can cut off the power supply in time when the standby voltage path is abnormal, and protect related components from being damaged.

The VR-P3V3_STBY can be used for converting the standby voltage P12V_STBY into a stable voltage P3V3_STBY of 3.3V, and supplying power for management components such as CPLD and BMC.

The CPLD can be used for the functions of initialization, logic control, signal processing and the like of the server. In the power supply state monitoring system, the CPLD can acquire state information of each power supply node, such as voltage, current, EFUSE state and the like, and judge and process the state information according to a preset logic rule. When detecting the power supply abnormality, the CPLD can send out an alarm signal in time and take corresponding protection measures, such as cutting off the power supply, restarting the component and the like.

The BMC can be responsible for comprehensively monitoring and managing the hardware state of the server, including monitoring and alarming parameters such as temperature, voltage, fan rotation speed and the like. In the power supply state monitoring system, the BMC acquires detailed power supply state information through communication with the CPLD and other power supply monitoring components, and uploads the information to a remote management terminal, so that a manager can conveniently know the power supply operation condition of a server in real time. Meanwhile, the BMC can also remotely control and configure the server according to the power state information, so that the manageability and maintainability of the server are improved.

In some embodiments, the working process of the GPU power state monitoring system is as follows:

220VAC of AC mains supply is led out from the power distribution cabinet and is used as PSU power supply input, 4 PSUs are spliced on the PDB board to realize 2+2 redundant power supply, the power supply P54V_PSU is output, and then the power supply enters the GPU exchange board and is converted into a P54V power supply through EFUSE: P54 V_PSU. The P54V power supply outputs P12V_IN after VRM conversion. And the P12V_IN is divided into multiple paths, one path of the paths is converted into standby voltage P12V_STBY through EFUSE P12V_STBY, and then converted into P3V3_STBY through VR P3V3_STBY to supply power to the CPLD and the BMC. The other paths supply power to the corresponding components GPU0, GPU1, GPU2, GPU3, GPU4, GPU5, GPU6 and GPU7 through EFUSE:P12V_GPU0、EFUSE:P12V_GPU1、EFUSE:P12V_GPU2、EFUSE:P12V_GPU3、EFUSE:P12V_GPU4、EFUSE:P12V_GPU5、EFUSE:P12V_GPU6、EFUSE:P12V_GPU7 in sequence.

After the p3v3_stby is established, and the CPLD initialization is completed, a control signal en_p12v_ GPUn (n=0, 1,2,3,4,5,6, 7) is output, and the PG signal pg_p12v_ GPUn (n=0, 1,2,3,4,5,6, 7) of the corresponding GPU EFUSE is monitored.

If the CPLD detects that en_p12v_ GPUn is at a high level and the detected PG signal GPUn EFUSE transitions from high to low, and the duration interval of low is greater than the preset duration (e.g., 25 us), the CPLD determines that GPUn EFUSE is in an abnormal state.

Once it is determined GPUn EFUSE that is in the abnormal state, the CPLD immediately outputs a state abnormality signal fault# _p12v_ GPUn (n=0, 1,..7). These status exception signals are then converted and transmitted by the signal switching unit and ultimately conveyed to the BMC unit via the I2C bus without error.

After receiving the status exception signal from the CPLD, the BMC accurately records detailed fault information, including the number of the failed GPU, the fault type, the occurrence time, and the like, into a log file. The log files provide comprehensive and accurate fault analysis basis for operation and maintenance personnel, are helpful for the operation and maintenance personnel to quickly locate problems, troubleshoot fault reasons and take corresponding maintenance measures, so that stable operation of the server system is ensured.

In some embodiments, if the PSU fails and is powered down, the p3v3_stby on the GPU board is still continuously powered, and the CPLD is still in a normal operating state at this time. At this time, the CPLD records the operating state of the GPU EFUSE. At the power-down moment of PSU, the EN signal of GPU EFUSE is high, and the output voltage of EFUSE is pulled down quickly due to the existence of GPU load, so that the PG signal of GPU EFUSE also becomes low. At this time, the CPLD switches from high to low according to the EN signal, and the PG signal may misjudge that the GPU EFUSE is in an error state, and report the error information to the BMC to form an abnormal log record, thereby inevitably causing trouble to the analysis and positioning problem of the machine room operation and maintenance personnel.

Therefore, how to prevent misinformation of the EFUSE state caused by power failure of PSU is a technical problem to be solved.

In view of the above technical problems, the embodiment of the application provides a power supply state monitoring system, which uses two signals of a PSU as the basis for monitoring and judging whether the GPU EFUSE is in an abnormal state by using the two signals of the PSU as the basis, so that misinformation of the EFUSE state caused by power failure of the PSU can be effectively prevented.

The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a power supply state monitoring system according to an embodiment of the present application.

In some embodiments, the power supply state monitoring system includes a power supply unit, a power distribution unit, a controller, and a plurality of first over-current protection units (e.g., EFUSE0, EFUSE1, EFUSE2, EFUSE3, EFUSE4, EFUSE5, EFUSE6, EFUSE7 in fig. 2), where the power distribution unit is connected in series between the power supply unit and the plurality of first over-current protection units.

Alternatively, the power supply unit may be of a redundant design, such as an n+n redundant PSU design.

By way of example, the power supply unit may employ 2+2 redundant PSUs.

Wherein 2+2 redundant PSUs can be divided into two groups of 2 PSUs working in parallel. When any one of the PSUs fails, the other PSU can immediately bear all loads, so that the continuous operation of the server is ensured, and the service interruption caused by power failure is avoided.

The 2+2 redundant PSU configuration can reasonably control the cost while ensuring the stable operation of the equipment. For example, a small and medium-sized data center with tens of servers can provide reliable power guarantee for the servers by adopting 2+2 redundant PSU configuration, and the server is prevented from being stopped due to the failure of a single PSU, so that the normal operation of the service is affected.

The power distribution unit is a power distribution hub that receives output power from multiple PSUs and distributes it appropriately to various parts of the system. The power distribution unit has high-precision power distribution capability and good electrical isolation performance, and can ensure that power supply among different components is not interfered with each other. Meanwhile, the power distribution unit can integrate some power monitoring functions, such as real-time monitoring of voltage and current, so as to discover abnormal conditions in the power supply process in time.

In some embodiments, the controller may be responsible for functions such as initialization, logic control, and signal processing of the server. In the power supply state monitoring system, the controller can collect state information of each power supply node, such as voltage, current, EFUSE state and the like, in real time, and judge and process the state information according to a preset logic rule. When the power supply abnormality is detected, the controller can send out an alarm signal in time and take corresponding protection measures, such as cutting off the power supply, restarting the component and the like.

Alternatively, the controller may be a CPLD.

Optionally, the power supply state monitoring system further includes at least one processor (such as processor 0, processor 1, processor 2, processor 3, processor 4, processor 5, processor 6, processor 7 in fig. 2).

Alternatively, the processor may be a GPU, or may be another type of processor or load.

In some embodiments, the system may be provided with a first overcurrent protection unit for each processor, where the first overcurrent protection units may be used for overcurrent protection, and provide an independent power protection mechanism for each processor. When a certain processor has faults such as short circuit or overcurrent, the corresponding first overcurrent protection unit is fused, the power supply of the processor is cut off, the faults are prevented from being diffused to other components, and the safe operation of the whole system is protected.

In some embodiments, the processor is connected to its corresponding first overcurrent protection unit. For example, processor 0 is connected to EFUSE0, processor 1 is connected to EFUSE1, and processor 7 is connected to EFUSE 7.

In some embodiments, the power distribution unit is configured to output a first indication signal and a second indication signal, where the first indication signal is configured to indicate whether the input voltage of the power supply unit is in a preset input voltage range, and the second indication signal is configured to indicate whether the output voltage of the power supply unit is in a preset output voltage range.

The first indication signal is used for reflecting whether the input voltage of the power supply unit is in a normal fluctuation range, when the input voltage of the power supply unit is in the normal fluctuation range, the first indication signal can indicate that the input voltage of the power supply unit is normal, and when the input voltage of the power supply unit is in an abnormal fluctuation range, the first indication signal can indicate that the input voltage of the power supply unit is abnormal.

The setting of the input voltage range is determined based on the design requirement of the power supply unit and the actual application scene. The reasonable variation range of the input voltage under various working conditions is considered, so that the power supply unit can work normally under different input conditions. When the measured value of the input voltage is in the normal fluctuation range, the first indication signal outputs a specific level signal to indicate that the input voltage of the power supply unit is normal. In contrast, when the measured value of the input voltage exceeds the normal fluctuation range, the first indication signal changes its output level state rapidly, indicating that the input voltage of the power supply unit is abnormal.

Similarly, the second indication signal is configured to reflect whether the output voltage of the power supply unit is in a normal fluctuation range, when the output voltage of the power supply unit is in the normal fluctuation range, the second indication signal may indicate that the output voltage of the power supply unit is normal, and when the output voltage of the power supply unit is in an abnormal fluctuation range, the second indication signal may indicate that the output voltage of the power supply unit is abnormal.

In some embodiments, the controller is configured to update the control signal EN (e.g., en_0, en_1, en_2, en_3, en_4, en_5, en_6, en_7) of the first overcurrent protection unit based on the first indication signal and the second indication signal, and determine whether to output a state exception signal according to the updated control signal and the acquired monitoring signal PG (e.g., pg_0_new, pg_1_new, pg_2_new, pg_3_new, pg_4_new, pg_5_new, pg_6_new, pg_7_new) corresponding to the first overcurrent protection unit in fig. 2, where the state exception signal is used to indicate that the first overcurrent protection unit is in an abnormal state.

In some embodiments, the first indication signal and the second indication signal may be processed by a logic operation to generate a new control signal, where the new control signal may be used as a latest control signal of the first overcurrent protection unit to control on or off of the first overcurrent protection unit.

According to the embodiment of generating the control signal of the first overcurrent protection unit by processing the first indication signal and the second indication signal through logic operation, the response can be quickly and accurately made according to the real-time change of the power state, and the reliable operation of the system is effectively ensured.

In some embodiments, it may be determined whether the first overcurrent protection unit is in an abnormal state according to the updated control signal.

In some embodiments, the controller is further configured to obtain a monitoring signal corresponding to the first overcurrent protection unit, where the monitoring signal is used to indicate whether the supply voltage of the first overcurrent protection unit is in a preset supply voltage range.

In some embodiments, the controller may determine whether to output the abnormal state signal according to the updated control signal and the monitoring signal.

It can be understood that when the PSU fails, the first indication signal and/or the second indication signal may output a low level, so that the control signal of the first overcurrent protection unit may be set to a low level in advance, thereby ensuring that the control signal of the first overcurrent protection unit always becomes a low level before the monitoring signal when the system is powered down, so as to avoid a false alarm of the controller.

According to the power supply state monitoring system provided by the embodiment of the application, two paths of signals (the first indicating signal and the second indicating signal) of the PSU are introduced as the basis for the controller to monitor and judge whether the first overcurrent protection unit is in an abnormal state, so that the condition that the state of the first overcurrent protection unit is misreported due to the occurrence of faults of the PSU can be effectively prevented, and the operation and maintenance efficiency can be effectively improved.

Referring to fig. 3, fig. 3 is another schematic structural diagram of a power supply state monitoring system according to an embodiment of the present application.

In some embodiments, the controller includes a signal processing unit, where the signal processing unit is configured to perform a logic operation on the first indication signal, the second indication signal, and a control signal, and update the control signal based on a result of the operation, where the control signal is configured to control the first overcurrent protection unit to be turned on or off.

In some embodiments, the signal processing unit includes at least one first logic gate, a second logic gate and a third logic gate, where the first logic gate is configured to perform a first operation on a first indication signal and a second indication signal and output a first signal, the second logic gate is configured to perform a second operation on the basis of the first signal and output a second signal, and the third logic gate is configured to perform a first operation on the second signal and a control signal and update the control signal according to an operation result.

Optionally, the first operation includes a logical and operation, and the second operation includes a logical or operation.

The logical AND operation (AND) is used for combining the logic states of a plurality of input signals, AND the output result is true (high level "1") only when all the input signals are true (high level "1" is usually indicated in a digital circuit), AND the output result is false (low level "0") when one or more input signals are false (low level "0"). The operation characteristic enables the power management system to effectively comprehensively judge a plurality of conditions, for example, when judging whether the PSU is in a normal working state, the operation characteristic can simultaneously carry out AND operation on the first indication signal and the second indication signal, and only when the first indication signal and the second indication signal are normal, the PSU is judged to be in a normal working state.

Alternatively, the first logic gate and the third logic gate may be implemented as Complementary Metal Oxide Semiconductor (CMOS) and gate circuits. The CMOS and gate circuit may be formed by cross-coupling a P-channel metal-oxide-semiconductor (PMOS) and an N-type metal-oxide-semiconductor (NMOS) transistor, and the input signal implements a logic operation by controlling the on state of the transistor.

Wherein a logical OR Operation (OR) is used to combine the logic states of the plurality of input signals, the output result being determined to be true (high level "1") whenever one OR more of the plurality of input signals are true (typically represented by high level "1" in a digital circuit), and being false (low level "0") only when all of the plurality of input signals are false (low level "0").

Taking fault detection of a power management system as an example, the system may need to monitor the status of a plurality of PSUs, each PSU may generate a first signal indicating a normal or fault, and by taking these first signals as inputs of a logical or operation, whenever one PSU fails, the output signal becomes "1", so that it may be determined that at least one PSU of the plurality of PSUs fails.

Alternatively, the second logic gate may be a CMOS or gate structure, where the CMOS or gate structure may be formed of a PMOS in parallel and an NMOS in series, and the input signal performs a logic operation by controlling the transistor to be turned on.

In some embodiments, when the fluctuation range of the input voltage of the power supply unit is smaller than or equal to a first amplitude threshold, the input voltage fluctuates within an allowable stable range, and the first indication signal is a first level signal indicating that the input voltage of the power supply unit is normal, and when the fluctuation range of the input voltage of the power supply unit is larger than the first amplitude threshold, the input voltage fluctuates beyond the normal range, and the first indication signal is a second level signal indicating that the input voltage of the power supply unit is abnormal.

Similarly, when the fluctuation range of the output voltage of the power supply unit is smaller than or equal to a second amplitude threshold, the output voltage fluctuates within an allowable stable range, and the second indication signal is a first level signal which indicates that the output voltage of the power supply unit is normal, and when the fluctuation range of the output voltage of the power supply unit is larger than the second amplitude threshold, the output voltage fluctuates beyond the normal range, and the second indication signal is a second level signal which indicates that the output voltage of the power supply unit is abnormal.

Alternatively, the first level signal may be a high level, and the second level signal may be a low level, which is not limited in the embodiment of the present application.

In some embodiments, the power state monitoring system further includes at least one logic gate buffer unit connected in series between the power distribution unit and the controller.

The logic gate buffer unit may be configured to perform isolation enhancement processing on the first indication signal and the second indication signal.

The logic gate buffer unit is used as a core component for signal isolation and enhancement, and can ensure reliable transmission of the first indication signal and the second indication signal in a complex electromagnetic environment through level conversion, driving capability improvement and noise isolation.

In some embodiments, the power supply state monitoring system further includes a voltage adjustment unit connected in series between the power distribution unit and the first overcurrent protection unit.

In some embodiments, the voltage adjustment unit may be configured to convert the voltage output by the power distribution unit into the power supply voltage of the processor. For example, the 54V voltage output from the power distribution unit is converted into a 12V power supply voltage for the above-described processor.

The voltage adjusting unit may step down the input high voltage according to a predetermined ratio by an internal power conversion circuit, such as a switching power supply circuit or a linear voltage stabilizing circuit.

In some embodiments, the power supply state monitoring system further includes a second overcurrent protection unit connected in series between the power distribution unit and the voltage adjustment unit.

The second overcurrent protection unit is used as a key protection layer between the power distribution unit and the voltage adjustment unit, so that the whole overcurrent protection from the power inlet to the load can be realized. When the overcurrent condition occurs in the circuit, the second overcurrent protection unit can be quickly fused to cut off the circuit, so that the damage of the follow-up electronic elements caused by excessive current is prevented.

Optionally, the second over-current protection unit may include EFUSE. Compared with the traditional fuse, the EFUSE has the advantages of high response speed, reusability and the like, and can better adapt to the requirements of a server on power protection.

In some embodiments, the power supply state monitoring system further includes a voltage conversion unit connected in series between the voltage adjustment unit and the controller.

In some embodiments, the voltage conversion unit may be configured to convert the voltage output by the voltage adjustment unit into a supply voltage of the controller. For example, the p12v_stby voltage output from the voltage adjustment unit is converted into the power supply voltage p3v3_stby of the above-described processor.

In some embodiments, the power supply state monitoring system further includes a third overcurrent protection unit connected in series between the voltage adjustment unit and the voltage conversion unit.

The third overcurrent protection unit is used as a key protection layer between the voltage adjusting unit and the voltage converting unit, so that the whole overcurrent protection from the voltage adjusting unit to the voltage converting unit can be realized. When an overcurrent condition occurs in the circuit, the third overcurrent protection unit can be quickly fused to cut off the circuit, so that the damage of the follow-up electronic components (such as a controller) caused by excessive current is prevented.

Optionally, the third overcurrent protection unit may include EFUSE.

In some embodiments, the power state monitoring system further includes a BMC, and the controller is connected to the BMC.

In some embodiments, the controller is configured to send a status exception signal to the BMC, and the BMC is configured to output a log file based on the status exception signal.

For example, when the controller detects that a system has a state exception, a state exception signal may be sent to the BMC. After receiving the state exception signal sent by the controller, the BMC can generate a detailed log file according to a log record rule predefined by the system.

Alternatively, the log file may include the time, cause, associated parameter values, processor location, etc. that the exception occurred.

Optionally, the storage location of the log file may be a local storage device (such as flash memory or hard disk) or a remote server, so as to ensure the security and reliability of the log data, and the access authority is managed by a user authentication and authorization mechanism, so that only the authorized user can access and view the log file.

In some embodiments, the power supply state monitoring system further includes a signal switching unit connected in series between the BMC and the controller.

In some embodiments, the controller is configured to send a status exception signal to the signal switching unit, where the signal switching unit is configured to store the status exception signal in a corresponding port register.

The signal switching unit may immediately start the signal storage mechanism after receiving the state abnormality signal transmitted from the controller. A plurality of port registers are arranged in the device, and each port register corresponds to a specific signal channel. The signal switching unit accurately stores the signal in the corresponding port register according to the characteristics of the state abnormal signal and a preset storage rule. The port register has the characteristics of high-speed reading and writing, stable data retention and the like, can ensure that state abnormal signals are not lost or deformed in the storage process, and provides a reliable data source for the subsequent reading operation of the BMC.

In some embodiments, the BMC is configured to address the serial bus address of the signal switching unit through the serial bus to obtain a value in the port register, where the value may be used to indicate whether the first overcurrent protection unit is in an abnormal state.

Wherein, the BMC can switch the serial bus address of the unit through the serial bus addressing signal. The serial bus address is a unique identifier of the signal switching unit in the system, and the BMC can accurately position the target signal switching unit by sending a specific address signal.

Once the BMC successfully addresses the signal switching unit, it may send a read command to the signal switching unit according to the serial bus communication protocol. After receiving the reading instruction, the signal switching unit can read the value of the state exception signal from the corresponding port register and feed the value back to the BMC through the serial bus. After the BMC obtains the numerical value in the port register, the numerical value is analyzed and analyzed to determine whether each first overcurrent protection unit is in an abnormal state or not.

Alternatively, the serial bus may be an I2C (inter-INTEGRATED CIRCUIT) bus.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating interconnection topology between a signal switching unit and a controller and between a signal switching unit and a BMC according to an embodiment of the present application.

In some embodiments, the controller outputs FAULT signals (e.g., FAULT # processor 0, FAULT # processor 1, FAULT # processor 2, FAULT # processor 3, FAULT # processor 4, FAULT # processor 5, FAULT # processor 6, FAULT # processor 7) of EFUSE corresponding to each processor to the signal switching unit, and FAULT # processor n (n=0, 1, 7) signal states are stored in single port registers corresponding to the signal switching unit. Afterwards, the BMC can address the signal switching unit I2C address (defined according to the high-low level setting of the address pin A2\A1\A0l) through the I2C bus, and read the communicated port register value of the signal switching unit, so as to obtain the FAULT signal of the corresponding EFUSE.

According to the power supply state monitoring system provided by the embodiment of the application, two paths of signals of the PSU are introduced as the basis for the controller to monitor and judge whether the first overcurrent protection unit is in an abnormal state, so that the condition that the state of the first overcurrent protection unit is misreported due to the failure of the PSU can be effectively prevented, and the operation and maintenance efficiency can be effectively improved.

In order to better understand the power supply state monitoring system provided by the application, the following embodiments are exemplified by using practical application scenes.

Referring to fig. 5, fig. 5 is a schematic diagram of another GPU power state monitoring system according to an embodiment of the present application.

The GPU power supply state monitoring system comprises an AC mains supply access unit, 2+2 redundant PSUs, PDB, a second overcurrent protection unit (EFUSE: P54V), a first overcurrent protection unit GPU EFUSE (comprising EFUSE：P12V_GPU0、EFUSE：P12V_GPU1、EFUSE：P12V_GPU2、EFUSE：P12V_GPU3、EFUSE：P12V_GPU4、EFUSE：P12V_GPU5、EFUSE：P12V_GPU6、EFUSE：P12V_GPU7）、 processors (comprising GPU0, GPU1, GPU2, GPU3, GPU4, GPU5, GPU6 and GPU 7), a voltage regulation unit (VRM: P54V/P12V), a third overcurrent protection unit (EFUSE: P12 V_STBY), a voltage conversion unit (VR: P3V3_STBY), CPLD and BMC.

After the p3v3_stby is established, after the CPLD initialization is completed, a control signal en_p12v_ GPUn (n=0, 1,2,3,4,5,6, 7) is output, and the PG signal of the corresponding GPU EFUSE is monitored.

In some embodiments, two signals of PSU, namely, a first indication signal psu_ac_ok and a second indication signal psu_ PWROK, are introduced as a basis for the CPLD to monitor and determine whether the GPU EFUSE is in an abnormal state.

Wherein psu0_ac_ok of PSU0 is a first indication signal that the AC mains input corresponding to PSU0 is normal (the signal indicates that PSU input mains meets a normal fluctuation range), and psu0_ PWROK signal is a second indication signal that PSU0 output voltage is established and output voltage is normal.

The psu1_ac_ok and psu1_pwrok, psu2_ac_ok and psu2_pwrok, psu3_ac_ok and psu3_ PWROK are the same as each other, and a normal first indication signal and a normal second indication signal are respectively input to AC mains corresponding to PSU1, PSU2 and PSU 3.

In some embodiments, the two-way indication signal of each PSU is isolated and enhanced by a logic gate buffer unit, and then PSUi _ac_good and PSUi _pg signals (i=0, 1,2, 3) are generated as inputs of the signal processing unit.

Further, after performing logic operation processing on the PSUi _ac_good and PSUi _pg signals, the signal processing unit generates a GPU EFUSE enable control signal en_p12v_ GPUn (n=0, 1,..7) to control the corresponding GPU EFUSE to be turned on and off.

In some embodiments, the signal processing unit is specifically configured to:

And combining two paths of indication signals of PSUi _AC_GOOD and PSUi _PG into a GPU EFUSE enabling signal generation logic, wherein when PSU fails (namely, the AC mains supply of the two paths of signals of PSUi _AC_GOOD and PSUi _PG is powered down or PSU output is abnormal), the EN signal of the GPU EFUSE is set low in advance, so that the EN signal (EN_P12V_ GPUn) of the GPU EFUSE always becomes low before the PG signal of the GPU EFUSE when the system is powered down, and the false alarm action of the CPLD is avoided.

According to the power supply state monitoring system provided by the embodiment of the application, through introducing PSUi _AC_GOOD and PSUi _PG two paths of indication signals of the PSU as signals for detecting and judging whether the PSU is in a normal working state by the CPLD, the problem of misreporting log records of the GPU EFUSE can be avoided, and the efficiency of fault positioning and operation and maintenance work is improved.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating an internal structure of a signal processing unit according to an embodiment of the present application.

In some embodiments, the signal processing unit may include 12 sets of and logic gates, 1 set of or logic gates.

Wherein, the 4 sets of AND logic gates realize the AND operation of PSU_i_AC_GOOD and PSU_i_PG of 4 PSUs, and output corresponding first signals, including PSU_0_SHUT, PSU_1_SHUT, PSU_2_SHUT and PSU_3_SHUT.

The 4 first signals are subjected to OR operation by the OR logic gate, and then the second signal PSU_ALL_SHUT is output.

When the second signal psu_all_shutdown is at a low level, PSU is abnormally powered down (two conditions that the AC mains input voltage is abnormal and the AC mains input voltage is normal but PSU output voltage is abnormal are caused by power failure).

Further, the second signal psu_all_shutdown is sequentially and-ored with the control signal EN (en_p12v_ GPUi) corresponding to the 8 sets of GPU EFUSE by using another 8 sets of and logic gates, and the corresponding control signal en_p12v_ GPUi _new is output, and the control signal may be used to control the corresponding GPU EFUSE to be turned on or off.

For a better understanding of the embodiments of the present application, referring to fig. 7, fig. 7 is a logic truth table corresponding to each input signal and each output signal in the signal processing unit.

Referring to fig. 8, fig. 8 is a first power-down timing chart of an AC mains power down system according to an embodiment of the present application.

The timing chart shown in fig. 8 is a power-down timing chart of the AC mains power-down system before the signal processing unit is introduced.

After the AC mains is powered down, the psu_ac_ok signal goes low after the t1 time interval, psu_ PWROK goes low after the psu_ac_ok signal goes low t2 time interval, p24v_psu goes down after the psu_ac_ok signal goes low t3 time interval, and p12v_stby, p3v3_stby are powered down sequentially.

At this time, the CPLD is still in a normal operation state, and when it detects en_p12v_ GPUn to be high and detects pg_p12v_ GPUn to be low (as shown in the timing diagram of fig. 8, the dashed box identifies the position), the CPLD generates a state anomaly signal fault# _p12v_ GPUn (the state anomaly signal is low, which indicates that EFUSE: p12v_ GPUn is abnormal in power supply), and transmits a GPU EFUSE anomaly signal to the BMC. Wherein:

t1 represents a time interval from the power down of the AC mains supply to the change of the PSU_AC_OK signal to the low level, t2 represents a time interval from the high to the low of the PSU_AC_OK signal to the high to the low of the PSU_ PWROK signal, t3 represents a time interval from the high to the low of the PSU_ PWROK signal to the power down of the P24V_PSU, t4 represents a time interval from the power down of the P24V_PSU to the power down of the P12V_STBY, t5 represents a time interval from the power down of the P12V_STBY to the power down of the P3V3_STBY, and t6 represents a time interval from the high to the low of the EN_P12V_ GPUn signal to the high to the low of the PG_P12V_ GPUn.

Referring to fig. 9, fig. 9 is a second power-down timing chart of an AC mains power down system according to an embodiment of the present application.

The timing chart shown in fig. 9 is a power-down timing chart of the AC mains power-down system after the signal processing unit is introduced. In this case, the AC input is abnormal, at which point psu_ac_ok goes low before psu_ PWROK.

The difference from the processing unit before being introduced is that when the AC mains supply is powered down (PSU input voltage is abnormal), the enable signal EN_P12V_ GPUn of the GPU EFUSE is also synchronously pulled down, and other signal time sequences are unchanged. At this time, the CPLD detects that EN_P12V_ GPUn has become low before PG_P12V_ GPUn (as shown in the timing diagram of FIG. 9, the dashed box indicates position), so that the CPLD determines EFUSE:P12V_ GPUn is a normal power-down operation and does not trigger a state exception signal for indicating that GPU EFUSE is in an abnormal state.

Referring to fig. 10, fig. 10 is a third power-down timing chart of an AC mains power down system according to an embodiment of the present application.

The timing diagram shown in fig. 10 is a power-down timing diagram of the AC mains power down system after the signal processing unit is introduced (PSU output voltage is abnormal, and psu_ PWROK becomes low before psu_ac_ok).

The difference from the signal processing unit before being introduced is that when PSU output is abnormal, the enable signal EN_P12V_ GPUn of GPU EFUSE is synchronously pulled down, and other signal time sequences are unchanged. At this time, the CPLD detects that EN_P12V_ GPUn goes low (as indicated by the dashed box in the timing diagram of FIG. 10) before PG_P12V_ GPUn, so that the CPLD determines EFUSE that P12V_ GPUn is a normal power-down operation and does not trigger a state exception signal for indicating that GPU EFUSE is in an abnormal state.

Based on the description of the foregoing embodiments, the following description will be given by taking an embodiment of an AI server system configured with 8 450W GPUs to supply power, where the scheme of preventing power failure is illustrated.

In some embodiments, the method of the application comprises the steps of:

1) And arranging and installing 8 GPUs on a system substrate, and splitting the GPUs into two groups of 4 GPUs each.

2) 54V reduced current requirement = 8 x 450/54/0.97+800/54/0.97 = 84A. The pre-54V EFUSE OCP point set = 1.2 x 84 = 100.8A. Thus, the sampling resistor (Rsense) resistance options for a single set of EFUSEs include 1mΩ, 3W, 1%, 2512 package sizes, 2 in number.

Wherein, 1mΩ belongs to low resistance design, is applicable to heavy current scene (such as server power, motor drive etc.).

The maximum current supportable by the rated power of 3W is 173A, and a safety margin (which is usually used by 50-70% of the rated power) is reserved in actual work, namely, a single resistor can stably support about 86A-121A current.

The accuracy of 1% can meet most current monitoring requirements (such as overcurrent protection and power calculation).

2512 Package specifications help to spread heat and reduce junction temperature.

2) The power supply of the system substrate adopts a high-performance power supply cable, and the power P24V_PSU is taken from the PDB which is interconnected with the 2+2 redundant PSU terminal.

3) The P54V EFUSE can select a 54V scheme.

4) On the system substrate, a group of 12V EFUSEs are added to the input ends of GPU0, GPU1, GPU2, GPU3, GPU4, GPU5, GPU6 and GPU7 respectively. According to the single GPU 450W, the power supply current is reduced to 37.5A by 12V. The en_p12v_ GPUn signal of each group EFUSE is interconnected with the CPLD, and the pg_p12v_ GPUn signal is fed back to the CPLD (n=0, 1..7).

6) Based on the truth table shown in fig. 7, the truth table logic operation function is implemented through the CPLD, the obtained psu_all_shift signal and the EN signal of each group of GPU EFUSE are logically and-operated, then the output en_p12v_ GPUn _new is fed back to the CPLD judging logic unit (i.e., the CPLD detects that the pg_p12v_ GPUn signal goes low before en_p12v_ GPUn _new, and judges that the GPU EFUSE is in an abnormal state, and if the CPLD detects that the pg_p12v_ GPUn signal goes low after en_p12v_ GPUn _new), the judging signal FAULT # P12v_ GPUn is transmitted to the input port of the signal switching unit.

7) And a signal switching unit is added between the BMC and the CPLD to realize the signal switching from the GPIO signal to the I2C bus, a PG_P12V_ GPUn signal (n=0, 1, the..4, 7) is written into a port register, then the BMC acquires a register value through the port register accessed by the I2C bus, and a corresponding FAULT signal is analyzed.

8) Finally, the BMC records the analyzed FAULT signal into a FAULT log file.

The power supply state monitoring system provided by the embodiment of the application can realize the following beneficial effects:

1) And introducing a PSU_AC_OK signal of the 54V PSU as an input signal for judging whether the GPU EFUSE works in an abnormal state when the CPLD detects and judges that the AC mains supply is powered down.

2) And introducing a PSU_ PWROK signal of the 54V PSU as an input signal for judging whether the GPU EFUSE is in an abnormal state or not when the PSU output is abnormally powered down through CPLD detection.

3) And a PSU signal processing unit is introduced into the CPLD, and whether the GPU EFUSE is in an abnormal state is judged through the internal logic operation processing of the CPLD, so that the problem of misreporting the log record of the GPU EFUSE is avoided, and the efficiency of fault positioning and operation and maintenance work is improved.

In some embodiments of the present application, a server is provided, where the server includes a power supply status monitoring system, where the power supply status monitoring system includes at least one power supply unit, a power distribution unit, a controller, and a first overcurrent protection unit, where the power distribution unit is connected in series between the power supply unit and the first overcurrent protection unit.

The power supply distribution unit is used for outputting a first indication signal and a second indication signal, wherein the first indication signal is used for indicating whether the input voltage of the power supply unit is in a preset input voltage range, and the second indication signal is used for indicating whether the output voltage of the power supply unit is in a preset output voltage range.

In some embodiments, the controller is specifically configured to determine whether to output a status exception signal according to the updated control signal and the monitoring signal.

In some embodiments, the signal processing unit includes at least one first logic gate, a second logic gate and a third logic gate, where the first logic gate is configured to perform a first operation on the first indication signal and the second indication signal and output a first signal, the second logic gate is configured to perform a second operation on the basis of the first signal and output a second signal, and the third logic gate is configured to perform a first operation on the second signal and the control signal and update the control signal according to an operation result.

In some embodiments, the first operation includes a logical AND operation, and the second operation includes a logical OR operation.

In some embodiments, the first indication signal is a first level signal when the fluctuation amplitude of the input voltage of the power supply unit is less than or equal to a first amplitude threshold value, and the first indication signal is a second level signal when the fluctuation amplitude of the input voltage of the power supply unit is greater than the first amplitude threshold value.

In some embodiments, the second indication signal is a first level signal when the fluctuation amplitude of the output voltage of the power supply unit is less than or equal to a second amplitude threshold value, and is a second level signal when the fluctuation amplitude of the output voltage of the power supply unit is greater than the second amplitude threshold value.

In some embodiments, the first level signal is a high level signal, and the second level signal is a low level signal.

In some embodiments, the system further comprises at least one logic gate buffer unit connected in series between the power distribution unit and the controller, wherein the logic gate buffer unit is used for performing isolation enhancement processing on the first indication signal and the second indication signal.

In some embodiments, the system further comprises at least one processor coupled to the first overcurrent protection unit.

In some embodiments, the system further comprises a voltage adjusting unit connected in series between the power distribution unit and the first overcurrent protection unit, wherein the voltage adjusting unit is used for converting the voltage output by the power distribution unit into the power supply voltage of the processor.

In some embodiments, the system further comprises a second overcurrent protection unit connected in series between the power distribution unit and the voltage adjustment unit.

In some embodiments, the system further comprises a voltage conversion unit connected in series between the voltage adjustment unit and the controller, wherein the voltage conversion unit is used for converting the voltage output by the voltage adjustment unit into the power supply voltage of the controller.

In some embodiments, the system further includes a third overcurrent protection unit connected in series between the voltage adjustment unit and the voltage conversion unit.

In some embodiments, the system further comprises a baseboard management controller connected to the baseboard management controller, the controller is configured to send a status exception signal to the baseboard management controller, and the baseboard management controller is configured to output a log file based on the status exception signal.

In some embodiments, the system further comprises a signal switching unit, wherein the signal switching unit is connected in series between the baseboard management controller and the controller, the controller is used for sending a state abnormal signal to the signal switching unit, and the signal switching unit is used for storing the state abnormal signal in a corresponding port register.

In some embodiments, the baseboard management controller is configured to address a serial bus address of the signal switching unit through the serial bus to obtain a value in the port register, where the value is used to indicate whether the first overcurrent protection unit is in an abnormal state.

In some embodiments, the first overcurrent protection unit includes an electronic fuse.

It should be noted that, the specific structure and implementation principle of the power supply state monitoring system may refer to the embodiments shown in fig. 2 to 7, and will not be described herein.

The server provided by the embodiment of the application comprises a power supply state monitoring system, wherein the power supply state monitoring system is used for monitoring and judging whether the first overcurrent protection unit works abnormally or not by introducing two paths of signals of the power supply unit as a basis for a controller, so that the condition that the state of the first overcurrent protection unit is misreported due to PSU abnormality can be effectively prevented, and the operation and maintenance efficiency of the server can be effectively improved.

Optionally, the power supply state monitoring system may also be applied to a switch with 54V power supply, a storage product, and the like, which is not described in detail in the embodiment of the present application.

In describing embodiments of the present application, it should be noted that, unless explicitly stated or limited otherwise, the terms "coupled" and "connected" should be interpreted broadly, for example, as a fixed connection, as an indirect connection via an intermediary, as a communication between two elements or as an interaction relationship between two elements. The specific meaning of the above terms in the embodiments of the present application will be understood by those of ordinary skill in the art according to specific circumstances.

It should be understood that the division of the modules in the above computing device is merely a division of a logic function, and each function may correspond to one functional module, or two or more functions may be integrated into one functional module. In actual implementation, all or part of the modules may be integrated into one physical entity, or may be distributed in different physical entities.

The foregoing detailed description of the application has been presented for purposes of illustration and description, and it should be understood that the foregoing is by way of illustration and description only, and is not intended to limit the scope of the application.

Finally, it should be noted that the foregoing embodiments are merely illustrative of and not limiting on the technical solutions of the embodiments of the present application, and although the embodiments of the present application have been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not deviate from the essence of the corresponding technical solutions from the scope of the embodiments of the present application.

Claims

1. The power supply state monitoring system is characterized by comprising at least one power supply unit, a power distribution unit, a controller and a first overcurrent protection unit, wherein the power distribution unit is connected in series between the power supply unit and the first overcurrent protection unit;

The power supply distribution unit is used for outputting a first indication signal and a second indication signal, wherein the first indication signal is used for indicating whether the input voltage of the power supply unit is in a preset input voltage range or not, and the second indication signal is used for indicating whether the output voltage of the power supply unit is in the preset output voltage range or not;

The controller comprises a signal processing unit, wherein the signal processing unit is used for carrying out logic operation on the first indication signal, the second indication signal and the control signal of the first overcurrent protection unit, and updating the control signal of the first overcurrent protection unit based on an operation result;

the controller is used for determining whether to output a state abnormal signal according to the updated control signal, and the state abnormal signal is used for indicating that the first overcurrent protection unit is in an abnormal state.

2. The system of claim 1, wherein the controller is further configured to:

and acquiring a monitoring signal corresponding to the first overcurrent protection unit, wherein the monitoring signal is used for indicating whether the power supply voltage of the first overcurrent protection unit is in a preset power supply voltage range.

3. The system according to claim 2, wherein the controller is specifically configured to:

And determining whether to output the state abnormality signal according to the updated control signal and the monitoring signal.

4. The system of claim 1, wherein the control signal is used to control the first over-current protection unit to turn on or off.

5. The system of claim 4, wherein the signal processing unit comprises at least one first logic gate, a second logic gate, and a third logic gate;

the first logic gate is used for performing a first operation on the first indication signal and the second indication signal and outputting a first signal;

the second logic gate is used for performing a second operation based on the first signal and outputting a second signal;

the third logic gate is used for performing a first operation on the second signal and the control signal, and updating the control signal according to an operation result.

6. The system of claim 5, wherein the first operation comprises a logical and operation and the second operation comprises a logical or operation.

7. The system of claim 6, wherein the first indication signal is a first level signal when a fluctuation amplitude of an input voltage of the power supply unit is less than or equal to a first amplitude threshold;

when the fluctuation amplitude of the input voltage of the power supply unit is larger than the first amplitude threshold value, the first indication signal is a second level signal.

8. The system of claim 6, wherein the second indication signal is a first level signal when a fluctuation amplitude of the output voltage of the power supply unit is less than or equal to a second amplitude threshold;

and when the fluctuation amplitude of the output voltage of the power supply unit is larger than the second amplitude threshold value, the second indication signal is a second level signal.

9. The system of claim 7 or 8, wherein the first level signal is a high level signal and the second level signal is a low level signal.

10. The system of claim 1, further comprising at least one logic gate buffer unit connected in series between the power distribution unit and the controller;

the logic gate buffer unit is used for performing isolation enhancement processing on the first indication signal and the second indication signal.

11. The system of claim 1, further comprising at least one processor, wherein the processor is coupled to the first overcurrent protection unit.

12. The system of claim 11, further comprising a voltage regulation unit connected in series between the power distribution unit and the first over-current protection unit;

the voltage adjusting unit is used for converting the voltage output by the power distribution unit into the power supply voltage of the processor.

13. The system of claim 12, further comprising a second over-current protection unit connected in series between the power distribution unit and the voltage regulation unit.

14. The system of claim 12, further comprising a voltage conversion unit connected in series between the voltage adjustment unit and the controller;

The voltage conversion unit is used for converting the voltage output by the voltage adjustment unit into the power supply voltage of the controller.

15. The system of claim 14, further comprising a third over-current protection unit connected in series between the voltage regulation unit and the voltage conversion unit.

16. The system of claim 1, further comprising a baseboard management controller, the controller being coupled to the baseboard management controller;

The controller is used for sending the state abnormal signal to the baseboard management controller;

the baseboard management controller is used for outputting a log file based on the state abnormal signal.

17. The system of claim 16, further comprising a signal switching unit connected in series between the baseboard management controller and the controller;

The controller is used for sending the state abnormal signal to the signal switching unit;

The signal switching unit is used for storing the state abnormal signal in a corresponding port register.

18. The system of claim 17, wherein the baseboard management controller is configured to address a serial bus address of the signal switching unit via a serial bus to obtain a value in the port register, the value being used to indicate whether the first overcurrent protection unit is in an abnormal state.

19. The system of claim 1, wherein the first overcurrent protection unit comprises an electronic fuse.

20. A server, characterized in that it comprises a power supply status monitoring system according to any one of claims 1 to 19.