CN112905377B - A method, system and medium for avoiding BMC crash - Google Patents
A method, system and medium for avoiding BMC crash Download PDFInfo
- Publication number
- CN112905377B CN112905377B CN202110181358.9A CN202110181358A CN112905377B CN 112905377 B CN112905377 B CN 112905377B CN 202110181358 A CN202110181358 A CN 202110181358A CN 112905377 B CN112905377 B CN 112905377B
- Authority
- CN
- China
- Prior art keywords
- interrupt
- flag
- strategy
- instruction
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract 7
- 238000012544 monitoring process Methods 0.000 claims abstract 9
- 230000003247 decreasing effect Effects 0.000 claims 3
- 238000004590 computer program Methods 0.000 claims 1
- 230000002265 prevention Effects 0.000 claims 1
- 230000002159 abnormal effect Effects 0.000 abstract 1
- 238000004519 manufacturing process Methods 0.000 abstract 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3024—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域technical field
本发明涉及服务器检测技术领域,特别是涉及一种避免BMC崩溃的方法、系统及介质。The present invention relates to the technical field of server detection, in particular to a method, system and medium for avoiding BMC crash.
背景技术Background technique
当用户按下电源指令发生器启动服务器时,BMC记录一笔SEL log,用以记录服务器系统被按下过电源指令发生器,当用户按下重启指令发生器时,BMC记录一笔SEL log,用以记录服务器系统被按下过重启指令发生器,当用户按下中断指令发生器时,BMC记录一笔系统事件日志,用以记录服务器系统被按下过中断指令发生器,上述BMC都是通过注册GPIO中断处理程序,判断GPIO引脚的高低准位变化来得知服务器系统上的指令发生器曾经被使用者按下过,然后利用GPIO中断函数对按下的指令发生器对应的事件进行快速处理。When the user presses the power command generator to start the server, the BMC records a SEL log to record that the server system has been pressed the power command generator. When the user presses the restart command generator, the BMC records a SEL log. It is used to record that the server system has been pressed to restart the command generator. When the user presses the interrupt command generator, the BMC records a system event log to record that the server system has been pressed to the interrupt command generator. The above BMCs are all By registering the GPIO interrupt handler, judging the change of the high and low levels of the GPIO pins, it is known that the command generator on the server system has been pressed by the user, and then the GPIO interrupt function is used to quickly execute the events corresponding to the pressed command generator. deal with.
但是当指令发生器出现问题时,短时间会产生很多中断,这样会造成BMC对这些中断函数的集中处理,使得BMC的CPU利用率降低,从而无法即时有效的处理上层业务。However, when there is a problem with the instruction generator, many interrupts will be generated in a short time, which will cause the BMC to centrally process these interrupt functions and reduce the CPU utilization of the BMC, so that the upper-layer services cannot be processed immediately and effectively.
发明内容SUMMARY OF THE INVENTION
本发明主要解决的是服务器指令发生器出现异常时,BMC重复处理异常中断,导致BMC的CPU利用率降低或者BMC崩溃的问题。The invention mainly solves the problem that when the server instruction generator is abnormal, the BMC repeatedly processes the abnormal interruption, which leads to the reduction of the CPU utilization rate of the BMC or the collapse of the BMC.
为解决上述技术问题,本发明采用的一个技术方案是:提供一种避免BMC崩溃的方法,包括以下步骤:In order to solve the above-mentioned technical problems, a technical scheme adopted by the present invention is: a method for avoiding the collapse of BMC is provided, comprising the following steps:
执行监控策略,监测CPU利用率及指令发生器的中断指令;Execute monitoring strategy, monitor CPU utilization and interrupt instruction of instruction generator;
设定flag区、非flag区、中断参数阈值及flag阈值;Set the flag area, non-flag area, interrupt parameter threshold and flag threshold;
当所述CPU利用率达到所述flag区时,若指令发生器产生所述中断指令,则对所述指令发生器对应的中断参数及flag执行递增策略;When the CPU utilization rate reaches the flag area, if the instruction generator generates the interrupt instruction, then the corresponding interrupt parameter and the flag of the instruction generator are executed increment strategy;
当所述CPU利用率达到所述非flag区时,对所述指令发生器对应的所述flag执行递减策略。When the CPU utilization reaches the non-flag area, a decrementing strategy is performed on the flag corresponding to the instruction generator.
进一步,所述设定flag区、非flag区、中断参数阈值及flag阈值的步骤进一步包括:所述非flag区包括第一阈值区间、第二阈值区间及缓冲区间,所述flag区包括第一参数区间和第二参数区间。Further, the step of setting the flag area, the non-flag area, the interrupt parameter threshold value and the flag threshold value further includes: the non-flag area includes a first threshold interval, a second threshold interval and a buffer area, and the flag area includes a first threshold interval. The parameter interval and the second parameter interval.
进一步,所述递增策略包括第一递增策略和第二递增策略;Further, the incremental strategy includes a first incremental strategy and a second incremental strategy;
当所述CPU利用率达到所述第一参数区间时,所述监控策略监测是否有所述指令发生器生成所述中断指令,若有,则根据所述中断指令对所述中断参数执行所述第一递增策略,当所述中断参数达到所述中断参数阈值时,对所述flag执行所述第一递增策略,当所述flag大于所述flag阈值,则执行应急策略。When the CPU utilization reaches the first parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, and if so, executes the interrupt parameter according to the interrupt instruction. The first increment strategy, when the interrupt parameter reaches the interrupt parameter threshold, execute the first increment strategy on the flag, and execute the emergency strategy when the flag is greater than the flag threshold.
进一步,当所述CPU利用率达到所述第二参数区间时,所述监控策略监测是否有所述指令发生器生成所述中断指令,若有,则根据所述中断指令对所述中断参数执行所述第二递增策略,当所述中断参数达到所述中断参数阈值时,对所述flag执行所述第一递增策略,当所述flag大于所述flag阈值,则执行应急策略。Further, when the CPU utilization rate reaches the second parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, and if so, executes the interrupt parameter according to the interrupt instruction. In the second increment strategy, when the interrupt parameter reaches the interrupt parameter threshold, the first increment strategy is executed on the flag, and when the flag is greater than the flag threshold, an emergency strategy is executed.
进一步,所述执行应急策略的步骤进一步包括:关闭所述指令发生器对应的中断处理程序,停止对所述指令发生器的中断指令的处理,将关闭所述中断处理程序的信息存储于系统事件日志中,并执行告警操作。Further, the step of implementing the emergency strategy further comprises: closing the corresponding interrupt handler of the instruction generator, stopping the processing of the interrupt instruction of the instruction generator, and storing the information of closing the interrupt handler in the system event log, and perform alarm operations.
进一步,所述当所述CPU利用率达到所述非flag区时,对所述指令发生器对应的所述flag执行递减策略的步骤进一步包括:若所述CPU利用率达到第一阈值区间,则将所述指令发生器对应的所述中断参数及所述flag初始化。Further, when the described CPU utilization reaches the non-flag area, the step of executing the decreasing strategy to the flag corresponding to the instruction generator further comprises: if the CPU utilization reaches the first threshold interval, then Initializing the interrupt parameter and the flag corresponding to the instruction generator.
进一步,所述当所述CPU利用率达到所述非flag区时,对所述指令发生器对应的所述flag执行递减策略的步骤进一步还包括:Further, when the described CPU utilization rate reaches the described non-flag area, the step of executing the decreasing strategy to the described flag corresponding to the instruction generator further includes:
当所述CPU利用率达到所述第二阈值区间时,对所述指令发生器对应的所述flag执行所述第一递减策略,判断所述指令发生器对应的所述中断处理程序是否为关闭状态,若所述指令发生器的所述中断处理程序处于关闭状态,则判断所述指令发生器的所述flag是否等于所述flag阈值,若等于,则保持所述指令发生器的所述中断处理程序处于关闭状态,若小于,则开启所述指令发生器的所述中断处理程序,并将开启的所述中断处理程序的信息存储于所述系统事件日志中。When the CPU utilization rate reaches the second threshold interval, execute the first decreasing strategy on the flag corresponding to the instruction generator, and determine whether the interrupt handler corresponding to the instruction generator is closed State, if the described interrupt handler of the described instruction generator is in the closed state, then judge whether the described flag of the described instruction generator is equal to the described flag threshold, if equal, then keep the described interrupt of the described instruction generator The processing program is in a closed state. If it is smaller than the value, the interrupt processing program of the instruction generator is opened, and the information of the opened interrupt processing program is stored in the system event log.
进一步,所述第一递增策略为,将所述中断参数或者所述flag以第一参数为基数进行递增,当所述中断参数或者所述flag达到所述中断参数阈值或者所述flag阈值时,由初始值继续递增;Further, the first increment strategy is to increment the interrupt parameter or the flag based on the first parameter, and when the interrupt parameter or the flag reaches the interrupt parameter threshold or the flag threshold, continue to increase from the initial value;
所述第二递增策略为,将所述中断参数或者所述flag以第二参数为基数进行递增,当所述中断参数或者所述flag达到所述中断参数阈值或者所述flag阈值时,由初始值继续递增。The second increment strategy is to increment the interrupt parameter or the flag based on the second parameter, and when the interrupt parameter or the flag reaches the interrupt parameter threshold or the flag threshold, the initial The value continues to increment.
本发明还提供一种避免BMC崩溃的系统,应用于服务器,所述系统包括:监控模块、加权模块、策略模块和系统事件模块;The present invention also provides a system for avoiding BMC crash, which is applied to the server, and the system includes: a monitoring module, a weighting module, a strategy module and a system event module;
所述监控模块用于监控所述服务器的CPU利用率,以及所述服务器的指令发生器的中断指令,并设定flag区及非flag区;Described monitoring module is used to monitor the CPU utilization rate of described server, and the interrupt instruction of the instruction generator of described server, and set flag area and non-flag area;
所述策略模块存储有递增策略、递减策略及应急策略;The strategy module stores an increasing strategy, a decreasing strategy and an emergency strategy;
当所述CPU利用率达到所述flag区时,所述加权模块执行所述递增策略,根据所述中断指令执行中断参数加权和flag加权,并对所述指令发生器对应的中断处理程序执行所述应急策略;When the CPU utilization rate reaches the flag area, the weighting module executes the incremental strategy, executes interrupt parameter weighting and flag weighting according to the interrupt instruction, and executes the interrupt handler corresponding to the instruction generator. describe emergency strategies;
当所述CPU利用率达到非flag区时,所述加权模块对所述flag执行所述递减策略,并对所述所述指令发生器对应的中断处理程序执行所述应急策略;When the CPU utilization rate reaches the non-flag area, the weighting module executes the decrement strategy to the flag, and executes the emergency strategy to the interrupt handler corresponding to the instruction generator;
所述系统事件模块用于存储所述中断处理程序的信息。The system event module is used for storing the information of the interrupt handler.
本发明还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现任一项所述的避免BMC崩溃的方法步骤。The present invention also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the method steps for avoiding a BMC crash is implemented.
本发明的有益效果是:The beneficial effects of the present invention are:
1、本发明所述的避免BMC崩溃的方法,可以实现在工厂制造服务器时,发生服务器指令发生器异常时,不会因为BMC发生崩溃导致BMC无法响应,导致服务器测试人员误以为是服务器BMC固件设计出现问题,并且可以通过系统事件日志,定位发生异常的指令发生器,不需要服务器研发人员进行工厂去确认服务器指令发生器异常的原因。1. The method for avoiding BMC crash according to the present invention can realize that when the server is manufactured in the factory, when the server command generator is abnormal, the BMC will not be unable to respond due to the crash of the BMC, causing the server tester to mistakenly think that it is the server BMC firmware. If there is a problem with the design, and the abnormal command generator can be located through the system event log, there is no need for the server R&D personnel to go to the factory to confirm the cause of the abnormal server command generator.
2、本发明所述的避免BMC崩溃的系统,可以根据策略模块的策略执行对指令发生器中断指令的加权,并在加权结束后,根据中断阈值执行中断处理程序的关闭,当满足中断处理程序开启条件时,再次将中断处理程序进行开启,并可以将关闭或者开启的中断处理程序信息进行系统事件的保存,方便修护人员进行检修。2. The system for avoiding BMC collapse according to the present invention can perform weighting of the interrupt instruction of the instruction generator according to the strategy of the strategy module, and after the weighting ends, execute the shutdown of the interrupt handler according to the interrupt threshold, when the interrupt handler is satisfied. When the condition is turned on, the interrupt handler will be turned on again, and the information of the turned off or turned on interrupt handler can be saved as a system event, which is convenient for maintenance personnel to conduct maintenance.
3、本发明所述的避免BMC崩溃的介质,可以实现对指令发生器对应的中断参数进行递增或者递减策略,并在中断参数达到阈值时,生成flag,通过flag判定指令发生器是否发生异常,并在发生异常后,关闭异常指令发生器的中断处理程序,减轻CPU的负载压力,保证CPU的正常运行。3. The medium for avoiding BMC collapse according to the present invention can realize the strategy of increasing or decreasing the interrupt parameter corresponding to the instruction generator, and when the interrupt parameter reaches the threshold value, a flag is generated, and the flag is used to determine whether the instruction generator is abnormal, And after the abnormality occurs, the interrupt handler of the abnormal instruction generator is closed to reduce the load pressure of the CPU and ensure the normal operation of the CPU.
附图说明Description of drawings
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the specific embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative efforts.
图1是本发明实施例1所述的避免BMC崩溃的方法的示意图;1 is a schematic diagram of the method for avoiding BMC collapse according to Embodiment 1 of the present invention;
图2是本发明实施例1所述的避免BMC崩溃的方法的流程图;2 is a flowchart of the method for avoiding BMC collapse according to Embodiment 1 of the present invention;
图3是本发明实施例1所述的避免BMC崩溃的方法的区间划分图;3 is an interval division diagram of the method for avoiding BMC collapse according to Embodiment 1 of the present invention;
图4是本发明实施例2所述的避免BMC崩溃的系统的示意图。FIG. 4 is a schematic diagram of a system for avoiding BMC collapse according to Embodiment 2 of the present invention.
具体实施方式Detailed ways
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,还可以是两个元件内部的连通,可以是无线连接,也可以是有线连接。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that the terms "installed", "connected" and "connected" should be understood in a broad sense, unless otherwise expressly specified and limited, for example, it may be a fixed connection or a detachable connection connection, or integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, or it can be the internal connection of two components, which can be a wireless connection or a wired connection connect. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.
需要说明的是,在本发明的描述中,BMC(Baseboard Management Controller)是基板管理控制器、CPU(Central Processing Unit)是中央处理器。It should be noted that, in the description of the present invention, a BMC (Baseboard Management Controller) is a baseboard management controller, and a CPU (Central Processing Unit) is a central processing unit.
实施例1Example 1
本发明实施例提供一种避免BMC崩溃的方法,请参阅图1至图3,包括以下步骤;An embodiment of the present invention provides a method for avoiding BMC crash, please refer to FIG. 1 to FIG. 3 , including the following steps;
为了避免当服务器指令发生器出现异常时,指令发生器异常对BMC进行中断函数的传输,造成BMC的CPU负载瞬间加大,导致BMC崩溃的现象,本发明首先创建监控策略,不断监控服务器的CPU利用率,设定CPU利用率阈值以及中断参数,根据CPU利用率,将CPU的利用率分为flag区与非flag区,flag区为CPU利用率大于等于70%,非flag区为CPU利用率小于70%,当CPU的利用率达到70%时,小于90%,确认服务器上是否发生指令发生器的中断,因为每一个指令发生器对应一个中断响应程序,中断计数器对每一个指令发生器按下产生的中断进行计数,对应一个最开始时对中断参数进行初始化,中断参数为0,当按下指令发生器发生中断时,将中断参数加1,如果没有中断的发生,则继续使用监控策略对CPU利用率进行监测;In order to avoid the phenomenon that when the server instruction generator is abnormal, the instruction generator abnormally transmits the interrupt function to the BMC, causing the CPU load of the BMC to increase instantaneously, resulting in the collapse of the BMC, the present invention first creates a monitoring strategy to continuously monitor the CPU of the server. Utilization, set the CPU utilization threshold and interrupt parameters. According to the CPU utilization, the CPU utilization is divided into a flag area and a non-flag area. The flag area is the CPU utilization rate greater than or equal to 70%, and the non-flag area is the CPU utilization rate. Less than 70%, when the CPU utilization reaches 70%, less than 90%, confirm whether the interrupt of the command generator occurs on the server, because each command generator corresponds to an interrupt response program, and the interrupt counter is pressed for each command generator. The interrupts generated in the next step are counted, and the interrupt parameters are initialized at the beginning. The interrupt parameter is 0. When the interrupt occurs when the command generator is pressed, the interrupt parameter is incremented by 1. If no interrupt occurs, the monitoring strategy will continue to be used. Monitor CPU utilization;
当监控策略监控到CPU利用率大于等于90%时,此时CPU已经不能进行应对突发密集的中断响应,所以对中断参数进行进一步加权,确认服务器上是否发生指令发生器的中断操作,若出现,则将中断参数加2。When the monitoring strategy monitors that the CPU utilization is greater than or equal to 90%, the CPU can no longer respond to burst-intensive interrupts, so the interrupt parameters are further weighted to confirm whether the interrupt operation of the instruction generator occurs on the server. , the interrupt parameter is incremented by 2.
上述对中断参数加1或者加2只是一种简单的加权关系,数字并非我们所要探讨的问题,也不是所要保护的范围,加1或者加2只是一种理想情况,根据服务器的运行状态,或者服务器的应用场景,以及服务器CPU的性能,可以适当选取更优的中断参数进行加权。The above addition of 1 or 2 to the interrupt parameter is just a simple weighting relationship. The number is not what we want to discuss, nor is it the scope of protection. Adding 1 or 2 is just an ideal situation. According to the operating status of the server, or The application scenario of the server and the performance of the server CPU can appropriately select a better interrupt parameter for weighting.
当中断参数大于中断参数上限10时,就生成一个flag,当flag的数量大于10的时候,则将产生中断的指令发生器,比如关机指令发生器、重启指令发生器对应的中断处理程序关闭,也就是判定此指令发生器出现异常,CPU拒绝接收此指令发生器的中断指令,保护服务器的上层业务的正常运行,并在关掉此指令发生器的同时,将此指令发生器的信息日志记录于SEL log系统事件日志中,通过这种方式提醒工作人员,此指令发生器出现异常已经影响到CPU的正常运行,提醒工作人员迅速做出处理反应。When the interrupt parameter is greater than the upper limit of the interrupt parameter of 10, a flag is generated. When the number of flags is greater than 10, the instruction generator that generates the interrupt, such as the shutdown instruction generator and the interrupt handler corresponding to the restart instruction generator, will be closed. That is to say, it is determined that the instruction generator is abnormal, and the CPU refuses to receive the interrupt instruction of the instruction generator, protects the normal operation of the upper-layer business of the server, and records the information log of the instruction generator while turning off the instruction generator. In the SEL log system event log, in this way, the staff is reminded that the abnormality of the command generator has affected the normal operation of the CPU, and the staff is reminded to respond quickly.
监控策略监控CPU利用率低于70%,达到非flag区时,此时CPU的利用率已经可以处理一些中断指令,所以CPU根据情况,将指令发生器对应的关闭的中断处理程序开启。The monitoring strategy monitors that the CPU utilization is lower than 70%. When the non-flag area is reached, the CPU utilization can already process some interrupt instructions. Therefore, the CPU opens the closed interrupt handler corresponding to the instruction generator according to the situation.
当CPU利用率低于70%,大于等于60%时,此时相当于缓冲区,监控策略不去处理中断处理程序,而是继续等待CPU利用率继续变化,等待CPU利用率增大或者减少。When the CPU utilization is lower than 70% and greater than or equal to 60%, it is equivalent to a buffer. The monitoring strategy does not process the interrupt handler, but continues to wait for the CPU utilization to continue to change and wait for the CPU utilization to increase or decrease.
监控策略监控CPU的利用率小于等于60%,大于50%时,将flag数量减少一个,flag数量最少为0个,因为CPU的剩余利用率足以应对各种中断指令,所以可以尝试进行中断操作,监控策略确认指令发生器当前的中断处理程序状态,当该指令发生器的flag数量经过减少后,若该指令发生器的flag数量等于10,也就是说,该指令发生器原本的flag数量为11,此时可能该指令发生器出现严重问题,而维修人员也并未及时处理,将此指令发生器的中断处理程序继续保持关闭,若该指令发生器经过两次flag数量减少处理,也就是说明CPU的利用率保持稳定,并为超出非flag区,CPU的利用率在缓冲区和CPU利用率60%到50%区间中进行至少一次往复运动,监控策略监控到该指令发生器的flag数量为小于10时,将该指令发生器已经关闭的中断处理程序打开,并将该指令发生器打开的信息记录在系统时间日志中,通知维修人员,此指令发生器已经被打开,CPU可以进行该指令发生器的中断指令的处理。The monitoring strategy monitors that the CPU utilization is less than or equal to 60%, and when it is greater than 50%, reduce the number of flags by one, and the number of flags is at least 0. Because the remaining utilization of the CPU is sufficient to deal with various interrupt instructions, you can try to perform interrupt operations. The monitoring strategy confirms the current interrupt handler status of the instruction generator. When the number of flags of the instruction generator is reduced, if the number of flags of the instruction generator is equal to 10, that is, the original number of flags of the instruction generator is 11. , at this time, there may be a serious problem with the command generator, and the maintenance personnel have not dealt with it in time, and the interrupt handler of the command generator will continue to be closed. The utilization rate of the CPU remains stable, and in order to exceed the non-flag area, the utilization rate of the CPU performs at least one reciprocating motion in the buffer area and the CPU utilization rate of 60% to 50%. The monitoring strategy monitors the number of flags of the instruction generator as When it is less than 10, open the interrupt handler that the command generator has been closed, and record the information that the command generator is turned on in the system time log, and notify the maintenance personnel that the command generator has been turned on, and the CPU can execute the command The processing of the interrupt instruction of the generator.
监控策略监控到CPU的利用率为小于等于50%时,触发初始化策略,初始化策略将中断参数以及flag数量清零,因为此时的CPU可以处理大量的业务,可以执行大量的中断指令。When the monitoring strategy monitors that the CPU utilization is less than or equal to 50%, the initialization strategy is triggered, and the initialization strategy clears the interrupt parameters and the number of flags, because the CPU at this time can process a large number of services and execute a large number of interrupt instructions.
实施例2Example 2
本发明实施例提供一种避免BMC崩溃的系统,应用于服务器,请参阅图4,所述系统包括;An embodiment of the present invention provides a system for avoiding BMC crash, which is applied to a server. Please refer to FIG. 4 . The system includes;
监控模块、加权模块、指令发生器、策略模块和系统事件模块。Monitoring module, weighting module, command generator, policy module and system event module.
指令发生器用于对服务器进行各种指令控制,例如关机指令、重启指令和待机指令等,指令发生器每一次生成指令,都会对服务器BMC产生一次中断,当指令发生器产生异常,重复对BMC发送指令,BMC的CPU会因瞬时接收大量指令而陷入宕机,导致BMC无法正常工作。The command generator is used to control the server with various commands, such as shutdown command, restart command, and standby command. Every time the command generator generates a command, it will interrupt the server BMC. instruction, the CPU of the BMC will crash due to instantaneously receiving a large number of instructions, resulting in the BMC not working properly.
监控模块监控CPU的利用率,并按照CPU的利用率划分flag区间以及flag区间,本实施例中,按照CPU的性能指标,将CPU利用率大于等于70%时,设定为flag区间,小于70%设定为非flag区间,当CPU的利用率达到70%,监控模块进一步划分区间。The monitoring module monitors the utilization rate of the CPU, and divides the flag interval and the flag interval according to the utilization rate of the CPU. In this embodiment, according to the performance index of the CPU, when the utilization rate of the CPU is greater than or equal to 70%, it is set as the flag interval, which is less than 70%. % is set as a non-flag interval. When the CPU utilization reaches 70%, the monitoring module further divides the interval.
当CPU利用率大于等于70%小于90%时,监控模块监控指令发生器是否产生中断指令,若产生中断指令,则通过加权模块以及策略模块的配合进行中断参数以及flag的加权,监控模块每次监控到关机指令发生器或者重启指令发生器发生中断指令时,对每个指令发生器进行单独的加权,每次监控到产生中断指令,则加权模块指令策略模块的第一递增策略,也就是每次加1,对该指令发生器的中断指令进行中断的递增,当中断参数递增到10,对flag执行策略模块的第一递增策略,逐步的对flag进行递增,当flag数量大于10时,对该指令发生器对应的中断处理程序停止,BMC不再接受此中断发生器的中断指令,因为频繁的发生中断指令,说明该中断发生器出现了异常,将停止的中断处理程序记录在系统事件模块中,通过这种方式,可以保护现场,以便维修人员对出现异常的指令发生器进行维修。When the CPU utilization is greater than or equal to 70% and less than 90%, the monitoring module monitors whether the instruction generator generates an interrupt command. If an interrupt command is generated, the weighting of the interrupt parameters and flags is carried out through the cooperation of the weighting module and the strategy module. When the shutdown command generator or the restart command generator is monitored to generate an interrupt command, each command generator is individually weighted, and each time an interrupt command is monitored, the weighting module commands the first increment strategy of the strategy module, that is, each time an interrupt command is generated. Increment by 1 each time, the interrupt instruction of the instruction generator will be interrupted and incremented. When the interrupt parameter is incremented to 10, the first increment strategy of the strategy module is executed for the flag, and the flag is gradually incremented. When the number of flags is greater than 10, the The interrupt handler corresponding to the instruction generator stops, and the BMC no longer accepts the interrupt instruction of the interrupt generator. Because of frequent occurrence of interrupt instructions, it means that the interrupt generator is abnormal, and records the stopped interrupt handler in the system event module. In this way, the site can be protected so that maintenance personnel can repair the abnormal command generator.
同理当CPU利用率大于等于90%时,说明CPU的可以使用空间极少,需要更加快速的加权策略,监控模块每次监控到关机指令发生器或者重启指令发生器发生中断指令时,对该指令发生器产生的中断指令的中断参数进行第二递增策略,也就是每次加2,当中断参数递增到10时,对flag执行策略模块的第一递增策略,并当flag数量大于10时,对该指令发生器对应的中断处理程序停止,BMC不再接受此中断发生器的中断指令,因为频繁的发生中断指令,说明该中断发生器出现了异常,将停止的中断处理程序记录在系统事件模块中,通过这种方式,可以保护现场,以便维修人员对出现异常的指令发生器进行维修。Similarly, when the CPU utilization is greater than or equal to 90%, it means that the CPU has very little usable space, and a faster weighting strategy is required. Every time the monitoring module monitors the shutdown command generator or restarts the command generator, when an interrupt command occurs, the command will be updated accordingly. The interrupt parameter of the interrupt instruction generated by the generator is subjected to the second increment strategy, that is, adding 2 each time. When the interrupt parameter is incremented to 10, the first increment strategy of the strategy module is executed for the flag, and when the number of flags is greater than 10, the The interrupt handler corresponding to the instruction generator stops, and the BMC no longer accepts the interrupt instruction of the interrupt generator. Because of frequent occurrence of interrupt instructions, it means that the interrupt generator is abnormal, and records the stopped interrupt handler in the system event module. In this way, the site can be protected so that maintenance personnel can repair the abnormal command generator.
当CPU利用率降低到非flag区时,对CPU利用率进行细分,将小于70%大于等于60%时,CPU的利用率划分为缓冲区,此时还不知道CPU的利用率是否稳定,所以在CPU的利用率为缓冲区时,不对中断指令进行计数,也不会打开或者关闭指令发生器的中断处理程序。When the CPU utilization is reduced to the non-flag area, the CPU utilization is subdivided. When it is less than 70% and greater than or equal to 60%, the CPU utilization is divided into buffers. At this time, it is not known whether the CPU utilization is stable. Therefore, when the CPU utilization is the buffer, the interrupt instruction is not counted, and the interrupt handler of the instruction generator will not be opened or closed.
当CPU利用率降低到小于60%大于50%时,只要CPU的利用率进入一次就对所有的中断处理器进行一次flag的递减策略,也就是将指令发生器的flag减1,因为在flag大于10,也就是11时,才会对中断处理程序关闭,所以当CPU利用率降低到小于60%大于50%时,并且当指令发生器对应的flag小于10时才会对关闭的中断处理程序打开,说明CPU的利用率经过两次低到小于60%大于50%的情况,很稳定,所以在flag等于10时,不会打开关闭的中断处理程序,当打开关闭的中断处理程序时,将处理的情况发送到系统事件模块进行保存,通知修护人员,此指令发生器已经经历了一次重启,需要仔细确认。When the CPU utilization rate is reduced to less than 60% and greater than 50%, as long as the CPU utilization rate enters once, a flag decrement strategy is performed for all interrupt processors, that is, the flag of the instruction generator is decreased by 1, because when the flag is greater than 10, that is, 11, the interrupt handler will be closed, so when the CPU utilization is reduced to less than 60% and greater than 50%, and when the flag corresponding to the instruction generator is less than 10, the closed interrupt handler will be opened. , which means that the CPU utilization is very stable after two times when it is less than 60% and greater than 50%. Therefore, when the flag is equal to 10, the closed interrupt handler will not be opened. When the closed interrupt handler is opened, it will be processed. The situation is sent to the system event module for saving, and the maintenance personnel are notified that the command generator has undergone a restart and needs to be carefully confirmed.
当CPU利用率小于等于50%时,对中断参数以及flag初始化,通常初始化也就是清零,也可以根据CPU的性能,设定初始化值When the CPU utilization is less than or equal to 50%, the interrupt parameters and flags are initialized. Usually initialization is to clear them. The initialization value can also be set according to the performance of the CPU.
上述本发明实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned embodiments of the present invention disclose the serial numbers of the embodiments only for description, and do not represent the advantages and disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above-mentioned embodiments can be completed by hardware, and the program completed by instructing relevant hardware through a program can also be stored in a computer-readable storage medium. The storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above descriptions are only the embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110181358.9A CN112905377B (en) | 2021-02-10 | 2021-02-10 | A method, system and medium for avoiding BMC crash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110181358.9A CN112905377B (en) | 2021-02-10 | 2021-02-10 | A method, system and medium for avoiding BMC crash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112905377A CN112905377A (en) | 2021-06-04 |
CN112905377B true CN112905377B (en) | 2022-09-20 |
Family
ID=76123416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110181358.9A Active CN112905377B (en) | 2021-02-10 | 2021-02-10 | A method, system and medium for avoiding BMC crash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112905377B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844136A (en) * | 2016-12-06 | 2017-06-13 | 北京智能管家科技有限公司 | The collection method and system of a kind of program crashing information |
CN110187996A (en) * | 2019-05-30 | 2019-08-30 | 苏州浪潮智能科技有限公司 | BMC host process method for diagnosing faults, device, equipment and readable storage medium storing program for executing |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6629252B1 (en) * | 1999-10-28 | 2003-09-30 | International Business Machines Corporation | Method for determining if a delay required before proceeding with the detected interrupt and exiting the interrupt without clearing the interrupt |
FR2830952B1 (en) * | 2001-10-11 | 2004-03-12 | Inst Nat Rech Inf Automat | IMPROVED ALEA GENERATOR |
US9298524B2 (en) * | 2013-03-22 | 2016-03-29 | Insyde Software Corp. | Virtual baseboard management controller |
US9529750B2 (en) * | 2014-07-14 | 2016-12-27 | American Megatrends, Inc. | Service processor (SP) initiated data transaction with bios utilizing interrupt |
US10705901B2 (en) * | 2018-02-23 | 2020-07-07 | Dell Products, L.P. | System and method to control memory failure handling on double-data rate dual in-line memory modules via suspension of the collection of correctable read errors |
CN110362326A (en) * | 2019-07-12 | 2019-10-22 | 浪潮电子信息产业股份有限公司 | A kind of upgrade method and relevant device of BIOS |
CN111124761B (en) * | 2019-12-29 | 2022-04-22 | 浪潮电子信息产业股份有限公司 | Equipment restarting method, device, equipment and medium |
CN111737064A (en) * | 2020-06-29 | 2020-10-02 | 深圳市同泰怡信息技术有限公司 | BMC system control method and device, storage medium and computer equipment |
-
2021
- 2021-02-10 CN CN202110181358.9A patent/CN112905377B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844136A (en) * | 2016-12-06 | 2017-06-13 | 北京智能管家科技有限公司 | The collection method and system of a kind of program crashing information |
CN110187996A (en) * | 2019-05-30 | 2019-08-30 | 苏州浪潮智能科技有限公司 | BMC host process method for diagnosing faults, device, equipment and readable storage medium storing program for executing |
Also Published As
Publication number | Publication date |
---|---|
CN112905377A (en) | 2021-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021248836A1 (en) | Smart device startup method and apparatus, smart device, and readable storage medium | |
CN115237644B (en) | System fault processing method, central operation unit and vehicle | |
CN112732674B (en) | Cloud platform service management method, device, equipment and readable storage medium | |
CN113672306B (en) | Method, device, system and medium for recovery from abnormal self-checking of server components | |
WO2024082873A1 (en) | Data backup method, and computer device and storage medium | |
CN111026573A (en) | Watchdog system of multi-core processing system and control method | |
US8230446B2 (en) | Providing a computing system with real-time capabilities | |
JP2016224883A (en) | Fault detection method, information processing apparatus, and fault detection program | |
CN112905377B (en) | A method, system and medium for avoiding BMC crash | |
WO2022267812A1 (en) | Software recovery method, electronic device, and storage medium | |
US7840853B2 (en) | CPU suppression system and CPU suppression method using service processor | |
CN109062718B (en) | Server and data processing method | |
CN111045844A (en) | A kind of fault degradation method and device | |
CN113742165B (en) | Dual master control device and master-slave control method | |
CA2365427A1 (en) | Internal product fault monitoring apparatus and method | |
JPH10307635A (en) | Computer system and temperature monitoring method applied to the same system | |
CN114124644B (en) | Ethernet OAM alarm method and device based on Linux kernel mode | |
CN115904793A (en) | Memory unloading method, system and chip based on multi-core heterogeneous system | |
JP7414313B1 (en) | Remote power control device, remote power control method, and remote power control program | |
CN113868001A (en) | A method, system and computer storage medium for checking a memory repair result | |
JP4313858B2 (en) | calculator | |
CN111023441A (en) | An air conditioner control method, system, storage medium and air conditioner | |
JPH10269110A (en) | Method for avoiding hang-up of computer system, and computer system using the same method | |
JP7040784B2 (en) | Monitoring control device, monitoring control method, computer and program | |
JP2018147510A (en) | Server device and server system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |