CN117130817A

CN117130817A - Linux system condition monitoring method, electronic equipment and medium

Info

Publication number: CN117130817A
Application number: CN202311122382.0A
Authority: CN
Inventors: 臧克敏
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-11-28

Abstract

The invention discloses a method for monitoring the condition of a Linux system, electronic equipment and a medium, comprising the following steps: defining a global variable in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; judging whether n CPUs all have interrupt response capability; under the condition that n CPUs all have interrupt response capability, inverting the n bit numerical values in the global variable in a timer callback function; judging whether the ith CPU can normally schedule; under the condition that the ith CPU can normally schedule, firstly setting the Boolean variable on the ith CPU as B, then waking up the dog feeding thread of the ith CPU, and setting the Boolean variable on the ith CPU as A in the dog feeding thread. The invention can accurately detect the condition of the Linux system in real time.

Description

Linux system condition monitoring method, electronic equipment and medium

Technical Field

The invention belongs to the field of monitoring of underlying software of a Linux system, and particularly relates to a method for monitoring a condition of the Linux system, electronic equipment and a medium.

Background

The Linux system is widely used in our work and production, a large amount of Linux systems are deployed in large-scale desktop servers or small-scale embedded devices, the wide use of the Linux system brings higher requirements on the safety and the robustness of the system operation, the system is blocked sometimes in actual production work, the system cannot respond to any operation condition, all application programs have no reaction, and in the condition, we can only restart after power-off again.

CN101739305a discloses a device and a method for monitoring a watchdog in kernel level of an operating system, which provides a method for monitoring an application program by a watchdog module implemented by a timer, and if a task of the monitored application layer is not executed within a specified time, a problem is considered, and a series of operations such as reset are executed later. The method in this patent has some problems: for example, if Linux is RT-patch, the low-priority application layer task may not be scheduled on time, and the method does not have the ability to monitor kernel threads.

Disclosure of Invention

The invention aims to provide a method, electronic equipment and medium for monitoring the condition of a Linux system so as to accurately and real-timely detect the condition of the Linux system.

The method for monitoring the Linux system condition comprises the following steps:

defining a global variable bitmap in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; the number of bits of the global variable bitmap is equal to the number n of CUP cores in the system, one bit represents the state of one CPU, the values of the n bits are the same, and the global variable bitmap is visible to all CPUs in the system.

Based on the n bit values in the global variable bitmap, whether n CPUs all have interrupt response capability is judged.

Under the condition that the n CPUs all have the interrupt response capability, the n bit numerical values in the global variable bitmap are inverted in a timer callback function (for example, the bit numerical value is initially 0, the bit numerical value becomes 1 after inversion, or the bit numerical value is initially 1, and the bit numerical value becomes 0 after inversion).

Judging whether the ith CPU can normally schedule or not based on the Boolean variable on the ith CPU; wherein i takes all integers from 1 to n in turn.

Under the condition that the ith CPU can normally schedule, firstly setting the Boolean variable on the ith CPU as B, then waking up the dog feeding thread of the ith CPU, setting the Boolean variable on the ith CPU as A in the awakened dog feeding thread of the ith CPU, and then returning to judge whether n CPUs all have the interrupt response capability.

Preferably, the method for judging whether all n CPUs have the interrupt response capability is as follows: when the timer of each CPU expires, detecting whether the n bit numerical values in the global variable bitmap are consistent (for example, whether all are 0 or all are 1) in a timer callback function, if so, indicating that all n CPUs have response interrupt capability, otherwise, indicating that the CPUs in the system lose response interrupt capability.

Preferably, when the CPU in the system loses the capability of responding to the interrupt and the duration (i.e. the duration of the CPU in the system losing the capability of responding to the interrupt) reaches the preset timing period, the system is restarted, so that the system is restarted due to misjudgment can be avoided.

Preferably, the method for judging whether the ith CPU can normally schedule is as follows: detecting whether a Boolean variable on the ith CPU is set to A in a timer callback function, if so, indicating that the ith CPU can normally schedule, otherwise, indicating that the ith CPU loses scheduling capability. Because the Boolean variable is set as A in the dog feeding thread, if the Boolean variable is not set as A, the CPU is not used for executing the dog feeding operation, and the CPU loses the scheduling capability, so that accurate monitoring is realized.

Preferably, under the condition that the ith CPU loses the scheduling capability, firstly, a call stack log (used for later analysis of reasons of losing the scheduling capability) on the ith CPU is stored, and then when the duration of the time when the Boolean variable on the ith CPU is not set as A reaches a preset timing period, the IPI interrupt is triggered to inform the rest n-1 CPUs that the system needs to be restarted, and then the system is restarted, so that the system is prevented from being restarted due to erroneous judgment.

Preferably, during normal operation of the system, the operation logs of all CPUs are stored into the memory area I in real time through a ram control mechanism; when a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPU (namely the CPU losing the scheduling capability) are saved to a memory area I through a ram control mechanism; after restarting the system, the log (including the running log and the call stack log) in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition. The method has the advantages that when a certain CPU loses scheduling capability, the call stack logs of the CPU are stored for later analysis, so that the problem that excessive memory is occupied due to the fact that call stack logs of all CPUs are stored in real time is avoided, and system memory is saved.

Preferably, the a=1, b=0; or the a=0, b=1.

The electronic equipment comprises a processor and a memory which is in communication connection with the processor, wherein a computer readable program is stored in the memory, and the monitoring method of the Linux system condition can be executed when the computer readable program is called by the processor.

The medium of the invention stores a computer readable program, and when the computer readable program is called, the method for monitoring the condition of the Linux system can be executed.

The invention does not need hardware watchdog, does not need extra external circuit, only utilizes the (high-precision) timer of the system, realizes the pure software watchdog through the timer callback function and the dog feeding thread on the CPU, realizes accurate and real-time monitoring of the running conditions of all the CPUs in the system, and has more general application scenes.

Drawings

Fig. 1 is a flowchart of a method for monitoring a Linux system in this embodiment.

Fig. 2 is a schematic diagram of an electronic device in the present embodiment.

Detailed Description

As shown in fig. 1, the method for monitoring the Linux system condition in the embodiment includes the following steps:

defining a global variable bitmap in the system, defining a boolean variable set as a (a=1 in this embodiment) on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a (high precision) timer for each CPU, defining a timer callback function, and then executing the second step. The number of bits of the global variable bitmap is equal to the number n of CPU cores in the system, one bit represents the state of one CPU, the values of the n bits are the same (the value is 1 in the embodiment), and the global variable bitmap is visible to all CPUs in the system. A boolean variable set to 1 indicates that the CPU is normal and a boolean variable set to 0 indicates that the CPU has lost scheduling capability. Some operations are performed on the state machine in the callback function of the timer.

And secondly, when the timer of each CPU expires, detecting whether the values of n bits in the global variable bitmap are consistent or not in a timer callback function, if so, executing the fourth step (indicating that the n CPUs all have interrupt response capability), otherwise (indicating that the system has CPU losing interrupt response capability and cannot respond to interrupt) executing the third step.

Thirdly, judging whether the duration time of the inconsistent values of the n bits (namely, the CPU in the system loses the capability of responding to the interrupt) reaches a preset timing period, if so, executing the twelfth step, otherwise, returning to execute the second step.

And fourthly, reversing the n bit numerical values in the global variable bitmap (namely changing the n bit numerical values from 1 to 0 or changing the n bit numerical values from 0 to 1) in a timer callback function, and then executing the fifth step.

Fifthly, detecting whether a Boolean variable on the ith CPU is set as A in a timer callback function, if so (the ith CPU can normally schedule), executing a ninth step, otherwise (the ith CPU loses scheduling capability, the CPU does not execute a dog feeding operation), executing a sixth step; wherein i takes all integers from 1 to n in turn.

And sixthly, saving a call stack log on the ith CPU, and executing a seventh step.

And seventh, judging whether the duration of the Boolean variable on the ith CPU which is not set as A (namely the ith CPU loses the scheduling capability) reaches a preset timing period, if so, executing the eighth step, otherwise, returning to execute the second step.

And eighth step, triggering IPI interrupt, informing the rest n-1 CPUs that the system needs to be restarted, and then executing twelfth step.

Ninth, the boolean variable on the ith CPU is set to B (b=0 in this embodiment) in the timer callback function, and then the tenth step is performed.

And a tenth step of waking up the dog feeding thread of the ith CPU in the timer callback function, and then executing the eleventh step.

And eleventh step, setting the Boolean variable on the ith CPU as A in the awakened dog feeding thread of the ith CPU, and then returning to execute the second step.

And twelfth step, restarting the system.

During normal operation of the system, the operation logs of all CPUs are stored in the memory area I in real time through a ram control mechanism. When a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPUs are saved to a memory area I through a ram control mechanism. After restarting the system, the log (including the running log and the call stack log) in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition.

The memory area I and the memory area II are two memory areas reserved for ram control in the memory and used for caching log. The independent storage partition is a partition formatted on a storage medium such as emmc or ufs, and is a read-write independent partition dedicated to storing logs.

As shown in fig. 2, the electronic device in this embodiment includes a processor 1 and a memory 2 communicatively connected to the processor 1, where the memory 2 stores a computer readable program, and when the computer readable program is called by the processor 1, the above-mentioned method for monitoring the Linux system condition can be executed.

The embodiment also provides a medium, in which a computer readable program is stored, and when the computer readable program is called, the method for monitoring the Linux system condition can be executed.

The above examples are provided for illustrating the technical concept and features of the present invention and are intended to enable those skilled in the art to understand the contents of the present invention and to implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be included in the scope of the present invention.

Claims

1. The method for monitoring the Linux system condition is characterized by comprising the following steps:

defining a global variable bitmap in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; the number of bits of the global variable bitmap is equal to the number n of CUP cores in the system, one bit represents the state of one CPU, the values of the n bits are the same, and the global variable bitmap is visible to all CPUs in the system;

judging whether n CPUs have interrupt response capability or not based on n bit numerical values in the global variable bitmap;

under the condition that n CPUs have interrupt response capability, inverting n bit numerical values in the global variable bitmap in a timer callback function;

judging whether the ith CPU can normally schedule or not based on the Boolean variable on the ith CPU; wherein i sequentially takes all integers from 1 to n;

2. The method for monitoring a Linux system according to claim 1, wherein the method for determining whether all of the n CPUs have the capability of responding to an interrupt is as follows: when the timer of each CPU expires, detecting whether the n bit numerical values in the global variable bitmap are consistent or not in a timer callback function, if so, indicating that all n CPUs have the capability of responding to interrupt, otherwise, indicating that the CPUs in the system lose the capability of responding to interrupt.

3. The method for monitoring the condition of the Linux system according to claim 2, wherein: when the CPU loses the capability of responding to the interrupt and the duration reaches the preset timing period, restarting the system.

4. The method for monitoring the Linux system according to claim 1, wherein the method for judging whether the ith CPU can normally schedule is as follows: detecting whether a Boolean variable on the ith CPU is set to A in a timer callback function, if so, indicating that the ith CPU can normally schedule, otherwise, indicating that the ith CPU loses scheduling capability.

5. The method for monitoring the Linux system according to claim 4, wherein: under the condition that the ith CPU loses the scheduling capability, firstly, a call stack log on the ith CPU is stored, and then when the duration of the time when the Boolean variable on the ith CPU is not set as A reaches a preset timing period, IPI interruption is triggered to inform the rest n-1 CPUs that the system needs to be restarted, and then the system is restarted.

6. The method for monitoring the Linux system according to claim 5, wherein: during normal operation of the system, the operation logs of all CPUs are stored into a memory area I in real time through a ram control mechanism; when a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPUs are saved to a memory area I through a ram control mechanism; after restarting the system, the log in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition.

7. The method for monitoring the condition of the Linux system according to any one of claims 1 to 6, wherein: the a=1 and the b=0.

8. The method for monitoring the condition of the Linux system according to any one of claims 1 to 6, wherein: the a=0 and the b=1.

9. An electronic device, characterized in that: comprising a processor (1) and a memory (2) in communicative connection with the processor (1), the memory (2) having stored therein a computer readable program which, when called by the processor, is capable of executing the method for monitoring the condition of a Linux system according to any of claims 1 to 8.

10. A medium, characterized by: a computer readable program stored therein, which when called, is capable of executing the method for monitoring the condition of the Linux system according to any one of claims 1 to 8.