[go: up one dir, main page]

CN117130817A - Linux system condition monitoring method, electronic equipment and medium - Google Patents

Linux system condition monitoring method, electronic equipment and medium Download PDF

Info

Publication number
CN117130817A
CN117130817A CN202311122382.0A CN202311122382A CN117130817A CN 117130817 A CN117130817 A CN 117130817A CN 202311122382 A CN202311122382 A CN 202311122382A CN 117130817 A CN117130817 A CN 117130817A
Authority
CN
China
Prior art keywords
cpu
ith
cpus
condition
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311122382.0A
Other languages
Chinese (zh)
Inventor
臧克敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan Automobile Co Ltd
Original Assignee
Chongqing Changan Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan Automobile Co Ltd filed Critical Chongqing Changan Automobile Co Ltd
Priority to CN202311122382.0A priority Critical patent/CN117130817A/en
Publication of CN117130817A publication Critical patent/CN117130817A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • G06F9/4484Executing subprograms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method for monitoring the condition of a Linux system, electronic equipment and a medium, comprising the following steps: defining a global variable in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; judging whether n CPUs all have interrupt response capability; under the condition that n CPUs all have interrupt response capability, inverting the n bit numerical values in the global variable in a timer callback function; judging whether the ith CPU can normally schedule; under the condition that the ith CPU can normally schedule, firstly setting the Boolean variable on the ith CPU as B, then waking up the dog feeding thread of the ith CPU, and setting the Boolean variable on the ith CPU as A in the dog feeding thread. The invention can accurately detect the condition of the Linux system in real time.

Description

Linux system condition monitoring method, electronic equipment and medium
Technical Field
The invention belongs to the field of monitoring of underlying software of a Linux system, and particularly relates to a method for monitoring a condition of the Linux system, electronic equipment and a medium.
Background
The Linux system is widely used in our work and production, a large amount of Linux systems are deployed in large-scale desktop servers or small-scale embedded devices, the wide use of the Linux system brings higher requirements on the safety and the robustness of the system operation, the system is blocked sometimes in actual production work, the system cannot respond to any operation condition, all application programs have no reaction, and in the condition, we can only restart after power-off again.
CN101739305a discloses a device and a method for monitoring a watchdog in kernel level of an operating system, which provides a method for monitoring an application program by a watchdog module implemented by a timer, and if a task of the monitored application layer is not executed within a specified time, a problem is considered, and a series of operations such as reset are executed later. The method in this patent has some problems: for example, if Linux is RT-patch, the low-priority application layer task may not be scheduled on time, and the method does not have the ability to monitor kernel threads.
Disclosure of Invention
The invention aims to provide a method, electronic equipment and medium for monitoring the condition of a Linux system so as to accurately and real-timely detect the condition of the Linux system.
The method for monitoring the Linux system condition comprises the following steps:
defining a global variable bitmap in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; the number of bits of the global variable bitmap is equal to the number n of CUP cores in the system, one bit represents the state of one CPU, the values of the n bits are the same, and the global variable bitmap is visible to all CPUs in the system.
Based on the n bit values in the global variable bitmap, whether n CPUs all have interrupt response capability is judged.
Under the condition that the n CPUs all have the interrupt response capability, the n bit numerical values in the global variable bitmap are inverted in a timer callback function (for example, the bit numerical value is initially 0, the bit numerical value becomes 1 after inversion, or the bit numerical value is initially 1, and the bit numerical value becomes 0 after inversion).
Judging whether the ith CPU can normally schedule or not based on the Boolean variable on the ith CPU; wherein i takes all integers from 1 to n in turn.
Under the condition that the ith CPU can normally schedule, firstly setting the Boolean variable on the ith CPU as B, then waking up the dog feeding thread of the ith CPU, setting the Boolean variable on the ith CPU as A in the awakened dog feeding thread of the ith CPU, and then returning to judge whether n CPUs all have the interrupt response capability.
Preferably, the method for judging whether all n CPUs have the interrupt response capability is as follows: when the timer of each CPU expires, detecting whether the n bit numerical values in the global variable bitmap are consistent (for example, whether all are 0 or all are 1) in a timer callback function, if so, indicating that all n CPUs have response interrupt capability, otherwise, indicating that the CPUs in the system lose response interrupt capability.
Preferably, when the CPU in the system loses the capability of responding to the interrupt and the duration (i.e. the duration of the CPU in the system losing the capability of responding to the interrupt) reaches the preset timing period, the system is restarted, so that the system is restarted due to misjudgment can be avoided.
Preferably, the method for judging whether the ith CPU can normally schedule is as follows: detecting whether a Boolean variable on the ith CPU is set to A in a timer callback function, if so, indicating that the ith CPU can normally schedule, otherwise, indicating that the ith CPU loses scheduling capability. Because the Boolean variable is set as A in the dog feeding thread, if the Boolean variable is not set as A, the CPU is not used for executing the dog feeding operation, and the CPU loses the scheduling capability, so that accurate monitoring is realized.
Preferably, under the condition that the ith CPU loses the scheduling capability, firstly, a call stack log (used for later analysis of reasons of losing the scheduling capability) on the ith CPU is stored, and then when the duration of the time when the Boolean variable on the ith CPU is not set as A reaches a preset timing period, the IPI interrupt is triggered to inform the rest n-1 CPUs that the system needs to be restarted, and then the system is restarted, so that the system is prevented from being restarted due to erroneous judgment.
Preferably, during normal operation of the system, the operation logs of all CPUs are stored into the memory area I in real time through a ram control mechanism; when a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPU (namely the CPU losing the scheduling capability) are saved to a memory area I through a ram control mechanism; after restarting the system, the log (including the running log and the call stack log) in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition. The method has the advantages that when a certain CPU loses scheduling capability, the call stack logs of the CPU are stored for later analysis, so that the problem that excessive memory is occupied due to the fact that call stack logs of all CPUs are stored in real time is avoided, and system memory is saved.
Preferably, the a=1, b=0; or the a=0, b=1.
The electronic equipment comprises a processor and a memory which is in communication connection with the processor, wherein a computer readable program is stored in the memory, and the monitoring method of the Linux system condition can be executed when the computer readable program is called by the processor.
The medium of the invention stores a computer readable program, and when the computer readable program is called, the method for monitoring the condition of the Linux system can be executed.
The invention does not need hardware watchdog, does not need extra external circuit, only utilizes the (high-precision) timer of the system, realizes the pure software watchdog through the timer callback function and the dog feeding thread on the CPU, realizes accurate and real-time monitoring of the running conditions of all the CPUs in the system, and has more general application scenes.
Drawings
Fig. 1 is a flowchart of a method for monitoring a Linux system in this embodiment.
Fig. 2 is a schematic diagram of an electronic device in the present embodiment.
Detailed Description
As shown in fig. 1, the method for monitoring the Linux system condition in the embodiment includes the following steps:
defining a global variable bitmap in the system, defining a boolean variable set as a (a=1 in this embodiment) on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a (high precision) timer for each CPU, defining a timer callback function, and then executing the second step. The number of bits of the global variable bitmap is equal to the number n of CPU cores in the system, one bit represents the state of one CPU, the values of the n bits are the same (the value is 1 in the embodiment), and the global variable bitmap is visible to all CPUs in the system. A boolean variable set to 1 indicates that the CPU is normal and a boolean variable set to 0 indicates that the CPU has lost scheduling capability. Some operations are performed on the state machine in the callback function of the timer.
And secondly, when the timer of each CPU expires, detecting whether the values of n bits in the global variable bitmap are consistent or not in a timer callback function, if so, executing the fourth step (indicating that the n CPUs all have interrupt response capability), otherwise (indicating that the system has CPU losing interrupt response capability and cannot respond to interrupt) executing the third step.
Thirdly, judging whether the duration time of the inconsistent values of the n bits (namely, the CPU in the system loses the capability of responding to the interrupt) reaches a preset timing period, if so, executing the twelfth step, otherwise, returning to execute the second step.
And fourthly, reversing the n bit numerical values in the global variable bitmap (namely changing the n bit numerical values from 1 to 0 or changing the n bit numerical values from 0 to 1) in a timer callback function, and then executing the fifth step.
Fifthly, detecting whether a Boolean variable on the ith CPU is set as A in a timer callback function, if so (the ith CPU can normally schedule), executing a ninth step, otherwise (the ith CPU loses scheduling capability, the CPU does not execute a dog feeding operation), executing a sixth step; wherein i takes all integers from 1 to n in turn.
And sixthly, saving a call stack log on the ith CPU, and executing a seventh step.
And seventh, judging whether the duration of the Boolean variable on the ith CPU which is not set as A (namely the ith CPU loses the scheduling capability) reaches a preset timing period, if so, executing the eighth step, otherwise, returning to execute the second step.
And eighth step, triggering IPI interrupt, informing the rest n-1 CPUs that the system needs to be restarted, and then executing twelfth step.
Ninth, the boolean variable on the ith CPU is set to B (b=0 in this embodiment) in the timer callback function, and then the tenth step is performed.
And a tenth step of waking up the dog feeding thread of the ith CPU in the timer callback function, and then executing the eleventh step.
And eleventh step, setting the Boolean variable on the ith CPU as A in the awakened dog feeding thread of the ith CPU, and then returning to execute the second step.
And twelfth step, restarting the system.
During normal operation of the system, the operation logs of all CPUs are stored in the memory area I in real time through a ram control mechanism. When a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPUs are saved to a memory area I through a ram control mechanism. After restarting the system, the log (including the running log and the call stack log) in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition.
The memory area I and the memory area II are two memory areas reserved for ram control in the memory and used for caching log. The independent storage partition is a partition formatted on a storage medium such as emmc or ufs, and is a read-write independent partition dedicated to storing logs.
As shown in fig. 2, the electronic device in this embodiment includes a processor 1 and a memory 2 communicatively connected to the processor 1, where the memory 2 stores a computer readable program, and when the computer readable program is called by the processor 1, the above-mentioned method for monitoring the Linux system condition can be executed.
The embodiment also provides a medium, in which a computer readable program is stored, and when the computer readable program is called, the method for monitoring the Linux system condition can be executed.
The above examples are provided for illustrating the technical concept and features of the present invention and are intended to enable those skilled in the art to understand the contents of the present invention and to implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be included in the scope of the present invention.

Claims (10)

1. The method for monitoring the Linux system condition is characterized by comprising the following steps:
defining a global variable bitmap in a system, defining a Boolean variable set as A on each CPU, creating a priority configurable dog feeding thread for each CPU, registering a timer for each CPU, and defining a timer callback function; the number of bits of the global variable bitmap is equal to the number n of CUP cores in the system, one bit represents the state of one CPU, the values of the n bits are the same, and the global variable bitmap is visible to all CPUs in the system;
judging whether n CPUs have interrupt response capability or not based on n bit numerical values in the global variable bitmap;
under the condition that n CPUs have interrupt response capability, inverting n bit numerical values in the global variable bitmap in a timer callback function;
judging whether the ith CPU can normally schedule or not based on the Boolean variable on the ith CPU; wherein i sequentially takes all integers from 1 to n;
under the condition that the ith CPU can normally schedule, firstly setting the Boolean variable on the ith CPU as B, then waking up the dog feeding thread of the ith CPU, setting the Boolean variable on the ith CPU as A in the awakened dog feeding thread of the ith CPU, and then returning to judge whether n CPUs all have the interrupt response capability.
2. The method for monitoring a Linux system according to claim 1, wherein the method for determining whether all of the n CPUs have the capability of responding to an interrupt is as follows: when the timer of each CPU expires, detecting whether the n bit numerical values in the global variable bitmap are consistent or not in a timer callback function, if so, indicating that all n CPUs have the capability of responding to interrupt, otherwise, indicating that the CPUs in the system lose the capability of responding to interrupt.
3. The method for monitoring the condition of the Linux system according to claim 2, wherein: when the CPU loses the capability of responding to the interrupt and the duration reaches the preset timing period, restarting the system.
4. The method for monitoring the Linux system according to claim 1, wherein the method for judging whether the ith CPU can normally schedule is as follows: detecting whether a Boolean variable on the ith CPU is set to A in a timer callback function, if so, indicating that the ith CPU can normally schedule, otherwise, indicating that the ith CPU loses scheduling capability.
5. The method for monitoring the Linux system according to claim 4, wherein: under the condition that the ith CPU loses the scheduling capability, firstly, a call stack log on the ith CPU is stored, and then when the duration of the time when the Boolean variable on the ith CPU is not set as A reaches a preset timing period, IPI interruption is triggered to inform the rest n-1 CPUs that the system needs to be restarted, and then the system is restarted.
6. The method for monitoring the Linux system according to claim 5, wherein: during normal operation of the system, the operation logs of all CPUs are stored into a memory area I in real time through a ram control mechanism; when a certain CPU loses the scheduling capability, the running logs of all the CPUs and the call stack logs of the CPUs are saved to a memory area I through a ram control mechanism; after restarting the system, the log in the memory area I is copied to the memory area II, and then the log in the memory area II is written into the independent storage partition.
7. The method for monitoring the condition of the Linux system according to any one of claims 1 to 6, wherein: the a=1 and the b=0.
8. The method for monitoring the condition of the Linux system according to any one of claims 1 to 6, wherein: the a=0 and the b=1.
9. An electronic device, characterized in that: comprising a processor (1) and a memory (2) in communicative connection with the processor (1), the memory (2) having stored therein a computer readable program which, when called by the processor, is capable of executing the method for monitoring the condition of a Linux system according to any of claims 1 to 8.
10. A medium, characterized by: a computer readable program stored therein, which when called, is capable of executing the method for monitoring the condition of the Linux system according to any one of claims 1 to 8.
CN202311122382.0A 2023-08-31 2023-08-31 Linux system condition monitoring method, electronic equipment and medium Pending CN117130817A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311122382.0A CN117130817A (en) 2023-08-31 2023-08-31 Linux system condition monitoring method, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311122382.0A CN117130817A (en) 2023-08-31 2023-08-31 Linux system condition monitoring method, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN117130817A true CN117130817A (en) 2023-11-28

Family

ID=88850632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311122382.0A Pending CN117130817A (en) 2023-08-31 2023-08-31 Linux system condition monitoring method, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN117130817A (en)

Similar Documents

Publication Publication Date Title
US6662204B2 (en) Thread control system and method in a computer system
US20080005539A1 (en) Method and apparatus to manage processor cores
US11068360B2 (en) Error recovery method and apparatus based on a lockup mechanism
CN100405307C (en) Watchdog Control Method
US9612893B2 (en) Peripheral watchdog timer
KR20150111936A (en) Runtime backup of data in a memory module
CN105550057B (en) Embedded software system fault detection recovery method and system
US7219264B2 (en) Methods and systems for preserving dynamic random access memory contents responsive to hung processor condition
EP2686770A1 (en) Detection on resource leakage
CN111026573A (en) Watchdog system of multi-core processing system and control method
CN103530197A (en) Method for detecting and solving Linux system deadlock
US20180329721A1 (en) Stand-by mode of an electronic circuit
CN109062718B (en) Server and data processing method
CN117130817A (en) Linux system condition monitoring method, electronic equipment and medium
JP2001318807A (en) Method and device for controlling task switching
CN109151144B (en) Hardware management method, device, system, computer equipment and storage medium
US20120047403A1 (en) Data processing system
CN111400087A (en) Control method of operating system, terminal and storage medium
CN107957950B (en) Method and device for detecting system resource leakage
CN113827905B (en) Method, device, equipment and storage medium for judging monitoring treatment of fire water system
US10514970B2 (en) Method of ensuring operation of calculator
EP1971920A2 (en) Method and apparatus for dumping a process memory space
CN117331421B (en) Micro control chip, resetting method thereof and storage medium
CN116431377B (en) Watchdog circuit
JP4647276B2 (en) Semiconductor circuit device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination