CN114217925B

CN114217925B - Business program operation monitoring method and system for realizing abnormal automatic restarting

Info

Publication number: CN114217925B
Application number: CN202111483149.6A
Authority: CN
Inventors: 陈华展
Original assignee: China Citic Bank Corp Ltd
Current assignee: China Citic Bank Corp Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2024-10-01
Anticipated expiration: 2041-12-07
Also published as: CN114217925A

Abstract

The invention relates to a business program operation monitoring method and a system for realizing abnormal automatic restarting, which uses a method for communicating the semaphore of a father process and a child process, judges program crash according to the semaphore generated by the exit of a protected program (child process), and solves the problem that CPU resources are occupied by process polling communication when judging program crash by using a software watchdog; the problems that the existing memory management static detection means has high false detection rate and is not real-time, the dynamic detection means can only acquire useful information by inquiring logs by operation and maintenance personnel, and can not be remedied in time are solved. Particularly, whether the service program is started for the first time can be automatically judged, and the influence on the normal running process caused by the fact that complete initialization is executed under the condition of restarting the crash process is avoided. Meanwhile, by checking the integrity of the sub-process program, the tampered sub-process is prevented from running, the abnormal exit information, the return signal quantity and the sub-process restarting log of the sub-process are recorded by the father process, and data support is provided for system maintenance personnel to analyze the code defects of the sub-process.

Description

Business program operation monitoring method and system for realizing abnormal automatic restarting

Technical Field

The invention relates to the technical field of self-computer program operation guarantee, in particular to a business program operation monitoring method and system for realizing abnormal automatic restarting.

Background

In embedded devices and computer systems running operating systems, application software crashes are mostly related to the management of memory. When the program is determined to be abnormal, the detection methods for finding the memory errors are mainly divided into two types: 1) Static detection is carried out, and the static detection of code levels such as cpp_check, logisope RuleChecke, PC-Lint and the like is carried out by the conventional static detection tools, so that errors which are not released in partial memory allocation can be found by the scheme; 2) Dynamic detection, namely detecting memory abnormality in the running process of a program by using a special debug tool, a core file checking tool and other means, and feeding back suggestions to a programmer to manually modify codes, such as memory error checking tools of a Garbage Collector (GC), a Purify, a Valgrind and the like.

In the embedded field, the most common method for determining program exception and whether a program runs off is to use a watchdog timer. The watchdog counts when the program runs, and if the program runs normally, the watchdog resets the counter at regular time; if the counter is increased to the set point, the program is considered to have an exception. The watchdog timer is an important component of the singlechip hardware, and is a common method for detecting program abnormality, but for an embedded application scene with an operating system, software is required to realize the watchdog function, and more CPU resources are occupied. In the memory error detection method of the application software running the operating system, the disadvantage of static memory error detection is that the miss detection rate and the false detection rate are relatively high; dynamic monitoring requires engineering maintenance and frequent attention of programmers to log files, manual analysis requires a process and time to solve the problem, and if the equipment is in use, program breakdown occurs and untimely processing can cause serious influence of equipment and even system shutdown.

Because program crash has a plurality of problems such as sporadic, probabilistic and difficult reproduction, and the embedded equipment is generally sensitive to the utilization rate of a program running CPU, the method for detecting program abnormality and memory error cannot meet the requirements of rapidly finding and removing faults after the program abnormality of the equipment in use.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a business program operation monitoring method and a system for realizing abnormal automatic restarting, which uses a method for communicating the semaphore of a father process and a child process, judges program crash according to the semaphore generated by the exit of a protected program (child process), and solves the problem that CPU resources are occupied by process polling communication when judging program crash by using a software watchdog; the problems that the existing memory management static detection means has high false detection rate and is not real-time, the dynamic detection means can only acquire useful information by inquiring logs by operation and maintenance personnel, and can not be remedied in time are solved. Particularly, whether the service program is started for the first time can be automatically judged, and the influence on the normal running process caused by the fact that complete initialization is executed under the condition of restarting is avoided. Meanwhile, by checking the integrity of the sub-process program, the tampered sub-process is prevented from running, the abnormal exit information, the return signal quantity and the sub-process restarting log of the sub-process are recorded by the father process, and data support is provided for system maintenance personnel to analyze the code defects of the sub-process.

In order to achieve the above object, the present invention adopts the technical scheme that:

a business program operation monitoring method for realizing abnormal automatic restarting is characterized by comprising the following steps:

s10, starting a first process;

S20, creating a second process through copying of the first process;

s30, identifying the first process as a parent process and the second process as a child process according to the return values of the creation functions of the first process and the second process;

S40, verifying the program integrity of the business program corresponding to the child process through the parent process;

s50, when the business program passes the program integrity verification, the business program is used for replacing the sub-process, a counting start mark is set in the sub-process, the parameter value of the counting start mark is recorded as1, a start log is recorded, the business logic is initialized and executed according to the parameter value of the counting start mark, and meanwhile, the father process enters a dormant state to wait for the exit semaphore of the sub-process;

S60, the parent process wakes up from the dormant state after receiving the exit semaphore of the child process, and judges whether the exit semaphore of the child process belongs to a normal semaphore or an abnormal semaphore;

s71, when the exit semaphore of the child process is judged to belong to the normal semaphore, recording an exit log and ending the parent process;

s72, when the exit semaphore of the child process is judged to belong to the abnormal semaphore, the parent process creates a third process and increases the parameter value of the counting start flag by 1;

and S80, when the father process creates the third process, using the third process as a new second process and the father process as a new first process, repeatedly executing the steps S30, S40, S50, S60 and S72 until the father process judges that the exit semaphore belongs to the normal semaphore and executing the step S71.

Further, the identifying the first process as a parent process and the second process as a child process according to the creation function return values of the first process and the second process includes:

the return value of the creation function of the first process is greater than 0, and the first process is identified as a father process;

When the return value of the creation function of the second process is 0, identifying the second process as a sub-process;

And when the return value of the creation function of the second process is smaller than 0, judging that the creation of the second process fails.

Further, the initializing according to the parameter value of the count enable flag includes:

when the parameter value of the counting start mark is 1, the service program is identified as the first start, namely the service program judges itself as a subprocess started for the first time, and the complete initialization of the first start is executed;

When the parameter value of the counting start mark is larger than 1, the service program is identified to be restarted, namely the service program judges itself as a restarted sub-process, and the full initialization of the first start is not executed.

Further, the first process is started by executing a startup self-starting script.

Further, the step S30 further includes:

resetting the parent process to be unresponsive to the interrupt signal quantity, and preventing the parent process from exiting in advance in response to the interrupt signal, wherein the child process does not exit at the moment, so that the parent process cannot protect the child process.

The invention also relates to a service program operation monitoring system for realizing abnormal automatic restarting, which is characterized by comprising the following steps:

the sub-process starting module is used for creating a sub-process;

the sub-process semaphore monitoring module is used for monitoring and receiving the exit semaphore of the sub-process;

the program integrity verification module is used for verifying the program integrity of the business program corresponding to the subprocess;

the log recording module is used for recording a start log and an exit log;

And the time service module is used for connecting the hardware clock and acquiring clock information as time stamps of the start log and the exit log.

The invention also relates to a computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the method described above.

The invention also relates to an electronic device, which is characterized by comprising a processor and a memory;

The memory is used for storing service programs;

the processor is used for executing the method by calling the service program.

The beneficial effects of the invention are as follows:

the method and the system for monitoring the operation of the business program for realizing abnormal automatic restarting are adopted, the program crash is judged according to the signal quantity generated by the exit of the protected program (subprocess), the problem that the CPU resource is occupied by the polling communication of the software watchdog process is high is solved, the problems that the existing memory management static detection means has high false detection rate and is not real-time, the dynamic detection means can only acquire useful information by means of the inquiry log of operation staff and cannot remedy the useful information in time and the like are solved, the method for automatically restarting the program after the program crash caused by the automatic discovery of the memory error is realized, the program can be automatically restarted under the condition of abnormal program occurrence, the service is recovered, the normal operation process can be isolated under certain scenes, the user experience is improved, and the basis and the time are also strived for the inquiry of the developer.

Drawings

Fig. 1 is a schematic flow chart of a method for monitoring operation of a service program for implementing abnormal automatic restarting.

Fig. 2 is a schematic structural diagram of a service program operation monitoring system for implementing abnormal automatic restarting according to the present invention.

Fig. 3 is a schematic structural diagram of an embodiment of hardware of a device involved in the application of the method of the present invention.

Detailed Description

For a clearer understanding of the present invention, reference will be made to the following detailed description taken in conjunction with the accompanying drawings and examples.

The first aspect of the present invention relates to a method for monitoring operation of a service program for implementing abnormal automatic restart, wherein the flow is shown in fig. 1, and the method comprises the following steps:

S10, starting a first process by executing a starting self-starting script, wherein starting parameters of the first process are solidified in the script, the parameters are a service program directory and a program file name which need to be started, and if the service program is started, additional parameters can be added after the program file name;

S20, creating a second process through copying of the first process, and preferably, calling a fork function by the first process to start the second process;

S30, identifying the first process as a parent process and the second process as a child process according to the creation function return values of the first process and the second process, wherein the creation function return value of the first process is more than 0, and identifying the first process as the parent process; when the return value of the creation function of the second process is 0, identifying the second process as a sub-process; when the return value of the creation function of the second process is smaller than 0, judging that the creation of the second process fails;

S50, when the business program passes the program integrity verification, adding a counting start mark into the business program corresponding to the child process, recording the parameter value of the counting start mark as 1, recording a start log, replacing the child process with the business program by using execl () and executing business logic, and simultaneously transmitting additional parameters required for starting the business program received by the parent process into the replaced business program; wherein, a parameter value of 1 of the counting start mark indicates that the service program is started for the first time and needs to be initialized, and a parameter value of not 1 of the counting start mark indicates that the service program is not started for the first time and does not need to be initialized; meanwhile, the parent process enters a dormant state to wait for the exit semaphore of the child process, and the parent process does not occupy system resources at the moment;

S60, the father process wakes up from the sleep state after receiving the exit semaphore of the child process, and judges that the exit semaphore belongs to a normal semaphore or an abnormal semaphore, preferably, a WTERMSIG function is called in the father process to check the semaphore of the exit semaphore of the child process, for the embedded device adopting the Linux system, the abnormal semaphore has a conventional definition, for example, an illegal access memory (SIGSEGV) is classified as the abnormal semaphore, ctrl+c is typed in an INTR character (SIGINT) to be classified as the normal semaphore, and the father process can judge according to the related definition;

s71, when the exiting semaphore is judged to belong to the normal semaphore, recording an exiting log and ending the parent process;

S72, when the exit semaphore is judged to belong to the abnormal semaphore, a third process is created, and the parameter value of the counting start flag is increased by 1, which is equivalent to restarting operation of executing the service program through restarting the sub-process, and the counting start flag is changed to be more than 1, so that the service program is indicated not to be started for the first time, full initialization is not required to be executed, other modules in normal operation and used resources are not forced to be released, and the influence on a user is reduced;

And S80, when the father process creates the third process, using the third process as a new second process and the father process as a new first process, repeatedly executing the steps S30, S40, S50, S60 and S72 until the father process judges that the exit semaphore belongs to the normal semaphore and executing the step S71, namely, repeatedly monitoring the process until the child process corresponding to the service program is judged to exit normally.

A second aspect of the present invention relates to a service program operation monitoring system for implementing abnormal automatic restart, which is structured as shown in fig. 2, and includes:

the sub-process starting module is used for creating a sub-process;

The log recording module is used for recording a starting log and an exiting log, in particular to a starting log and an exiting log of a corresponding sub-process;

and the time service module is used for connecting a hardware clock and acquiring clock information serving as time stamps of the start log and the exit log.

Execution of the above method can be achieved by using the system.

Fig. 3 is a schematic diagram of a hardware structure of a device for monitoring operation of a service program by applying the method of the present invention, which provides a network communication device, and adopts a red-black isolation architecture, wherein a red-zone optical interface is connected with an internal network, and a black-zone optical interface is connected with an external network; the key hardware components are a black area CPU, a red area CPU, an algorithm CPU and an FPGA; the reset key is connected to the reset pins of the three CPUs.

The black area CPU, the red area CPU and the algorithm CPU have the same model and all run the same embedded Linux operating system, and the application software of the three area CPU is customized and developed according to the functional application requirements. Initializing operation of the device: the clock chip gives time to the Linux system of the three-zone CPU, the algorithm CPU downloads the algorithm program to the FPGA, and the black-zone CPU and the red-zone CPU initialize the optical interfaces of the black zone and the red zone respectively. Once the communication link is established, the traffic data path is shown by the arrow in fig. 3. The three-zone CPU will not participate in service operation most of the time, and only generates control signaling for the ongoing service when needed: the algorithm CPU issues an algorithm updating instruction to the FPGA at regular time, and when the local management instruction is issued, the red area CPU responds first, and the black area CPU is responsible for receiving the remote management instruction and controlling the equipment to execute.

The business application programs of the three-area CPU are all protected by the father process as child processes. When the sub-process of the CPU in the red and black areas has memory errors and program crashes, the parent process restarts the sub-process to only affect the local management and the remote management which are running at the time, and after the sub-process is restarted, the management function is recovered. When the subprocess of the CPU of the algorithm module crashes, one algorithm change and some main control processes can be influenced, and the algorithm update and the main control processes can be recovered after the subprocess is restarted because the scene is saved when the program crashes. Since the parent process is distinguished from the first-time process, the device initialization operation is not performed, and the ongoing communication is protected. Unless the operator of the device is remotely or locally managing, the average user experiences little impact of a CPU program crash on the network traffic.

If the reset key or the power key is pressed, or the device is restarted through a menu of remote management and local management, the parent process is restarted, at this time, the initialization operation of the device is required to be executed after the device is restarted, the FPGA and the optical interface are also reset, and the ongoing communication is interrupted. The device control management instructions, which are manually operated at this time, are psychological in that the system administrator is expecting the communication interruption, and the temporary communication interruption can be accepted at this time.

According to the invention, through monitoring of the sub-process exit semaphore, the crashed service program can be restarted in time after the fault is found, so that the probability of communication system fault caused by program crash is reduced. The response time sensitivity from fault occurrence to fault removal is reduced, so that the acceptance time limit of a customer for solving the code defect in a limited period is prolonged. The recorded program start and program stop logs are convenient for developers to locate, check and modify defects in codes in the future. Meanwhile, the service program monitoring is realized by only using a software method, the watchdog hardware is not required to be added, and the hardware cost is effectively reduced. The method of the invention is an effective remedy for introducing actual engineering as code defects are overlooked in the development and test stage, and an auxiliary method for checking defects is verified to be effective in network communication engineering practice.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A business program operation monitoring method for realizing abnormal automatic restarting is characterized by comprising the following steps:

s10, starting a first process;

S20, creating a second process through copying of the first process;

2. The method of claim 1, wherein the identifying the first process as a parent process and the second process as a child process based on the creation function return values of the first process and the second process comprises:

3. The method of claim 1, wherein initializing the parameter value according to the count enable flag comprises:

when the parameter value of the counting start mark is 1, the service program is identified as the first start, and the complete initialization of the first start is executed;

when the parameter value of the counting start flag is larger than 1, the service program is identified to restart, and the complete initialization of the first start is not executed.

4. The method of claim 1, wherein the first process is started by executing a power-on self-starting script.

5. The method of claim 1, wherein the step S30 further comprises:

the parent process is reset to not respond to the interrupt semaphore.

6. A business process operation monitoring system implementing an abnormal automatic restart of the method of claim 1, comprising:

the sub-process starting module is used for creating a sub-process;

the log recording module is used for recording a start log and an exit log;

7. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 5.

8. An electronic device comprising a processor and a memory;

The memory is used for storing service programs;

The processor is configured to execute the method of any one of claims 1 to 5 by invoking a business program.