[go: up one dir, main page]

CN113778735B - Fault processing method and device and computer readable storage medium - Google Patents

Fault processing method and device and computer readable storage medium Download PDF

Info

Publication number
CN113778735B
CN113778735B CN202111039126.6A CN202111039126A CN113778735B CN 113778735 B CN113778735 B CN 113778735B CN 202111039126 A CN202111039126 A CN 202111039126A CN 113778735 B CN113778735 B CN 113778735B
Authority
CN
China
Prior art keywords
fault
container
cause
failure
scheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111039126.6A
Other languages
Chinese (zh)
Other versions
CN113778735A (en
Inventor
李嘉荣
黎原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202111039126.6A priority Critical patent/CN113778735B/en
Publication of CN113778735A publication Critical patent/CN113778735A/en
Application granted granted Critical
Publication of CN113778735B publication Critical patent/CN113778735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/815Virtual

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a fault processing method, a fault processing device and a computer readable storage medium, relates to the technical field of Internet, and can improve the efficiency and the accuracy of container fault processing. The fault processing method comprises the following steps: under the condition that the first container fails, determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model; determining a first fault repairing scheme according to a first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme; creating a second container identical to the failed first container; performing repair operation on the second container according to the first fault repair scheme; and if the repaired second container works normally, outputting a first fault repairing scheme.

Description

Fault processing method and device and computer readable storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a fault handling method, a fault handling device, and a computer readable storage medium.
Background
The platform is a service (platformas a service, paaS) cloud platform, is one of realization modes of cloud computing service, and is used for flattening basic resources such as computing, storage, network, operating system, database, middleware and the like, delivering users in a platform service mode, and providing complete development and running environments for application programs by setting different containers. Meanwhile, the PaaS cloud platform also provides services such as application debugging, application deployment, performance monitoring, load balancing, resource on-demand adjustment, automatic expansion and contraction and the like for the application program, so that the development flow of the application program can be simplified, and the development efficiency of the application program can be improved.
In the PaaS cloud platform, different application programs run in different containers, so that different services can be provided by the PaaS cloud platform. However, after the containers and the application program fail, the possible failures in each container need to be manually checked and analyzed. The above-mentioned fault handling scheme of the container requires a large amount of manual handling, resulting in low efficiency and accuracy of fault handling.
Disclosure of Invention
The application provides a fault processing method, a fault processing device and a computer readable storage medium, which can improve the efficiency and the accuracy of container fault processing.
In a first aspect, the present application provides a fault handling method, the fault handling method comprising: under the condition that the first container fails, determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model; determining a first fault repairing scheme according to a first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme; creating a second container identical to the failed first container; performing repair operation on the second container according to the first fault repair scheme; and if the repaired second container works normally, outputting a first fault repairing scheme.
According to the technical scheme provided by the embodiment of the application, the fault reason of the first container can be predicted according to the fault reason prediction model, manual analysis is not needed, the fault reason is automatically determined, and the fault processing efficiency of the container is improved. Further, according to the embodiment of the application, the fault restoration scheme can be determined according to the fault cause automatically analyzed, and the second container which is the same as the first container is restored according to the fault restoration scheme so as to verify the fault restoration scheme. Because the corresponding relation exists between the fault repairing scheme and the fault reason, the verified fault repairing scheme is more in line with the fault of the first container, so that the fault processing of the container is more accurate. Thereby improving the efficiency and accuracy of container fault handling.
In some embodiments, the failure cause prediction model is trained by: acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples comprise fault reasons and fault results; according to the training sample set, training to generate a fault cause prediction model; in the training process, the input of the fault cause prediction model is a fault result in a training sample, and the output is the fault cause in the training sample. In this way, the embodiment of the application can collect and sort faults possibly occurring in the container according to the training samples to form a training sample set and generate the fault cause prediction model, so that non-professional personnel can process the faults of the PaaS cloud platform according to the fault cause prediction model, and the efficiency and accuracy of fault processing of the PaaS cloud platform are improved.
In some embodiments, the method further comprises: under the condition that the first container operates normally, generating a third container which is the same as the first container which operates normally; modifying configuration information of the third container; if the modified third container fails, recording a second failure result, and taking the modified content of the configuration information of the third container as a second failure cause; and taking the second fault result and the second fault cause as training samples.
In some embodiments, the method further comprises: determining fault characteristic information of the modified third container according to the configuration information of the first container and the configuration information of the modified third container, wherein the fault characteristic information is the content of the configuration information of the modified third container, which is different from the configuration information of the first container; repairing the modified fault characteristic information of the third container; if the repaired third container works normally, recording the repairing operation in the second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and the second fault reason. Therefore, the fault repairing scheme can be determined according to the difference between the container with the fault and the container with the normal working, and the effectiveness of the fault repairing scheme is improved. And establishing a corresponding relation between the second fault repairing scheme and the second fault cause, so that the fault processing device can determine an effective fault repairing scheme according to the fault cause and the corresponding relation. Therefore, the effectiveness and the accuracy of the container fault treatment process are improved, and the fault treatment efficiency of the container is further improved.
In some embodiments, outputting the first failover scheme includes: and sending the first fault repair scheme to a client used by a maintainer. Therefore, maintenance personnel can process the faults of the container according to the first fault repairing scheme, and the fault processing efficiency of the container is improved.
In a second aspect, an embodiment of the present application further provides a fault handling apparatus, where the fault handling apparatus includes: a processing module and a communication module. The processing module is used for determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model under the condition that the first container fails; determining a first fault repairing scheme according to a first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme; creating a second container identical to the failed first container; and performing a repair operation on the second container according to the first fault repair scheme. And the communication module is used for outputting a first fault repairing scheme if the repaired second container works normally.
In some embodiments, the communication module is further configured to obtain a training sample set, where the training sample set includes a plurality of training samples, and the training samples include a failure cause and a failure result; the processing module is also used for training and generating a fault cause prediction model according to the training sample set; in the training process, the input of the fault cause prediction model is a fault result in a training sample, and the output is the fault cause in the training sample.
In some embodiments, the processing module is specifically configured to generate, in a case where the first container is operating normally, a third container identical to the first container that is operating normally; modifying configuration information of the third container; if the modified third container fails, recording a second failure result, and taking the modified content of the configuration information of the third container as a second failure cause; and taking the second fault result and the second fault cause as training samples.
In some embodiments, the processing module is further configured to determine, according to the configuration information of the first container and the configuration information of the modified third container, failure feature information of the modified third container, where the failure feature information is content of the configuration information of the modified third container that is different from the configuration information of the first container; repairing the modified fault characteristic information of the third container; if the repaired third container works normally, recording the repairing operation in the second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and the second fault reason.
In some embodiments, the communication module is specifically configured to send the first fault repair scheme to a client used by a maintenance person.
In a third aspect, an embodiment of the present application further provides a fault handling apparatus, including: the processor interfaces with the communication. The processor and the communication interface are configured to implement the fault handling method of the first aspect or any one of the possible embodiments.
In a fourth aspect, embodiments of the present application further provide a computer readable storage medium storing computer instructions that, when executed, implement the fault handling method of the first aspect or any one of the possible embodiments.
In a fifth aspect, embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the fault handling method of the first aspect or any one of the possible embodiments.
The technical effects of any one of the designs of the second aspect to the fifth aspect may be referred to as the technical effects of the corresponding design of the first aspect, and will not be described herein.
Drawings
Fig. 1 is a schematic diagram of a Paas cloud platform according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a fault handling method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of another fault handling method according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of another fault handling method according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of another fault handling method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another fault handling apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments obtained by a person skilled in the art based on the embodiments provided by the present application fall within the scope of protection of the present application.
Throughout the specification and claims, the term "comprising" is to be interpreted as an open, inclusive meaning, i.e. "comprising, but not limited to, unless the context requires otherwise. In the description of the present specification, the terms "one embodiment," "some embodiments," "example embodiments," "examples," or "some examples," etc., are intended to indicate that a particular feature, structure, material, or characteristic associated with the embodiment or example is included in at least one embodiment or example of the application. The schematic representations of the above terms do not necessarily refer to the same embodiment or example.
The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
At least one of "A, B and C" has the same meaning as at least one of "A, B or C" and includes the following combinations of A, B and C: a alone, B alone, C alone, a combination of a and B, a combination of a and C, a combination of B and C, and a combination of A, B and C.
"A and/or B" includes the following three combinations: only a, only B, and combinations of a and B.
As shown in fig. 1, there are multiple nodes in the PaaS cloud platform, where the nodes are used to implement various functions such as computing, storage, network, operating system, database, middleware, and the like. For example, nodes 10, 20, 30, 40 may be computing nodes in the PaaS cloud platform that provide computing for other nodes or devices in the PaaS cloud platform by completing computing tasks.
In the architecture shown in fig. 1, a plurality of containers may be provided in a node, where the containers are used to provide a runtime environment for a service application. For example, the node 10 may be provided with a container 11, a container 12, a container 13, and a container 14.
On the one hand, the PaaS cloud platform has large number of containers, and after long-time operation, the programs of the containers and the application programs of the services are easy to fail, so that the containers fail, and the services in the containers are not available. On the other hand, the uncertain factors exist in the application parameters received by the service in the container, the different values of the application parameters can cause the container to fail, and the service in the container is not available.
When the above-mentioned failure occurs, a developer is required to perform failure analysis on the container to repair the unusable container. The containers are large in variety and number, a large amount of manual processing is needed, and the accuracy of fault prediction is low.
In order to solve the technical problems, the application provides a fault processing method, and the technical scheme provided by the embodiment of the application can be applied to the PaaS cloud platform or other platforms provided with containers.
In order to implement the fault processing method provided by the embodiment of the present application, the embodiment of the present application provides a fault processing device for executing the fault processing method, and fig. 2 is a schematic structural diagram of the fault processing device provided by the embodiment of the present application. As shown in fig. 2, the fault handling apparatus 200 comprises at least one processor 201, a communication line 202, and at least one communication interface 204, and may further comprise a memory 203. The processor 201, the memory 203, and the communication interface 204 may be connected through a communication line 202.
Processor 201 may be a central processing unit (central processing unit, CPU), an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present application, such as: one or more digital signal processors (DIGITAL SIGNAL processors, DSPs), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGAs).
Communication line 202 may include a path for communicating information between the above-described components.
The communication interface 204 is used to communicate with other nodes or other devices in the PaaS cloud platform, or also to communicate with a communication network outside the PaaS cloud platform, and any transceiver-like device may be used, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.
The memory 203 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disc storage, a compact disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to include or store the desired program code in the form of instructions or data structures and that can be accessed by a computer.
In a possible design, the memory 203 may exist separately from the processor 201, that is, the memory 203 may be a memory external to the processor 201, where the memory 203 may be connected to the processor 201 through a communication line 202, for storing execution instructions or application program codes, and the execution is controlled by the processor 201 to implement a fault handling method provided by the embodiments of the present application described below. In yet another possible design, the memory 203 may be integrated with the processor 201, i.e., the memory 203 may be an internal memory of the processor 201, e.g., the memory 203 may be a cache, may be used to temporarily store some data and instruction information, etc.
As one implementation, processor 201 may include one or more CPUs, such as CPU0 and CPU1 in fig. 1.
As another implementation, the fault handling apparatus 200 may include a plurality of processors, such as the processor 201 and the processor 207 in fig. 2.
As yet another implementation, the fault handling apparatus 200 may further include an output device 205 and an input device 206. The output device 205 communicates with the processor 201 and may display information in a variety of ways. For example, the output device 205 may be a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. The input device 206 is in communication with the processor 201 and may receive user input in a variety of ways. For example, the input device 206 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
As shown in fig. 3, an embodiment of the present application provides a fault handling method, which is applied to the fault handling apparatus 200 in fig. 2, and the method includes steps S301 to S305:
s301, when the first container fails, determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model.
In some embodiments, the failure of the first container may be a failure of a program of the first container or a program of a service within the first container.
In other embodiments, the failure of the first container may be a failure caused by an incorrect application parameter received by the service within the first container. Illustratively, as illustrated in connection with FIG. 1, the container 31 in the node 30 is provided with a database access service. The service in the container 31 can accept the service requests of 20 database accesses at the same time, that is, "maximum connection number=20". Assuming that the number of connections received by the container 31=30, the maximum number of connections of the container 31 is exceeded, at which time the container 31 may malfunction.
Alternatively, the first failure result of the first container may include an operational state of the first container after the first container fails. For example, the operating state of the first container after the failure may be the operating parameter information of the first container after the failure. For example, the memory occupancy of the first container or the hard disk occupancy.
Further, the first failure result of the first container further includes a feedback result of the first container after the first container fails. For example, the feedback result after the first container fails may be an operation result of the service in the first container after the first container fails. For example, the result of the operation of the service in the first container may be an indication information for indicating that the maximum connection number is exceeded.
In the embodiment of the application, the failure cause prediction model is used for determining the failure cause of the first container failure. The failure cause prediction model is obtained by training according to the failure result and the failure cause of the possible failure of the first container.
In some embodiments, one fault result may correspond to one or more fault causes, and illustratively, the first fault cause may be an error in one or more instruction codes in a program serviced within the first container. Still further exemplary, the first failure cause may be that one or more application parameters received by the service within the first container are incorrect.
S302, determining a first fault restoration scheme according to a first fault reason and a preset corresponding relation.
The preset corresponding relation can be a corresponding relation between a fault reason and a fault repairing scheme.
For example, one failure cause may correspond to one or more failure recovery schemes. Therefore, verification of one or more fault remediation schemes is required to improve the accuracy of the fault prediction.
In some embodiments, a failure recovery scheme is used to recover a failed first container such that the first container is operating properly.
For example, the failover scheme may be a modification to the program of the first container or the program serviced within the first container. Or the failover scheme may be a modification of the application parameters received by the service in the first container. Or the failover scheme may be a modification of application parameters in the program serviced within the first container.
S303, creating a second container which is the same as the first container with the fault.
Wherein the second container is a replica of the failed first container. The second container is used to test the first failover scheme. According to the embodiment of the application, the first container can be tested by testing the first fault restoration scheme through the second container. Accordingly, the test results of the second container for the first failover scheme are equally applicable to the first container.
In some embodiments, the number of second containers may be determined based on the number of failover scenarios. As described in step 301, one failure cause may correspond to one or more failure recovery schemes. The number of the second containers in the application can be the same as the number of the fault repairing schemes, so that a plurality of fault repairing schemes can be tested at the same time, the test result of the fault repairing schemes can be obtained more quickly, and the test efficiency of the fault repairing schemes is improved.
As a possible implementation, the fault handling means may create the second container based on configuration information of the first container. For example, the fault handling device may establish a container that is not running a service directly according to the container environment of the first container, where the service in the first container is run according to the program information and the application parameters of the service in the first container, and a second container that is faulty is obtained.
It will be appreciated that the second container created from the configuration information of the first container has the same failure as the first container, so that the failure handling means can replace the failure handling of the first container by performing the failure handling of the second container.
S304, repairing the second container according to the first fault repairing scheme.
As a possible implementation manner, the program of the first container and/or the program served in the first container are repaired, so as to implement the repair operation on the second container.
As another possible implementation manner, the application parameters received by the first container and/or the operation parameters of the service in the first container are repaired, so as to implement a repair operation on the second container.
It should be understood that the first container is a running container in the Paas cloud platform, and the stability of the first container is directly related to the stability of the Paas cloud platform or the node where the first container is located. Directly modifying the program of the first container may cause problems in the Paas cloud platform or the node where the first container is located, and affect the safe and stable operation of the Paas cloud platform or the node where the first container is located. And the repair operation is carried out on the second container which is the same as the first container with the fault, the modification of the first container can be avoided, and the treatment result of the repair operation on the second container is also applicable to the first container. Therefore, the method and the device for repairing the second container have the same repairing operation as the first container with faults, and can avoid the influence of the fault treatment of the first container on the Paas cloud platform or the node where the first container is located.
S305, if the repaired second container works normally, outputting a first fault repair scheme.
In some embodiments, outputting the first failover scheme includes: and sending the first fault repair scheme to a client used by a maintainer.
It should be understood that if the repaired second container works normally, the first failure repair scheme corresponding to the second container is the correct failure repair scheme with respect to the failure occurring in the first container. The first failure cause is the correct failure cause.
If the repaired second container cannot work normally, the first fault repairing scheme corresponding to the second container is incorrect relative to the fault of the first container, and is not a fault repairing scheme for repairing the fault of the first container.
According to the technical scheme provided by the embodiment of the application, the fault reasons of the first container are predicted, the verification of the fault repair scheme corresponding to the fault reasons is realized by establishing the second container with the same fault, the verified fault repair scheme is more accurate in corresponding to the fault reasons, and the accuracy of fault processing of the PaaS cloud platform is improved. In the process, a large amount of manual processing is not needed, and the manual participation in the fault processing of the PaaS cloud platform is reduced. Therefore, the efficiency and the accuracy of fault processing of the PaaS cloud platform are improved.
In some embodiments, after the maintainer receives the first fault repair scheme, the first container may be repaired according to the first fault repair scheme, so that the first container operates normally, and the efficiency of fault processing of the PaaS cloud platform is improved.
In some embodiments, the fault handling apparatus may also send information to a maintenance person indicating the course of the process in the fault handling method. The information may include, among other things, a first failure result of the first container, a first failure cause, and a repair result to the second container. Therefore, maintenance personnel can comprehensively judge faults of the first container, and the success rate of the first container repair is improved.
In some embodiments, the fault handling device may further send information indicating to reestablish the first container to the node to which the first container corresponds. That is, in the case that the first container cannot be repaired, the node corresponding to the first container may reestablish the first container and continue to provide the service.
In some embodiments, the fault handling apparatus may further receive first instruction information of a maintenance person, where the first instruction information is used to specify a cause of the fault that may be repaired automatically, and a corresponding fault repair scheme. Therefore, the fault processing device can automatically repair faults in the PaaS cloud platform according to the fault processing method, and manual operation in the fault processing of the PaaS cloud platform is reduced.
In some embodiments, the fault handling apparatus may further receive second instruction information of the maintenance personnel, wherein the second instruction information is used to specify a cause of the fault that is not automatically repairable. Therefore, the fault processing device can ensure the safety in the fault processing of the PaaS cloud platform.
Optionally, before the first container fails, the failure processing device may establish a failure cause prediction model, so that the failure processing device may process the failure occurring in the first container according to the failure cause prediction model. As shown in fig. 4, the failure cause prediction model is obtained through training in steps S401 to S402:
S401, acquiring a training sample set.
Wherein the training sample set comprises a plurality of training samples.
In some embodiments, the training samples include a failure cause and a failure result. There is a correspondence between the cause of the fault and the result of the fault. For example, one failure cause may correspond to one or more failure results. Or a fault result may correspond to one or more fault causes.
As a possible implementation, as shown in fig. 5, the training samples may be determined by steps A1-A4.
A1, under the condition that the first container operates normally, generating a third container which is the same as the first container which operates normally.
The third container is used for testing the fault reasons in the fault library. The fault library is a collection of a plurality of fault causes.
For example, the fault handling device may receive information indicating that a fault library is to be established, where the information may include a cause of a fault that may occur in the first container entered by the maintenance personnel.
As yet another example, the fault handling apparatus may receive information indicating to update the fault repository, wherein the information may include a newly added cause of the fault that may occur in the first container.
A2, modifying the configuration information of the third container.
In some embodiments, the fault handling apparatus may modify the configuration information of the first container according to the cause of the fault in the fault repository. For example, if one of the failure causes is that the application parameters to be input are changed, the failure processing apparatus may change the application parameters to be input in the third container.
A3, if the modified third container fails, recording a second failure result, and taking the modified content of the configuration information of the third container as a second failure cause.
The second failure result may be an operation state after the third container fails, or may also be a feedback result after the third container fails.
In some embodiments, the second failure cause may be one of the failure causes in the failure repository. The second fault cause may be, for example, a fault cause for indicating that the application parameter to be input by the service is changed, from among the fault causes in the fault library.
In some embodiments, if the modified third container works normally, the third container is deleted, and the third container identical to the first container that works normally is regenerated, so as to realize automatic testing of the failure cause in the failure library.
A4, taking the second fault result and the second fault cause as training samples.
In the embodiment of the application, the fault processing device also needs to store the corresponding relation between the second fault result and the second fault cause, so that the fault processing device can determine the fault cause according to the fault result.
S402, training to generate a fault cause prediction model according to the training sample set; in the training process, the input of the fault cause prediction model is a fault result in a training sample, and the output is the fault cause in the training sample.
It should be noted that, the training process of the failure cause prediction model is supervised training, that is, the failure result, the failure cause, and the correspondence between the failure result and the failure cause are determined.
Therefore, the embodiment of the application can collect and sort the faults possibly occurring in the container according to the indication of the development and maintenance personnel to form the training sample and generate the fault cause prediction model, so that non-professional personnel can process the faults of the PaaS cloud platform according to the fault cause prediction model, and the efficiency and the accuracy of fault processing of the PaaS cloud platform are improved.
Optionally, after the training of the failure cause prediction model is finished, the failure processing device may further repair the modified third container and determine a failure repair scheme for the failure of the modified third container. Thereby establishing a corresponding relation between the fault cause and the fault repair scheme.
As shown in fig. 6, the process of establishing the correspondence between the failure cause and the failure repair scheme may include steps B1-B3:
B1, determining fault characteristic information of the modified third container according to the configuration information of the first container and the configuration information of the modified third container.
The fault characteristic information is the content of the configuration information of the modified third container, which is different from the content of the configuration information of the first container.
Illustratively, the configuration information of the third container is taken as an example of an application parameter of the service in the third container. Assuming that the "maximum connection number=20" in the first container and the "maximum connection number=5" in the third container, the failure feature information is "maximum connection number=5".
It will be appreciated that the first container is a normal operation container, the modified third container is a failed container, and the failure characteristic information may indicate difference information between the failed container and the normal operation container, that is, configuration information of the modified third container is different from content of the configuration information of the first container. Thus, the failure processing apparatus can repair the failed container (third container) based on the difference information between the failed container and the normal-operation container.
And B2, repairing the modified fault characteristic information of the third container.
As a possible implementation manner, the modified fault characteristic information of the third container is subjected to repair operation, so that the fault processing device can modify the configuration information corresponding to the fault characteristic information of the third container to be the same as the configuration information of the first container.
And B3, if the repaired third container works normally, recording the repairing operation in the second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and a second fault reason.
In some embodiments, if the repaired third container works normally, it indicates that the third container is repaired successfully, that is, the second fault repair scheme is applicable to the second fault cause. Therefore, the fault processing device can establish a corresponding relation between the second fault restoration scheme and the second fault cause, so that the fault processing device can determine the fault restoration scheme according to the fault cause and the preset corresponding relation.
In this way, the embodiment of the application can determine the fault repairing scheme according to the difference between the container with the fault and the container with the normal working, so that the fault repairing scheme can effectively repair the container with the fault, and the repairing efficiency of the fault repairing scheme is improved.
It can be seen that the technical solution provided by the embodiment of the present application is mainly described from the method perspective. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the fault processing device according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present application is schematic, which is merely a logic function division, and other division manners may be implemented in practice.
Fig. 7 is a schematic structural diagram of another fault handling apparatus according to an embodiment of the present application. The fault handling device comprises: a processing module 701 and a communication module 702.
The processing module 701 is configured to determine a first failure cause according to a first failure result of the first container and a pre-trained failure cause prediction model when the first container fails; determining a first fault repairing scheme according to a first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme; creating a second container identical to the failed first container; performing repair operation on the second container according to the first fault repair scheme;
and the communication module 702 is configured to output the first failure repair scheme if the repaired second container is working normally.
In some embodiments, the communication module 702 is further configured to obtain a training sample set, where the training sample set includes a plurality of training samples, and the training samples include a failure cause and a failure result; the processing module 701 is further configured to train and generate a failure cause prediction model according to the training sample set; in the training process, the input of the fault cause prediction model is a fault result in a training sample, and the output is the fault cause in the training sample.
In some embodiments, the processing module 701 is specifically configured to generate, in a case where the first container is operating normally, a third container that is the same as the first container that is operating normally; modifying configuration information of the third container; if the modified third container fails, recording a second failure result, and taking the modified content of the configuration information of the third container as a second failure cause; and taking the second fault result and the second fault cause as training samples.
In some embodiments, the processing module 701 is further configured to determine, according to the configuration information of the first container and the configuration information of the modified third container, fault feature information of the modified third container, where the fault feature information is content of the configuration information of the modified third container that is different from the configuration information of the first container; repairing the modified fault characteristic information of the third container; if the repaired third container works normally, recording the repairing operation in the second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and the second fault reason.
In some embodiments, the communication module 702 is specifically configured to send the first failover scheme to a client used by a maintenance person.
Optionally, the fault handling apparatus may further comprise a storage module for storing program code and/or data of the fault handling apparatus. Such as the program code of the fault handling method described above. And also for example, includes a fault library including various fault causes involved in the above-described fault handling method.
Wherein the processing module 701 may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination of computing functions, including for example, one or more microprocessor combinations, a combination of DSPs and microprocessors, and the like. The communication module 702 may be a transceiver circuit or a communication interface, etc. The memory module may be a memory. When the processing module 701 is a processor, the communication module 702 is a communication interface, and the storage module is a memory, the fault handling apparatus according to the embodiment of the present application may be the fault handling apparatus shown in fig. 2.
The embodiment of the invention also provides a computer readable storage medium, which comprises computer execution instructions, when the computer execution instructions run on a computer, cause the computer to execute the fault processing method provided by the embodiment.
The embodiment of the invention also provides a computer program product which can be directly loaded into a memory and contains software codes, and the computer program product can realize the fault processing method provided by the embodiment after being loaded and executed by a computer.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and the division of modules or units, for example, is merely a logical function division, and other manners of division are possible when actually implemented. For example, multiple units or components may be combined or may be integrated into another device, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (12)

1. A method of fault handling, the method comprising:
Under the condition that a first container fails, determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model;
determining a first fault repairing scheme according to the first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme;
creating a second container identical to the failed first container;
Performing repair operation on the second container according to the first fault repair scheme;
And if the repaired second container works normally, outputting the first fault repairing scheme.
2. The method of claim 1, wherein the failure cause prediction model is trained by:
acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples comprise fault reasons and fault results;
According to the training sample set, training to generate the fault cause prediction model; in the training process, the input of the fault cause prediction model is a fault result in the training sample, and the output is a fault cause in the training sample.
3. The method according to claim 2, wherein the method further comprises:
Generating a third container which is the same as the first container which is in normal operation under the condition that the first container is in normal operation;
modifying configuration information of the third container;
If the modified third container fails, a second failure result is recorded, and the modified content of the configuration information of the third container is used as a second failure cause;
and taking the second fault result and the second fault cause as training samples.
4. A method according to claim 3, characterized in that the method further comprises:
Determining fault characteristic information of the modified third container according to the configuration information of the first container and the configuration information of the modified third container, wherein the fault characteristic information is the content of the configuration information of the modified third container, which is different from the configuration information of the first container;
repairing the modified fault characteristic information of the third container;
and if the repaired third container works normally, recording the repairing operation in a second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and the second fault reason.
5. The method of any of claims 1 to 4, wherein the outputting the first failover scheme comprises:
And sending the first fault repair scheme to a client used by maintenance personnel.
6. A fault handling apparatus, the fault handling apparatus comprising: the processing module and the communication module are used for processing the data;
The processing module is used for determining a first failure reason according to a first failure result of the first container and a pre-trained failure reason prediction model under the condition that the first container fails; determining a first fault repairing scheme according to the first fault cause and a preset corresponding relation, wherein the preset corresponding relation is the corresponding relation between the fault cause and the fault repairing scheme; creating a second container identical to the failed first container; performing repair operation on the second container according to the first fault repair scheme;
And the communication module is used for outputting the first fault repairing scheme if the repaired second container works normally.
7. The fault handling apparatus of claim 6, wherein,
The communication module is further used for acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples comprise fault reasons and fault results;
The processing module is further used for training and generating the fault cause prediction model according to the training sample set; in the training process, the input of the fault cause prediction model is a fault result in the training sample, and the output is a fault cause in the training sample.
8. The fault handling apparatus of claim 7, wherein,
The processing module is specifically configured to generate a third container that is the same as the first container that is running normally, under the condition that the first container is running normally; modifying configuration information of the third container; if the modified third container fails, a second failure result is recorded, and the modified content of the configuration information of the third container is used as a second failure cause; and taking the second fault result and the second fault cause as training samples.
9. The apparatus according to claim 8, wherein,
The processing module is further configured to determine, according to the configuration information of the first container and the modified configuration information of the third container, modified fault feature information of the third container, where the fault feature information is content of the modified configuration information of the third container different from the configuration information of the first container; repairing the modified fault characteristic information of the third container; and if the repaired third container works normally, recording the repairing operation in a second fault repairing scheme, and establishing a corresponding relation between the second fault repairing scheme and the second fault reason.
10. The fault handling device according to any of claims 6 to 9, wherein the communication module is specifically configured to send the first fault repair scheme to a client used by a maintenance person.
11. A fault handling apparatus, comprising: a processor and communication interface for implementing the fault handling method of any of the preceding claims 1 to 5.
12. A computer readable storage medium storing computer instructions which, when executed, are adapted to carry out the fault handling method of any one of the preceding claims 1 to 5.
CN202111039126.6A 2021-09-06 2021-09-06 Fault processing method and device and computer readable storage medium Active CN113778735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111039126.6A CN113778735B (en) 2021-09-06 2021-09-06 Fault processing method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111039126.6A CN113778735B (en) 2021-09-06 2021-09-06 Fault processing method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113778735A CN113778735A (en) 2021-12-10
CN113778735B true CN113778735B (en) 2024-06-21

Family

ID=78841282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111039126.6A Active CN113778735B (en) 2021-09-06 2021-09-06 Fault processing method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113778735B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726101B (en) * 2022-04-25 2024-10-25 广州恒泰电力工程有限公司 Intelligent power distribution terminal monitoring method and system for power utilization control

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915263A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Process fault processing method and device based on container technology
CN111880981A (en) * 2020-07-30 2020-11-03 北京浪潮数据技术有限公司 Fault repairing method and related device for docker container

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11126494B2 (en) * 2017-10-31 2021-09-21 Paypal, Inc. Automated, adaptive, and auto-remediating system for production environment
CN113328872B (en) * 2020-02-29 2023-03-28 华为技术有限公司 Fault repairing method, device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915263A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Process fault processing method and device based on container technology
CN111880981A (en) * 2020-07-30 2020-11-03 北京浪潮数据技术有限公司 Fault repairing method and related device for docker container

Also Published As

Publication number Publication date
CN113778735A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
US7392148B2 (en) Heterogeneous multipath path network test system
US8201019B2 (en) Data storage device in-situ self test, repair, and recovery
US10146604B2 (en) Bad block detection and predictive analytics in NAND flash storage devices
CN112148542B (en) Reliability testing method, device and system for distributed storage cluster
CA2780370C (en) Methods and systems for preboot data verification
CN114297666A (en) Cloud deployment automation vulnerability mining system based on fuzzy test
CN110990289B (en) Method and device for automatically submitting bug, electronic equipment and storage medium
US11544048B1 (en) Automatic custom quality parameter-based deployment router
US20170123873A1 (en) Computing hardware health check
CN104067234A (en) In situ processor re-characterization
JP7611927B2 (en) Executing tests in a deterministic order
CN113778735B (en) Fault processing method and device and computer readable storage medium
CN111752824B (en) A test system, device and medium for SDN software
CN110291505A (en) Reduce the recovery time of application
US20100251029A1 (en) Implementing self-optimizing ipl diagnostic mode
CN119668916A (en) Cluster system fault handling method, system, device, equipment, medium and program
CN105027083B (en) Use the recovery routine of diagnostic result
CN107992420A (en) Put forward the management method and system of survey project
CN116521496A (en) Method, system, computer device and storage medium for verifying server performance
CN115033473A (en) Testing method and device based on software downtime location
CN111737130B (en) Public cloud multi-tenant authentication service testing method, device, equipment and storage medium
US12038828B2 (en) Distributed debugging environment for a continuous integration pipeline
CN113987065A (en) Database drift method, system, electronic device and storage medium
CN114090357A (en) A hard disk performance testing method, device, electronic device and storage medium
CN107797915B (en) Fault repairing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant