CN111913824B - Method for determining data link fault cause and related equipment - Google Patents
Method for determining data link fault cause and related equipment Download PDFInfo
- Publication number
- CN111913824B CN111913824B CN202010578137.0A CN202010578137A CN111913824B CN 111913824 B CN111913824 B CN 111913824B CN 202010578137 A CN202010578137 A CN 202010578137A CN 111913824 B CN111913824 B CN 111913824B
- Authority
- CN
- China
- Prior art keywords
- data
- file
- data file
- information
- abnormal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 43
- 230000005856 abnormality Effects 0.000 claims abstract description 28
- 230000002159 abnormal effect Effects 0.000 claims description 96
- 230000005540 biological transmission Effects 0.000 claims description 42
- 238000011144 upstream manufacturing Methods 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 22
- 238000007637 random forest analysis Methods 0.000 claims description 13
- 238000012544 monitoring process Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012423 maintenance Methods 0.000 abstract description 7
- 238000003066 decision tree Methods 0.000 description 10
- 238000013515 script Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000012546 transfer Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000345998 Calamus manan Species 0.000 description 1
- 241000219112 Cucumis Species 0.000 description 1
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 description 1
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 235000012950 rattan cane Nutrition 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/80—Database-specific techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a method for determining the failure cause of a data link and related equipment, wherein the method comprises the following steps: acquiring file information of a target data file; searching a data link corresponding to the target data file from the data link full view according to the file information; acquiring data abnormality information corresponding to the target data file in each node system on the data link; generating a data exception vector of the target data file on a corresponding data link according to the data exception information; inputting the data anomaly vector into a fault cause decision model corresponding to the data link; and acquiring the reason of the data link failure of the target data file output by the failure reason decision model. The method for determining the data link fault cause can reduce the dependence on experience of operation and maintenance personnel, accurately determine the root cause of the data link fault and improve the fault checking efficiency.
Description
Technical Field
The present invention relates to the field of operation and maintenance technologies, and in particular, to a method for determining a cause of a data link failure and related devices.
Background
In recent years, with the continuous expansion of commercial banking business and the continuous popularization of big data applications, the amount of data that a bank IT (information technology) system needs to process increases exponentially. The system pressure on the data link is increased, and data cannot be generated and transmitted in time for various reasons, so that important fund business, supervision and reporting, management analysis and the like of the bank can be greatly influenced.
At present, after a data link fails, the cause of the failure is usually checked and operated manually. However, this approach is often very passive, and it is difficult to ensure timeliness. And, can only be along rattan and touch melon in the trouble shooting, when the downstream system does not receive data, can only seek its upstream system, if the upstream system confirms that has supplied data, then need both sides of upstream and downstream system to check whether there is a problem in the transport means together. If the upstream system does not supply data, the upstream system of the upstream system needs to be searched, and whether the upstream system of the upstream stage has a problem is further checked. And so on until the cause of the fault is found.
Because the upstream data on which different data depend is different, the architecture logic complexity is also different, and the method is greatly dependent on the experience of operation and maintenance personnel and is limited in the exposed faults, the operation and maintenance personnel sometimes consume a great deal of manpower and cannot find the root cause of the faults of the data link.
Disclosure of Invention
In order to solve the technical problems, the embodiments of the present invention provide a method for determining a cause of a data link failure and a related system, which accurately determine a cause of a failure of a data link of a target data file by inputting a data anomaly vector of the target data file on a data link corresponding to the data link into a failure cause decision model corresponding to the data link, thereby reducing the dependency of failure detection on experience of operation and maintenance personnel.
In a first aspect, an embodiment of the present invention provides a method for determining a cause of a data link failure, where the method includes:
acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
searching a data link corresponding to the target data file from the data link full view according to the file information;
acquiring data abnormality information corresponding to the target data file in each node system on the data link, wherein the data abnormality information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal, whether operation processing of the node system in the process of generating the intermediate data file is abnormal, whether file transmission of the node system to the intermediate data file is abnormal, whether system resources of the node system in a set time period are abnormal, and whether database indexes of the node system in the set time period are abnormal;
generating a data exception vector of the target data file on a corresponding data link according to the data exception information;
inputting the data anomaly vector into a fault cause decision model corresponding to the data link, wherein the fault cause decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: a data anomaly vector of a sample data file on the data link, a tag identifying a cause of the fault;
and acquiring the reasons of the number link faults of the target data files output by the fault reason decision model.
In one embodiment of the invention, the method further comprises:
monitoring and collecting transmission information of each data file in each node system, wherein the transmission information comprises: data file information of the current data file, upstream data file information on which the current data file depends, and downstream data file information corresponding to the current data file;
and generating the full view of the data link according to the transmission information of each data file.
In one embodiment of the invention, the method further comprises:
and monitoring and recording the data abnormality information of each data file in each node system.
In one embodiment of the present invention, the acquiring the data anomaly information corresponding to the data file in each node system on the data link includes:
and searching data abnormality information corresponding to the target data file in each node system on the data link from the recorded data abnormality information according to the file information.
In one embodiment of the present invention, the fault cause includes: the name of the failed system and the reason for the failed system to fail, wherein the reason for the failed system to fail comprises: system resources are tense, system database is abnormal, system operation is wrong, system data is not generated, and system transmission is failed.
In a second aspect, an embodiment of the present invention provides an apparatus for determining a cause of a data link failure, where the apparatus includes:
the file information acquisition module is used for acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
the data link acquisition module is used for searching the data link corresponding to the target data file from the data link full view according to the file information;
the abnormal information acquisition module acquires data abnormal information corresponding to the target data file in each node system on the data link, wherein the data abnormal information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal, whether operation processing of the node system in the process of generating the intermediate data file is abnormal, whether file transmission of the node system to the intermediate data file is abnormal, whether system resources of the node system in a set time period are abnormal, and whether database indexes of the node system in the set time period are abnormal;
the abnormal vector generation module is used for generating a data abnormal vector of the target data file on a corresponding data link according to the data abnormal information;
the abnormal vector input module is used for inputting the data abnormal vector into a fault reason decision model corresponding to the data link, wherein the fault reason decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: a data anomaly vector of a sample data file on the data link, a tag identifying a cause of the fault;
and the fault reason acquisition module is used for acquiring the reason of the data link fault of the target data file output by the fault reason decision model.
In one embodiment of the invention, the apparatus further comprises:
the transmission information acquisition module is used for monitoring and acquiring the transmission information of each data file in each node system, wherein the transmission information comprises: data file information of the current data file, upstream data file information on which the current data file depends, and downstream data file information corresponding to the current data file;
and the data link full view generation module is used for generating the data link full view according to the transmission information of each data file.
In one embodiment of the invention, the apparatus further comprises:
the data anomaly information recording module is used for monitoring and recording the data anomaly information of each data file in each node system.
In one embodiment of the present invention, the acquiring the data anomaly information corresponding to the target data file in each node system on the data link includes:
and searching data abnormality information corresponding to the target data file in each node system on the data link from the recorded data abnormality information according to the file information.
In one embodiment of the present invention, the fault cause includes: the name of the failed system and the reason for the failed system to fail, wherein the reason for the failed system to fail comprises: system resources are tense, system database is abnormal, system operation is wrong, system data is not generated, and system transmission is failed.
In a third aspect, an embodiment of the present invention provides a computer storage medium having stored thereon computer instructions executable by a processor to implement the method for determining a cause of a data link failure according to any of the previous embodiments.
In a fourth aspect, an embodiment of the present invention provides a computer apparatus, including:
a memory having a computer program stored thereon;
a processor configured to execute the computer program to implement the method for determining a cause of a data link failure according to any one of the foregoing embodiments.
Compared with the prior art, the method provided by the embodiment of the invention has the following beneficial technical effects:
according to the method and the related equipment for determining the failure cause of the data link, the data link corresponding to the target data file is searched from the data link full view, the data exception vector is generated according to the data exception information corresponding to the target data file in each node system on the data link, and then the data exception vector is input into the failure cause decision model to determine the cause of failure of the data link of the target data file, so that the dependence of failure check on experience of operation and maintenance personnel can be reduced, the root cause of failure of the data link of the target data file is accurately determined, and the labor and time consumed in failure check are saved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a data link in full view, according to an embodiment of the present invention;
fig. 2 is a flow chart of a method of determining a cause of a data link failure in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of a random forest model according to an embodiment of the present invention;
FIG. 4 shows a training schematic of training an initial decision tree according to one embodiment of the invention;
fig. 5 is a schematic diagram of an apparatus for determining a cause of a data link failure according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First, terms related to the embodiments of the present invention will be briefly described.
And (3) operation: the data processing component of the IT system comprises three parts, namely data receiving, data processing and data transmitting, and mainly realizes processing of the data file through self-defined program logic.
Data link: in IT operation and maintenance practice, virtual lines are drawn according to the upstream and downstream dependency relationship of each grade of data file.
Full view of data link: the upstream and downstream dependency graph of multi-level data files in multiple systems is typically a mesh directed graph. Each data file may have a "one-to-one", "many-to-one", and "one-to-many" relationship with its upstream and downstream data files.
Time to number and job processing time: each data file has ITs specific service definition and use, and in order to meet the specific functions of each IT system, each data file generally has the latest generation and arrival time required, and the corresponding processing operation is required to be completed within a fixed time range, otherwise, the service effect is caused.
Fig. 1 is a schematic diagram of a data link in full view, according to an embodiment of the present invention. As shown in fig. 1, A, B, B2, and C are four node systems. The A system is the most upstream supply system that generates multiple data files, some of which are supplied to the B1 system and some of which are supplied to the B2 system. And the B1 system and the B2 system respectively process the received data files through different jobs, and then send the processed data files to a downstream C system. And C, processing the received data file by the system through the operation, thereby generating a final data file.
If the target data file does not arrive on time or an error occurs, i.e., the data link of the target data file fails, an alarm may be generated. There are many reasons for the failure of the data link of the target data file, which may be that the file transmission of a certain upstream node system of the data file is problematic, that the job processing process of a certain upstream node system of the data file is problematic, or that the data file in a certain upstream node system of the data file is not generated in time, etc.
In order to determine the root cause of a failure of a data link of a target data file, the present embodiment provides a method for determining the root cause of the failure of the data link. Fig. 2 shows a flow chart of a method of determining a cause of a data link failure according to an embodiment of the invention. As shown in fig. 2, the method for determining a cause of a failure of a data link according to this embodiment includes:
s101: and acquiring file information of a target data file, wherein the file information uniquely identifies the target data file.
The target data file may be a final data file to be obtained, and the terminal node system of the data link may monitor whether the final data file arrives on time or is in error, that is, monitor whether the target data file is abnormal, if so, determine that the data link corresponding to the target data file has a fault, and generate an alarm. When an alarm occurs, file information of a target data file for triggering the alarm can be obtained according to the alarm information. The file information of the target data file is a unique identifier of the target data file, and may be formed by combining a system, a path and a name of the target data file where the target data file is located.
S102: and searching the data link corresponding to the target data file from the data link full view according to the file information.
The method comprises the steps of acquiring a pre-generated full view of a data link, and searching the data link corresponding to a target data file according to file information of the target data file.
The data link full view represents the upstream and downstream dependency relationship of each data file in each system, and in one implementation of this embodiment, the data link full view may be pre-generated by:
and monitoring each data file in each system, and collecting transmission information of each data file. And after the transmission information of each data file is acquired, generating a data link full view representing the upstream and downstream dependency relationship of each data file according to the acquired transmission information of each data file.
The systems may be cross-platform systems, such as Linux systems, HP-UX systems, AIX systems, windows, and other systems. Multiple hosts may be included in a system, and proxy scripts (e.g., shell scripts or python scripts) may be deployed on each host of the systems to collect transfer information for the data files. For a system, the transmission information of the collected data file may include the data file information of the data file itself in the system, for example, the name of the data file, the name of the system where the data file is located, and the path; upstream data file information on which the data file depends, such as the name of the upstream data file, the system in which it is located, and the path; and downstream data file information corresponding to the data file, for example, the name of the downstream data file, the system and the path in which the downstream data file is located.
S103: acquiring data abnormality information corresponding to the target data file in each node system on the data link, wherein the data abnormality information comprises: whether the file generation of the intermediate data file corresponding to the target data file in the node system is abnormal, whether the operation processing of the node system in the process of generating the intermediate data file is abnormal, whether the file transmission of the node system to the intermediate data file is abnormal, whether the system resource of the node system in a set time period is abnormal, and whether the database index of the node system in the set time period is abnormal.
The data link corresponding to the target data file is a directed data link, and the node systems on the data link may sequentially include a plurality of node systems such as a first-stage node system, a second-stage node system, …, an nth-stage intermediate node system, and the like. The data files of the adjacent node systems have upstream and downstream dependency relationships or corresponding relationships. For a target data file, the data file corresponding to the target data file in each node system on the corresponding data link is an intermediate data file of the target data file. The data link corresponding to the target data file may be determined according to the file information of the target data file, and then the node system names on the data link and the file information of the intermediate data files corresponding to the target data file in the node systems may be determined.
The inventor finds that the reasons for the failure of the target data file are many in the process of implementing the embodiment of the invention, and the reasons mainly include:
1. the intermediate data file of the upstream node system is not generated;
2. the upstream node system overtime the job execution due to the reasons of insufficient system resources, job scheduling congestion, job processing logic errors, abnormal data format and the like;
3. abnormal file transmission between the upstream node system and the downstream node system or transmission interruption caused by network failure, and the intermediate data file generated by the upstream node system is not successfully transmitted to the downstream node system;
4. the downstream node system cannot successfully receive the intermediate data file sent upstream due to insufficient disk space, insufficient system resources, abnormal transmission components and the like;
5. database anomalies of the node system result in read-write anomalies of the file.
Therefore, after determining the data link corresponding to the target data file, the node system name on the data link, and the file information of the intermediate data file corresponding to the target data file in the node system, the embodiment may collect the following data anomaly information in the node system on the data link:
1. whether the file generation of the intermediate data file corresponding to the target data file in the node system is abnormal (i.e., whether the file generation is abnormal). For example, whether the generation of the intermediate data file is abnormal or not can be determined by monitoring whether the intermediate data file is generated or not under the specified directory of the node system and acquiring the size of the intermediate data file. If the intermediate data file under the specified directory is not generated or the generated intermediate data file has abnormal size, for example, 0KB, the generation of the intermediate data file is judged to be abnormal. The designated directory may be a directory in which an intermediate data file corresponding to the target data file should be stored in the node system.
In some embodiments, if there are a plurality of intermediate data files corresponding to the target data file in the node system, information about whether file generation of each intermediate data file is abnormal may be aggregated. For example, if the intermediate data files corresponding to the target data file in the B system are file 1 and file 2, and the generation of file 1 is normal (the value is 1) and the generation of file 2 is abnormal (the value is 0), then the and operation may be performed on the values of file 1 and file 2, so as to determine that the file generation of the intermediate data file corresponding to the target data file in the node system is abnormal (the value after the and operation is 0).
2. Whether the working processing of the node system in the process of generating the intermediate data file corresponding to the target data file is abnormal or not. For example, it may be determined whether or not job processing of an intermediate data file corresponding to a target data file in a node system is abnormal by script analysis of job logs in the node system. When the number of the intermediate data files corresponding to the target data files is multiple, information about whether the job processing of each intermediate data file is abnormal can be aggregated, so that whether the node system is abnormal in the process of generating the intermediate data files corresponding to the target data files is determined.
3. Whether the file transmission of the intermediate data file corresponding to the target data file is abnormal or not by the node system. For example, the transmission log may be analyzed by script to determine whether an abnormality has occurred in file transmission of an intermediate data file corresponding to the target data file in the node system. When there are a plurality of intermediate data files corresponding to the target data file, information about whether the file transfer of each intermediate data file is abnormal may be aggregated, so as to determine whether the file transfer of the intermediate data file corresponding to the target data file is abnormal in the node system.
4. And whether the system resources of the node system are abnormal in a set time period or not. The system resource may include a CPU usage rate, a memory usage rate, a disk IO response time, a file usage rate, and the like, and whether the system resource is abnormal may be determined by determining whether an average usage rate of the system resource in a set period of time is greater than a set threshold. For example, then 10:30 to 11:00, and judging whether the average value of the CPU utilization rate in the time period is larger than a set threshold value, thereby determining whether the CPU resource is tense. Based on the same principle, whether the memory resources are tense can also be determined by judging whether the memory usage rate in the set time period is greater than the set threshold value, whether the disk resources are tense can be determined by judging whether the disk IO response time in the set time period is greater than the set threshold value, and the like. Then, the abnormal information of each system resource can be aggregated, so as to determine whether the system resource of the node system is abnormal. For example, when an abnormality occurs in one of the system resources, it may be determined that the system resource of the node system is abnormal.
The set time period of the system resource items such as the CPU utilization rate, the memory utilization rate, the disk IO response time, the file utilization rate and the like may be determined by the information of the intermediate data file corresponding to the target data file in the node system, for example, the set time period may be set according to the arrival time of the set intermediate data file.
5. Whether the database index of the node system is abnormal (i.e., whether the database is abnormal) in the set time period. The database index may include whether there is an excessively long SQL, a large transaction, whether a session is blocked, whether there is a deadlock, whether there is an invalid index, etc., and the database index information of the set period may be obtained through a script. It may then be determined from these database index information whether the database of the node system is abnormal. For example, when an abnormality occurs in one of the index items, it may be determined that the database of the node system is abnormal. The set period of time for collecting the database index may be determined by information of an intermediate data file corresponding to the target data file in the node system, and may be set according to a job processing time of the intermediate data file, for example.
In one implementation manner of the present embodiment, file generation information, file transmission information, job processing information, and data anomaly information such as resource information, database index information of each data file in each system may be monitored and recorded at intervals. Therefore, when the data link corresponding to the target data file is found according to the file information of the target data file, the intermediate data file corresponding to the target data file can be found from the recorded information, and further the data abnormality information corresponding to the target data file is obtained.
For example, an acquisition script may be set on each host of each system, data abnormality information of each data file in the system may be acquired at intervals, and the data abnormality information of each acquired data file may be recorded in a data abnormality information recording table. After the file information of the target data file and the corresponding data link are obtained, the data exception information of the target data file on the corresponding data link can be searched from the data exception information record table according to the file information of the target data file.
S104: and generating a data exception vector of the target data file on a corresponding data link according to the data exception information.
After the data anomaly information is obtained, the data anomaly information can be processed, for example, digitized and standardized, so as to form a data anomaly vector of the target data file on the corresponding data link. The data anomaly vector may be a data string consisting of 0,1, each bit representing a type of data anomaly information in a node system, where 0 may represent anomalies and 1 may represent normal.
For example, the data link corresponding to the file 1 is from a- > B- > C, and the data exception vector of the file 1 on the corresponding data link is: [1,1,1,1,1,0,1,1,0,0,0,0,0,0,0] wherein the first 5 bits respectively indicate whether or not file generation from the A system is abnormal, whether or not job processing is abnormal, whether or not file transfer is abnormal, whether or not system resources are abnormal, whether or not database is abnormal, the middle 5 bits respectively indicate whether or not file generation in the B system is abnormal, whether or not job processing is abnormal, whether or not file transfer is abnormal, whether or not system resources are abnormal, whether or not database is abnormal, and the last 5 bits respectively indicate whether or not file generation of the C system is abnormal, whether or not job processing is abnormal, whether or not file transfer is abnormal, whether or not system resources are abnormal, whether or not database is abnormal. Of course, for the target data file, the file generation information of the end node system C may be default, that is, the file generation abnormality information of the C node system may be omitted, thereby obtaining a 14-bit data abnormality vector.
S105: inputting the data anomaly vector into a fault cause decision model corresponding to the data link, wherein the fault cause decision model is obtained by training a random forest model by using a plurality of groups of sample data, and the sample data comprises: a data anomaly vector for a sample data file on the data link, a tag identifying a cause of the fault.
Specifically, for a target data file, various data anomaly information on the corresponding data link may be associated with each other, and it is often impossible to determine whether the target data file is abnormal or not only according to the data anomaly vector of the target data file, that is, the root cause of the data link failure of the target data file. For example, if the data anomaly vector corresponding to the target data file is [0,0,0,1,1,0,0,0,1,1,0,0,0,1], the root cause of the data link failure of the target data file cannot be determined according to the data anomaly vector.
To determine the root cause of the data link failure of the target data file, the present embodiment inputs the data anomaly vector of the target data file into the failure cause decision model corresponding to the data link thereof to determine the root cause of the data link failure of the target data file.
The random forest model can be trained by inputting the data anomaly vector of the sample data file and the fault reason label thereof into the random forest model, so that a fault reason decision model is obtained in advance.
The fault reason tag may be composed of the name of the faulty system, and the reason why the faulty system is faulty. The causes of the failure system failure may include: system resources are tense, system databases are abnormal, system operation errors, system data are not generated, system transmission faults and the like. For example, the failure cause tag may be sys_a_gen, which indicates that the failure cause generated a failure for the file of system a.
The random forest model is an integrated learning model, and the random forest model is used for completing learning tasks by constructing and combining a plurality of decision trees, so that the random forest model has better generalization performance and accuracy than a single decision tree model. Fig. 3 is a schematic diagram of a random forest model according to an embodiment of the present invention. As shown in fig. 3, in this embodiment, a plurality of independent decision trees are generated, and then data anomaly vectors are respectively input into each decision tree to obtain a decision result of each decision tree, and then a "minority-compliance-majority" principle is adopted to determine a final decision result, that is, if the number of decision trees with the same decision result is greater than a set threshold (for example, more than half of decision trees), the same decision result is determined as the final result; if not, outputting the judgment result of each decision tree, and manually confirming. From this, the root cause of the failure of the data link is determined. A training schematic of training an initial decision tree is shown in fig. 4.
In a possible implementation manner of this embodiment, different fault cause decision models may be trained through different sample data files, where each fault cause decision model corresponds to one data link. After determining the data link corresponding to the target data file, the data anomaly vector of the target data file can be input into a fault cause decision model corresponding to the data link to determine the root cause of the fault of the data link of the target data file.
S106: and acquiring the reason of the data link failure of the target data file output by the failure reason decision model.
After the data anomaly vector of the target data file on the corresponding data link is input into the fault reason decision model corresponding to the data link, the fault reason decision model can output the fault reason label of the target data file, for example, sys_A_gen, and the root cause of the data link fault of the target data file is the file generation fault of the system A.
Fig. 5 is a schematic diagram of an apparatus for determining a cause of a data link failure according to an embodiment of the present invention. As shown in fig. 5, the apparatus 10 for determining a cause of a data link failure according to the present embodiment may include: a file information acquisition module 11, a data link acquisition module 12, an anomaly information acquisition module 13, an anomaly vector generation module 14, an anomaly vector input module 15, and a failure cause acquisition module 16.
The file information obtaining module 11 is configured to obtain file information of a target data file, where the file information uniquely identifies the target data file;
a data link obtaining module 12, configured to find a data link corresponding to the target data file from a data link full view according to the file information;
an anomaly information acquisition module 13, configured to acquire data anomaly information corresponding to the target data file in each node system on the data link, where the data anomaly information includes: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal, whether operation processing of the node system in the process of generating the intermediate data file is abnormal, whether file transmission of the node system to the intermediate data file is abnormal, whether system resources of the node system in a set time period are abnormal, and whether database indexes of the node system in the set time period are abnormal;
an anomaly vector generation module 14, configured to generate a data anomaly vector of the target data file on a data link corresponding to the target data file according to the data anomaly information;
an anomaly vector input module 15, configured to input the data anomaly vector into a fault cause decision model corresponding to the data link, where the fault cause decision model is obtained by training a random forest model using a plurality of sets of sample data, and the sample data includes: a data anomaly vector for a sample data file on the data link, a tag identifying a cause of the fault.
And the failure reason obtaining module 16 is configured to obtain the reason of the data link failure of the target data file output by the failure reason decision model.
In one implementation of this embodiment, the apparatus 10 further includes:
the transmission information acquisition module is used for monitoring and acquiring the transmission information of each data file in each node system, wherein the transmission information comprises: data file information of the current data file itself, upstream data file information on which the current data file depends, and downstream data file information on which the current data file depends;
and the data link full view generation module is used for generating the data link full view according to the transmission information of each data file.
In one implementation of this embodiment, the apparatus 10 further includes:
the data anomaly information recording module is used for monitoring and recording the data anomaly information of each data file in each node system.
In one implementation manner of this embodiment, the obtaining the data exception information corresponding to the target data file in each node system on the data link includes:
and searching data abnormality information corresponding to the target data file in each node system on the data link from the recorded data abnormality information according to the file information.
In one implementation of this embodiment, the failure cause includes: the name of the failed system and the reason for the failed system to fail, wherein the reason for the failed system to fail comprises: system resources are tense, system database is abnormal, system operation is wrong, system data is not generated, and system transmission is failed.
The device for determining the failure cause of the data file in this embodiment may be used to execute the technical scheme of the above embodiment of the method of the present invention, and its implementation principle and technical effect are similar, and will not be repeated here.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software in combination with a hardware platform. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in some parts of the embodiments or implementations of the present invention.
Yet another embodiment of the present invention provides a computer storage medium, such as a hard disk, optical disk, flash memory, floppy disk, magnetic tape, etc., having stored thereon computer readable instructions executable by a processor to implement the method of determining a cause of a data link failure described in any of the above embodiments.
Yet another embodiment of the present invention provides a computer device comprising:
a memory on which a computer program is stored,
a processor that can execute the computer program to implement the method for determining a cause of a data link failure as described in any one of the embodiments above.
The terms and expressions used in the present specification are used as examples only and are not meant to be limiting. It will be appreciated by those skilled in the art that numerous changes may be made to the details of the above-described embodiments without departing from the underlying principles of the disclosed embodiments. The scope of the invention is therefore to be determined only by the following claims, in which all terms are to be understood in their broadest reasonable sense unless otherwise indicated.
Claims (12)
1. A method of determining a cause of a data link failure, the method comprising:
acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
searching a data link corresponding to the target data file from the data link full view according to the file information;
acquiring data abnormality information corresponding to the target data file in each node system on the data link, wherein the data abnormality information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal, whether operation processing of the node system in the process of generating the intermediate data file is abnormal, whether file transmission of the node system to the intermediate data file is abnormal, whether system resources of the node system in a set time period are abnormal, and whether database indexes of the node system in the set time period are abnormal;
generating a data exception vector of the target data file on a corresponding data link according to the data exception information;
inputting the data anomaly vector into a fault cause decision model corresponding to the data link, wherein the fault cause decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: a data anomaly vector of a sample data file on the data link, a tag identifying a cause of the fault;
and acquiring the reason of the data link failure of the target data file output by the failure reason decision model.
2. The method according to claim 1, wherein the method further comprises:
monitoring and collecting transmission information of each data file in each node system, wherein the transmission information comprises: data file information of the current data file, upstream data file information on which the current data file depends, and downstream data file information corresponding to the current data file;
and generating the full view of the data link according to the transmission information of each data file.
3. The method according to claim 2, wherein the method further comprises:
and monitoring and recording the data abnormality information of each data file in each node system.
4. The method of claim 3, wherein the obtaining data exception information corresponding to the target data file in each node system on the data link comprises:
and searching data abnormality information corresponding to the target data file in each node system on the data link from the recorded data abnormality information according to the file information.
5. The method of claim 1, wherein the cause of the fault comprises: the name of the failed system and the reason for the failed system to fail, wherein the reason for the failed system to fail comprises: system resources are tense, system database is abnormal, system operation is wrong, system data is not generated, and system transmission is failed.
6. An apparatus for determining a cause of a data link failure, the apparatus comprising:
the file information acquisition module is used for acquiring file information of a target data file, wherein the file information uniquely identifies the target data file;
the data link acquisition module is used for searching the data link corresponding to the target data file from the data link full view according to the file information;
the abnormal information acquisition module acquires data abnormal information corresponding to the target data file in each node system on the data link, wherein the data abnormal information comprises: whether file generation of an intermediate data file corresponding to the target data file in the node system is abnormal, whether operation processing of the node system in the process of generating the intermediate data file is abnormal, whether file transmission of the node system to the intermediate data file is abnormal, whether system resources of the node system in a set time period are abnormal, and whether database indexes of the node system in the set time period are abnormal;
the abnormal vector generation module is used for generating a data abnormal vector of the target data file on a corresponding data link according to the data abnormal information;
the abnormal vector input module is used for inputting the data abnormal vector into a fault reason decision model corresponding to the data link, wherein the fault reason decision model is obtained by training a random forest model by using a plurality of sample data, and the sample data comprises: a data anomaly vector of a sample data file on the data link, a tag identifying a cause of the fault;
and the fault reason acquisition module is used for acquiring the reason of the data link fault of the target data file output by the fault reason decision model.
7. The apparatus of claim 6, wherein the apparatus further comprises:
the transmission information acquisition module is used for monitoring and acquiring the transmission information of each data file in each node system, wherein the transmission information comprises: data file information of the current data file, upstream data file information on which the current data file depends, and downstream data file information corresponding to the current data file;
and the data link full view generation module is used for generating the data link full view according to the transmission information of each data file.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the data anomaly information recording module is used for monitoring and recording the data anomaly information of each data file in each node system.
9. The apparatus of claim 8, wherein the obtaining the data exception information corresponding to the target data file in each node system on the data link comprises:
and searching data abnormality information corresponding to the target data file in each node system on the data link from the recorded data abnormality information according to the file information.
10. The apparatus of claim 6, wherein the cause of the fault comprises: the name of the failed system and the reason for the failed system to fail, wherein the reason for the failed system to fail comprises: system resources are tense, system database is abnormal, system operation is wrong, system data is not generated, and system transmission is failed.
11. A computer storage medium having stored thereon computer instructions executable by a processor to implement the method of any of claims 1-5.
12. A computer device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program to implement the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010578137.0A CN111913824B (en) | 2020-06-23 | 2020-06-23 | Method for determining data link fault cause and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010578137.0A CN111913824B (en) | 2020-06-23 | 2020-06-23 | Method for determining data link fault cause and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111913824A CN111913824A (en) | 2020-11-10 |
CN111913824B true CN111913824B (en) | 2024-03-05 |
Family
ID=73226479
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010578137.0A Active CN111913824B (en) | 2020-06-23 | 2020-06-23 | Method for determining data link fault cause and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111913824B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113672776B (en) * | 2021-08-25 | 2024-04-12 | 中国农业银行股份有限公司 | Fault analysis method and device |
CN113641736B (en) * | 2021-10-13 | 2022-01-25 | 云和恩墨(北京)信息技术有限公司 | Method and device for displaying session blocking source |
CN114356617B (en) * | 2021-11-29 | 2024-03-08 | 苏州浪潮智能科技有限公司 | Error injection testing method, device, system and computing equipment |
CN114676027A (en) * | 2022-03-30 | 2022-06-28 | 中国建设银行股份有限公司 | Data processing method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102611568A (en) * | 2011-12-21 | 2012-07-25 | 华为技术有限公司 | Failure service path diagnosis method and device |
CN108809731A (en) * | 2018-06-28 | 2018-11-13 | 珠海兴业新材料科技有限公司 | A kind of control method dimming optical projection system business datum chain based on subway |
CN109218114A (en) * | 2018-11-12 | 2019-01-15 | 西安微电子技术研究所 | A kind of server failure automatic checkout system and detection method based on decision tree |
CN109298703A (en) * | 2017-07-25 | 2019-02-01 | 富泰华工业(深圳)有限公司 | Fault diagnosis system and method |
CN110493025A (en) * | 2018-05-15 | 2019-11-22 | 中国移动通信集团浙江有限公司 | Method and device for fault root cause diagnosis based on multi-layer directed graph |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3922375B2 (en) * | 2004-01-30 | 2007-05-30 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Anomaly detection system and method |
US7333962B2 (en) * | 2006-02-22 | 2008-02-19 | Microsoft Corporation | Techniques to organize test results |
-
2020
- 2020-06-23 CN CN202010578137.0A patent/CN111913824B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102611568A (en) * | 2011-12-21 | 2012-07-25 | 华为技术有限公司 | Failure service path diagnosis method and device |
CN109298703A (en) * | 2017-07-25 | 2019-02-01 | 富泰华工业(深圳)有限公司 | Fault diagnosis system and method |
CN110493025A (en) * | 2018-05-15 | 2019-11-22 | 中国移动通信集团浙江有限公司 | Method and device for fault root cause diagnosis based on multi-layer directed graph |
CN108809731A (en) * | 2018-06-28 | 2018-11-13 | 珠海兴业新材料科技有限公司 | A kind of control method dimming optical projection system business datum chain based on subway |
CN109218114A (en) * | 2018-11-12 | 2019-01-15 | 西安微电子技术研究所 | A kind of server failure automatic checkout system and detection method based on decision tree |
Non-Patent Citations (1)
Title |
---|
杨波.基于数据链的软件故障定位方法.软件学报.2015,第26卷(第2期),254-268. * |
Also Published As
Publication number | Publication date |
---|---|
CN111913824A (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111913824B (en) | Method for determining data link fault cause and related equipment | |
US11829365B2 (en) | Systems and methods for data quality monitoring | |
US9542255B2 (en) | Troubleshooting based on log similarity | |
Lin et al. | Log clustering based problem identification for online service systems | |
CN110704231A (en) | A fault handling method and device | |
CN113190373B (en) | Micro-service system fault root cause positioning method based on fault feature comparison | |
JP2019502191A (en) | Service call information processing method and device | |
CN105095052B (en) | Fault detection method under SOA environment and device | |
CN113360722B (en) | Fault root cause positioning method and system based on multidimensional data map | |
CN113946499A (en) | Micro-service link tracking and performance analysis method, system, equipment and application | |
CN112559237B (en) | Operation and maintenance system troubleshooting method and device, server and storage medium | |
CN111949480A (en) | A Component Awareness-Based Log Anomaly Detection Method | |
CN112087320A (en) | Abnormity positioning method and device, electronic equipment and readable storage medium | |
US11790249B1 (en) | Automatically evaluating application architecture through architecture-as-code | |
US20230306343A1 (en) | Business process management system and method thereof | |
CN111835566A (en) | System fault management method, device and system | |
CN107579944B (en) | Artificial intelligence and MapReduce-based security attack prediction method | |
CN115102836A (en) | Network equipment failure analysis method, device and storage medium | |
US12073295B2 (en) | Machine learning model operation management system and method | |
CN110609761B (en) | Method and device for determining fault source, storage medium and electronic equipment | |
CN112766509A (en) | Method for analyzing fault propagation path of electronic information system | |
CN116506340A (en) | Flow link testing method and device, electronic equipment and storage medium | |
US9372746B2 (en) | Methods for identifying silent failures in an application and devices thereof | |
US20200391885A1 (en) | Methods and systems for identifying aircraft faults | |
CN119149502A (en) | Log-based node generation method, device, equipment, medium and product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |