CN111953544B - Fault detection method, device, equipment and storage medium of server - Google Patents
Fault detection method, device, equipment and storage medium of server Download PDFInfo
- Publication number
- CN111953544B CN111953544B CN202010821134.5A CN202010821134A CN111953544B CN 111953544 B CN111953544 B CN 111953544B CN 202010821134 A CN202010821134 A CN 202010821134A CN 111953544 B CN111953544 B CN 111953544B
- Authority
- CN
- China
- Prior art keywords
- fault
- keywords
- server
- data information
- fault data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 72
- 230000011218 segmentation Effects 0.000 claims abstract description 101
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000012545 processing Methods 0.000 claims abstract description 29
- 230000008569 process Effects 0.000 claims description 21
- 230000008439 repair process Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 15
- 238000012544 monitoring process Methods 0.000 claims description 14
- 238000013515 script Methods 0.000 claims description 12
- 230000009286 beneficial effect Effects 0.000 abstract description 7
- 238000003058 natural language processing Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Debugging And Monitoring (AREA)
Abstract
本申请公开了一种服务器的故障检测方法,包括:采集服务器故障时产生的故障数据信息;对各故障数据信息进行分词处理,得到对应的关键词;确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数;根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。本方法能够提高对不同厂商的服务器或者不同操作系统的服务器进行故障检测的便捷度,从而提高对服务器进行故障检测的效率。本申请还公开了一种服务器的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。
This application discloses a server fault detection method, including: collecting fault data information generated when the server fails; performing word segmentation processing on each fault data information to obtain corresponding keywords; determining each keyword and the preset word segmentation category The matching times of the preset keywords in the library; the fault value of the server is calculated according to the matched keywords and the corresponding matching times, and the fault condition of the server is determined according to the fault value. The method can improve the convenience of fault detection for servers of different manufacturers or servers of different operating systems, thereby improving the efficiency of fault detection for servers. The application also discloses a server fault detection device, equipment and computer-readable storage medium, all of which have the above-mentioned beneficial effects.
Description
技术领域technical field
本发明涉及服务器检测领域,特别涉及一种服务器的故障检测方法、装置、设备及计算机可读存储介质。The invention relates to the field of server detection, in particular to a server fault detection method, device, equipment and computer-readable storage medium.
背景技术Background technique
随着云计算技术的快速发展,服务器的需求量与日俱增,大量服务器长时间不间断的运行必然导致故障率的提升,如何快速地发现并处理服务器的故障成为技术人员需要解决的技术难题。With the rapid development of cloud computing technology, the demand for servers is increasing day by day. The long-term uninterrupted operation of a large number of servers will inevitably lead to an increase in failure rate. How to quickly find and deal with server failures has become a technical problem that technicians need to solve.
目前,各服务器厂商分别为自家服务器设置监测平台,用于检测自家的服务器的故障,但不同服务器厂商生产的不同类型的服务器,以及服务器本身安装的不同操作系统使得需要利用各对应的监测平台进行故障检测。也就是说,现有技术中,用于对各不同的服务器进行故障检测的监测平台的功能参差不齐,且监测平台多基于独有协议,导致无法统一监管,统一输出,在需要对多种类型的服务器以及设置多种类型操作系统的服务器进行故障检测时,操作过程复杂,检测效率低。At present, each server manufacturer sets up a monitoring platform for its own server to detect the failure of its own server, but different types of servers produced by different server manufacturers, and different operating systems installed on the server itself make it necessary to use the corresponding monitoring platform to monitor Fault detection. That is to say, in the existing technology, the functions of the monitoring platforms used for fault detection of different servers are uneven, and most of the monitoring platforms are based on unique protocols, resulting in the inability to perform unified supervision and unified output. When fault detection is performed on servers with various types of operating systems and servers with multiple types of operating systems, the operation process is complicated and the detection efficiency is low.
因此,如何提高对服务器故障检测的便捷度和效率,是本领域技术人员目前需要解决的技术问题。Therefore, how to improve the convenience and efficiency of server fault detection is a technical problem that those skilled in the art need to solve.
发明内容Contents of the invention
有鉴于此,本发明的目的在于提供一种服务器的故障检测方法,能够提高对服务器故障检测的便捷度和效率;本发明的另一目的是提供一种服务器的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。In view of this, the object of the present invention is to provide a server fault detection method, which can improve the convenience and efficiency of server fault detection; another object of the present invention is to provide a server fault detection device, equipment and computer Reading storage media all have the above beneficial effects.
为解决上述技术问题,本发明提供一种服务器的故障检测方法,包括:In order to solve the above technical problems, the present invention provides a fault detection method for a server, including:
采集服务器故障时产生的故障数据信息;Collect fault data information generated when the server fails;
对各所述故障数据信息进行分词处理,得到对应的关键词;performing word segmentation processing on each of the fault data information to obtain corresponding keywords;
确定出各所述关键词与预先设置的分词类库中的预设关键词的匹配次数;Determine the matching times of each of the keywords and the preset keywords in the preset word segmentation class library;
根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况。The fault value of the server is calculated according to the matched keywords and the corresponding matching times, and the fault condition of the server is determined according to the fault value.
优选地,所述故障数据信息包括带内故障数据信息和带外故障数据信息;对应的,所述采集服务器故障时产生的故障数据信息的过程,具体包括:Preferably, the fault data information includes in-band fault data information and out-of-band fault data information; correspondingly, the process of collecting fault data information generated when a server fails specifically includes:
接收所述服务器的操作系统和/或预设监测平台发送的所述带内故障信息;receiving the in-band fault information sent by the operating system of the server and/or a preset monitoring platform;
通过在所述服务器中运行预设采集脚本,获取所述带内故障信息;Obtaining the in-band fault information by running a preset collection script in the server;
接收所述服务器的BMC转发的所述带外故障数据信息;receiving the out-of-band fault data information forwarded by the BMC of the server;
对应的,所述根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况的过程,具体包括:Correspondingly, the process of calculating the fault value of the server according to the matched keywords and the corresponding matching times, and determining the fault condition of the server according to the fault value, specifically includes:
根据所述带内故障数据信息和所述带外故障数据信息分别对应的匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况。Calculate the fault value of the server according to the matched keywords corresponding to the in-band fault data information and the out-of-band fault data information and the corresponding matching times, and determine the server according to the fault value failure conditions.
优选地,在所述采集服务器故障时产生的故障数据信息之后,进一步包括:Preferably, after the collection of fault data information generated when the server fails, it further includes:
判断所述故障数据信息的数据格式类型;judging the data format type of the fault data information;
若所述故障数据信息为文本格式,则将所述故障数据信息存储至数据库中,并进入所述对各所述故障数据信息进行分词处理,得到对应的关键词的步骤;If the fault data information is in text format, then store the fault data information in a database, and enter the step of performing word segmentation processing on each of the fault data information to obtain corresponding keywords;
若所述故障数据信息为图形格式,则识别出所述故障数据信息中的文字,得出文本格式的所述故障数据信息,将文本格式的所述故障数据信息存储至所述数据库中,并进入所述对各所述故障数据信息进行分词处理,得到对应的关键词的步骤。If the fault data information is in graphic format, then identify the text in the fault data information, obtain the fault data information in text format, store the fault data information in text format in the database, and Enter the step of performing word segmentation processing on each of the fault data information to obtain corresponding keywords.
优选地,进一步包括:Preferably, further comprising:
预先设置故障告警上限值和故障告警下限值;Preset the fault alarm upper limit and fault alarm lower limit;
对应的,所述根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况的过程,具体包括:Correspondingly, the process of calculating the fault value of the server according to the matched keywords and the corresponding matching times, and determining the fault condition of the server according to the fault value, specifically includes:
根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的所述故障值;calculating the fault value of the server according to the matched keywords and the corresponding matching times;
若所述故障值大于所述故障告警下限值并小于所述故障告警上限值,则确定所述服务器为可自动修复故障,并启动预设的故障修复程序;If the fault value is greater than the fault alarm lower limit and smaller than the fault alarm upper limit, then determine that the server can automatically repair the fault, and start a preset fault repair program;
若所述故障值大于所述故障告警上限值,则确定出所述服务器为不可自动修复故障,发出对应的告警信息。If the fault value is greater than the fault alarm upper limit value, it is determined that the server is a fault that cannot be automatically repaired, and a corresponding alarm message is issued.
优选地,所述对各所述故障数据信息进行分词处理,得到对应的关键词的过程,具体包括:Preferably, the process of performing word segmentation processing on each of the fault data information to obtain corresponding keywords specifically includes:
判断所述故障数据信息的语言类型;judging the language type of the fault data information;
若所述故障数据信息为中文,则利用中文分词工具对所述故障数据信息进行分词处理,得到对应的关键词;If the fault data information is Chinese, then utilize the Chinese word segmentation tool to carry out word segmentation processing on the fault data information to obtain corresponding keywords;
若所述故障数据信息为英文,则利用英文分词工具对所述故障数据信息进行分词处理,得到对应的关键词。If the fault data information is in English, use an English word segmentation tool to perform word segmentation processing on the fault data information to obtain corresponding keywords.
优选地,所述确定出各所述关键词与预先设置的分词类库中的预设关键词的匹配次数的过程,具体包括:Preferably, the process of determining the number of matches between each of the keywords and the preset keywords in the preset word segmentation class library specifically includes:
利用预先设置的所述分词类库中的去除词类库确定出所述关键词中的去除词,并删除所述去除词;Determining the removed words in the keywords by using the removed word class library in the preset word segmentation class library, and deleting the removed words;
利用所述分词类库中的关键词类库对剩余的所述关键词进行匹配,确定出剩余的所述关键词中与所述预设关键词的匹配次数。Using the keyword class library in the participle class library to match the remaining keywords, and determine the matching times of the remaining keywords with the preset keywords.
优选地,在所述根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况之后,进一步包括:Preferably, after calculating the fault value of the server according to the matched keywords and the corresponding matching times, and determining the fault condition of the server according to the fault value, further comprising:
将所述故障情况通过邮件和/或短信的方式发送给目标终端设备。Send the fault condition to the target terminal device by email and/or short message.
为解决上述技术问题,本发明还提供一种服务器的故障检测装置,包括:In order to solve the above technical problems, the present invention also provides a server fault detection device, including:
采集模块,用于采集服务器故障时产生的故障数据信息;The collection module is used to collect fault data information generated when the server fails;
分词模块,用于对各所述故障数据信息进行分词处理,得到对应的关键词;A word segmentation module, configured to perform word segmentation processing on each of the fault data information to obtain corresponding keywords;
匹配模块,用于确定出各所述关键词与预先设置的分词类库中的预设关键词的匹配次数;A matching module, configured to determine the number of matches between each of the keywords and the preset keywords in the preset word segmentation class library;
确定模块,用于根据匹配的所述关键词以及对应的匹配次数计算出所述服务器的故障值,并根据所述故障值确定出所述服务器的故障情况。A determining module, configured to calculate a fault value of the server according to the matched keywords and corresponding matching times, and determine a fault condition of the server according to the fault value.
为解决上述技术问题,本发明还提供一种服务器的故障检测设备,包括:In order to solve the above technical problems, the present invention also provides a fault detection device for a server, including:
存储器,用于存储计算机程序;memory for storing computer programs;
处理器,用于执行所述计算机程序时实现上述任一种服务器的故障检测方法的步骤。A processor, configured to implement the steps of any one of the server fault detection methods described above when executing the computer program.
为解决上述技术问题,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一种服务器的故障检测方法的步骤。In order to solve the above-mentioned technical problems, the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the above-mentioned server fault detection methods is implemented. A step of.
本发明提供的一种服务器的故障检测方法,包括:采集服务器故障时产生的故障数据信息;对各故障数据信息进行分词处理,得到对应的关键词;确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数;根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。可见,本方法是基于NLP对采集的故障数据信息进行分词匹配,并根据分词匹配情况计算出故障值,进而确定出服务器的故障情况,分词匹配以及确定故障情况的过程可以利用统一的计算机程序执行,不需要针对不同厂商的服务器或者不同操作系统的服务器进行区别化设置对应的故障检测程序或者设置对应的监测平台,因此本方法能够提高对不同厂商的服务器或者不同操作系统的服务器进行故障检测的便捷度,从而提高对服务器进行故障检测的效率。A server fault detection method provided by the present invention includes: collecting fault data information generated when the server fails; performing word segmentation processing on each fault data information to obtain corresponding keywords; determining each keyword and the preset word segmentation category The matching times of the preset keywords in the library; the fault value of the server is calculated according to the matched keywords and the corresponding matching times, and the fault condition of the server is determined according to the fault value. It can be seen that this method is based on NLP to perform word segmentation matching on the collected fault data information, and calculate the fault value according to the word segmentation matching situation, and then determine the fault condition of the server. The process of word segmentation matching and determining the fault condition can be performed by a unified computer program , there is no need to differentiate the corresponding fault detection program or set up the corresponding monitoring platform for servers of different manufacturers or servers of different operating systems, so this method can improve the efficiency of fault detection for servers of different manufacturers or servers of different operating systems Convenience, thereby improving the efficiency of fault detection on the server.
为解决上述技术问题,本发明还提供了一种服务器的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。In order to solve the above-mentioned technical problems, the present invention also provides a server fault detection device, equipment and computer-readable storage medium, all of which have the above-mentioned beneficial effects.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or prior art. Obviously, the accompanying drawings in the following description are only For some embodiments of the present invention, those skilled in the art can also obtain other drawings according to the provided drawings without creative work.
图1为本发明实施例提供的一种服务器的故障检测方法的流程图;FIG. 1 is a flowchart of a server fault detection method provided by an embodiment of the present invention;
图2为本发明实施例提供的另一种服务器的故障检测方法的流程图;FIG. 2 is a flow chart of another server fault detection method provided by an embodiment of the present invention;
图3为本发明实施例提供的一种具体的存储目录结构的示意图;FIG. 3 is a schematic diagram of a specific storage directory structure provided by an embodiment of the present invention;
图4为本发明实施例提供的一种服务器的故障检测装置的结构图;FIG. 4 is a structural diagram of a server fault detection device provided by an embodiment of the present invention;
图5为本发明实施例提供的一种服务器的故障检测设备的结构图。FIG. 5 is a structural diagram of a server fault detection device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
本发明实施例的核心是提供一种服务器的故障检测方法,能够提高对服务器故障检测的便捷度和效率;本发明的另一核心是提供一种服务器的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。The core of the embodiment of the present invention is to provide a server fault detection method, which can improve the convenience and efficiency of server fault detection; another core of the present invention is to provide a server fault detection device, equipment and computer-readable storage Medium, all have above-mentioned beneficial effect.
为了使本领域技术人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。In order to enable those skilled in the art to better understand the solution of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
图1为本发明实施例提供的一种服务器的故障检测方法的流程图。如图1所示,一种服务器的故障检测方法包括:FIG. 1 is a flowchart of a server fault detection method provided by an embodiment of the present invention. As shown in Figure 1, a server fault detection method includes:
S10:采集服务器故障时产生的故障数据信息;S10: collecting fault data information generated when the server fails;
在本实施例中,首先需要获取服务器故障时产生的故障数据信息,具体包括通过日志、脚本、故障截图等故障文件获取对应的故障数据信息。需要说明的是,在实际操作中,对采集服务器的故障数据信息的具体方式不做限定,例如可以是通过在服务器中运行预设采集脚本获取故障数据信息并自动回传,也可以是直接接收服务器的操作系统或者服务器对应的预设监测平台获取并发送的故障数据信息。In this embodiment, it is first necessary to obtain fault data information generated when the server fails, specifically including obtaining corresponding fault data information through fault files such as logs, scripts, and fault screenshots. It should be noted that, in actual operation, there is no limit to the specific method of collecting the fault data information of the server. For example, the fault data information can be obtained by running a preset collection script in the server and automatically sent back, or it can be directly received The fault data information obtained and sent by the operating system of the server or the preset monitoring platform corresponding to the server.
S20:对各故障数据信息进行分词处理,得到对应的关键词;S20: Perform word segmentation processing on each fault data information to obtain corresponding keywords;
S30:确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数。S30: Determine the number of matches between each keyword and the preset keyword in the preset word segmentation class library.
在本实施例中,在获取到故障数据信息之后,基于自然语言处理NLP的方式解析获取到的故障数据信息。具体的,在执行NLP解析之前,需要构建用于NLP的分词类库,分词类库主要用于定义服务器故障的专业词汇,以便分词。设置的分词类库需要包括:词名、词性、故障指数;其中,词名用于区分各不同的故障的名称;词性包括名词、形容词等,便于分词工具区分词性;故障指数表示分词得出的关键词对故障的影响系数,用于计算故障值。In this embodiment, after the fault data information is acquired, the acquired fault data information is parsed based on natural language processing (NLP). Specifically, before performing NLP parsing, it is necessary to build a word segmentation library for NLP. The word segmentation library is mainly used to define professional vocabulary for server failures for word segmentation. The set word segmentation class library needs to include: word name, part of speech, and fault index; among them, the word name is used to distinguish the names of different faults; the part of speech includes nouns, adjectives, etc., which is convenient for word segmentation tools to distinguish parts of speech; the fault index indicates the result of word segmentation The keyword is the influence coefficient of the fault, which is used to calculate the fault value.
需要说明的是,在实际操作中,分词类库包括通用词库和专属词库,通用词库用于对所有服务器故障统一解析,专属词库用于处理特殊故障分词。在构建完成分词类库后,以服务器为单位,按时间顺序,逐条读取预先采集并存储于数据库中故障数据信息,调用分词工具对故障数据信息进行分词,得到分词结果,即关键词;然后将各关键词分别与分词类库中的预设关键词进行匹配,记录匹配的关键词及该关键词对应的匹配次数。It should be noted that in actual operation, the word segmentation class library includes a general thesaurus and a special thesaurus. The general thesaurus is used for unified analysis of all server failures, and the dedicated thesaurus is used for word segmentation for special failures. After building the word segmentation class library, take the server as the unit and read the fault data information collected in advance and stored in the database one by one in chronological order, call the word segmentation tool to segment the fault data information, and obtain the word segmentation results, that is, keywords; then Each keyword is matched with a preset keyword in the word segmentation class library, and the matched keyword and the matching times corresponding to the keyword are recorded.
S40:根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。S40: Calculate the fault value of the server according to the matched keyword and the corresponding matching times, and determine the fault condition of the server according to the fault value.
具体的,在确定出与分词类库中的预设关键词相匹配的关键词以及对应的匹配次数之后,以服务器为单位计算出该服务器对应的故障值,根据故障值确定出服务器的故障情况,即确定服务器是否异常。例如,通过预设阈值,当故障值大于预设阈值时,则表示该服务器存在需要修复的故障;否则,表示该服务器的故障可忽略。Specifically, after determining the keywords matching the preset keywords in the participle class library and the corresponding matching times, the server is used as a unit to calculate the corresponding fault value of the server, and the fault condition of the server is determined according to the fault value , that is, to determine whether the server is abnormal. For example, with a preset threshold, when the fault value is greater than the preset threshold, it means that the server has a fault that needs to be repaired; otherwise, it means that the server fault can be ignored.
本发明实施例提供的一种服务器的故障检测方法,包括:采集服务器故障时产生的故障数据信息;对各故障数据信息进行分词处理,得到对应的关键词;确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数;根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。可见,本方法是基于NLP对采集的故障数据信息进行分词匹配,并根据分词匹配情况计算出故障值,进而确定出服务器的故障情况,分词匹配以及确定故障情况的过程可以利用统一的计算机程序执行,不需要针对不同厂商的服务器或者不同操作系统的服务器进行区别化设置对应的故障检测程序或者设置对应的监测平台,因此本方法能够提高对不同厂商的服务器或者不同操作系统的服务器进行故障检测的便捷度,从而提高对服务器进行故障检测的效率。A server failure detection method provided by an embodiment of the present invention includes: collecting failure data information generated when the server fails; performing word segmentation processing on each failure data information to obtain corresponding keywords; determining the relationship between each keyword and the preset The matching times of the preset keywords in the word segmentation class library; calculate the fault value of the server according to the matched keywords and the corresponding matching times, and determine the fault condition of the server according to the fault value. It can be seen that this method is based on NLP to perform word segmentation matching on the collected fault data information, and calculate the fault value according to the word segmentation matching situation, and then determine the fault condition of the server. The process of word segmentation matching and determining the fault condition can be performed by a unified computer program , there is no need to differentiate the corresponding fault detection program or set up the corresponding monitoring platform for servers of different manufacturers or servers of different operating systems, so this method can improve the efficiency of fault detection for servers of different manufacturers or servers of different operating systems Convenience, thereby improving the efficiency of fault detection on the server.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,故障数据信息包括带内故障数据信息和带外故障数据信息;对应的,采集服务器故障时产生的故障数据信息的过程,具体包括:On the basis of the above embodiments, this embodiment further explains and optimizes the technical solution. Specifically, in this embodiment, the fault data information includes in-band fault data information and out-of-band fault data information; correspondingly, the collected The process of fault data information generated when the server fails, specifically includes:
接收服务器的操作系统和/或预设监测平台发送的带内故障信息;Receive in-band fault information sent by the server's operating system and/or preset monitoring platform;
通过在服务器中运行预设采集脚本,获取带内故障信息;Obtain in-band fault information by running preset collection scripts in the server;
接收服务器的BMC转发的带外故障数据信息;Receive the out-of-band fault data information forwarded by the BMC of the server;
对应的,根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况的过程,具体包括:Correspondingly, the process of calculating the fault value of the server according to the matched keywords and the corresponding matching times, and determining the fault condition of the server according to the fault value, specifically includes:
根据带内故障数据信息和带外故障数据信息分别对应的匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。The fault value of the server is calculated according to the matching keywords corresponding to the in-band fault data information and the out-of-band fault data information and the corresponding matching times, and the fault condition of the server is determined according to the fault value.
具体的,在本实施例中,采集的故障数据信息包括带内故障数据信息和带外故障数据信息;其中,带内故障信息采集有两种方式,方式一:通过信息转发服务器被动接收服务器的操作系统或服务器的预设监测平台发送的故障数据信息;方式二:通过侵入服务器的操作系统,运行预设采集脚本,通过预设采集脚本采集服务器的带内故障信息并自动回传。带外故障信息采集采用被动接收方式,通过信息转发服务器被动接收通过服务器的BMC转发的故障数据信息。Specifically, in this embodiment, the collected fault data information includes in-band fault data information and out-of-band fault data information; wherein, there are two ways to collect in-band fault information, and the first way is to passively receive the information from the server through the information forwarding server. Fault data information sent by the operating system or the default monitoring platform of the server; Method 2: By intruding into the operating system of the server, running the preset collection script, collecting the in-band fault information of the server through the preset collection script and automatically returning it. The out-of-band fault information collection adopts the passive receiving method, and the fault data information forwarded by the BMC of the server is passively received through the information forwarding server.
也就是说,在本实施例中,故障数据信息包括两部分,即带内故障数据信息和带外故障数据信息;因此在分别确定出带内故障数据信息与预设的分词类库匹配的关键词以及对应的匹配次数,以及带外故障数据信息与预设的分词类库匹配的关键词以及对应的匹配次数之后,在计算故障值时,则需要根据带内故障数据信息和带外故障数据信息分别对应的匹配关键词以及各关键词对应的匹配次数计算出服务器的故障值,再根据故障值确定出服务器的故障情况。That is to say, in this embodiment, the fault data information includes two parts, that is, the in-band fault data information and the out-of-band fault data information; Words and corresponding matching times, as well as keywords and corresponding matching times between out-of-band fault data information and the preset word segmentation class library, when calculating the fault value, it needs to be based on in-band fault data information and out-of-band fault data The matching keywords corresponding to the information and the matching times corresponding to each keyword calculate the fault value of the server, and then determine the fault condition of the server according to the fault value.
可见,本实施例中,通过获取带内故障数据信息和带外故障数据信息,并依据这两种故障数据信息计算出服务器的故障值,能够提高计算出服务器的故障值的准确度,从而能够提高服务器故障检测的准确度。It can be seen that in this embodiment, by obtaining in-band fault data information and out-of-band fault data information, and calculating the fault value of the server based on these two fault data information, the accuracy of calculating the fault value of the server can be improved, thereby enabling Improve the accuracy of server failure detection.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例在采集服务器故障时产生的故障数据信息之后,进一步包括:On the basis of the above embodiments, this embodiment further explains and optimizes the technical solution. Specifically, after collecting the fault data information generated when the server fails, this embodiment further includes:
判断故障数据信息的数据格式类型;Determine the data format type of the fault data information;
若故障数据信息为文本格式,则将故障数据信息存储至数据库中,并进入对各故障数据信息进行分词处理,得到对应的关键词的步骤;If the fault data information is in text format, then store the fault data information in the database, and enter the step of performing word segmentation processing on each fault data information to obtain corresponding keywords;
若故障数据信息为图形格式,则识别出故障数据信息中的文字,得出文本格式的故障数据信息,将文本格式的故障数据信息存储至数据库中,并进入对各故障数据信息进行分词处理,得到对应的关键词的步骤。If the fault data information is in graphic format, identify the text in the fault data information, obtain the fault data information in text format, store the fault data information in text format in the database, and enter word segmentation processing for each fault data information, The step of obtaining the corresponding keyword.
具体的,在本实施例中,是在采集服务器故障时产生的故障数据信息之后,首先识别采集到的故障数据信息的数据格式类型。其中,数据格式类型包括文本格式和图形格式,并且,日志或者脚本文件一般为文本格式,故障截图一般为图形格式。Specifically, in this embodiment, after collecting the fault data information generated when the server fails, first identify the data format type of the collected fault data information. Wherein, the data format types include text format and graphic format, and log or script files are generally in text format, and fault screenshots are generally in graphic format.
当故障数据信息为文本格式时,则将文本格式的故障数据信息存储至数据库中,并进入对各故障数据信息进行分词处理,得到对应的关键词的步骤;若故障数据信息为图形格式,则需要先通过OCR(Optical Character Recognition,光学字符识别)技术识别出图形格式的故障数据信息中的文字,得出文本格式的故障数据信息,然后将文本格式的故障数据信息存储至数据库中,并进入对各故障数据信息进行分词处理,得到对应的关键词的步骤。When the fault data information is in text format, store the fault data information in text format in the database, and enter the step of word segmentation processing for each fault data information to obtain the corresponding keywords; if the fault data information is in graphic format, then It is necessary to first recognize the text in the fault data information in graphic format through OCR (Optical Character Recognition, Optical Character Recognition) technology, obtain the fault data information in text format, and then store the fault data information in text format in the database and enter A step of performing word segmentation processing on each fault data information to obtain corresponding keywords.
可见,本实施例的方法,能够针对文本格式和图形格式的故障数据信息进行处理,增加故障数据信息的多样性和丰富性,基于更多样性的故障数据信息确定出故障值以对服务器进行故障检测,能够进一步提高故障检测的准确度。It can be seen that the method of this embodiment can process the fault data information in text format and graphic format, increase the diversity and richness of fault data information, and determine the fault value based on more diverse fault data information for server Fault detection can further improve the accuracy of fault detection.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例进一步包括:On the basis of the above embodiments, this embodiment further explains and optimizes the technical solution. Specifically, this embodiment further includes:
预先设置故障告警上限值和故障告警下限值;Preset the fault alarm upper limit and fault alarm lower limit;
对应的,根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况的过程,具体包括:Correspondingly, the process of calculating the fault value of the server according to the matched keywords and the corresponding matching times, and determining the fault condition of the server according to the fault value, specifically includes:
根据匹配的关键词以及对应的匹配次数计算出服务器的故障值;Calculate the failure value of the server according to the matched keywords and the corresponding matching times;
若故障值大于故障告警下限值并小于故障告警上限值,则确定服务器为可自动修复故障,并启动预设的故障修复程序;If the fault value is greater than the lower limit of the fault alarm and less than the upper limit of the fault alarm, it is determined that the server can automatically repair the fault, and a preset fault repair program is started;
若故障值大于故障告警上限值,则确定出服务器为不可自动修复故障,发出对应的告警信息。If the fault value is greater than the upper limit value of the fault alarm, it is determined that the server is a fault that cannot be automatically repaired, and a corresponding alarm message is issued.
具体的,在本实施例中,进一步预先设置故障告警上限值和故障告警下限值;在计算出服务器对应的故障值之后,将故障值与故障告警上限值和故障告警下限值分别进行比较;若故障值大于故障告警下限值并小于故障告警上限值,即故障值超过故障告警下限值,但是还未达到故障告警上限值,因此表示服务器当前的故障为可自动修复故障,因此启动预设的故障修复程序对服务器进行自修复;若故障值大于故障告警上限值,则表示服务器当前的故障为不可自动修复故障,发出对应的告警信息,通知服务器负责人对服务器进行人工故障修复;若故障值小于故障告警下限值,则表示服务器当前的故障可忽略,因此继续采集服务器故障时的故障数据信息。Specifically, in this embodiment, the upper limit value of the fault warning and the lower limit value of the fault warning are further preset; For comparison; if the fault value is greater than the lower limit of the fault alarm and smaller than the upper limit of the fault alarm, that is, the fault value exceeds the lower limit of the fault alarm, but has not yet reached the upper limit of the fault alarm, so it means that the current fault of the server can be automatically repaired fault, so start the preset fault repair program to self-repair the server; if the fault value is greater than the upper limit of the fault alarm, it means that the current fault of the server cannot be automatically repaired, and a corresponding alarm message will be issued to notify the person in charge of the server to repair the fault. Carry out manual fault repair; if the fault value is less than the lower limit of the fault alarm, it means that the current fault of the server can be ignored, so continue to collect fault data information when the server is faulty.
可见,本实施例通过进一步利用计算出的故障值与预先设置的故障告警上限值和故障告警下限值进行比较,以进一步分析服务器的故障情况,并在服务器的故障为可自动修复故障时,启动预设的故障修复程序对服务器进行自修复,因此相对减少了技术人员需要进行故障排查修复的操作,相对减少技术人员的工作量。It can be seen that in this embodiment, by further using the calculated fault value to compare with the preset fault alarm upper limit value and fault alarm lower limit value, the fault condition of the server can be further analyzed, and when the fault of the server is a fault that can be automatically repaired , start the preset fault repair program to self-repair the server, thus relatively reducing the operations of technicians to perform troubleshooting and repair, and relatively reducing the workload of technicians.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,对各故障数据信息进行分词处理,得到对应的关键词的过程,具体包括:On the basis of the above embodiments, this embodiment further explains and optimizes the technical solution. Specifically, in this embodiment, the process of word segmentation processing is performed on each fault data information to obtain corresponding keywords, which specifically includes:
判断故障数据信息的语言类型;Determine the language type of the fault data information;
若故障数据信息为中文,则利用中文分词工具对故障数据信息进行分词处理,得到对应的关键词;If the fault data information is in Chinese, use the Chinese word segmentation tool to perform word segmentation processing on the fault data information to obtain the corresponding keywords;
若故障数据信息为英文,则利用英文分词工具对故障数据信息进行分词处理,得到对应的关键词。If the fault data information is in English, use the English word segmentation tool to perform word segmentation processing on the fault data information to obtain corresponding keywords.
具体的,在本实施例中,是在对各故障数据信息进行分词处理,得到对应的关键词的过程中,首先分析并判断出故障数据信息的语言类型,语言类型包括中文和英文,因此需要利用对应的分词工具对故障数据信息进行分词处理,并得到对应的关键词。作为优选的实施方式,中文分词工具可以具体为Ansj;英文分词工具可以具体为NLTK;本实施例对所使用的具体的分词工具的类型不做限定。Specifically, in this embodiment, in the process of performing word segmentation processing on each fault data information to obtain corresponding keywords, first analyze and judge the language type of the fault data information, and the language types include Chinese and English, so it is necessary to Use the corresponding word segmentation tool to perform word segmentation processing on the fault data information, and obtain the corresponding keywords. As a preferred embodiment, the Chinese word segmentation tool can be specifically Ansj; the English word segmentation tool can be specifically NLTK; this embodiment does not limit the type of the specific word segmentation tool used.
可见,本实施例通过进一步区分故障数据信息的语言类型,并利用与语言类型对应的分词工具对故障数据信息进行分词处理,因此能够进一步提高分词得出的关键词的准确度,从而提高计算故障值的准确度,从而提高对服务器故障检测的准确度。It can be seen that in this embodiment, by further distinguishing the language types of fault data information, and using word segmentation tools corresponding to the language types to perform word segmentation processing on fault data information, the accuracy of keywords obtained by word segmentation can be further improved, thereby improving the calculation of faults. The accuracy of the value, thereby improving the accuracy of server failure detection.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数的过程,具体包括:On the basis of the above-mentioned embodiments, this embodiment further explains and optimizes the technical solution. Specifically, in this embodiment, the matching of each keyword with the preset keywords in the preset participle class library is determined. process, including:
利用预先设置的分词类库中的去除词类库确定出关键词中的去除词,并删除去除词;Utilize the removed part of speech library in the pre-set participle class library to determine the removed words in the keywords, and delete the removed words;
利用分词类库中的关键词类库对剩余的关键词进行匹配,确定出剩余的关键词中与预设关键词的匹配次数。Using the keyword class library in the word segmentation class library to match the remaining keywords, and determine the matching times of the remaining keywords with the preset keywords.
具体的,在本实施例中,是在设置分词类库时进一步设置去除词类库,去除词类库中设置的是预设去除词,预设去除词包括与确定服务器是否故障无关的词语,以及会对确定服务器是否故障造成干扰/模糊的词语;然后将故障数据信息的各关键词分别与去除词类库中的各预设去除词进行匹配,确定出关键词中的去除词,并将匹配出的去除词删除,得出剩余的关键词;然后利用分词类库中的关键词类库对剩余的关键词进行匹配,确定出剩余的关键词中与关键词类库中预设关键词相匹配的关键词,并确定出对应的匹配次数。Specifically, in this embodiment, when the participle class library is set, the removal of the part of speech library is further set, and what is set in the removal of the part of speech library is a preset removal word, and the preset removal word includes words that are irrelevant to determining whether the server is faulty, And the words that will cause interference/fuzziness to determine whether the server is faulty; then each keyword of the fault data information is matched with each preset removal word in the word class library to determine the removal word in the keyword, and The matched removed words are deleted to obtain the remaining keywords; then the remaining keywords are matched using the keyword class library in the word segmentation class library to determine the remaining keywords and the preset keywords in the keyword class library matching keywords, and determine the corresponding matching times.
也就是说,本实施例通过预先筛选出关键词中的去除词,再利用剩余的关键词与分词类库中的关键词类库进行匹配,进而根据剩余的关键词确定出与预设关键词相匹配的关键词以及对应的匹配次数,能够提高确定出与分词类库中的关键词类库中预设关键词相匹配的关键词以及对应的匹配次数的准确度,从而提高计算出服务器的故障值的准确度,进而提高对服务器故障检测的准确度。That is to say, this embodiment screens out the removed words in the keywords in advance, and then uses the remaining keywords to match with the keyword class library in the word segmentation class library, and then determines the preset keyword according to the remaining keywords. The matching keywords and the corresponding matching times can improve the accuracy of determining the keywords matching the preset keywords in the keyword class library in the participle class library and the corresponding matching times, thereby improving the computing power of the server. The accuracy of the fault value, thereby improving the accuracy of server fault detection.
在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例在根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况之后,进一步包括:On the basis of the above embodiments, this embodiment further explains and optimizes the technical solution. Specifically, this embodiment calculates the fault value of the server according to the matched keywords and the corresponding matching times, and calculates the fault value according to the fault value After determining the fault condition of the server, it further includes:
将故障情况通过邮件和/或短信的方式发送给目标终端设备。Send the fault condition to the target terminal device by email and/or SMS.
具体的,在本实施例中,预先设置接收故障情况的目标终端设备,然后在根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况之后,即确定出对服务器的故障检测的故障情况之后,进一步将故障情况通过邮件和/或短信的方式发送给目标终端设备;以便技术人员可以通过目标终端设备远程获取对服务器的故障检测情况,从而进一步提高对服务器的故障检测的便捷度。Specifically, in this embodiment, the target terminal device that receives the fault condition is set in advance, and then the fault value of the server is calculated according to the matched keyword and the corresponding matching times, and the fault state of the server is determined according to the fault value, That is, after determining the fault condition of the fault detection of the server, the fault condition is further sent to the target terminal device by mail and/or short message; so that the technician can remotely obtain the fault detection situation of the server through the target terminal device, thereby further Improve the convenience of server failure detection.
为了使本技术领域的人员更好地理解本申请中的技术方案,下面结合实际应用场景对本申请实施例中的技术方案进行详细说明。如图2为本发明实施例提供的另一种服务器的故障检测方法的流程图;如图2所示,以目标服务器为执行主体,以对多个客户端服务器进行故障检测为例,一种服务器的故障检测方法的具体过程如下:In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be described in detail below in combination with actual application scenarios. Figure 2 is a flow chart of another server fault detection method provided by the embodiment of the present invention; The specific process of the fault detection method of the server is as follows:
一、准备阶段1. Preparation stage
将需要进行故障检测的客户端服务器的设备信息,包括服务器名称、服务器地址(IP)及故障告警上限值和故障告警下限值录入到服务器信息表中,表结构如下所示:Enter the device information of the client server that needs to perform fault detection, including server name, server address (IP), fault alarm upper limit and fault alarm lower limit, into the server information table. The table structure is as follows:
表1服务器信息表Table 1 server information table
创建NLP分词类库,包括创建通用词库和专属词库,通用词库和专属词库均设置关键词类库和去除词类库,关键词类库中添加预设关键词,去除词类库中添加预设去除词,通用词库中存储通用的故障识别关键词,专属词库添加无法适用通用词库的故障识别关键词,录入每个关键词时,均需添加词性(如名词、形容词等)和故障指数。以Ansj中文分词为例,通用词库和专属词库需要创建manual.dic和special.dic,并将通用词库和专属词库加载到配置文件中,而关键词对应的故障指数则需要单独存储到数据库表中。Create NLP word classification library, including the creation of general thesaurus and exclusive thesaurus. Both the general thesaurus and the exclusive thesaurus are set with the keyword class library and the word class library is removed. Preset keywords are added to the keyword class library and the word class library is removed. Add preset removal words in the general lexicon, store common fault identification keywords in the general lexicon, and add fault identification keywords that cannot be applied to the general lexicon in the exclusive lexicon. When entering each keyword, it is necessary to add the part of speech (such as noun, adjective etc.) and failure index. Taking Ansj Chinese word segmentation as an example, manual.dic and special.dic need to be created for the general thesaurus and special thesaurus, and the general and special thesaurus are loaded into the configuration file, while the fault index corresponding to the keyword needs to be stored separately into the database table.
二、信息采集2. Information collection
采集的故障数据信息包括带内故障数据信息和带外故障数据信息。具体的,预先配置信息转发服务器的目标地址为信息采集模块所在的服务器IP,即目标服务器的IP;启动信息采集程序,自动定时下发信息采集脚本到各个客户端服务器,同时保持与信息转发服务器链路畅通,等待接收故障数据信息;客户端服务器在接收到运行信息采集脚本的命令后,自动运行信息采集脚本并将执行结果回传给目标服务器;客户端服务器发生故障时,通过客户端服务器的操作系统或者对应的预设监测平台获取对应的故障数据信息,并通过信息转发服务器转发到目标服务器。客户端服务器的BMC检测到故障数据信息时,通过信息转发服务器将故障数据信息转发给目标服务器。The collected fault data information includes in-band fault data information and out-of-band fault data information. Specifically, the target address of the pre-configured information forwarding server is the server IP where the information collection module is located, that is, the IP of the target server; the information collection program is started, and the information collection script is automatically and regularly sent to each client server, while maintaining the connection with the information forwarding server The link is unblocked, waiting to receive fault data information; after the client server receives the command to run the information collection script, it will automatically run the information collection script and send the execution result back to the target server; The operating system or the corresponding preset monitoring platform obtains the corresponding fault data information, and forwards it to the target server through the information forwarding server. When the BMC of the client server detects the fault data information, it forwards the fault data information to the target server through the information forwarding server.
如图3所示,为本发明实施例提供的一种具体的存储目录结构的示意图,将采集到的故障数据信息以文件形式存储,带内故障数据信息存储到in_band文件夹中,带外故障数据信息存储到out_of_band文件夹中,且无论带内或带外的故障数据信息均以服务器为单位存储(例:server01),同一服务器故障则以时间戳为单位存储(例:1594532194)。As shown in Figure 3, it is a schematic diagram of a specific storage directory structure provided by the embodiment of the present invention. The collected fault data information is stored in the form of a file, and the in-band fault data information is stored in the in_band folder. The data information is stored in the out_of_band folder, and the fault data information regardless of in-band or out-of-band is stored in units of servers (for example: server01), and the failure of the same server is stored in units of timestamps (for example: 1594532194).
三、数据存储3. Data storage
启动数据存储程序,进入故障数据信息的存储目录。Start the data storage program and enter the storage directory of fault data information.
分别扫描in_band文件夹和out_of_band文件夹,分别获取in_band文件夹和out_of_band文件夹当前目录下所有文件夹列表,从第一个文件夹开始,进入server01目录,扫描目录下的时间戳文件夹,分别读取每个目录中的文件,得到故障数据信息,并识别数据格式类型;若为文本格式,则将故障数据信息直接写入服务器故障信息表中;若为图形格式,则调用OCR解析,识别出图形格式的故障数据信息中的文字,得出文本格式的故障数据信息,并存入服务器故障信息表中。需要说明的是,在将故障数据信息存入服务器故障信息表中时,需要根据带内故障数据信息和带外故障数据信息进行对应存储。Scan the in_band folder and the out_of_band folder respectively, obtain the list of all folders under the current directory of the in_band folder and out_of_band folder respectively, start from the first folder, enter the server01 directory, scan the time stamp folder under the directory, and read Take the files in each directory to get the fault data information and identify the data format type; if it is in text format, write the fault data information directly into the server fault information table; if it is in graphic format, call OCR analysis to identify The text in the fault data information in graphic format is obtained to obtain the fault data information in text format and stored in the server fault information table. It should be noted that when the fault data information is stored in the server fault information table, corresponding storage needs to be performed according to the in-band fault data information and the out-of-band fault data information.
表2服务器sTable 2 Servers
四、NLP解析4. NLP analysis
启动NLP解析,逐条读取服务器故障信息表中的内容,根据读取到的故障数据信息的语言类型,调用中文分词工具(例:Ansj)或英文分词工具(例:NLTK)对带内故障数据信息和带外故障数据信息分别进行分词,并将分词结果信息(关键词)保存到故障信息分词结果表。表结构如下所示:Start the NLP analysis, read the contents of the server fault information table one by one, and call the Chinese word segmentation tool (example: Ansj) or English word segmentation tool (example: NLTK) to analyze the in-band fault data according to the language type of the read fault data information. Word segmentation is performed on the information and out-of-band fault data information respectively, and the word segmentation result information (keywords) is saved in the fault information word segmentation result table. The table structure looks like this:
表3故障信息分词结果表Table 3 Fault information word segmentation result table
读取故障信息分词结果表,将带内分词结果、带外分词结果两列的内容与预设的分词类库进行匹配。具体包括利用通用词库和专属词库进行匹配,并剔除关键词中的去除词,保存匹配到的关键词及每个关键词对应的匹配个数,并存入故障匹配结果表。表结构如下所示:Read the fault information word segmentation result table, and match the contents of the two columns of in-band word segmentation result and out-of-band word segmentation result with the preset word segmentation class library. Specifically, it includes using the general thesaurus and the exclusive thesaurus for matching, and removing the removed words in the keywords, saving the matched keywords and the number of matches corresponding to each keyword, and storing them in the fault matching result table. The table structure looks like this:
表4故障匹配结果表Table 4 Fault matching result table
五、故障演算5. Fault calculation
启动故障演算程序,故障演算程序以客户端服务器为单位,读取故障匹配结果表,计算出服务器的故障值。本实施例中,计算服务器的故障值的方式如下:Start the fault calculation program, the fault calculation program takes the client server as the unit, reads the fault matching result table, and calculates the fault value of the server. In this embodiment, the method of calculating the fault value of the server is as follows:
其中,fm表示服务器的故障值,a表示带内故障数据信息中与分词类库中的预设关键词匹配的关键词的个数,b表示带外故障数据信息中与分词类库中的预设关键词匹配的关键词的个数,c(c=a+b)表示带内外故障数据信息中与分词类库中的预设关键词匹配的关键词的总个数;δ表示带内故障的系数等级;γ表示带外故障的系数等级;ni表示a个相匹配的关键词中关键词i的匹配个数;nj表示b个相匹配的关键词中关键词j的匹配个数;xi表示a个相匹配的关键词中关键词i的匹配系数;xj表示b个相匹配的关键词中关键词j的匹配系数;xk(xk=xi∪xj)表示c个相匹配的关键词中关键词k(k=i∪j)的匹配系数。Among them, f m represents the fault value of the server, a represents the number of keywords matching the preset keywords in the word segmentation class library in the in-band fault data information, and b represents the number of keywords in the out-of-band fault data information and the word segmentation class library The number of keywords matched by preset keywords, c (c=a+b) represents the total number of keywords matched with the preset keywords in the word segmentation class library in the in-band and out-of-band fault data information; δ represents the in-band The coefficient level of the fault; γ represents the coefficient level of the out-of-band fault; n i represents the matching number of keyword i in a matching keywords; n j represents the matching number of keyword j in b matching keywords x i represents the matching coefficient of keyword i in a matching keywords; x j represents the matching coefficient of keyword j in b matching keywords; x k (x k = x i ∪ x j ) Indicates the matching coefficient of keyword k (k=i∪j) among the c matching keywords.
将计算出的客户端服务器的当前故障值存入到服务器信息表(表1)中。Store the calculated current failure value of the client server into the server information table (Table 1).
六、结果推送6. Results push
启动结果推送程序,读取服务器信息表中录入的每个客户端服务器的故障告警上限值fmax(m)及故障告警下限值fmin(m);读取客户端服务器当前故障值fm,判断是否满足(fmax(m)>fm)&&(fm>fmin(m));若满足,则确定客户端服务器为可自动修复故障,因此启动预设的故障修复程序自动修复故障;若判断满足fmax(m)<fm,则表示服务器当前的故障为不可自动修复故障,因此通过邮件和/或短信的方式向目标终端设备发送告警信息,通知服务器负责人对服务器进行人工故障修复;若上述条件均不满足,即fmin(m)>fm,则继续获取客户端服务器的故障数据信息。Start the result push program, read the fault alarm upper limit value f max (m) and the fault alarm lower limit value f min (m) of each client server entered in the server information table; read the current fault value f of the client server m , judge whether it is satisfied (f max (m)>f m )&&(f m >f min (m)); if it is satisfied, it is determined that the client server can automatically repair the fault, so the preset fault repair program is started automatically Repair the failure; if f max (m)<f m is judged to be satisfied, it means that the current failure of the server cannot be automatically repaired. Therefore, an alarm message is sent to the target terminal device by email and/or short message, and the person in charge of the server is notified to repair the failure of the server. Carry out manual fault repair; if none of the above conditions are satisfied, that is, f min (m)>f m , continue to obtain fault data information of the client server.
本发明实施例提供的一种服务器的故障检测方法,是基于NLP对采集的故障数据信息进行分词匹配,并根据分词匹配情况计算出故障值,进而确定出服务器的故障情况,分词匹配以及确定故障情况的过程可以利用统一的计算机程序执行,不需要针对不同厂商的服务器或者不同操作系统的服务器进行区别化设置对应的故障检测程序或者设置对应的监测平台,因此本方法能够提高对不同厂商的服务器或者不同操作系统的服务器进行故障检测的便捷度,从而提高对服务器进行故障检测的效率。A server fault detection method provided by an embodiment of the present invention is based on word segmentation matching of collected fault data information based on NLP, and calculates the fault value according to the word segmentation matching situation, and then determines the fault condition of the server, word segmentation matching and determining the fault The process of the situation can be performed using a unified computer program, and there is no need to differentiate the corresponding fault detection program or set the corresponding monitoring platform for servers of different manufacturers or servers of different operating systems, so this method can improve the detection of servers of different manufacturers Or the convenience of fault detection for servers with different operating systems, thereby improving the efficiency of fault detection for servers.
上文对于本发明提供的一种服务器的故障检测方法的实施例进行了详细的描述,本发明还提供了一种与该方法对应的服务器的故障检测装置、设备及计算机可读存储介质,由于装置、设备及计算机可读存储介质部分的实施例与方法部分的实施例相互照应,因此装置、设备及计算机可读存储介质部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。The embodiment of a server fault detection method provided by the present invention has been described in detail above, and the present invention also provides a server fault detection device, equipment, and computer-readable storage medium corresponding to the method, because The embodiments of the device, equipment, and computer-readable storage medium correspond to the embodiments of the method, so for the embodiments of the device, equipment, and computer-readable storage medium, please refer to the description of the embodiments of the method, and will not be repeated here. .
图4为本发明实施例提供的一种服务器的故障检测装置的结构图,如图4所示,一种服务器的故障检测装置包括:FIG. 4 is a structural diagram of a server fault detection device provided by an embodiment of the present invention. As shown in FIG. 4 , a server fault detection device includes:
采集模块41,用于采集服务器故障时产生的故障数据信息;A
分词模块42,用于对各故障数据信息进行分词处理,得到对应的关键词;The
匹配模块43,用于确定出各关键词与预先设置的分词类库中的预设关键词的匹配次数;
确定模块44,用于根据匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。The determining
本发明实施例提供的服务器的故障检测装置,具有上述服务器的故障检测方法的有益效果。The server fault detection device provided by the embodiment of the present invention has the beneficial effect of the above server fault detection method.
作为优选的实施方式,采集模块具体包括:As a preferred embodiment, the acquisition module specifically includes:
第一采集子模块,用于接收服务器的操作系统和/或预设监测平台发送的带内故障信息;The first collection sub-module is used to receive the in-band fault information sent by the operating system of the server and/or the preset monitoring platform;
第二采集子模块,用于通过在服务器中运行预设采集脚本,获取带内故障信息;The second collection sub-module is used to obtain in-band fault information by running a preset collection script in the server;
第三采集子模块,用于接收服务器的BMC转发的带外故障数据信息;The third acquisition sub-module is used to receive the out-of-band fault data information forwarded by the BMC of the server;
对应的,匹配模块具体包括:Correspondingly, the matching module specifically includes:
确定子模块,用于根据带内故障数据信息和带外故障数据信息分别对应的匹配的关键词以及对应的匹配次数计算出服务器的故障值,并根据故障值确定出服务器的故障情况。The determination sub-module is used to calculate the fault value of the server according to the corresponding matching keywords and the corresponding matching times of the in-band fault data information and the out-of-band fault data information, and determine the fault condition of the server according to the fault value.
作为优选的实施方式,一种服务器的故障检测装置进一步包括:As a preferred embodiment, a server fault detection device further includes:
判断模块,用于判断故障数据信息的数据格式类型;Judging module, used for judging the data format type of fault data information;
第一执行模块,用于若故障数据信息为文本格式,则将故障数据信息存储至数据库中,并调用分词模块;The first execution module is used to store the fault data information in the database and call the word segmentation module if the fault data information is in text format;
第二执行模块,用于若故障数据信息为图形格式,则识别出故障数据信息中的文字,得出文本格式的故障数据信息,将文本格式的故障数据信息存储至数据库中,并调用分词模块。The second execution module is used to identify the text in the fault data information if the fault data information is in graphic format, obtain the fault data information in text format, store the fault data information in text format in the database, and call the word segmentation module .
作为优选的实施方式,一种服务器的故障检测装置进一步包括:As a preferred embodiment, a server fault detection device further includes:
告警值设置模块,用于预先设置故障告警上限值和故障告警下限值;The alarm value setting module is used to pre-set the upper limit value of the fault alarm and the lower limit value of the fault alarm;
对应的,确定模块,具体包括:Correspondingly, determine the modules, including:
计算子模块,用于根据匹配的关键词以及对应的匹配次数计算出服务器的故障值;The calculation sub-module is used to calculate the failure value of the server according to the matched keywords and the corresponding matching times;
第三执行模块,用于若故障值大于故障告警下限值并小于故障告警上限值,则确定服务器为可自动修复故障,并启动预设的故障修复程序;The third execution module is used to determine that the server can automatically repair the fault if the fault value is greater than the fault alarm lower limit value and smaller than the fault alarm upper limit value, and start a preset fault repair program;
第四执行模块,用于若故障值大于故障告警上限值,则确定出服务器为不可自动修复故障,发出对应的告警信息。The fourth execution module is used to determine that the server is a failure that cannot be automatically repaired if the failure value is greater than the upper limit of the failure alarm, and send corresponding alarm information.
作为优选的实施方式,一种服务器的故障检测装置进一步包括:As a preferred embodiment, a server fault detection device further includes:
发送模块,用于将故障情况通过邮件和/或短信的方式发送给目标终端设备。The sending module is used to send the fault condition to the target terminal device by email and/or short message.
图5为本发明实施例提供的一种服务器的故障检测设备的结构图,如图5所示,一种服务器的故障检测设备包括:FIG. 5 is a structural diagram of a server fault detection device provided by an embodiment of the present invention. As shown in FIG. 5 , a server fault detection device includes:
存储器51,用于存储计算机程序;
处理器52,用于执行计算机程序时实现如上述服务器的故障检测方法的步骤。The
本发明实施例提供的服务器的故障检测设备,具有上述服务器的故障检测方法的有益效果。The server fault detection device provided by the embodiment of the present invention has the beneficial effect of the above server fault detection method.
为解决上述技术问题,本发明还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述服务器的故障检测方法的步骤。In order to solve the above technical problems, the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned server fault detection method are realized.
本发明实施例提供的计算机可读存储介质,具有上述服务器的故障检测方法的有益效果。The computer-readable storage medium provided by the embodiment of the present invention has the beneficial effect of the above-mentioned server fault detection method.
以上对本发明所提供的服务器的故障检测方法、装置、设备及计算机可读存储介质进行了详细介绍。本文中应用了具体实施例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。The server fault detection method, device, equipment and computer-readable storage medium provided by the present invention have been introduced in detail above. In this paper, specific examples are used to illustrate the principles and implementation modes of the present invention, and the descriptions of the above examples are only used to help understand the methods and core ideas of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in the description is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two. In order to clearly illustrate the possible For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010821134.5A CN111953544B (en) | 2020-08-14 | 2020-08-14 | Fault detection method, device, equipment and storage medium of server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010821134.5A CN111953544B (en) | 2020-08-14 | 2020-08-14 | Fault detection method, device, equipment and storage medium of server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111953544A CN111953544A (en) | 2020-11-17 |
CN111953544B true CN111953544B (en) | 2023-04-07 |
Family
ID=73342966
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010821134.5A Active CN111953544B (en) | 2020-08-14 | 2020-08-14 | Fault detection method, device, equipment and storage medium of server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111953544B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657096A (en) * | 2021-08-24 | 2021-11-16 | 北京来也网络科技有限公司 | Abnormal service data processing method, device, equipment and medium based on RPA and AI |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101499064A (en) * | 2008-02-01 | 2009-08-05 | 华为技术有限公司 | Method and apparatus for building pattern matching state machine |
CN108182523A (en) * | 2017-12-26 | 2018-06-19 | 新疆金风科技股份有限公司 | The treating method and apparatus of fault data, computer readable storage medium |
CN109271272B (en) * | 2018-10-15 | 2022-05-17 | 江苏物联网研究发展中心 | Big data assembly fault auxiliary repair system based on unstructured log |
CN109902153B (en) * | 2019-04-02 | 2020-11-06 | 杭州安脉盛智能技术有限公司 | Equipment fault diagnosis method and system based on natural language processing and case reasoning |
-
2020
- 2020-08-14 CN CN202010821134.5A patent/CN111953544B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111953544A (en) | 2020-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103761173A (en) | Log based computer system fault diagnosis method and device | |
CN109710518A (en) | Script review method and device | |
CN111581057B (en) | General log analysis method, terminal device and storage medium | |
CN112445912A (en) | Fault log classification method, system, device and medium | |
EP4071616A1 (en) | Method for generating topology diagram, anomaly detection method, device, apparatus, and storage medium | |
CN111696523A (en) | Accuracy testing method and device of voice recognition engine and electronic equipment | |
CN112395195A (en) | Method, device and equipment for processing automatic test data and storage medium | |
CN113037521A (en) | Method for identifying state of communication equipment, communication system and storage medium | |
CN111309596A (en) | Database testing method, device, terminal equipment and storage medium | |
CN111953544B (en) | Fault detection method, device, equipment and storage medium of server | |
CN110716843B (en) | System fault analysis processing method and device, storage medium and electronic equipment | |
CN111831528A (en) | A computer system log association method and related device | |
CN113743124B (en) | Intelligent question-answering exception processing method and device and electronic equipment | |
CN114860680A (en) | Log analysis and processing method and its device, equipment and medium | |
CN113971401B (en) | Wind power fault information extraction method and device | |
CN110515792B (en) | Monitoring method and device based on web version task management platform and computer equipment | |
CN119202708A (en) | Fault diagnosis data labeling method and device | |
CN114020432A (en) | Task exception handling method and device and task exception handling system | |
CN119127547A (en) | Error log processing method, device, equipment, storage medium and program product | |
CN112540925A (en) | New characteristic compatibility detection system and method, electronic device and readable storage medium | |
CN110807037B (en) | Data modification method and device, electronic equipment and storage medium | |
CN108959646B (en) | Method, system, device and storage medium for automatically verifying communication number | |
CN115757099B (en) | Automatic testing method and device for platform firmware protection and recovery function | |
CN115580524A (en) | Server fault positioning method and device | |
CN109614621A (en) | A method, device and device for correcting text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |