[go: up one dir, main page]

CN114116282B - Method and device for reporting and repairing network additional storage faults - Google Patents

Method and device for reporting and repairing network additional storage faults Download PDF

Info

Publication number
CN114116282B
CN114116282B CN202111342238.9A CN202111342238A CN114116282B CN 114116282 B CN114116282 B CN 114116282B CN 202111342238 A CN202111342238 A CN 202111342238A CN 114116282 B CN114116282 B CN 114116282B
Authority
CN
China
Prior art keywords
alarm
alarm event
reporting
error
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111342238.9A
Other languages
Chinese (zh)
Other versions
CN114116282A (en
Inventor
郑强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111342238.9A priority Critical patent/CN114116282B/en
Publication of CN114116282A publication Critical patent/CN114116282A/en
Application granted granted Critical
Publication of CN114116282B publication Critical patent/CN114116282B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a method, a system, equipment and a storage medium for reporting and repairing network additional storage faults, wherein the method comprises the following steps: acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file; judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; responding to an alarm event to trigger an alarm, and calling a reporting function to report the alarm event; and calling a repair function in a fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event. The application can display the network additional storage alarm and be visible to the user, thus the fault can be effectively handled, the stability of the system is ensured, and meanwhile, part of the alarm can be automatically repaired without manual intervention, thus the application has no perception to the user and increases the acceptance of the user.

Description

一种网络附加存储故障上报并修复的方法和装置Method and device for reporting and repairing faults of network attached storage

技术领域technical field

本发明涉及存储领域,更具体地,特别是指一种网络附加存储故障上报并修复的方法、系统、设备和存储介质。The present invention relates to the field of storage, and more specifically refers to a method, system, device and storage medium for reporting and repairing network attached storage faults.

背景技术Background technique

大数据时代,对存储的可靠性及问题精准定位的要求越来越高。但是目前MCS系统(基于linux内核的精简linux)NAS(Network Attached Storage,网络附加存储)业务在使用过程中出现故障后,GUI(Graphical User Interface,图形用户界面)中没有网络附加存储业务相关的告警事件提示信息,因此不便于用户及时获取故障信息,这样就不能及时测处理,为系统的稳定运行埋下了隐患。In the era of big data, the requirements for storage reliability and precise problem location are getting higher and higher. However, at present, after the NAS (Network Attached Storage, network-attached storage) service of the MCS system (based on the simplified linux kernel) fails during use, there is no alarm related to the network-attached storage service in the GUI (Graphical User Interface, graphical user interface) Event prompt information, so it is not convenient for users to obtain fault information in time, so that they cannot be detected and processed in time, which lays a hidden danger for the stable operation of the system.

发明内容Contents of the invention

有鉴于此,本发明实施例的目的在于提出一种网络附加存储故障上报并修复的方法、系统、计算机设备及计算机可读存储介质,本发明通过将网络附加存储告警直观的显示在用户的页面上,并在网络附加存储出现告警的时候,自动修复减少了人工的干预,增加了用户的认可度,同时提升了系统的稳定性。In view of this, the purpose of the embodiments of the present invention is to provide a method, system, computer equipment and computer-readable storage medium for reporting and repairing NAS faults. The present invention visually displays NAS alarms on the user's page In addition, when an alarm occurs in the network attached storage, automatic repair reduces manual intervention, increases user acceptance, and improves system stability at the same time.

基于上述目的,本发明实施例的一方面提供了一种网络附加存储故障上报并修复的方法,包括如下步骤:获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。Based on the above purpose, an aspect of the embodiments of the present invention provides a method for reporting and repairing network attached storage faults, including the following steps: obtaining an alarm information file of the network attached storage, and filling the alarm information file with alarm data information ; Judging in turn whether each alarm event triggers an alarm according to the filled alarm data information; in response to an alarm event that can trigger an alarm, call a reporting function to report the alarm event; and according to the occurrence of the alarm event, call the failure mode library The repair function in repairs the alarm event.

在一些实施方式中,所述调用上报函数上报所述告警事件包括:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In some implementations, calling the reporting function to report the alarm event includes: activating an error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and In response to the error not being activated by other managers, the error code is mapped to the real error code of the node, and an error flag is set.

在一些实施方式中,方法还包括:响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In some embodiments, the method further includes: in response to the alarm event failing to trigger an alarm, calling a clear function to clear the alarm event.

在一些实施方式中,所述调用清除函数清除所述告警事件包括:将缓存中的错误码信息清除,并判断错误码是否为预设值;以及响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In some implementations, the clearing the warning event by calling the clearing function includes: clearing the error code information in the cache, and judging whether the error code is a preset value; and clearing the platform main The current mode of the process, and set the platform main process to normal mode.

本发明实施例的另一方面,提供了一种网络附加存储故障上报并修复的系统,包括:获取模块,配置用于获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;判断模块,配置用于根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;上报模块,配置用于响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及修复模块,配置用于根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。Another aspect of the embodiments of the present invention provides a system for reporting and repairing network-attached storage faults, including: an acquisition module configured to acquire an alarm information file of the network-attached storage, and fill the alarm information file in the alarm information file Data information; a judging module configured to sequentially judge whether each alarm event triggers an alarm according to the filled alarm data information; a reporting module configured to trigger an alarm in response to an alarm event, and call a reporting function to report the alarm event; And a repair module configured to call a repair function in the failure mode library to repair the alarm event according to the occurrence of the alarm event.

在一些实施方式中,所述上报模块配置用于:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In some implementations, the reporting module is configured to: activate the error corresponding to the alarm event in the manager corresponding to the alarm event, and check whether other managers have activated the error; and respond to other management If the device has not activated the error, map the error code to the real error code of the node, and set the error flag.

在一些实施方式中,系统还包括清除模块,配置用于:响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In some embodiments, the system further includes a clearing module configured to: call a clearing function to clear the alarm event in response to the alarm event failing to trigger the alarm.

在一些实施方式中,所述清除模块进一步配置用于:将缓存中的错误码信息清除,并判断错误码是否为预设值;以及响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In some implementations, the clearing module is further configured to: clear the error code information in the cache, and determine whether the error code is a preset value; and clear the current status of the platform main process in response to the error code being a preset value mode, and set the platform main process to normal mode.

本发明实施例的又一方面,还提供了一种计算机设备,包括:至少一个处理器;以及存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现如上方法的步骤。In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory, the memory stores computer instructions executable on the processor, and the instructions are executed by the The steps of the above method are realized when the processor executes.

本发明实施例的再一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时实现如上方法步骤的计算机程序。In yet another aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, and the computer-readable storage medium stores a computer program for implementing the above method steps when executed by a processor.

本发明具有以下有益技术效果:通过将网络附加存储告警直观的显示在用户的页面上,并在网络附加存储出现告警的时候,自动修复减少了人工的干预,增加了用户的认可度,同时提升了系统的稳定性。The present invention has the following beneficial technical effects: by visually displaying the network attached storage alarm on the user's page, and when the network attached storage alarm occurs, the automatic repair reduces manual intervention, increases the user's recognition, and improves system stability.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and those skilled in the art can obtain other embodiments according to these drawings without any creative effort.

图1为本发明提供的网络附加存储故障上报并修复的方法的实施例的示意图;FIG. 1 is a schematic diagram of an embodiment of a method for reporting and repairing network attached storage faults provided by the present invention;

图2为本发明提供的网络附加存储故障上报并修复的系统的实施例的示意图;2 is a schematic diagram of an embodiment of a system for reporting and repairing network attached storage faults provided by the present invention;

图3为本发明提供的网络附加存储故障上报并修复的计算机设备的实施例的硬件结构示意图;3 is a schematic diagram of the hardware structure of an embodiment of a computer device for reporting and repairing network attached storage faults provided by the present invention;

图4为本发明提供的网络附加存储故障上报并修复的计算机存储介质的实施例的示意图。Fig. 4 is a schematic diagram of an embodiment of a computer storage medium for reporting and repairing a NAS fault provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明实施例进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

需要说明的是,本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本发明实施例的限定,后续实施例对此不再一一说明。It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are to distinguish two entities with the same name but different parameters or parameters that are not the same, see "first" and "second" It is only for the convenience of expression, and should not be construed as a limitation on the embodiments of the present invention, which will not be described one by one in the subsequent embodiments.

本发明实施例的第一个方面,提出了一种网络附加存储故障上报并修复的方法的实施例。图1示出的是本发明提供的网络附加存储故障上报并修复的方法的实施例的示意图。如图1所示,本发明实施例包括如下步骤:According to the first aspect of the embodiments of the present invention, an embodiment of a method for reporting and repairing a NAS fault is proposed. FIG. 1 is a schematic diagram of an embodiment of a method for reporting and repairing a NAS fault provided by the present invention. As shown in Figure 1, the embodiment of the present invention includes the following steps:

S1、获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;S1. Obtain an alarm information file stored in a network attached storage, and fill the alarm information file with alarm data information;

S2、根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;S2. Determine in turn whether each alarm event triggers an alarm according to the filled alarm data information;

S3、响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及S3. An alarm can be triggered in response to the alarm event, and the reporting function is called to report the alarm event; and

S4、根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。S4. Call a repair function in the failure mode library to repair the alarm event according to the occurrence identifier of the alarm event.

通过网络附加存储虚机中嵌入多个故障的感知器,如果出现故障,感知器能迅速捕捉到并上报到MCS系统,例如采集网络附加存储节点failover(故障转移)、NFS(NetworkFile System,网络文件系统)服务、CIFS(Common Internet File Systems,通用互联网文件系统)服务、FTP(File Transfer Protocol,文件传输协议)服务、Minioss服务、网络附加存储重启故障、网络附加存储以太网端口故障、文件系统容量等告警供MCS系统调用,实现流程如下:By embedding multiple faulty sensors in the network-attached storage virtual machine, if a fault occurs, the sensor can quickly capture and report to the MCS system, such as collecting network-attached storage node failover (failover), NFS (NetworkFile System, network file System) service, CIFS (Common Internet File Systems, common Internet file system) service, FTP (File Transfer Protocol, file transfer protocol) service, Minioss service, NAS restart failure, NAS Ethernet port failure, file system capacity Wait for the alarm to be called by the MCS system, and the implementation process is as follows:

在mcs上由守护进程vm_daemon.py实现,每间隔5秒调用nas_alarmd一次,nas_alarmd通过ssh(Secure Shell,安全外壳协议)连接虚拟机执行nas_alarm.py进行查询,查询基于节点。nas_alarmd获取虚机中网络附加存储节点故障转移、网络文件系统服务、通用互联网文件系统服务、文件传输协议服务和Minioss服务、重启、网卡、文件系统的状态,如果查询成功就写入fifo文件,供mcs告警代码查询。It is realized by the daemon process vm_daemon.py on mcs, which calls nas_alarmd every 5 seconds. nas_alarmd connects to the virtual machine through ssh (Secure Shell, secure shell protocol) and executes nas_alarm.py to query. The query is based on the node. nas_alarmd obtains the status of network attached storage node failover, network file system service, general Internet file system service, file transfer protocol service, Minioss service, restart, network card, and file system in the virtual machine. If the query is successful, it will write the fifo file for mcs alarm code query.

获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息。根据填充的所述告警数据信息依次判断每个告警事件是否触发告警。The alarm information file stored in the network is acquired, and the alarm data information is filled in the alarm information file. Whether each alarm event triggers an alarm is judged sequentially according to the filled alarm data information.

响应于告警事件能够触发告警,调用上报函数上报所述告警事件。MCS系统告警检测处理通过系统中的EC、PL两个模块完成,各模块具体负责功能如下,EC模块通过读取网络附加存储告警信息文件来依次判断告警事件,并填充错误记录、状态数据、激活标志等信息,然后根据填充信息依次处理告警事件,如果有告警则调用告警上报函数,否则就调用告警清除函数;PL模块根据接收到的告警事件信息进行错误码排序,并上报告警事件。具体流程如下:MCS检查事件状态是否在starting,如果是则退出;MCS系统读NAS告警信息状态文件,并判断获取信息是否有效,如果无效则退出;MCS系统开始依次判断NAS告警信息,并填充错误记录、状态数据、激活标志等信息;根据上一步骤中填充的告警数据信息依次处理告警事件,如果某个告警事件有告警则调用ecmgr_sensor_report_node_error函数上报告警,如果无告警则调用ecmgr_sensor_clear_node_error函数清除告警。In response to an alarm event being able to trigger an alarm, a reporting function is called to report the alarm event. The alarm detection and processing of the MCS system is completed by the two modules of EC and PL in the system. The specific functions of each module are as follows. The EC module judges the alarm events in turn by reading the network additional storage alarm information files, and fills in error records, status data, activation If there is an alarm, call the alarm reporting function; otherwise, call the alarm clearing function; the PL module sorts the error codes according to the received alarm event information, and reports the alarm event. The specific process is as follows: MCS checks whether the event status is starting, and if so, exits; MCS system reads the NAS alarm information status file, and judges whether the acquired information is valid, and exits if it is invalid; MCS system starts to judge NAS alarm information in turn, and fills in errors Records, status data, activation flags and other information; according to the alarm data information filled in the previous step, the alarm events are processed sequentially. If an alarm event has an alarm, call the ecmgr_sensor_report_node_error function to report the alarm. If there is no alarm, call the ecmgr_sensor_clear_node_error function to clear the alarm.

在一些实施方式中,所述调用上报函数上报所述告警事件包括:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。检查错误码是否为0x522,如果是,则强行将平台主进程设置为522模式,如果不是,调用函数上报告警。将错误码进行缓存,防止因io process(输入输出过程)退出而丢失错误信息。In some implementations, calling the reporting function to report the alarm event includes: activating an error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and In response to the error not being activated by other managers, the error code is mapped to the real error code of the node, and an error flag is set. Check whether the error code is 0x522, if yes, set the platform main process to 522 mode forcibly, if not, call the function to report an alarm. The error code is cached to prevent the loss of error information due to the exit of the io process (input and output process).

在一些实施方式中,方法还包括:响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In some embodiments, the method further includes: in response to the alarm event failing to trigger an alarm, calling a clear function to clear the alarm event.

在一些实施方式中,所述调用清除函数清除所述告警事件包括:将缓存中的错误码信息清除,并判断错误码是否为预设值;以及响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。检查错误码是否是0x522,如果是,清除平台主进程522模式,如果否,将平台主进程设置为普通模式。In some implementations, the clearing the warning event by calling the clearing function includes: clearing the error code information in the cache, and judging whether the error code is a preset value; and clearing the platform main The current mode of the process, and set the platform main process to normal mode. Check whether the error code is 0x522, if yes, clear the platform main process 522 mode, if not, set the platform main process to normal mode.

调用清除函数清除告警事件也包括:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码。Calling the clear function to clear the alarm event also includes: activating the error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and responding to other managers not activating the error The above error is mapped to the actual error code of the node.

根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。According to the identification of the occurrence of the alarm event, the repair function in the failure mode library is called to repair the alarm event.

在图形用户界面前端的告警界面中可以显示NAS相关告警事件信息,该界面列出了当前告警事件的错误代码、时间戳记、状态、描述、对象类型、对象标识以及对象名信息,右键点击某告警事件可以对其执行查看属性、清空日志、运行修复等操作。部分告警,通过大数据后台脚本登记,然后调用自动修复模块,自动定位修复。自动修改模块的原理,根据告警出现的标识,调用故障模式库中的自动修复模块,进行自动修复。NAS-related alarm event information can be displayed on the alarm interface at the front end of the graphical user interface. This interface lists the error code, time stamp, status, description, object type, object ID, and object name information of the current alarm event. Right-click an alarm Events can perform operations such as viewing properties, clearing logs, and running repairs. Some alarms are registered through the big data background script, and then the automatic repair module is called to automatically locate and repair. The principle of the automatic modification module calls the automatic repair module in the failure mode library to perform automatic repair according to the identification of the alarm.

需要特别指出的是,上述网络附加存储故障上报并修复的方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于网络附加存储故障上报并修复的方法也应当属于本发明的保护范围,并且不应将本发明的保护范围局限在实施例之上。It should be pointed out that each step in each embodiment of the above-mentioned method for reporting and repairing a network-attached storage failure can be mutually interleaved, replaced, added, or deleted. The method of reporting and repairing should also belong to the protection scope of the present invention, and should not limit the protection scope of the present invention to the embodiments.

基于上述目的,本发明实施例的第二个方面,提出了一种网络附加存储故障上报并修复的系统。如图2所示,系统200包括如下模块:获取模块,配置用于获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;判断模块,配置用于根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;上报模块,配置用于响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及修复模块,配置用于根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。Based on the above purpose, a second aspect of the embodiments of the present invention proposes a system for reporting and repairing network attached storage faults. As shown in FIG. 2 , the system 200 includes the following modules: an acquisition module configured to acquire an alarm information file stored in a network, and fill alarm data information in the alarm information file; a judging module configured to The alarm data information sequentially determines whether each alarm event triggers an alarm; the reporting module is configured to trigger an alarm in response to the alarm event, and calls the reporting function to report the alarm event; and the repair module is configured to occur according to the alarm event call the repair function in the failure mode library to repair the alarm event.

在一些实施方式中,所述上报模块配置用于:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In some implementations, the reporting module is configured to: activate the error corresponding to the alarm event in the manager corresponding to the alarm event, and check whether other managers have activated the error; and respond to other management If the device has not activated the error, map the error code to the real error code of the node, and set the error flag.

在一些实施方式中,系统还包括清除模块,配置用于:响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In some embodiments, the system further includes a clearing module configured to: call a clearing function to clear the alarm event in response to the alarm event failing to trigger the alarm.

在一些实施方式中,所述清除模块进一步配置用于:将缓存中的错误码信息清除,并判断错误码是否为预设值;以及响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In some implementations, the clearing module is further configured to: clear the error code information in the cache, and determine whether the error code is a preset value; and clear the current status of the platform main process in response to the error code being a preset value mode, and set the platform main process to normal mode.

基于上述目的,本发明实施例的第三个方面,提出了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:S1、获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;S2、根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;S3、响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及S4、根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。Based on the above purpose, a third aspect of the embodiments of the present invention proposes a computer device, including: at least one processor; and a memory, the memory stores computer instructions that can run on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Obtain an alarm information file stored in the network, and fill the alarm information file with alarm data information; S2. Determine whether each alarm event triggers an alarm according to the filled alarm data information; S3. In response to an alarm event being able to trigger an alarm, call a reporting function to report the alarm event; and S4. Call a repair function in the failure mode library to repair the alarm event according to the occurrence identifier of the alarm event.

在一些实施方式中,所述调用上报函数上报所述告警事件包括:在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In some implementations, calling the reporting function to report the alarm event includes: activating an error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and In response to the error not being activated by other managers, the error code is mapped to the real error code of the node, and an error flag is set.

在一些实施方式中,步骤还包括:响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In some embodiments, the step further includes: calling a clear function to clear the alarm event in response to the alarm event failing to trigger the alarm event.

在一些实施方式中,所述调用清除函数清除所述告警事件包括:将缓存中的错误码信息清除,并判断错误码是否为预设值;以及响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In some implementations, the clearing the warning event by calling the clearing function includes: clearing the error code information in the cache, and judging whether the error code is a preset value; and clearing the platform main The current mode of the process, and set the platform main process to normal mode.

如图3所示,为本发明提供的上述网络附加存储故障上报并修复的计算机设备的一个实施例的硬件结构示意图。As shown in FIG. 3 , it is a schematic diagram of the hardware structure of an embodiment of the computer device for reporting and repairing the above-mentioned NAS failure provided by the present invention.

以如图3所示的装置为例,在该装置中包括一个处理器301以及一个存储器302。Taking the device shown in FIG. 3 as an example, the device includes a processor 301 and a memory 302 .

处理器301和存储器302可以通过总线或者其他方式连接,图3中以通过总线连接为例。The processor 301 and the memory 302 may be connected through a bus or in other ways, and the connection through a bus is taken as an example in FIG. 3 .

存储器302作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的网络附加存储故障上报并修复的方法对应的程序指令/模块。处理器301通过运行存储在存储器302中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现网络附加存储故障上报并修复的方法。The memory 302, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the network-attached storage fault report and repair in the embodiment of the present application The method corresponds to the program instruction/module. The processor 301 executes various functional applications and data processing of the server by running non-volatile software programs, instructions and modules stored in the memory 302, that is, a method for reporting and repairing NAS faults.

存储器302可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据网络附加存储故障上报并修复的方法的使用所创建的数据等。此外,存储器302可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器302可选包括相对于处理器301远程设置的存储器,这些远程存储器可以通过网络连接至本地模块。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 302 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; data etc. In addition, the memory 302 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some embodiments, the memory 302 may optionally include memory that is remotely located relative to the processor 301, and these remote memories may be connected to the local module through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

一个或者多个网络附加存储故障上报并修复的方法对应的计算机指令303存储在存储器302中,当被处理器301执行时,执行上述任意方法实施例中的网络附加存储故障上报并修复的方法。One or more computer instructions 303 corresponding to the method for reporting and repairing NAS faults are stored in the memory 302 , and when executed by the processor 301 , execute the method for reporting and repairing NAS faults in any of the above method embodiments.

执行上述网络附加存储故障上报并修复的方法的计算机设备的任何一个实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。Any embodiment of the computer device that executes the method for reporting and repairing the above NAS fault can achieve the same or similar effects as any of the corresponding foregoing method embodiments.

本发明还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时执行网络附加存储故障上报并修复的方法的计算机程序。The present invention also provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program for performing a method for reporting and repairing network-attached storage faults when executed by a processor.

如图4所示,为本发明提供的上述网络附加存储故障上报并修复的计算机存储介质的一个实施例的示意图。以如图4所示的计算机存储介质为例,计算机可读存储介质401存储有被处理器执行时执行如上方法的计算机程序402。As shown in FIG. 4 , it is a schematic diagram of an embodiment of the computer storage medium for reporting and repairing the above NAS fault provided by the present invention. Taking the computer storage medium shown in FIG. 4 as an example, the computer readable storage medium 401 stores a computer program 402 for executing the above method when executed by a processor.

最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,网络附加存储故障上报并修复的方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。Finally, it should be noted that those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware to complete, and the program of the method for reporting and repairing network-attached storage failures can be stored in In a computer-readable storage medium, when the program is executed, it may include the procedures of the embodiments of the above-mentioned methods. Wherein, the storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), and the like. The foregoing computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.

以上是本发明公开的示例性实施例,但是应当注意,在不背离权利要求限定的本发明实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本发明实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。The above are the exemplary embodiments disclosed in the present invention, but it should be noted that various changes and modifications can be made without departing from the scope of the disclosed embodiments of the present invention defined in the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. In addition, although the elements disclosed in the embodiments of the present invention may be described or required in an individual form, they may also be understood as a plurality unless explicitly limited to a singular number.

应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。It should be understood that as used herein, the singular form "a" and "an" are intended to include the plural forms as well, unless the context clearly supports an exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

上述本发明实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments disclosed in the above-mentioned embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above-mentioned embodiments can be completed by hardware, or can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. The above-mentioned The storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本发明实施例公开的范围(包括权利要求)被限于这些例子;在本发明实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本发明实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本发明实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本发明实施例的保护范围之内。Those of ordinary skill in the art should understand that: the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope (including claims) disclosed by the embodiments of the present invention is limited to these examples; under the idea of the embodiments of the present invention , the technical features in the above embodiments or different embodiments can also be combined, and there are many other changes in different aspects of the above embodiments of the present invention, which are not provided in details for the sake of brevity. Therefore, within the spirit and principle of the embodiments of the present invention, any omissions, modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the embodiments of the present invention.

Claims (10)

1.一种网络附加存储故障上报并修复的方法,其特征在于,包括如下步骤:1. A method for reporting and repairing network attached storage faults, characterized in that, comprising the steps of: 获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;Obtain an alarm information file stored in the network, and fill alarm data information in the alarm information file; 根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;Judging in turn whether each alarm event triggers an alarm according to the filled alarm data information; 响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及In response to an alarm event that can trigger an alarm, call a reporting function to report the alarm event; and 根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。According to the identification of the occurrence of the alarm event, the repair function in the failure mode library is called to repair the alarm event. 2.根据权利要求1所述的方法,其特征在于,所述调用上报函数上报所述告警事件包括:2. The method according to claim 1, wherein reporting the alarm event by calling the reporting function comprises: 在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及activating the error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and 响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In response to the error not being activated by other managers, the error code is mapped to the real error code of the node, and an error flag is set. 3.根据权利要求1所述的方法,其特征在于,方法还包括:3. The method according to claim 1, characterized in that the method further comprises: 响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In response to an alarm event that cannot trigger an alarm, a clear function is called to clear the alarm event. 4.根据权利要求3所述的方法,其特征在于,所述调用清除函数清除所述告警事件包括:4. The method according to claim 3, wherein the calling a clearing function to clear the alarm event comprises: 将缓存中的错误码信息清除,并判断错误码是否为预设值;以及clearing the error code information in the cache, and judging whether the error code is a preset value; and 响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In response to the fact that the error code is a preset value, the current mode of the platform main process is cleared, and the platform main process is set to a normal mode. 5.一种网络附加存储故障上报并修复的系统,其特征在于,包括:5. A system for reporting and repairing network-attached storage faults, characterized in that it comprises: 获取模块,配置用于获取网络附加存储的告警信息文件,并在所述告警信息文件中填充告警数据信息;The obtaining module is configured to obtain the alarm information file of the network attached storage, and fill the alarm data information in the alarm information file; 判断模块,配置用于根据填充的所述告警数据信息依次判断每个告警事件是否触发告警;A judging module configured to sequentially judge whether each alarm event triggers an alarm according to the filled alarm data information; 上报模块,配置用于响应于告警事件能够触发告警,调用上报函数上报所述告警事件;以及A reporting module configured to trigger an alarm in response to an alarm event, and call a reporting function to report the alarm event; and 修复模块,配置用于根据所述告警事件出现的标识,调用故障模式库中的修复函数对所述告警事件进行修复。The repair module is configured to call a repair function in the failure mode library to repair the alarm event according to the occurrence of the alarm event. 6.根据权利要求5所述的系统,其特征在于,所述上报模块配置用于:6. The system according to claim 5, wherein the reporting module is configured to: 在所述告警事件对应的管理器中激活所述告警事件对应的错误,并检查其他管理器是否激活过所述错误;以及activating the error corresponding to the alarm event in the manager corresponding to the alarm event, and checking whether other managers have activated the error; and 响应于其他管理器未激活过所述错误,将错误码映射为节点真实错误码,并设置错误标记。In response to the error not being activated by other managers, the error code is mapped to the real error code of the node, and an error flag is set. 7.根据权利要求5所述的系统,其特征在于,系统还包括清除模块,配置用于:7. The system according to claim 5, wherein the system further comprises a clearing module configured to: 响应于告警事件不能触发告警,调用清除函数清除所述告警事件。In response to an alarm event that cannot trigger an alarm, a clear function is called to clear the alarm event. 8.根据权利要求7所述的系统,其特征在于,所述清除模块进一步配置用于:8. The system according to claim 7, wherein the clearing module is further configured to: 将缓存中的错误码信息清除,并判断错误码是否为预设值;以及clearing the error code information in the cache, and judging whether the error code is a preset value; and 响应于错误码为预设值,清除平台主进程的当前模式,并将所述平台主进程设置为普通模式。In response to the fact that the error code is a preset value, the current mode of the platform main process is cleared, and the platform main process is set to a normal mode. 9.一种计算机设备,其特征在于,包括:9. A computer device, comprising: 至少一个处理器;以及at least one processor; and 存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现权利要求1-4任意一项所述方法的步骤。A memory, the memory stores computer instructions operable on the processor, and when the instructions are executed by the processor, the steps of the method according to any one of claims 1-4 are implemented. 10.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-4任意一项所述方法的步骤。10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1-4 are implemented.
CN202111342238.9A 2021-11-12 2021-11-12 Method and device for reporting and repairing network additional storage faults Active CN114116282B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111342238.9A CN114116282B (en) 2021-11-12 2021-11-12 Method and device for reporting and repairing network additional storage faults

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111342238.9A CN114116282B (en) 2021-11-12 2021-11-12 Method and device for reporting and repairing network additional storage faults

Publications (2)

Publication Number Publication Date
CN114116282A CN114116282A (en) 2022-03-01
CN114116282B true CN114116282B (en) 2023-08-18

Family

ID=80379036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111342238.9A Active CN114116282B (en) 2021-11-12 2021-11-12 Method and device for reporting and repairing network additional storage faults

Country Status (1)

Country Link
CN (1) CN114116282B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115473788B (en) * 2022-08-29 2023-08-11 苏州浪潮智能科技有限公司 A storage alarm test method, device, equipment, and storage medium
CN115842710A (en) * 2022-11-22 2023-03-24 中国农业银行股份有限公司 Service side data processing method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339297A (en) * 2016-09-14 2017-01-18 郑州云海信息技术有限公司 Method and system for warning failures of storage system in real time
CN108763038A (en) * 2018-08-08 2018-11-06 平安科技(深圳)有限公司 Management method, device, computer equipment and the storage medium of alarm data
CN110688280A (en) * 2019-09-25 2020-01-14 中国建设银行股份有限公司 Management system, method, equipment and storage medium of alarm event
CN112035319A (en) * 2020-08-31 2020-12-04 浪潮云信息技术股份公司 Monitoring alarm system for multi-path state
CN112131201A (en) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 Method, system, equipment and medium for high availability of network additional storage
WO2021136247A1 (en) * 2019-12-31 2021-07-08 华为技术有限公司 Alarm processing method and apparatus, and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339297A (en) * 2016-09-14 2017-01-18 郑州云海信息技术有限公司 Method and system for warning failures of storage system in real time
CN108763038A (en) * 2018-08-08 2018-11-06 平安科技(深圳)有限公司 Management method, device, computer equipment and the storage medium of alarm data
CN110688280A (en) * 2019-09-25 2020-01-14 中国建设银行股份有限公司 Management system, method, equipment and storage medium of alarm event
WO2021136247A1 (en) * 2019-12-31 2021-07-08 华为技术有限公司 Alarm processing method and apparatus, and storage medium
CN112035319A (en) * 2020-08-31 2020-12-04 浪潮云信息技术股份公司 Monitoring alarm system for multi-path state
CN112131201A (en) * 2020-09-18 2020-12-25 苏州浪潮智能科技有限公司 Method, system, equipment and medium for high availability of network additional storage

Also Published As

Publication number Publication date
CN114116282A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN111290918B (en) Server running state monitoring method and device and computer readable storage medium
CN112631913B (en) Method, device, equipment and storage medium for monitoring operation faults of application program
CN107241229B (en) A business monitoring method and device based on an interface testing tool
CN114116282B (en) Method and device for reporting and repairing network additional storage faults
US20210288897A1 (en) Mitigating failure in request handling
CN110427303A (en) A kind of fault alarming method and device
CN108510287B (en) Judgment method, electronic device and computer-readable storage medium for customer return visit
CN102075368A (en) Method, device and system for diagnosing service failure
CN108833190A (en) A kind of NFS service failure warning method, device and storage medium
CN111988391A (en) Message sending method and device
CN113806138A (en) Database backup and recovery detection method, device, electronic device, and storage medium
US10108474B2 (en) Trace capture of successfully completed transactions for trace debugging of failed transactions
US8621276B2 (en) File system resiliency management
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
TWI518680B (en) Method for maintaining file system of computer system
CN109150587B (en) Maintenance method and device
CN115333923A (en) Fault point tracing analysis method, device, equipment and medium
CN106161087A (en) The network interface card error event collection method of a kind of linux system and system
CN106911508B (en) DNS configuration recovery method and device
CN115733740A (en) Log detection method and device, computer equipment and computer readable storage medium
CN114629786A (en) Log real-time analysis method, device, storage medium and system
CN115496470A (en) Full-link configuration data processing method and device and electronic equipment
CN110362464B (en) Software analysis method and equipment
CN119814529A (en) Fault alarm method, device, computer equipment and storage medium
CN115952006B (en) Resource leak detection method, system, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant