[go: up one dir, main page]

CN112882901B - Intelligent health state monitor of distributed processing system - Google Patents

Intelligent health state monitor of distributed processing system Download PDF

Info

Publication number
CN112882901B
CN112882901B CN202110243326.7A CN202110243326A CN112882901B CN 112882901 B CN112882901 B CN 112882901B CN 202110243326 A CN202110243326 A CN 202110243326A CN 112882901 B CN112882901 B CN 112882901B
Authority
CN
China
Prior art keywords
health monitoring
health
node
functional module
network switch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110243326.7A
Other languages
Chinese (zh)
Other versions
CN112882901A (en
Inventor
李成文
韩强
张伟栋
陈国�
丰生磊
赵子杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Aeronautics Computing Technique Research Institute of AVIC
Original Assignee
Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Aeronautics Computing Technique Research Institute of AVIC filed Critical Xian Aeronautics Computing Technique Research Institute of AVIC
Priority to CN202110243326.7A priority Critical patent/CN112882901B/en
Publication of CN112882901A publication Critical patent/CN112882901A/en
Application granted granted Critical
Publication of CN112882901B publication Critical patent/CN112882901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a health state intelligent monitor of a distributed processing system, which comprises a monitoring management node, a health monitoring data network switch and a health monitoring server, wherein the health monitoring data network switch is connected with the health monitoring server; the health monitoring system comprises a health monitoring server, a data communication network, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring management node, wherein the number of the monitoring management nodes corresponds to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network via the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data, diagnoses the cause of system failure and resumes work in the shortest time possible. The system has the advantages that the working states of various components such as a power supply, a CPU, a memory and a solid memory of the distributed processing system are monitored in real time, a system manager is assisted to rapidly diagnose the cause of system faults, the testability, maintainability and assurances of the system are effectively improved, and meanwhile the task processing capacity of the system is greatly improved.

Description

Intelligent health state monitor of distributed processing system
Technical Field
The invention belongs to the technical field of embedded computer system design, and particularly relates to an intelligent monitor for health status of a distributed processing system.
Background
The types and the number of the devices of the onboard embedded system are more and more, the processing system is more and more complex, the health condition of the system is more and more difficult to monitor, the problem cannot be accurately positioned by the traditional method only by means of BIT test of the main processor, the task processing function of the main processor is directly affected, and the operation efficiency of the system processing resources is reduced.
Disclosure of Invention
The invention aims to provide a distributed processing system health state intelligent monitor which is used for meeting the requirements of a high-performance aircraft system on the testability, maintainability and reliability of processing equipment.
In order to realize the tasks, the invention adopts the following technical scheme:
A distributed processing system health state intelligent monitor comprises a monitoring management node, a health monitoring data network switch and a health monitoring server; the health monitoring system comprises a health monitoring server, a data communication network, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring management node, wherein the number of the monitoring management nodes corresponds to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network via the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data, diagnoses the cause of system failure and resumes work in the shortest time possible.
Further, the health monitoring data network switch realizes monitoring data exchange, the data network is FC, AFDX or Ethernet, and the network communication rate is not lower than 1Gbps.
Further, the processor comprises a child node and a root node; the number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health states of all the functional modules including the functional modules of the active root node and the backup root node, detects the power failure and the extraction of the functional modules, reports events to the corresponding monitoring management nodes, receives the control instructions of the monitoring management nodes, executes proper operation to perform task scheduling of the functional modules, and prevents system faults.
Further, the physical bearers of the two root node data links are two I2C buses.
Further, the sub-node is an independent circuit unit located in the functional module, and the sub-node is used for collecting and uploading sensor data, CPU state data and self-checking data in the functional module, reading the module slot number and the equipment number, and controlling the powering on and powering off and resetting of the functional module.
Further, the micro controller in the child node runs module function software and is responsible for receiving an external root node command and uploading sensor data, CPU and software state data; the monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering on the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip and a memory; detecting the running state of a key application; and detecting the up-and-down line of the port of the switching chip.
Further, the child node monitors the voltage and the temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus;
When the voltage, the temperature or the working state of the functional module is abnormal, the sub-node sends an alarm to the root node, meanwhile, the information such as the voltage, the temperature and the working state of the functional module is reported to the root node in response to the query command of the root node, the root node receives the query request from the monitoring management node through the network, issues the query request such as the temperature, the voltage and the working state to the sub-node of each functional module, automatically reports the system alarm information to the monitoring management node and records the system working log.
Further, the monitoring management node, the root node and the child nodes are respectively powered by independent power supplies and are powered on before the functional circuits of the distributed processing system.
Compared with the prior art, the invention has the following technical characteristics:
Aiming at stronger demands of complex environments on embedded systems, the invention provides the intelligent monitor for the health state of the distributed processing system, which is used for realizing the real-time monitoring of the working states of all components such as a power supply, a CPU, a memory, a solid memory and the like of the distributed processing system, assisting a system manager to rapidly diagnose the reasons of system faults, recovering the work in the shortest time as much as possible, effectively improving the testability, maintainability and safeguarding performance of the system, and greatly improving the task processing capacity of the system.
Drawings
FIG. 1 is a distributed processing system health intelligent monitor;
Fig. 2 is a processor content monitor functional architecture.
Detailed Description
Referring to fig. 1, the intelligent monitor for health status of a distributed processing system provided by the invention comprises a monitoring management node, a health monitoring data network switch and a health monitoring server; the monitoring management nodes can be arranged in a plurality of modes according to the system requirement and correspond to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor and transmits the information to the health monitoring server through the health monitoring data network switch by the data communication network, and the health monitoring server makes analysis decision on the health monitoring data, rapidly diagnoses the cause of system faults and resumes work in the shortest time possible. The health monitoring data network switch is used for realizing monitoring data exchange, the data network can be FC, AFDX, ethernet and the like, and the network communication rate is not lower than 1Gbps.
As shown in fig. 2, the processor includes child nodes and root nodes. The number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health states of all the functional modules including the functional modules of the active root node and the backup root node, detects the power failure and the extraction of the functional modules, reports events to the corresponding monitoring management nodes, receives the control instructions of the monitoring management nodes, executes proper operation to perform task scheduling of the functional modules, and prevents system faults; the physical bearers of the two root node data links are two I2C buses. The functional module is a module for realizing a certain function in the processor, such as a computing module, an output module and the like.
The child node is also an independent circuit unit positioned in the functional module, and is powered by an independent power supply; the child node is mainly responsible for collecting and uploading sensor data, CPU state data and self-checking data in the functional module where the child node is located, reading the module slot number and the device number, and controlling the powering on and powering off and resetting of the functional module. The micro controller in the child node runs module function software and is responsible for receiving the command of the external root node and uploading the sensor data, the CPU and the software state data. The monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering on the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip, a memory and the like; detecting the running state of a key application; and detecting the up-and-down line of the port of the switching chip.
The monitoring management node, the root node and the child nodes are respectively powered by independent power supplies and are powered on before the functional circuits of the distributed processing system. The sub-node monitors the voltage and temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus. When the voltage, the temperature or the working state of the functional module is abnormal, the child node sends an alarm to the root node; and simultaneously, the information such as the voltage, the temperature, the working state and the like of the module is reported to the root node in response to the query command of the root node, the root node can receive the query request from the monitoring management node through the network, and issues the query request such as the temperature, the voltage, the working state and the like to the sub-nodes of each functional module, and the system alarm information is automatically reported to the monitoring management node and the system working log is recorded.
The intelligent monitor is independent of functional components, automatically operates, saves processing resources, improves the task processing capacity of the system, solves the problems of insufficient health condition monitoring information and inaccurate problem positioning of a complex processing system, can assist a system manager to rapidly diagnose system faults, and can resume work in the shortest time possible, thereby effectively improving the testability, maintainability and safeguarding performance of the system.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced equally; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (4)

1. The intelligent monitor for the health state of the distributed processing system is characterized by comprising a monitoring management node, a health monitoring data network switch and a health monitoring server; the health monitoring system comprises a health monitoring server, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring data network switch, wherein the number of the monitoring management nodes corresponds to that of the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network through the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data and diagnoses the cause of system faults;
The processor comprises a child node and a root node; the number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health status of all the functional modules, wherein the monitored functional modules comprise the active root node and the functional module where the corresponding backup root node is located; detecting power failure and extraction of the functional module, reporting an event to a corresponding monitoring management node, and receiving a control instruction of the monitoring management node to execute corresponding operation to perform task scheduling of the functional module;
The sub-node is an independent circuit unit positioned in the functional module, and is used for collecting and uploading sensor data, CPU state data and self-checking data in the functional module, reading the slot number and the equipment number of the functional module, and controlling the powering on and powering off and resetting of the functional module;
The microcontroller in the child node runs module function software and is responsible for receiving an external root node command and uploading sensor data, CPU state data and software state data; the monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering up the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip and a memory; detecting the running state of a key application; detecting the up-down line state of the port of the exchange chip;
the child node monitors the voltage and the temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus;
When the voltage, the temperature or the working state of the functional module is abnormal, the child node sends an alarm to the root node, meanwhile, the voltage, the temperature and the working state information of the functional module are reported to the root node in response to the query command of the root node, the root node receives the query request from the monitoring management node through the network, issues the temperature, the voltage and the working state query request to the child node of each functional module, automatically reports the system alarm information to the monitoring management node and records the system working log.
2. The intelligent monitor for health status of a distributed processing system according to claim 1, wherein the health monitoring data network switch implements monitoring data exchange, the data network is FC, AFDX, or ethernet, and the network communication rate is not lower than 1Gbps.
3. The distributed processing system health intelligent monitor of claim 1, wherein the physical bearers of the two root node data links are two I2C buses.
4. The intelligent monitor of health status of a distributed processing system according to claim 1, wherein the monitor management node, the root node, and the child nodes are powered by separate power sources, respectively, and are powered up prior to the functional circuitry of the distributed processing system.
CN202110243326.7A 2021-03-04 2021-03-04 Intelligent health state monitor of distributed processing system Active CN112882901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110243326.7A CN112882901B (en) 2021-03-04 2021-03-04 Intelligent health state monitor of distributed processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110243326.7A CN112882901B (en) 2021-03-04 2021-03-04 Intelligent health state monitor of distributed processing system

Publications (2)

Publication Number Publication Date
CN112882901A CN112882901A (en) 2021-06-01
CN112882901B true CN112882901B (en) 2024-06-18

Family

ID=76055397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110243326.7A Active CN112882901B (en) 2021-03-04 2021-03-04 Intelligent health state monitor of distributed processing system

Country Status (1)

Country Link
CN (1) CN112882901B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010379B (en) * 2021-03-09 2024-03-15 爱瑟福信息科技(上海)有限公司 Electronic equipment monitoring system
CN113722012A (en) * 2021-09-07 2021-11-30 超越科技股份有限公司 Domestic system-level management system
CN114172829B (en) * 2022-02-10 2022-08-12 统信软件技术有限公司 Server health monitoring method and system and computing equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109698775A (en) * 2018-11-21 2019-04-30 中国航空工业集团公司洛阳电光设备研究所 A kind of dual-machine redundancy backup system based on real-time status detection
CN111880997A (en) * 2020-07-29 2020-11-03 曙光信息产业(北京)有限公司 Distributed monitoring system, monitoring method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2750517B1 (en) * 1996-06-27 1998-08-14 Bull Sa METHOD FOR MONITORING A PLURALITY OF OBJECT TYPES OF A PLURALITY OF NODES FROM A ADMINISTRATION NODE IN A COMPUTER SYSTEM
US9063966B2 (en) * 2013-02-01 2015-06-23 International Business Machines Corporation Selective monitoring of archive and backup storage
GB2514833A (en) * 2013-06-07 2014-12-10 Ibm Portable computer monitoring
US9348573B2 (en) * 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
CN106126407B (en) * 2016-06-22 2018-07-17 西安交通大学 A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system
CN109144802A (en) * 2018-09-12 2019-01-04 杭州智享新电科技有限公司 Internet of Things module health control diagnostic method
US10880434B2 (en) * 2018-11-05 2020-12-29 Nice Ltd Method and system for creating a fragmented video recording of events on a screen using serverless computing
CN110011829B (en) * 2019-02-28 2021-11-19 西南电子技术研究所(中国电子科技集团公司第十研究所) Comprehensive airborne task system health management subsystem

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109698775A (en) * 2018-11-21 2019-04-30 中国航空工业集团公司洛阳电光设备研究所 A kind of dual-machine redundancy backup system based on real-time status detection
CN111880997A (en) * 2020-07-29 2020-11-03 曙光信息产业(北京)有限公司 Distributed monitoring system, monitoring method and device

Also Published As

Publication number Publication date
CN112882901A (en) 2021-06-01

Similar Documents

Publication Publication Date Title
CN112882901B (en) Intelligent health state monitor of distributed processing system
CN108089964A (en) A kind of device and method by BMC monitoring server CPLD states
EP2093934B1 (en) System, device, equipment and method for monitoring management
US20020152425A1 (en) Distributed restart in a multiple processor system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN111831488B (en) TCMS-MPU control unit with safety level design
US20120221885A1 (en) Monitoring device, monitoring system and monitoring method
CN103544092A (en) Health monitoring system of avionic electronic equipment based on ARINC653 standard
EP3306422B1 (en) Arithmetic device and control apparatus
CN108287780A (en) A kind of device and method of monitoring server CPLD states
CN111815118B (en) A remote sensing satellite autonomous health management system
CN116483613B (en) Processing method and device of fault memory bank, electronic equipment and storage medium
CN102495786B (en) Server system
CN108633129A (en) A kind of fault monitoring system and monitoring process method of LED tail lamp circuits
CN103365267A (en) Bay level equipment with self-recovery function in substation and implementation method of bay level equipment
CN101110053A (en) A Method for Realizing Computer Fault Alarm Control
CN112019455A (en) A switch monitoring device and method based on programmable logic device
CN111880999B (en) High-availability monitoring management device for high-density blade server and redundancy switching method
CN106407081B (en) Case management system and server
CN208063515U (en) A kind of fault monitoring system of LED tail lamp circuits
CN114153189B (en) Automatic driving controller safety diagnosis and protection method, system and storage device
CN101741654B (en) Operating system monitoring device and method
CN114559465A (en) Robot abnormality automatic recovery method, robot and robot system
CN115801640B (en) Mutual keep-alive system between BMC management board and network switch board based on ARM array server
CN104571454A (en) Method for seamless monitoring and management of blade server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant